This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Generalized Büchi Automata

Ryohei Oura, Ami Sakakibara,  and Toshimitsu Ushio This work was partially supported by JST-ERATO HASUO Project Grant Number JPMJER1603, Japan, and JST-Mirai Program Grant Number JPMJMI18B4, Japan. The work of A. Sakakibara was supported by the Grant-in-Aid for Japan Society for the Promotion of Science Research Fellow under Grant JP19J13487. The authors are with the Graduate School of Engineering Science, Osaka University, Toyonaka 560-8531, Japan (e-mail: r-oura, [email protected]; [email protected]).
Abstract

This letter proposes a novel reinforcement learning method for the synthesis of a control policy satisfying a control specification described by a linear temporal logic formula. We assume that the controlled system is modeled by a Markov decision process (MDP). We convert the specification to a limit-deterministic generalized Büchi automaton (LDGBA) with several accepting sets that accepts all infinite sequences satisfying the formula. The LDGBA is augmented so that it explicitly records the previous visits to accepting sets. We take a product of the augmented LDGBA and the MDP, based on which we define a reward function. The agent gets rewards whenever state transitions are in an accepting set that has not been visited for a certain number of steps. Consequently, sparsity of rewards is relaxed and optimal circulations among the accepting sets are learned. We show that the proposed method can learn an optimal policy when the discount factor is sufficiently close to one.

Index Terms:
Reinforcement Learning, Linear Temporal Logic, Limit-Deterministic Büchi Automata.

I Introduction

Reinforcement learning (RL) [1] is a useful approach to learning an optimal policy from sample behaviors of a controlled system with inherent stochasticity, e.g., a Markov decision process (MDP) [2], when the probabilities associated with the controlled system are unknown a priori. In RL, we use a reward function that assigns a reward to each transition in the behaviors and evaluate a control policy by the return, namely an expected (discounted) sum of the rewards along the behaviors. One of the recent trends is to apply RL to synthesis problems under linear temporal logic (LTL) constraints. LTL is a formal language with rich expressivity and thus suitable for describing complex missions assigned to a controlled system [3, 4].

It is known that any LTL formula can be converted into an ω\omega-automaton with a Büchi or a Rabin acceptance condition [3]. In many studies on LTL synthesis problems using RL, reward functions are formed systematically from automata corresponding to the LTL specification. This direction was first investigated by Sadigh et al. [5], where they defined a reward function based on the acceptance condition of a deterministic Rabin automaton [3] that accepts all words satisfying the LTL constraint. Reward functions defined on specification automata were also proposed for a deep reinforcement learning method [6] and for an extension of LTL in collaboration with a control barrier function [7].

Recently, a limit-deterministic Büchi automaton (LDBA) or a generalized one (LDGBA) is paid much attention to as an ω\omega-automaton corresponding to the LTL specification [8]. The RL-based approaches to the synthesis of a control policy using LDBAs or LDGBAs have been proposed in [9, 10, 11, 12]. An (LD)GBA has multiple accepting sets and accepts behaviors visiting all accepting sets infinitely often. One possible approach to generalized Büchi acceptance conditions is to convert a GBA into a corresponding BA, which has a single accepting set. The conversion, however, fixes the order of visits to accepting sets of the GBA [3] and causes the sparsity of the reward, which is a critical issue in RL-based controller synthesis. Another approach to RL-based synthesis for generalized Büchi conditions is the accepting frontier function introduced in [9, 10], based on which the reward function is defined. However, the function is memoryless, that is, it does not provide information of accepting sets that have been visited, which is important to improve learning performance.

In this letter, we propose a novel method to augment an LDGBA converted from a given LTL formula. Then, we define a reward function based on the acceptance condition of the product MDP of the augmented LDGBA and the controlled system represented as the MDP. The rest of the letter is organized as follows. Section II reviews an MDP, LTL, and automata. Section III proposes a novel RL-based method for the synthesis of a control policy. Section IV presents a numerical example to show the effectiveness of our proposed method.

Notations

0\mathbb{N}_{0} is the set of nonnegative integers. 0\mathbb{R}_{\geq 0} is the set of nonnegative real numbers. For sets AA and BB, ABAB denotes their concatenation, i.e., AB={ab;aA,bB}AB=\{ab;a\in A,b\in B\}. AωA^{\omega} denotes the infinite concatenation of the set AA and AA^{\ast} denotes the finite one, i.e., Aω={a0a1;anA,n0}A^{\omega}=\{a_{0}a_{1}\ldots;a_{n}\in A,n\in\mathbb{N}_{0}\} and A={a0a1an;anA,n0}A^{\ast}=\{a_{0}a_{1}\ldots a_{n};a_{n}\in A,n\in\mathbb{N}_{0}\}, respectively. εA\varepsilon\in A^{*} is the empty string.

II Preliminaries

II-A Markov Decision Process

Definition 1

A (labeled) Markov decision process (MDP) is a tuple MM = (S,A,P,sinit,AP,L)(S,A,P,s_{init},AP,L), where S is a finite set of states, AA is a finite set of actions, P:S×S×A[0,1]P:S\times S\times A\rightarrow[0,1] is a transition probability function, sinitSs_{init}\in S is the initial state, APAP is a finite set of atomic propositions, and L:S×A×S 2APL:S\times A\times S\ \rightarrow\ 2^{AP} is a labeling function that assigns a set of atomic propositions to each transition. Let 𝒜(s)={aA;sS s.t. P(s|s,a)0}\mathcal{A}(s)=\{a\in A;\exists s^{\prime}\in S\text{ s.t. }P(s^{\prime}|s,a)\neq 0\}. Note that sSP(s|s,a)=1\sum_{s^{\prime}\in S}P(s^{\prime}|s,a)=1 for any state sSs\in S and action a𝒜(s)a\in\mathcal{A}(s).

In the MDP MM, an infinite path starting from a state s0Ss_{0}\in S is defined as a sequence ρ=s0a0s1S(AS)ω\rho\ =\ s_{0}a_{0}s_{1}\ldots\ \in S(AS)^{\omega} such that P(si+1|si,ai)>0P(s_{i+1}|s_{i},a_{i})>0 for any i0i\in\mathbb{N}_{0}. A finite path is a finite sequence in S(AS)S(AS)^{\ast}. In addition, we sometimes represent ρ\rho as ρinit\rho_{init} to emphasize that ρ\rho starts from s0=sinits_{0}=s_{init}. For a path ρ=s0a0s1\rho\ =\ s_{0}a_{0}s_{1}\ldots, we define the corresponding labeled path L(ρ)=L(s0,a0,s1)L(s1,a1,s2)(2AP)ωL(\rho)\ =\ L(s_{0},a_{0},s_{1})L(s_{1},a_{1},s_{2})\ldots\in(2^{AP})^{\omega}. InfPathM(resp., FinPathM)InfPath^{M}\ (\text{resp., }FinPath^{M}) is defined as the set of infinite (resp., finite) paths starting from s0=sinits_{0}=s_{init} in the MDP MM. For each finite path ρ\rho, last(ρ)last(\rho) denotes its last state.

Definition 2

A policy on an MDP MM is defined as a mapping π:FinPathM×𝒜(last(ρ))[0,1]\pi:FinPath^{M}\times\mathcal{A}(last(\rho))\rightarrow[0,1]. A policy π\pi is a positional policy if for any ρFinPathM\rho\in FinPath^{M} and any a𝒜(last(ρ))a\in\mathcal{A}(last(\rho)), it holds that π(ρ,a)=π(last(ρ),a)\pi(\rho,a)=\pi(last(\rho),a); and there exists one action a𝒜(last(ρ))a^{\prime}\in\mathcal{A}(last(\rho)) such that π(ρ,a)=1\pi(\rho,a)=1 if a=aa=a^{\prime}, and π(ρ,a)=0\pi(\rho,a)=0 for any a𝒜(last(ρ))a\in\mathcal{A}(last(\rho)) with aaa\neq a^{\prime}.

Let InfPathπMInfPath^{M}_{\pi} (resp., FinPathπMFinPath^{M}_{\pi}) be the set of infinite (resp., finite) paths starting from s0=sinits_{0}=s_{init} in the MDP MM under a policy π\pi. The behavior of the MDP MM under a policy π\pi is defined on a probability space (InfPathπM,InfPathπM,PrπM)(InfPath^{M}_{\pi},\mathcal{F}_{InfPath^{M}_{\pi}},Pr^{M}_{\pi}).

A Markov chain induced by the MDP MM with a positional policy π\pi is a tuple MCπ=(Sπ,Pπ,sinit,AP,L)MC_{\pi}=(S_{\pi},P_{\pi},s_{init},AP,L), where Sπ=SS_{\pi}=S, Pπ(s|s)=P(s|s,a)P_{\pi}(s^{\prime}|s)=P(s^{\prime}|s,a) for s,sSs,s^{\prime}\in S and a𝒜(s)a\in\mathcal{A}(s) such that π(s,a)=1\pi(s,a)=1. The state set SπS_{\pi} of MCπMC_{\pi} can be represented as a disjoint union of a set of transient states TπT_{\pi} and closed irreducible sets of recurrent states RπjR^{j}_{\pi} with j{1,,h}j\in\{1,\ldots,h\}, i.e., Sπ=TπRπ1RπhS_{\pi}=T_{\pi}\cup R^{1}_{\pi}\cup\ldots\cup R^{h}_{\pi} [13]. In the following, we say a “recurrent class” instead of a “closed irreducible set of recurrent states” for simplicity.

In the MDP MM, we define a reward function R:S×A×S0R:S\times A\times S\rightarrow\mathbb{R}_{\geq 0}. The function returns the immediate reward received after the agent performs an action aa at a state ss and reaches a next state ss^{\prime} as a result.

Definition 3

For a policy π\pi on an MDP MM, any state sSs\in S, and a reward function RR, we define the expected discounted reward as

Vπ(s)=𝔼π[n=0γnR(Sn,An,Sn+1)|S0=s],\displaystyle V^{\pi}(s)=\mathbb{E}^{\pi}[\sum_{n=0}^{\infty}\gamma^{n}R(S_{n},A_{n},S_{n+1})|S_{0}=s],

where 𝔼π\mathbb{E}^{\pi} denotes the expected value given that the agent follows the policy π\pi from the state ss and γ[0,1)\gamma\in[0,1) is a discount factor. The function Vπ(s)V^{\pi}(s) is often referred to as a state-value function under the policy π\pi. For any state-action pair (s,a)S×A(s,a)\in S\times A, we define an action-value function Qπ(s,a)Q^{\pi}(s,a) under the policy π\pi as follows.

Qπ(s,a)=𝔼π[n=0γnR(Sn,An,Sn+1)|\displaystyle Q^{\pi}(s,a)=\mathbb{E}^{\pi}[\sum_{n=0}^{\infty}\gamma^{n}R(S_{n},A_{n},S_{n+1})| S0=s,A0=a].\displaystyle S_{0}=s,A_{0}=a].
Definition 4

For any state sSs\in S, a policy π\pi^{\ast} is optimal if

πargmaxπΠposVπ(s),\displaystyle\pi^{\ast}\in\mathop{\rm arg~{}max}\limits_{\pi\in\Pi^{pos}}V^{\pi}(s),

where Πpos\Pi^{pos} is the set of positional policies over the state set SS.

II-B Linear Temporal Logic and Automata

In our proposed method, we use linear temporal logic (LTL) formulas to describe various constraints or properties and to systematically assign corresponding rewards. LTL formulas are constructed from a set of atomic propositions, Boolean operators, and temporal operators. We use the standard notations for the Boolean operators: \top (true), ¬\neg (negation), and \land (conjunction). LTL formulas over a set of atomic propositions APAP are defined as

φ::=|αAP|φ1φ2|¬φ|Xφ|φ1Uφ2,\displaystyle\varphi::=\top\ |\ \alpha\in AP\ |\ \varphi_{1}\land\varphi_{2}\ |\ \neg\varphi\ |\ \text{{\bf X}}\varphi\ |\ \varphi_{1}\text{{\bf U}}\varphi_{2},

where φ\varphi, φ1\varphi_{1}, and φ2\varphi_{2} are LTL formulas. Additional Boolean operators are defined as :=¬\perp:=\neg\top, φ1φ2:=¬(¬φ1¬φ)\varphi_{1}\lor\varphi_{2}:=\neg(\neg\varphi_{1}\land\neg\varphi), and φ1φ2:=¬φ1φ2\varphi_{1}\Rightarrow\varphi_{2}:=\neg\varphi_{1}\lor\varphi_{2}. The operators X and U are called “next” and “until”, respectively. Using the operator U, we define two temporal operators: (1) eventually, Fφ:=Uφ\text{{\bf F}}\varphi:=\top\text{{\bf U}}\varphi and (2) always, Gφ:=¬F¬φ\text{{\bf G}}\varphi:=\neg\text{{\bf F}}\neg\varphi.

Definition 5

For an LTL formula φ\varphi and an infinite path ρ=s0a0s1\rho=s_{0}a_{0}s_{1}\ldots of an MDP MM with s0Ss_{0}\in S, the satisfaction relation M,ρφM,\rho\models\varphi is recursively defined as follows.

M,ρ,\displaystyle M,\rho\models\top,
M,ραAP\displaystyle M,\rho\models\alpha\in AP αL(s0,a0,s1),\displaystyle\Leftrightarrow\alpha\in L(s_{0},a_{0},s_{1}),
M,ρφ1φ2\displaystyle M,\rho\models\varphi_{1}\land\varphi_{2} M,ρφ1M,ρφ2,\displaystyle\Leftrightarrow M,\rho\models\varphi_{1}\land M,\rho\models\varphi_{2},
M,ρ¬φ\displaystyle M,\rho\models\neg\varphi M,ρ⊧̸φ,\displaystyle\Leftrightarrow M,\rho\not\models\varphi,
M,ρXφ\displaystyle M,\rho\models\text{{\bf X}}\varphi M,ρ[1:]φ,\displaystyle\Leftrightarrow M,\rho[1:]\models\varphi,
M,ρφ1Uφ2\displaystyle M,\rho\models\varphi_{1}\text{{\bf U}}\varphi_{2} \displaystyle\Leftrightarrow
j0,M,ρ[j:]\displaystyle\quad\exists j\geq 0,\ M,\rho[j:] φ2i,0i<j,M,ρ[i:]φ1,\displaystyle\models\varphi_{2}\land\forall i,0\leq i<j,\ M,\rho[i:]\models\varphi_{1},

where ρ[i:]\rho[i:] be the ii-th suffix ρ[i:]=siaisi+1\rho[i:]=s_{i}a_{i}s_{i+1}\ldots.

The next operator X requires that φ\varphi is satisfied by the next state suffix of ρ\rho. The until operator U requires that φ1\varphi_{1} holds true until φ2\varphi_{2} becomes true over the path ρ\rho. In the following, we write ρφ\rho\models\varphi for simplicity without referring to MM.

For any policy π\pi, the probability of all paths starting from sinits_{init} on the MDP MM that satisfy an LTL formula φ\varphi under the policy π\pi, or the satisfaction probability under π\pi is defined as

PrπM(sinitφ):=PrπM({ρinitInfPathπM;ρinitφ}).\displaystyle Pr^{M}_{\pi}(s_{init}\!\models\varphi):=Pr^{M}_{\pi}(\{\rho_{init}\!\in\!InfPath^{M}_{\pi};\rho_{init}\!\models\varphi\}).

We say that an LTL formula φ\varphi is positively satisfied by a positional policy π\pi if

PrπM(sinitφ)>0.\displaystyle Pr^{M}_{\pi}(s_{init}\models\varphi)>0.

Any LTL formula φ\varphi can be converted into various ω\omega-automata, namely finite state machines that recognize all infinite words satisfying φ\varphi. We review a generalized Büchi automaton at the beginning, and then introduce a limit-deterministic generalized Büchi automaton [10].

Definition 6

A transition-based generalized Büchi automaton (tGBA) is a tuple B=(X,xinit,Σ,δ,)B=(X,\ x_{init},\ \Sigma,\ \delta,\ \mathcal{F}), where XX is a finite set of states, xinitXx_{init}\in X is the initial state, Σ\Sigma is an input alphabet including ε\varepsilon, δX×Σ×X\delta\subset X\times\Sigma\times X is a set of transitions, and ={F1,,Fn}\mathcal{F}=\{F_{1},\ldots,F_{n}\} is an acceptance condition, where for each j{1,,n}j\in\{1,\ldots,n\}, FjδF_{j}\subset\delta is a set of accepting transitions and called an accepting set. We refer to a tGBA with one accepting set as a tBA.

An infinite sequence r=x0σ0x1X(ΣX)ωr=x_{0}\sigma_{0}x_{1}\ldots\in X(\Sigma X)^{\omega} is called an infinite run if (xi,σi,xi+1)δ(x_{i},\sigma_{i},x_{i+1})\in\delta\ for any i0i\in\mathbb{N}_{0}. An infinite word w=σ0σ1Σωw=\sigma_{0}\sigma_{1}\ldots\in\Sigma^{\omega} is accepted by BφB_{\varphi} if and only if there exists an infinite run r=x0σ0x1r=x_{0}\sigma_{0}x_{1}\ldots starting from x0=xinitx_{0}=x_{init} such that inf(r)Fjinf(r)\cap F_{j}\neq\emptyset\ for each FjF_{j}\in\mathcal{F}, where inf(r)inf(r) is the set of transitions that occur infinitely often in the run rr.

Definition 7

A transition-based limit-deterministic generalized Büchi automaton (tLDGBA) is a tGBA B=(X,xinit,Σ,δ,)B=(X,x_{init},\Sigma,\delta,\mathcal{F}) such that XX is partitioned into two disjoint sets XinitialX_{initial} and XfinalX_{final} such that

  • FjXfinal×Σ×XfinalF_{j}\subset X_{final}\times\Sigma\times X_{final}, j{1,,n}\forall j\in\{1,...,n\},

  • |{(x,σ,x)δ;σΣ,xXfinal}|=1|\{(x,\sigma,x^{\prime})\in\delta;\sigma\!\in\!\Sigma,x^{\prime}\in X_{final}\}|\!=\!1, xXfinal\forall x\!\in\!X_{final},

  • |{(x,σ,x)δ;σΣ,xXinitial}||\{(x,\sigma,x^{\prime})\in\delta;\sigma\!\in\!\Sigma,x^{\prime}\in X_{initial}\}|=0, xXfinal\forall x\!\in\!X_{final},

  • (x,ε,x)δ,xXinitialxXfinal\forall(x,\varepsilon,x^{\prime})\in\delta,\ x\in X_{initial}\land x^{\prime}\in X_{final}.

An ε\varepsilon-transition enables the tLDGBA to change its state with no input and reflects a single “guess” from XinitialX_{initial} to XfinalX_{final}. Note that by the construction in [8], the transitions in each part are deterministic except for ε\varepsilon-transitions from XintialX_{intial} to XfinalX_{final}. It is known that, for any LTL formula φ\varphi, there exists a tLDGBA that accepts all words satisfying φ\varphi [8]. We refer to a tLDGBA with one accepting set as a tLDBA. In particular, we represent a tLDGBA recognizing an LTL formula φ\varphi as BφB_{\varphi}, whose input alphabet is given by Σ=2AP{ε}\Sigma=2^{AP}\cup\{\varepsilon\}.

III Reinforcement-Learning-Based Synthesis of Control Policy

We introduce an automaton augmented with binary-valued vectors. The automaton can ensure that transitions in each accepting set occur infinitely often.

Let V={(v1,,vn)T;vi{0,1},i{1,,n}}V=\{(v_{1},\ldots,v_{n})^{T}\ ;\ v_{i}\in\{0,1\},\ i\in\{1,\ldots,n\}\} be a set of binary-valued vectors, and let 𝟏\bm{1} and 𝟎\bm{0} be the nn-dimentional vectors with all elements 1 and 0, respectively. In order to augment a tLDGBA BφB_{\varphi}, we introduce three functions visitf:δVvisitf:\delta\rightarrow V, reset:VVreset:V\rightarrow V, and Max:V×VVMax:V\times V\rightarrow V as follows. For any eδe\in\delta, visitf(e)=(v1,,vn)Tvisitf(e)=(v_{1},\ldots,v_{n})^{T}, where vi=1v_{i}=1 if eFie\in F_{i} and vi=0v_{i}=0 otherwise. For any vVv\in V, reset(v)=𝟎reset(v)=\bm{0} if v=𝟏v=\bm{1} and reset(v)=vreset(v)=v otherwise. For any v,uVv,u\in V, Max(v,u)=(l1,,ln)TMax(v,u)=(l_{1},\ldots,l_{n})^{T}, where li=max{vi,ui}l_{i}=max\{v_{i},u_{i}\} for any i{1,,n}i\in\{1,\ldots,n\}.

Each vector vv is called a memory vector and represents which accepting sets have been visited. The function visitfvisitf returns a binary vector whose ii-th element is 1 if and only if a transition in the accepting set FiF_{i} occurs. The function resetreset returns the zero vector 𝟎\bm{0} if at least one transition in each accepting set has occurred after the latest reset. Otherwise, it returns the input vector without change.

Definition 8

For a tLDGBA Bφ=(X,xinit,Σ,δ,)B_{\varphi}=(X,x_{init},\Sigma,\delta,\mathcal{F}), its augmented automaton is a tLDGBA B¯φ=(X¯,x¯init,\bar{B}_{\varphi}=(\bar{X},\bar{x}_{init}, Σ¯,δ¯,¯)\bar{\Sigma},\bar{\delta},\bar{\mathcal{F}}), where X¯=X×V\bar{X}=X\times V, x¯init=(xinit,𝟎)\bar{x}_{init}=(x_{init},\bm{0}), Σ¯=Σ\bar{\Sigma}=\Sigma, δ¯\bar{\delta} is defined as δ¯\bar{\delta} = {((x,v),σ¯,(x,v))X¯×Σ¯×X¯;(x,σ¯,x)δ,v=reset(Max(v,visitf((x,σ¯,x))))}\{((x,v),\bar{\sigma},(x^{\prime},v^{\prime}))\in\bar{X}\times\bar{\Sigma}\times\bar{X};\ (x,\bar{\sigma},x^{\prime})\in\delta,v^{\prime}=reset(Max(v,visitf((x,\bar{\sigma},x^{\prime}))))\}, and ¯={F1¯,\mathcal{\bar{F}}=\{\bar{F_{1}}, ,Fn¯}\ldots,\bar{F_{n}}\} is defined as Fj¯={((x,v),σ¯,(x,v))δ¯;(x,σ¯,x)Fj,\bar{F_{j}}=\{((x,v),\bar{\sigma},(x^{\prime},v^{\prime}))\in\bar{\delta}\ ;(x,\bar{\sigma},x^{\prime})\in F_{j}, vj=0}v_{j}=0\} for each j{1,,n}j\in\{1,...,n\}.

The augmented tLDGBA B¯φ\bar{B}_{\varphi} keeps track of previous visits to the accepting sets of BφB_{\varphi}. Intuitively, along a run of B¯φ\bar{B}_{\varphi}, a memory vector vv is reset to 𝟎\bm{0} when at least one transition in each accepting set of the original tLDGBA BφB_{\varphi} has occurred. We now show that a tLDGBA and its augmented automaton accept the same language. Let (B)Σω\mathcal{L}(B)\subseteq\Sigma^{\omega} be the accepted language of an automaton BB with the alphabet Σ\Sigma, namely the set of all infinite words accepted by BB.

Proposition 1

Let B=(X,xinit,Σ,δ,)B=(X,x_{init},\Sigma,\delta,\mathcal{F}) and B¯=(X¯,x¯init,Σ¯,δ¯,¯)\bar{B}=(\bar{X},\bar{x}_{init},\bar{\Sigma},\bar{\delta},\bar{\mathcal{F}}) be an arbitrary tLDGBA and its augmentation, respectively. Then, we have (B)=(B¯)\mathcal{L}(B)=\mathcal{L}(\bar{B}).

Proof:

Recall that Σ=Σ¯\Sigma=\bar{\Sigma}. We prove set inclusions in both directions.

  • \subset:

    Consider any w=σ0σ1(B)w=\sigma_{0}\sigma_{1}\ldots\in\mathcal{L}(B). Then, there exists a run r=x0σ0x1r=x_{0}\sigma_{0}x_{1} σ1x2X(ΣX)ω\sigma_{1}x_{2}\ldots\in X(\Sigma X)^{\omega} of BB such that x0=xinitx_{0}=x_{init} and inf(r)Fjinf(r)\cap F_{j}\neq\emptyset for each FjF_{j}\in\mathcal{F}. For the run rr, we construct a sequence r¯=x¯0σ¯0x¯1σ¯1x¯2X¯(Σ¯X¯)ω\bar{r}=\bar{x}_{0}\bar{\sigma}_{0}\bar{x}_{1}\bar{\sigma}_{1}\bar{x}_{2}\ldots\in\bar{X}(\bar{\Sigma}\bar{X})^{\omega} satisfying x¯i=(xi,vi)\bar{x}_{i}=(x_{i},{v}_{i}) and σ¯i=σi\bar{\sigma}_{i}=\sigma_{i} for any ii\in\mathbb{N}, where v0=𝟎{v}_{0}=\mathbf{0} and

    vi+1=reset(Max(vi,visitf((xi,σ¯i,xi+1)))),i.\displaystyle{v}_{i+1}\!=\!reset\Big{(}\!Max\big{(}{v}_{i},visitf((x_{i},\bar{\sigma}_{i},x_{i+1}))\big{)}\!\Big{)},\forall i\in\mathbb{N}.

    Clearly from the construction, we have (x¯i,σ¯i,x¯i+1)δ¯(\bar{x}_{i},\bar{\sigma}_{i},\bar{x}_{i+1})\in\bar{\delta} for any ii\in\mathbb{N}. Thus, r¯\bar{r} is a run of B¯\bar{B} starting from x¯0=(xinit,𝟎)=x¯init\bar{x}_{0}=(x_{init},\mathbf{0})=\bar{x}_{init}.

    We now show that inf(r¯)F¯jinf(\bar{r})\cap\bar{F}_{j}\neq\emptyset for each F¯j¯\bar{F}_{j}\in\bar{\mathcal{F}}. Since inf(r)Fjinf(r)\cap F_{j}\neq\emptyset for each FjF_{j}\in\mathcal{F}, we have

    inf(r¯){(x¯,σ¯,x¯)δ¯;visitf(([[x¯]]X,σ¯,[[x¯]]X))j=1}\displaystyle inf\!(\bar{r})\!\cap\!\{\!(\bar{x},\!\bar{\sigma},\!\bar{x}^{\prime})\!\in\!\bar{\delta};visitf\!(\!([\![\bar{x}]\!]_{X},\!\bar{\sigma},\![\![\bar{x}^{\prime}]\!]_{X}\!)\!)_{j}\!\!=\!1\}\!\neq\!\emptyset

    for any j{1,,n}j\in\{1,\ldots,n\}, where [[(x,v)]]X=x[\![(x,v)]\!]_{X}=x for each (x,v)X¯(x,v)\in\bar{X}. By the construction of r¯\bar{r}, therefore, there are infinitely many indices ll\in\mathbb{N} with vl=𝟎v_{l}=\mathbf{0}. Let l1,l2l_{1},l_{2}\in\mathbb{N} be arbitrary nonnegative integers such that l1<l2l_{1}<l_{2}, vl1=vl2=𝟎v_{l_{1}}=v_{l_{2}}=\mathbf{0}, and vl𝟎v_{l^{\prime}}\neq\mathbf{0} for any l{l1+1,,l21}l^{\prime}\in\{l_{1}+1,\ldots,l_{2}-1\}. Then,

    j{1,,n},\displaystyle\forall j\in\{1,\ldots,n\},\ k{l1,l1+1,,l21},\displaystyle\exists k\in\{l_{1},l_{1}+1,\ldots,l_{2}-1\},
    (xk,σk,xk+1)Fj(vk)j=0,\displaystyle(x_{k},\sigma_{k},x_{k+1})\in F_{j}\land(v_{k})_{j}=0,

    where (vk)j(v_{k})_{j} is the jj-th element of vkv_{k}. Hence, we have inf(r¯)F¯jinf(\bar{r})\cap\bar{F}_{j}\neq\emptyset for each F¯j¯\bar{F}_{j}\in\bar{\mathcal{F}}, which implies w(B¯)w\in\mathcal{L}(\bar{B}).

  • \supset:

    Consider any w¯σ¯0σ¯1(B¯)\bar{w}\in\bar{\sigma}_{0}\bar{\sigma}_{1}\ldots\in\mathcal{L}(\bar{B}). Then, there exists a run r¯=x¯0σ¯0x¯1σ¯1\bar{r}=\bar{x}_{0}\bar{\sigma}_{0}\bar{x}_{1}\bar{\sigma}_{1} x¯2X¯(Σ¯X¯)ω\bar{x}_{2}\ldots\in\bar{X}(\bar{\Sigma}\bar{X})^{\omega} of B¯\bar{B} such that x¯0=x¯init\bar{x}_{0}=\bar{x}_{init} and inf(r¯)F¯jinf(\bar{r})\cap\bar{F}_{j}\neq\emptyset for each F¯j¯\bar{F}_{j}\in\bar{\mathcal{F}}, i.e.,

    j\displaystyle\forall j\in {1,,n},k,lk,\displaystyle\{1,\ldots,n\},\ \forall k\in\mathbb{N},\ \exists l\geq k,
    ([[x¯l]]X,σ¯l,[[x¯l+1]]X)Fj(v¯l)j=0.\displaystyle([\![\bar{x}_{l}]\!]_{X},\bar{\sigma}_{l},[\![\bar{x}_{l+1}]\!]_{X})\in F_{j}\land(\bar{v}_{l})_{j}=0. (1)

    For the run r¯\bar{r}, we construct a sequence r=x0σ0x1σ1x2X(ΣX)ωr=x_{0}\sigma_{0}x_{1}\sigma_{1}x_{2}\ldots\in X(\Sigma X)^{\omega} such that xi=[[x¯i]]Xx_{i}=[\![\bar{x}_{i}]\!]_{X} and σi=σ¯i\sigma_{i}=\bar{\sigma}_{i} for any ii\in\mathbb{N}. It is clear that rr is a run of BB starting from x0=xinitx_{0}=x_{init}. It holds by Eq. (\supset:III) that inf(r)Fjinf(r)\cap F_{j}\neq\emptyset for each FjF_{j}\in\mathcal{F}, which implies w¯(B)\bar{w}\in\mathcal{L}(B).

For example, shown in Figs. 1 and 2 are a tLDGBA and its augmented automaton, respectively, associated with the following LTL formula.

φ=GFaGFbG¬c.\displaystyle\varphi=\text{{\bf GF}}a\land\text{{\bf GF}}b\land\text{{\bf G}}\neg c. (2)

The acceptance condition {\mathcal{F}} of the tLDGBA is given by ={\mathcal{F}}= {F1,F2}\{F_{1},F_{2}\}, where F1={(x0,{a},x0),(x0,{a,b},x0)}F_{1}=\{(x_{0},\{a\},x_{0}),\ (x_{0},\{a,b\},x_{0})\} and F2={(x0,{b},x0),(x0,{a,b},x0)}F_{2}=\{(x_{0},\{b\},x_{0}),\ (x_{0},\{a,b\},x_{0})\}. Practically, states in a strongly connected component that contains no accepting transitions can be merged as shown in Fig. 2.

Refer to caption
Figure 1: The tLDGBA recognizing the LTL formula GFaGFbG¬c\text{{\bf GF}}a\wedge\text{{\bf GF}}b\wedge\text{{\bf G}}\neg c, where X={x0,x1}=XfinalX=\{x_{0},x_{1}\}=X_{final} and xinit=x0x_{init}=x_{0}. Red arcs are accepting transitions that are numbered in accordance with the accepting sets they belong to.
Refer to caption
Figure 2: The augmented automaton for the tLDGBA in Fig. 1 recognizing the LTL formula GFaGFbG¬c\text{{\bf GF}}a\wedge\text{{\bf GF}}b\wedge\text{{\bf G}}\neg c, where the initial state is (x0,(0,0)T)(x_{0},(0,0)^{T}). Red arcs are accepting transitions that are numbered in accordance with the accepting sets they belong to. All states corresponding to x1x_{1} are merged into (x1,(,)T)(x_{1},(*,*)^{T}).

We modify the standard definition of a product MDP to deal with ε\varepsilon-transitions in the augmented automaton.

Definition 9

Given an augmented tLDGBA B¯φ\bar{B}_{\varphi} and an MDP MM, a tuple MB¯φ=M=(S,A,sinit,P,δ,M\otimes\bar{B}_{\varphi}=M^{\otimes}=(S^{\otimes},A^{\otimes},s_{init}^{\otimes},P^{\otimes},\delta^{\otimes}, ){\mathcal{F}}^{\otimes}) is a product MDP, where S=S×X¯S^{\otimes}=S\times\bar{X} is the finite set of states; AA^{\otimes} is the finite set of actions such that A=A{εx¯;x¯X s.t. (x¯,ε,x¯)δ¯}A^{\otimes}=A\cup\{\varepsilon_{\bar{x}^{\prime}};\exists\bar{x}^{\prime}\!\in\!X\text{ s.t. }(\bar{x},\varepsilon,\bar{x}^{\prime})\in\bar{\delta}\}, where εx¯\varepsilon_{\bar{x}^{\prime}} is the action for the ε\varepsilon-transition to the state x¯X¯\bar{x}^{\prime}\!\in\!\bar{X}; sinit=(sinit,x¯init)s_{init}^{\otimes}=(s_{init},\bar{x}_{init}) is the initial state; P:S×S×A[0,1]P^{\otimes}:S^{\otimes}\times S^{\otimes}\times A^{\otimes}\rightarrow[0,1] is the transition probability function defined as

P(s|s,a)\displaystyle P^{\otimes}(s^{\otimes\prime}|s^{\otimes},a)
={P(s|s,a)if(x¯,L((s,a,s)),x¯)δ¯,a𝒜(s)1ifs=s,(x¯,ε,x¯)δ¯,a=εx¯,0otherwise,\displaystyle=\left\{\begin{aligned} &P(s^{\prime}|s,a)&&\text{if}\ (\bar{x},L((s,a,s^{\prime})),\bar{x}^{\prime})\in\bar{\delta},a\in\mathcal{A}(s)\\ &1&&\text{if}\ s=s^{\prime},(\bar{x},\varepsilon,\bar{x}^{\prime})\in\bar{\delta},a=\varepsilon_{\bar{x}^{\prime}},\\ &0&&\text{otherwise},\end{aligned}\right.

where s=(s,(x,v))s^{\otimes}=(s,(x,v)) and s=(s,(x,v))s^{\otimes\prime}=(s^{\prime},(x^{\prime},v^{\prime})); δ={(s,a,s)S×A×S;P(s|s,a)>0}\delta^{\otimes}=\{(s^{\otimes},a,s^{\otimes\prime})\in S^{\otimes}\times A^{\otimes}\times S^{\otimes};P^{\otimes}(s^{\otimes\prime}|s^{\otimes},a)>0\} is the set of transitions; and ={F¯1,,F¯n}{\mathcal{F}}^{\otimes}=\{\bar{F}^{\otimes}_{1},\ldots,\bar{F}^{\otimes}_{n}\} is the acceptance condition, where F¯i={((s,x¯),a,(s,x¯))δ;(x¯,L(s,a,s),x¯)F¯i}\bar{F}^{\otimes}_{i}=\{((s,\bar{x}),a,(s^{\prime},\bar{x}^{\prime}))\in\delta^{\otimes};(\bar{x},L(s,a,s^{\prime}),\bar{x}^{\prime})\in\bar{F}_{i}\} for each i{1,,n}i\in\{1,\ldots,n\}.

Definition 10

The reward function :S×A×S0\mathcal{R}:S^{\otimes}\times A^{\otimes}\times S^{\otimes}\rightarrow{\mathbb{R}}_{\geq 0} is defined as

(s,a,s)={rpifi{1,,n},(s,a,s)F¯i,0otherwise,\displaystyle\mathcal{R}(s^{\otimes},a,s^{\otimes\prime})=\left\{\begin{aligned} &r_{p}\ \text{if}\ \exists i\in\!\{1,\ldots,n\},\ (s^{\otimes},a,s^{\otimes\prime})\in\bar{F}^{\otimes}_{i}\!,\\ &0\ \ \text{otherwise},\end{aligned}\right.

where rpr_{p} is a positive value.

Remark 1

When constructing a tBA from a tGBA, the order of visits to accepting sets of the tGBA is fixed. Consequently, the rewards based on the acceptance condition of the tBA tends to be sparse. The sparsity is critical in RL-based controller synthesis problems. The augmentation of the tGBA relaxes the sparsity since the augmented tGBA has all of the order of visits to all accepting sets of the original tGBA. For the acceptance condition \mathcal{F} of the tGBA, the size of the state space of the augmented tGBA is about 2||1||\frac{2^{|\mathcal{F}|}-1}{|\mathcal{F}|} times larger than the tBA constructed from the tGBA. However, the number of accepting transitions to all transitions in the augmented tGBA is much greater than the tBA. Therefore, our proposed method is expected to be better than using the tLDBA in the sense of sample efficiency.

In the following, we show that a positional policy positively satisfying φ\varphi on the product MDP MM^{\otimes} is synthesized by using the reward function \mathcal{R} in Definition 10, which is based on the acceptance condition of MM^{\otimes}.

For a Markov chain MCπMC^{\otimes}_{\pi} induced by a product MDP MM^{\otimes} with a positional policy π\pi, let Sπ=TπRπ1RπhS^{\otimes}_{\pi}=T^{\otimes}_{\pi}\cup R^{\otimes 1}_{\pi}\cup\ldots\cup R^{\otimes h}_{\pi} be the set of states in MCπMC^{\otimes}_{\pi}, where TπT^{\otimes}_{\pi} is the set of transient states and RπiR^{\otimes i}_{\pi} is the recurrent class for each i{1,,h}i\in\{1,\ldots,h\}, and let R(MCπ)R(MC^{\otimes}_{\pi}) be the set of all recurrent states in MCπMC^{\otimes}_{\pi}. Let δπi\delta^{\otimes i}_{\pi} be the set of transitions in a recurrent class RπiR^{\otimes i}_{\pi}, namely δπi={(s,a,s)δ;sRπi,P(s|s,a)>0}\delta^{\otimes i}_{\pi}=\{(s^{\otimes},a,s^{\otimes\prime})\in\delta^{\otimes};s^{\otimes}\in R^{\otimes i}_{\pi},\ P^{\otimes}(s^{\otimes\prime}|s^{\otimes},a)>0\}, and let PπP^{\otimes}_{\pi} : Sπ×Sπ[0,1]S^{\otimes}_{\pi}\times S^{\otimes}_{\pi}\rightarrow[0,1] be the transition probability function under π\pi.

Lemma 1

For any policy π\pi and any recurrent class RπiR^{\otimes i}_{\pi} in the Markov chain MCπMC^{\otimes}_{\pi}, MCπMC^{\otimes}_{\pi} satisfies one of the following conditions.

  1. 1.

    δπiF¯j\delta^{\otimes i}_{\pi}\cap\bar{F}^{\otimes}_{j}\neq\emptyset\ , j{1,,n}\forall j\in\{1,\ldots,n\},

  2. 2.

    δπiF¯j=\delta^{\otimes i}_{\pi}\cap\bar{F}^{\otimes}_{j}=\emptyset\ , j{1,,n}\forall j\in\{1,\ldots,n\}.

Proof:

Suppose that MCπMC^{\otimes}_{\pi} satisfies neither conditions 1 nor 2. Then, there exist a policy π\pi, i{1,,h}i\in\{1,\ldots,h\}, and j1j_{1}, j2j_{2} {1,,n}\in\{1,\ldots,n\} such that δπiF¯j1=\delta^{\otimes i}_{\pi}\cap\bar{F}^{\otimes}_{j_{1}}=\emptyset and δπiF¯j2\delta^{\otimes i}_{\pi}\cap\bar{F}^{\otimes}_{j_{2}}\neq\emptyset. In other words, there exists a nonempty and proper subset J2{1,,n}{{1,,n},}J\in 2^{\{1,\ldots,n\}}\setminus\{\{1,\ldots,n\},\emptyset\} such that δπiF¯j\delta^{\otimes i}_{\pi}\cap\bar{F}^{\otimes}_{j}\neq\emptyset for any jJj\in J. For any transition (s,a,s)δπiF¯j(s,a,s^{\prime})\in\delta^{\otimes i}_{\pi}\cap\bar{F}^{\otimes}_{j}, the following equation holds by the properties of the recurrent states in MCπMC^{\otimes}_{\pi}[13].

k=0pk((s,a,s),(s,a,s))=,\displaystyle\sum_{k=0}^{\infty}p^{k}((s,a,s^{\prime}),(s,a,s^{\prime}))=\infty, (3)

where pk((s,a,s),(s,a,s))p^{k}((s,a,s^{\prime}),(s,a,s^{\prime})) is the probability that the transition (s,a,s)(s,a,s^{\prime}) reoccurs after it occurs in kk time steps. Eq. (3) means that each transition in RπiR^{\otimes i}_{\pi} occurs infinitely often. However, the memory vector vv is never reset in RπiR^{\otimes i}_{\pi} by the assumption. This directly contradicts Eq. (3). ∎

Lemma 1 implies that, for an LTL formula φ\varphi, if a policy π\pi does not satisfy φ\varphi, then the agent obtains no reward in recurrent classes; otherwise there exists at least one recurrent class where the agent obtains rewards infinitely often.

Theorem 1

For a product MDP MM^{\otimes} of an MDP MM and an augmented tLDGBA B¯φ\bar{B}_{\varphi} corresponding to a given LTL formula φ\varphi and the reward function given by Definition 10, if there exists a positional policy positively satisfying φ\varphi on MM^{\otimes}, then there exists a discount factor γ\gamma^{\ast} such that any algorithm that maximizes the expected discounted reward with γ>γ\gamma>\gamma^{\ast} will find a positional policy positively satisfying φ\varphi on MM^{\otimes}.

Proof:

Suppose that π\pi^{\ast} is an optimal policy but does not satisfy the LTL formula φ\varphi. Then, for any recurrent class RπiR^{\otimes i}_{{\pi}^{\ast}} in the Markov chain MCπMC^{\otimes}_{{\pi}^{\ast}} and any accepting set F¯j\bar{F}^{\otimes}_{j} of the product MDP MM^{\otimes}, δπiF¯j=\delta^{\otimes i}_{\pi^{\ast}}\cap\bar{F}^{\otimes}_{j}=\emptyset holds by Lemma 1. Thus, the agent under the policy π\pi^{\ast} can obtain rewards only in the set of transient states. We consider the best scenario in the assumption. Let pk(s,s)p^{k}(s,s^{\prime}) be the probability of going to a state ss^{\prime} in kk time steps after leaving the state ss, and let Post(Tπ)Post(T^{\otimes}_{\pi}) be the set of states in recurrent classes that can be transitioned from states in TπT^{\otimes}_{\pi} by one action. For the initial state sinits_{init} in the set of transient states, it holds that

Vπ(sinit)\displaystyle V^{\pi^{\ast}}\!(s_{init})
=\displaystyle= k=0sTπγkpk(sinit,s)sTπPost(Tπ)Pπ(s|s)(s,a,s)\displaystyle\sum_{k=0}^{\infty}\sum_{s\in T^{\otimes}_{\pi^{\ast}}}\!\!\gamma^{k}p^{k}(s_{init},s)\!\!\!\!\sum_{s^{\prime}\in T^{\otimes}_{\pi^{\ast}}\cup Post(T^{\otimes}_{\pi^{\ast}})}\!\!\!\!\!\!P^{\otimes}_{\pi^{\ast}}\!(s^{\prime}|s)\mathcal{R}(s,a,s^{\prime})
\displaystyle\leq rpk=0sTπγkpk(sinit,s),\displaystyle r_{p}\sum_{k=0}^{\infty}\sum_{s\in T^{\otimes}_{\pi^{\ast}}}\gamma^{k}p^{k}(s_{init},s),

where the action aa is selected by π\pi^{\ast}. By the property of the transient states, for any state ss^{\otimes} in TπT^{\otimes}_{\pi^{\ast}}, there exists a bounded positive value mm such that k=0γkpk(sinit,s)k=0pk(sinit,s)<m\sum_{k=0}^{\infty}\gamma^{k}p^{k}(s_{init},s)\leq\sum_{k=0}^{\infty}p^{k}(s_{init},s)<m [13]. Therefore, there exists a bounded positive value m¯\bar{m} such that Vπ(sinit)<m¯V^{\pi^{\ast}}(s_{init})<\bar{m}. Let π¯\bar{\pi} be a positional policy satisfying φ\varphi. We consider the following two cases.

  1. 1.

    Assume that the initial state sinits_{init} is in a recurrent class Rπ¯iR^{\otimes i}_{\bar{\pi}} for some i{1,,h}i\in\{1,\ldots,h\}. For any accepting set F¯j\bar{F}^{\otimes}_{j}, δπ¯iF¯j\delta^{\otimes i}_{\bar{\pi}}\cap\bar{F}^{\otimes}_{j}\neq\emptyset holds by the definition of π¯\bar{\pi}. The expected discounted reward for sinits_{init} is given by

    Vπ¯(sinit)\displaystyle V^{\bar{\pi}}(s_{init})
    =\displaystyle= k=0sRπ¯iγkpk(sinit,s)sRπ¯iPπ¯(s|s)(s,a,s),\displaystyle\sum_{k=0}^{\infty}\sum_{s\in R^{\otimes i}_{\bar{\pi}}}\!\gamma^{k}p^{k}(s_{init},s)\!\!\sum_{s^{\prime}\in R^{\otimes i}_{\bar{\pi}}}P^{\otimes}_{\bar{\pi}}(s^{\prime}|s)\mathcal{R}(s,a,s^{\prime}),

    where the action aa is selected by π¯\bar{\pi}. Since sinits_{init} is in Rπ¯iR^{\otimes i}_{\bar{\pi}}, there exists a positive number k¯=min{k;kn,pk(sinit,sinit)>0}\bar{k}=\min\{k\ ;\ k\geq n,p^{k}(s_{init},s_{init})>0\} [13]. We consider the worst scenario in this case. It holds that

    Vπ¯(sinit)\displaystyle V^{\bar{\pi}}(s_{init})\geq k=npk(sinit,sinit)i=1nγkirp\displaystyle\sum_{k=n}^{\infty}p^{k}(s_{init},s_{init})\sum_{i=1}^{n}\gamma^{k-i}r_{p}
    \displaystyle\geq k=1pkk¯(sinit,sinit)i=0n1γkk¯irp\displaystyle\sum_{k=1}^{\infty}p^{k\bar{k}}(s_{init},s_{init})\sum_{i=0}^{n-1}\gamma^{k\bar{k}-i}r_{p}
    >\displaystyle> rpk=1γkk¯pkk¯(sinit,sinit),\displaystyle r_{p}\sum_{k=1}^{\infty}\gamma^{k\bar{k}}p^{k\bar{k}}(s_{init},s_{init}),

    whereas all states in R(MCπ¯)R(MC^{\otimes}_{\bar{\pi}}) are positive recurrent because |S|<|S^{\otimes}|<\infty [14]. Obviously, pkk¯(sinit,sinit)(pk¯(sinit,sinit))k>0p^{k\bar{k}}(s_{init},s_{init})\geq(p^{\bar{k}}(s_{init},s_{init}))^{k}>0 holds for any k(0,)k\in(0,\infty) by the Chapman-Kolmogorov equation [13]. Furthermore, we have limkpkk¯(sinit,sinit)>0\lim_{k\rightarrow\infty}p^{k\bar{k}}(s_{init},s_{init})>0 by the property of irreducibility and positive recurrence [15]. Hence, there exists p¯\bar{p} such that 0<p¯<pkk¯(sinit,sinit)0<\bar{p}<p^{k\bar{k}}(s_{init},s_{init}) for any k(0,]k\in(0,\infty] and we have

    Vπ¯(sinit)>\displaystyle V^{\bar{\pi}}(s_{init})> rpp¯γk¯1γk¯.\displaystyle r_{p}\bar{p}\frac{\gamma^{\bar{k}}}{1-\gamma^{\bar{k}}}.

    Therefore, for any rp<r_{p}<\infty and any m¯(Vπ(sinit),)\bar{m}\in(V^{\pi^{\ast}}(s_{init}),\infty), there exists γ<1\gamma^{\ast}<1 such that γ>γ\gamma>\gamma^{\ast} implies Vπ¯(sinit)>rpp¯γk¯1γk¯>m¯.V^{\bar{\pi}}(s_{init})>r_{p}\bar{p}\frac{\gamma^{\bar{k}}}{1-\gamma^{\bar{k}}}>\bar{m}.

  2. 2.

    Assume that the initial state sinits_{init} is in the set of transient states Tπ¯T_{\bar{\pi}}^{\otimes}. Pπ¯M(sinitφ)>0P^{M^{\otimes}}_{\bar{\pi}}(s_{init}\models\varphi)>0 holds by the definition of π¯\bar{\pi}. For a recurrent class Rπ¯iR^{\otimes i}_{\bar{\pi}} such that δπ¯iF¯j\delta^{\otimes i}_{\bar{\pi}}\cap\bar{F}^{\otimes}_{j}\neq\emptyset for each accepting set F¯j\bar{F}^{\otimes}_{j}, there exist a number l¯>0\bar{l}>0, a state s^\hat{s} in Post(Tπ¯)Rπ¯iPost(T^{\otimes}_{\bar{\pi}})\cap R^{\otimes i}_{\bar{\pi}}, and a subset of transient states {s1,,sl¯1}Tπ¯\{s_{1},\ldots,s_{\bar{l}-1}\}\subset T^{\otimes}_{\bar{\pi}} such that p(sinit,s1)>0p(s_{init},s_{1})>0, p(si,si+1)>0p(s_{i},s_{i+1})>0 for i{1,,i\in\{1,\ldots, l¯2}\bar{l}-2\}, and p(sl¯1,s^)>0p(s_{\bar{l}-1},\hat{s})>0 by the property of transient states. Hence, it holds that pl¯(sinit,s^)>0p^{\bar{l}}(s_{init},\hat{s})>0 for the state s^\hat{s}. Thus, by ignoring rewards in Tπ¯T^{\otimes}_{\bar{\pi}}, we have

    Vπ¯(sinit)\displaystyle V^{\bar{\pi}}(s_{init})\geq\ γl¯pl¯(sinit,s^)k=0sRπ¯iγkpk(s^,s)\displaystyle\gamma^{\bar{l}}p^{\bar{l}}(s_{init},\hat{s})\sum_{k=0}^{\infty}\sum_{s^{\prime}\in R^{\otimes i}_{\bar{\pi}}}\gamma^{k}p^{k}(\hat{s},s^{\prime})
    s′′Rπ¯iPπ¯(s′′|s)(s,a,s′′)\displaystyle\sum_{s^{\prime\prime}\in R^{\otimes i}_{\bar{\pi}}}P^{\otimes}_{\bar{\pi}}(s^{\prime\prime}|s^{\prime})\mathcal{R}(s^{\prime},a,s^{\prime\prime})
    >\displaystyle>\ γl¯pl¯(sinit,s^)rpp¯γk¯1γk¯,\displaystyle\gamma^{\bar{l}}p^{\bar{l}}(s_{init},\hat{s})r_{p}\bar{p}\frac{\gamma^{\bar{k}^{\prime}}}{1-\gamma^{\bar{k}^{\prime}}},

    where k¯n\bar{k}^{\prime}\geq n is a constant and 0<p¯<pkk¯(s^,s^)0<\bar{p}<p^{k\bar{k}^{\prime}}(\hat{s},\hat{s}) for any k(0,]k\in(0,\infty]. Therefore, for any rp<r_{p}<\infty and any m¯(Vπ(sinit),)\bar{m}\in(V^{\pi^{\ast}}(s_{init}),\infty), there exists γ<1\gamma^{\ast}<1 such that γ>γ\gamma>\gamma^{\ast} implies Vπ¯(sinit)>γl¯pl¯(sinit,s^)rpp¯γk¯1γk¯>m¯V^{\bar{\pi}}(s_{init})>\gamma^{\bar{l}}p^{\bar{l}}(s_{init},\hat{s})\frac{r_{p}\bar{p}\gamma^{\bar{k}^{\prime}}}{1-\gamma^{\bar{k}^{\prime}}}>\bar{m}.

The results contradict the optimality assumption of π\pi^{\ast}. ∎

Theorem 1 implies that, for the product MDP MM^{\otimes} of an MDP MM and an augmented tLDGBA corresponding to a given LTL formula φ\varphi, we obtain a positional policy positively satisfying φ\varphi on MM^{\otimes} by an algorithm maximizing the expected discounted reward with a discount factor sufficiently close to 1 if there exists a positional policy on MM^{\otimes} satisfying φ\varphi.

IV Example

Refer to caption
Figure 3: The environment consisting of eight rooms and one corridor. Red arcs are the transitions that we want to occur infinitely often, while blue arcs are the transitions that we never want to occur. s7s_{7} is the initial state.

In this section, we apply the proposed method to a path planning problem of a robot in an environment consisting of eight rooms and one corridor as shown in Fig. 3. The state s7s_{7} is the initial state and the action space is specified with 𝒜(s)={Right,Left,Up,Down}\mathcal{A}(s)=\{Right,Left,Up,Down\} for any state ss4s\neq s_{4} and 𝒜(s4)={to_s0,to_s1,to_s2,to_s3,to_s5,\mathcal{A}(s_{4})=\{to\_s_{0},to\_s_{1},to\_s_{2},to\_s_{3},to\_s_{5}, to_s6,to_s7,to_s8}to\_s_{6},to\_s_{7},to\_s_{8}\}, where to_sito\_s_{i} means attempting to go to the state sis_{i} for i{0, 1, 2, 3, 5, 6, 7, 8}i\in\{0,\ 1,\ 2,\ 3,\ 5,\ 6,\ 7,\ 8\}. The robot moves in the intended direction with probability 0.9 and it stays in the same state with probability 0.1 if it is in the state s4s_{4}. In the states other than s4s_{4}, it moves in the intended direction with probability 0.9 and it moves in the opposite direction to that it intended to go with probability 0.1. If the robot tries to go to outside the environment, it stays in the same state. The labeling function is as follows.

L((s,act,s))={{c}if s=si,i{2,3,5,6},{a}if (s,act,s)=(s4,to_s0,s0),{b}if (s,act,s)=(s4,to_s8,s8),otherwise.\displaystyle L((s,act,s^{\prime}))=\left\{\begin{aligned} &\{c\}&&\text{if }s^{\prime}=s_{i},\ i\in\{2,3,5,6\},\\ &\{a\}&&\text{if }(s,act,s^{\prime})=(s_{4},to\_s_{0},s_{0}),\\ &\{b\}&&\text{if }(s,act,s^{\prime})=(s_{4},to\_s_{8},s_{8}),\\ &\emptyset&&\text{otherwise}.\end{aligned}\right.

In the example, the robot tries to take two transitions that we want to occur infinitely often, represented by arcs labeled by {aa} and {bb}, while avoiding unsafe transitions represented by the arcs labeled by {c}. This is formally specified by the LTL formula given by Eq. (2). The LTL formula requires the robot to keep on entering the two rooms s0s_{0} and s8s_{8} from the corridor s4s_{4} regardless of the order of entries, while avoiding entering the four rooms s2s_{2}, s3s_{3}, s5s_{5}, and s6s_{6}. The tLDGBA BφB_{\varphi} and its augmented automaton B¯φ\bar{B}_{\varphi} are shown in Figs. 1 and 2, respectively.

Through the above scenario, we compare our approach with 1) a case where we first convert the tLDGBA into a tLDBA, for which the augmentation makes no change, and thus a reward function in Definition 10 is based on a single accepting set; and 2) the method using a reward function based on the accepting frontier function [9, 10]. For the three methods, we use Q-learning111We employ Q-learning here but any algorithm that maximizes the discounted expected reward can be applied to our proposed method. with an epsilon-greedy policy. The epsilon-greedy parameter is given by 0.95nt(s)\frac{0.95}{n_{t}(s^{\otimes})}, where nt(s)n_{t}(s^{\otimes}) is the number of visits to state ss^{\otimes} within tt time steps [16], so that it will gradually reduce to 0 to learn an optimal policy asymptotically. We set the positive reward rp=2r_{p}=2 and the discount factor γ=0.95\gamma=0.95. The learning rate α\alpha varies in accordance with the Robbins-Monro condition. We train the agent in 10000 iterations and 1000 episodes for 100 learning sessions.

Refer to caption Refer to caption
Figure 4: The mean of average reward in each episode for 100 learning sessions obtained from our proposed method (left) and the method using tLDBA (right). They are plotted per 50 episodes and the green areas represent the range of standard deviations.
Refer to caption Refer to caption
Figure 5: The optimal policy obtained from our proposed method (left) and the method in [9, 10] (right).

Results

1) Fig. 4 shows the average rewards obtained by our proposed method and the case using a tLDBA BφB^{\prime}_{\varphi} converted from φ\varphi, respectively. Both methods eventually acquire an optimal policy satisfying φ\varphi. As shown in Fig. 4, however, our proposed method converges faster. This is because the order of entrances to the rooms s0s_{0} and s8s_{8} is determined according to the tLDBA. Moreover, the number of transitions with a positive reward in B¯φ\bar{B}_{\varphi} is larger than that in BφB_{\varphi}^{\prime}.

2) We use the accepting frontier function [9, 10] for the tLDGBA Acc:δ×2δ2δAcc:\delta\times 2^{\delta}\rightarrow 2^{\delta}. Initializing a set of transitions 𝔽\mathbb{F} with the set of the all accepting transitions in BφB_{\varphi}, the function receives the transition (x,σ,x)(x,\sigma,x^{\prime}) that occurs and the set 𝔽\mathbb{F}. If (x,σ,x)(x,\sigma,x^{\prime}) is in 𝔽\mathbb{F}, then AccAcc removes the accepting sets containing (x,σ,x)(x,\sigma,x^{\prime}) from 𝔽\mathbb{F}. For the product MDP of the MDP MM and the tLDGBA BφB_{\varphi}, the reward function is based on the removed sets of BφB_{\varphi}. Then, we synthesize a positional policy on the product MDP derived from the tLDGBA BφB_{\varphi}.

Fig. 5 shows the optimal policies obtained by our proposed method and the method in [9, 10], respectively. The policy obtained by the method with the accepting frontier function fails to satisfy the LTL specification222We obtain the same result even with a state-based LDGBA. because it is impossible that the transitions labeled with {a}\{a\} and {b}\{b\} occur from s4s_{4} infinitely often by any positional policy on the product MDP with BφB_{\varphi} shown in Fig. 1. More specifically, as shown in Fig. 6, the agent cannot take each accepting transition colored with red by any deterministic stationary action selection at the product state (s4,x0)(s_{4},x_{0}). In our proposed method, the augmented tLDGBA includes the information of the (path-dependent) domain of the accepting frontier function explicitly as memory vectors, which enables us to synthesize a positional policy satisfying φ\varphi on the product MDP.

Refer to caption
Figure 6: A part of the product MDP of the MDP shown in Fig. 3 and the tLDGBA shown in Fig. 1, where each transitions is labeled with either an action or the transition probability.

V Conclusion

The letter proposed a novel RL-based method for the synthesis of a control policy for an LTL specification using an augmented tLDGBA. The proposed method improved the learning performance compared to existing methods. It is future work to investigate a method that maximizes the satisfaction probability.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   MIT press, 2018.
  • [2] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming.   John Wiley & Sons, 1994.
  • [3] C. Baier and J.-P. Katoen, Principles of Model Checking.   MIT Press, 2008.
  • [4] C. Belta, B. Yordanov, and E. Aydin Gol, Formal Methods for Discrete-Time Dynamical Systems.   Springer International Publishing, 2017.
  • [5] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in 53rd IEEE Conference on Decision and Control, 2014, pp. 1091–1096.
  • [6] Q. Gao, D. Hajinezhad, Y. Zhang, Y. Kantaros, and M. M. Zavlanos, “Reduced variance deep reinforcement learning with temporal logic specifications,” in Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems, 2019, pp. 237–248.
  • [7] X. Li, Z. Serlin, G. Yang, and C. Belta, “A formal methods approach to interpretable reinforcement learning for robotic planning,” Science Robotics, vol. 4, no. 37, pp. 1–16, 2019.
  • [8] S. Sickert, J. Esparza, S. Jaax, and J. Křetínskỳ, “Limit-deterministic Büchi automata for linear temporal logic,” in International Conference on Computer Aided Verification.   Springer, 2016, pp. 312–332.
  • [9] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-constrained reinforcement learning,” arXiv preprint arXiv:1801.08099, 2018.
  • [10] M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, and I. Lee, “Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees,” arXiv preprint arXiv:1909.05304, 2019.
  • [11] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Omega-regular objectives in model-free reinforcement learning,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2019, pp. 395–412.
  • [12] A. K. Bozkurt, Y. Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” arXiv preprint arXiv:1909.07299, 2019.
  • [13] R. Durrett, Essentials of Stochastic Processes, 2nd ed.   Springer, 2012.
  • [14] L. Breuer, “Introduction to stochastic processes,” [Online]. Available: https://www.kent.ac.uk/smsas/personal/lb209/files/sp07.pdf.
  • [15] S. M. Ross, Stochastic Processes, 2nd ed.   Wiley New York, 1996.
  • [16] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergence results for single-step on-policy reinforcement-learning algorithms,” Machine learning, vol. 38, no. 3, pp. 287–308, 2000.