This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\savesymbol

AND

Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

Yiannis Kantaros, , Jun Wang Yiannis Kantaros (ioannisk@wustl.edu) and Jun Wang (junw@wustl.edu) are with the Department of Electrical and Systems Engineering, Washington University in St. Louis, St. Louis, MO, 63130, USA. This work was supported in part by the NSF award CNS #2231257\#2231257.
Abstract

This paper addresses the problem of learning optimal control policies for systems with uncertain dynamics and high-level control objectives specified as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace structure and the outcomes of control decisions giving rise to an unknown Markov Decision Process (MDP). Existing reinforcement learning (RL) algorithms for LTL tasks typically rely on exploring a product MDP state-space uniformly (using e.g., an ϵ\epsilon-greedy policy) compromising sample-efficiency. This issue becomes more pronounced as the rewards get sparser and the MDP size or the task complexity increase. In this paper, we propose an accelerated RL algorithm that can learn control policies significantly faster than competitive approaches. Its sample-efficiency relies on a novel task-driven exploration strategy that biases exploration towards directions that may contribute to task satisfaction. We provide theoretical analysis and extensive comparative experiments demonstrating the sample-efficiency of the proposed method. The benefit of our method becomes more evident as the task complexity or the MDP size increases.

Index Terms:
Reinforcement Learning, Temporal Logic Planning, Stochastic Systems

I Introduction

Reinforcement learning (RL) has been successfully applied to synthesize control policies for systems with highly nonlinear, stochastic or unknown dynamics and complex tasks [1]. Typically, in RL, control objectives are specified as reward functions. However, specifying reward-based objectives can be highly non-intuitive, especially for complex tasks, while poorly designed rewards can significantly compromise system performance [2]. To address this challenge, Linear Temporal logic (LTL) has recently been employed to specify tasks that would have been very hard to define using Markovian rewards [3]; e.g., consider a navigation task requiring to visit regions of interest in a specific order.

Several model-free RL methods with LTL-encoded tasks have been proposed recently; see e.g., [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Common in the majority of these works is that they explore randomly a product state space that grows exponentially as the size of the MDP and/or the complexity of the assigned temporal logic task increase. This results in sample inefficiency and slow training/learning process. This issue becomes more pronounced by the fact that LTL specifications are converted into sparse rewards in order to synthesize control policies with probabilistic satisfaction guarantees [9, 14, 16].

Sample inefficiency is a well-known limitation in RL, whether control objectives are specified using reward functions directly or LTL. To address this limitation, reward engineering approaches have been proposed augmenting the reward signal [17, 18, 19, 20, 21, 22, 23]. Such methods often require a user to manually decompose the global task into sub-tasks, followed by assigning additional rewards to these intermediate sub-tasks. Nevertheless, this may result in sub-optimal control policies concerning the original task [24], while their efficiency highly depends on the task decomposition (i.e., the density of the rewards) [25]. Also, augmenting the reward signal for temporal logic tasks may compromise the probabilistic correctness of the synthesized controllers [9]. To alleviate these limitations, intelligent exploration strategies have been proposed, such as Boltzmann/softmax [26, 27] and upper confidence bound (UCB) [28] that do not require knowledge or modification of the rewards; a recent survey is available in [29]. Their sample-efficiency relies on guiding exploration using a continuously learned value function (e.g., Boltzmann) which, however, can be inaccurate in early training episodes. Alternatively, they rely on how many times a state-action pair has been visited (e.g., UCB), which might not always guide exploration towards directions contributing to task satisfaction.

Another approach to enhance sample-efficiency is through model-based methods [30, 31]. These works continuously learn an unknown Markov Decision Process (MDP), modeling the system, that is composed with automaton representations of LTL tasks. This gives rise to a product MDP (PMDP). Then, approximately optimal policies are constructed for the PMDP in a finite number of iterations. However, saving the associated data structures for the PMDP results in excessive memory requirements. Also, the quality of the generated policy critically depends on the accuracy of the learned PMDP. Finally, model-based methods require the computation of accepting maximum end components (AMECs) of PMDPs that has a quadratic time complexity in the PMDP size. This computation is avoided in related model-free methods; see e.g., [6].

In this paper, we propose a novel approach to enhance the sample-efficiency of model-free RL methods. Unlike the aforementioned works, the key idea to improve sample efficiency is to leverage the (known) task specification in order to extract promising directions for exploration that contribute to mission progress. We consider robots modeled as unknown MDPs with discrete state and action spaces, modeling uncertainty in the workspace and in the outcome of control decisions, and high-level LTL-encoded control objectives. The proposed algorithm relies on the following three steps. First, the LTL formula is converted into a Deterministic Rabin Automaton (DRA). Second, similar to [6], the product between the MDP and the DRA is constructed on-the-fly giving rise to a PMDP over which rewards are assigned based on the DRA acceptance condition. We note that the PMDP is not explicitly constructed/strored in our approach. The first two steps are common in related model-free algorithms. Third, a new RL method is applied over the PMDP to learn policies that maximize the expected accumulated reward capturing the satisfaction probability. The proposed RL algorithm relies on a new stochastic policy, called (ϵ,δ)(\epsilon,\delta)- greedy policy, that exploits the DRA representation of the LTL formula to bias exploration towards directions that may contribute to task satisfaction. Particularly, according to the proposed policy, the greedy action is selected with probability 1ϵ1-\epsilon (exploitation phase) while exploration is triggered with probability ϵ\epsilon, as in the ϵ\epsilon-greedy policy. Unlike the ϵ\epsilon-greedy policy, when exploration is enabled, either a random or a biased action is selected probabilistically (determined by δ\delta parameters), where the latter action guides the system towards directions that will most likely result in mission progress. For instance, consider a simple scenario where a robot with uncertain/unknown dynamics is required to eventually safely reach a region of interest. In this case, intuitively, exploration in the vicinity of the shortest dynamically feasible path (that is initially unknown but it is continuously learned) connecting the current robot position to the desired region should be prioritized to accelerate control design. We emphasize that the proposed task-driven exploration strategy does not require knowledge or modification of the reward structure. As a result, it can be coupled with sparse rewards, as e.g., in [9, 13], resulting in probabilistically correct control policies as well as with augmented rewards, as e.g., in [20, 25, 22], to further accelerate the learning phase.

Our approach is inspired by transfer learning algorithms that leverage external teacher policies for ‘similar’ tasks to bias exploration [32]. To design a biased exploration strategy, in the absence of external policies, we build upon [33, 34] that propose a biased sampling-based strategy to synthesize temporal logic controllers for large-scale, but deterministic, multi-robot systems. Particularly, computation of the biased action requires (i) a distance function over the DRA state space, similarly constructed as in [33, 34, 35, 36], to measure how far the system is from satisfying the assigned LTL task, and (ii) a continuously learned MDP model. The latter renders the proposed exploration strategy model-based. Thus, we would like to emphasize the following key differences with respect to related model-based RL methods discussed earlier. First, unlike existing model-based algorithms, the proposed method does not learn/store the PMDP model to compute the optimal policy. Instead, it learns only the MDP modeling the system, making it more memory efficient. Second, the quality of the learned policy is not contingent on the quality of the learned MDP model, distinguishing it from model-based methods. This is because our approach utilizes the MDP model solely for designing the biased action and, in fact, as it will be discussed in Section III-C, does not even require learning all MDP transition probabilities accurately. This is also supported by our numerical experiments where we empirically demonstrate sample efficiency of the proposed method against model inaccuracies. We provide comparative experiments demonstrating that the proposed learning algorithm outperforms in terms of sample-efficiency model-free RL methods that employ random (e.g., [6, 8, 9]), Boltzmann, and UCB exploration. The benefit of our approach becomes more pronounced as the size of the PMDP increases. We also provide comparisons against model-based methods showing that our method, as well as model-free baselines, are more memory-efficient and, therefore, scalable to large MDPs. A preliminary version of this work was presented in [37]. We extend [37] by (i) providing theoretical results that help understand when the proposed approach is, probabilistically, more sample efficient than random exploration methods; (ii) providing more comprehensive comparative experiments that do not exist in [37]; and (iii) demonstrating how the biased sampling strategy can be extended to Limit Deterministic Buchi Automata (LDBA) that have smaller state space than DRA and, therefore, can further expedite the learning process [38, 39, 8]. We also release software implementing our proposed algorithm, which can be found in [40].

Contribution: First, we propose a novel RL algorithm to quickly learn control policies for unknown MDPs with LTL tasks. Second, we provide conditions under which the proposed algorithm is, probabilistically, more sample-efficient than related works that rely on random exploration. Third, we show that the proposed exploration strategy can be employed for various automaton representations of LTL formulas such as DRA and LDBA. Fourth, we provide extensive comparative experiments demonstrating the sample efficiency of the proposed method compared to related works.

II Problem Definition

II-A Robot & Environment Model

Consider a robot that resides in a partitioned environment with a finite number of states. To capture uncertainty in the robot motion and the workspace, we model the interaction of the robot with the environment as a Markov Decision Process (MDP) of unknown structure, which is defined as follows.

Definition II.1 (MDP)

A Markov Decision Process (MDP) is a tuple 𝔐=(𝒳,x0,𝒜,P,𝒜𝒫)\mathfrak{M}=({\mathcal{X}},x_{0},{\mathcal{A}},P,\mathcal{AP}), where 𝒳{\mathcal{X}} is a finite set of states; x0𝒳x_{0}\in{\mathcal{X}} is an initial state; 𝒜{\mathcal{A}} is a finite set of actions. With slight abuse of notation 𝒜(x){\mathcal{A}}(x) denotes the available actions at state x𝒳x\in{\mathcal{X}}; P:𝒳×𝒜×𝒳[0,1]P:{\mathcal{X}}\times{\mathcal{A}}\times{\mathcal{X}}\rightarrow[0,1] is the transition probability function so that P(x,a,x)P(x,a,x^{\prime}) is the transition probability from state x𝒳x\in{\mathcal{X}} to state x𝒳x^{\prime}\in{\mathcal{X}} via control action a𝒜a\in{\mathcal{A}} and x𝒳P(x,a,x)=1\sum_{x^{\prime}\in{\mathcal{X}}}P(x,a,x^{\prime})=1, for all a𝒜(x)a\in{\mathcal{A}}(x); 𝒜𝒫\mathcal{AP} is a set of atomic propositions; L:𝒳2𝒜𝒫L:{\mathcal{X}}\rightarrow 2^{\mathcal{AP}} is the labeling function that returns the atomic propositions that are satisfied at a state x𝒳x\in{\mathcal{X}}.

Assumption II.2 (Fully Observable MDP)

We assume that the MDP 𝔐\mathfrak{M} is fully observable, i.e., at any time step tt the current state, denoted by xtx_{t}, and the observations L(xt)2𝒜𝒫L(x_{t})\in 2^{\mathcal{AP}} in state xtx_{t} are known.

Assumption II.3 (Static Environment)

We assume that the environment is static in the sense that the atomic propositions that are satisfied at an MDP state xx are fixed over time.

For instance, Assumption II.3 implies that obstacles and regions of interest in the environment are static. This assumption can be relaxed using probabilistically labeled MDPs as in [8].

II-B LTL-encoded Task Specification

The robot is responsible for accomplishing a task expressed as an LTL formula, such as sequencing, coverage, surveillance, data gathering or connectivity tasks [41, 42, 43, 44, 45, 46, 47]. LTL is a formal language that comprises a set of atomic propositions 𝒜𝒫\mathcal{AP}, the Boolean operators, i.e., conjunction \wedge and negation ¬\neg, and two temporal operators, next \bigcirc and until \cup. LTL formulas over a set 𝒜𝒫\mathcal{AP} can be constructed based on the following grammar: ϕ::=true|π|ϕ1ϕ2|¬ϕ|ϕ|ϕ1ϕ2,\phi::=\text{true}~{}|~{}\pi~{}|~{}\phi_{1}\wedge\phi_{2}~{}|~{}\neg\phi~{}|~{}\bigcirc\phi~{}|~{}\phi_{1}~{}\cup~{}\phi_{2}, where π𝒜𝒫\pi\in\mathcal{AP}. The other Boolean and temporal operators, e.g., always \square, have their standard syntax and meaning [3]. An infinite word ww over the alphabet 2𝒜𝒫2^{\mathcal{AP}} is defined as an infinite sequence w=π0π1π2(2𝒜𝒫)ωw=\pi_{0}\pi_{1}\pi_{2}\dots\in(2^{\mathcal{AP}})^{\omega}, where ω\omega denotes infinite repetition and πt2𝒜𝒫\pi_{t}\in 2^{\mathcal{AP}}, t\forall t\in\mathbb{N}. The language {w(2𝒜𝒫)ω|wϕ}\left\{w\in(2^{\mathcal{AP}})^{\omega}|w\models\phi\right\} is defined as the set of words that satisfy the LTL formula ϕ\phi, where (2𝒜𝒫)ω×ϕ\models\subseteq(2^{\mathcal{AP}})^{\omega}\times\phi is the satisfaction relation [3]. In what follows, we consider atomic propositions of the form πi\pi^{i} that are true if the robot is in state xi𝒳x_{i}\in{\mathcal{X}} and false otherwise.

II-C From LTL formulas to DRA

Any LTL formula can be translated into a Deterministic Rabin Automaton (DRA) defined as follows.

Definition II.4 (DRA [3])

A DRA over 2𝒜𝒫2^{\mathcal{AP}} is a tuple 𝔇=(𝒬D,qD0,Σ,δD,)\mathfrak{D}=({\mathcal{Q}}_{D},q_{D}^{0},\Sigma,\delta_{D},{\mathcal{F}}), where 𝒬D{\mathcal{Q}}_{D} is a finite set of states; qD0QDq_{D}^{0}\subseteq Q_{D} is the initial state; Σ=2𝒜𝒫\Sigma=2^{\mathcal{AP}} is the input alphabet; δD:𝒬D×ΣD𝒬D\delta_{D}:{\mathcal{Q}}_{D}\times\Sigma_{D}\rightarrow{{\mathcal{Q}}_{D}} is the transition function; and ={(𝒢1,1),,(𝒢f,f)}{\mathcal{F}}=\{({\mathcal{G}}_{1},{\mathcal{B}}_{1}),\dots,({\mathcal{G}}_{f},{\mathcal{B}}_{f})\} is a set of accepting pairs where 𝒢i,i𝒬D,i{1,,f}{\mathcal{G}}_{i},{\mathcal{B}}_{i}\subseteq{\mathcal{Q}}_{D},\forall i\in\{1,\dots,f\}. \hfill\Box

An infinite run ρD=qD0qD1qDt\rho_{D}=q_{D}^{0}q_{D}^{1}\dots q_{D}^{t}\dots of DD over an infinite word w=σ0σ1σ2w=\sigma_{0}\sigma_{1}\sigma_{2}\dots, where σtΣ\sigma_{t}\in\Sigma, t\forall t\in\mathbb{N}, is an infinite sequence of DRA states qDtq_{D}^{t}, t\forall t\in\mathbb{N}, such that δ(qDt,σt)=qDt+1\delta(q_{D}^{t},\sigma_{t})=q_{D}^{t+1}. An infinite run ρD\rho_{D} is called accepting if there exists at least one pair (𝒢i,i)({\mathcal{G}}_{i},{\mathcal{B}}_{i}) such that Inf(ρD)𝒢i\texttt{Inf}(\rho_{D})\cap{\mathcal{G}}_{i}\neq\varnothing and Inf(ρD)i=\texttt{Inf}(\rho_{D})\cap{\mathcal{B}}_{i}=\varnothing, where Inf(ρD)\texttt{Inf}(\rho_{D}) represents the set of states that appear in ρD\rho_{D} infinitely often; see also Ex. II.5.

Example II.5 (DRA)

Consider the LTL formula ϕ=(πExit1πExit2)\phi=\Diamond(\pi^{\text{Exit1}}\vee\pi^{\text{Exit2}}) that is true if a robot eventually reaches either Exit1 or Exit2 of a building. The corresponding DRA is illustrated in Figure 1.

Refer to caption
Figure 1: DRA corresponding to ϕ=(πExit1πExit2)\phi=\Diamond(\pi^{\text{Exit1}}\vee\pi^{\text{Exit2}}). There is only one set of accepting pairs defined as 𝒢1={qDF}{\mathcal{G}}_{1}=\{q_{D}^{F}\} and 1={qD0}{\mathcal{B}}_{1}=\{q_{D}^{0}\}. A transition is enabled if the robot generates a symbol satisfying the Boolean formula noted on top of the transitions. All transitions are feasible as per Def. III.1. The function dFd_{F} in (3) is defined as dF(qD0,)=1d_{F}(q_{D}^{0},{\mathcal{F}})=1 and dF(qDF,)=0d_{F}(q_{D}^{F},{\mathcal{F}})=0.

II-D Product MDP

Given the MDP 𝔐\mathfrak{M} and the DRA 𝔇\mathfrak{D}, we define the product MDP (PMDP) 𝔓=𝔐×𝔇\mathfrak{P}=\mathfrak{M}\times\mathfrak{D} as follows.

Definition II.6 (PMDP)

Given an MDP 𝔐=(𝒳,x0,𝒜,P,𝒜𝒫)\mathfrak{M}\allowbreak=({\mathcal{X}}\allowbreak,x_{0}\allowbreak,{\mathcal{A}}\allowbreak,P\allowbreak,\mathcal{AP}\allowbreak) and a DRA 𝔇=(𝒬D,qD0,Σ,,δD)\mathfrak{D}=({\mathcal{Q}}_{D},q_{D}^{0},\Sigma,{\mathcal{F}},\delta_{D}), we define the product MDP (PMDP) 𝔓=𝔐×𝔇\mathfrak{P}=\mathfrak{M}\times\mathfrak{D} as 𝔓=(𝒮,s0,𝒜𝔓,P𝔓,𝔓)\mathfrak{P}\allowbreak=(\mathcal{S}\allowbreak,s_{0}\allowbreak,{\mathcal{A}}_{\mathfrak{P}}\allowbreak,P_{\mathfrak{P}}\allowbreak,\mathcal{F}_{\mathfrak{P}}), where (i) 𝒮=𝒳×𝒬D\mathcal{S}={\mathcal{X}}\times{\mathcal{Q}}_{D} is the set of states, so that s=(x,qD)𝒮s=(x,q_{D})\in{\mathcal{S}}, x𝒳x\in{\mathcal{X}}, and qD𝒬Dq_{D}\in{\mathcal{Q}}_{D} ; (ii) s0=(x0,qD0){s_{0}}=(x_{0},q_{D}^{0}) is the initial state; (iii) 𝒜𝔓\mathcal{A}_{\mathfrak{P}} is the set of actions inherited from the MDP, so that 𝒜𝔓(s)=𝒜(x)\mathcal{A}_{\mathfrak{P}}(s)={\mathcal{A}}(x), where s=(x,qD)s=(x,q_{D}); (iv) P𝔓:𝒮×𝒜𝔓×𝒮:[0,1]P_{\mathfrak{P}}:{\mathcal{S}}\times{\mathcal{A}}_{\mathfrak{P}}\times{\mathcal{S}}:[0,1] is the transition probability function, so that P𝔓(s,aP,s)=P(x,a,x),P_{\mathfrak{P}}(s,a_{P},s^{\prime})=P(x,a,x^{\prime}), where s=(x,qD)𝒮s=(x,q_{D})\in{\mathcal{S}}, s=(x,qD)𝒮s^{\prime}=(x^{\prime},q_{D}^{\prime})\in{\mathcal{S}}, aP𝒜(s)a_{P}\in{\mathcal{A}}(s) and qD=δ(q,L(x))q_{D}^{\prime}=\delta(q,L(x)); (v) 𝔓={i𝔓}i=1f\mathcal{F}_{\mathfrak{P}}=\{\mathcal{F}_{i}^{\mathfrak{P}}\}_{i=1}^{f} is the set of accepting states, where i𝔓\mathcal{F}_{i}^{\mathfrak{P}} is a set defined as i𝔓=𝒳×i\mathcal{F}_{i}^{\mathfrak{P}}={\mathcal{X}}\times\mathcal{F}_{i} and i=(𝒢i,i)\mathcal{F}_{i}=({\mathcal{G}}_{i},{\mathcal{B}}_{i}).\hfill\Box

Given any policy 𝝁:𝒮𝒜𝔓\boldsymbol{\mu}:{\mathcal{S}}\rightarrow{\mathcal{A}}_{\mathfrak{P}} for 𝔓\mathfrak{P}, we define an infinite run ρ𝔓𝝁\rho_{\mathfrak{P}}^{\boldsymbol{\mu}} of 𝔓\mathfrak{P} to be an infinite sequence of states of 𝔓\mathfrak{P}, i.e., ρ𝔓𝝁=s0s1s2\rho_{\mathfrak{P}}^{\boldsymbol{\mu}}=s_{0}s_{1}s_{2}\dots, where P𝔓(st,𝝁(st),st+1)>0P_{\mathfrak{P}}(s_{t},\boldsymbol{\mu}(s_{t}),s_{t+1})>0. By definition of the accepting condition of the DRA 𝔇\mathfrak{D}, an infinite run ρ𝔓𝝁\rho_{\mathfrak{P}}^{\boldsymbol{\mu}} is accepting if there exists at least one pair i{1,,f}i\in\{1,\dots,f\} such that Inf(ρ𝔓𝝁)𝒢i𝔓\texttt{Inf}(\rho_{\mathfrak{P}}^{\boldsymbol{\mu}})\cap{\mathcal{G}}^{\mathfrak{P}}_{i}\neq\emptyset, and Inf(ρ𝔓𝝁)i𝔓=\texttt{Inf}(\rho_{\mathfrak{P}}^{\boldsymbol{\mu}})\cap{\mathcal{B}}^{\mathfrak{P}}_{i}=\emptyset.

II-E Problem Statement

Our goal is to compute a policy for the PMDP that maximizes the satisfaction probability (𝝁ϕ|s0)\mathbb{P}(\boldsymbol{\mu}\models\phi~{}|~{}s_{0}) of an LTL-encoded task ϕ\phi. A formal definition of this probability can be found in [3, 48, 49]. To this end, we first adopt existing reward functions R:𝒮×𝒜𝔓×𝒮R:{\mathcal{S}}\times{\mathcal{A}}_{\mathfrak{P}}\times{\mathcal{S}}\rightarrow\mathbb{R} defined based on the accepting condition of the PMDP as e.g., in [6]. Then, our goal is to compute a policy 𝝁\boldsymbol{\mu}^{*} that maximizes the expected accumulated return, i.e., 𝝁(s)=argmax𝝁𝒟U𝝁(s)\boldsymbol{\mu}^{*}(s)=\arg\max\limits_{\boldsymbol{\mu}\in\mathcal{D}}~{}{U}^{\boldsymbol{\mu}}(s), where 𝒟\mathcal{D} is the set of all stationary deterministic policies over 𝒮\mathcal{S}, and

U𝝁(s)=𝔼𝝁[t=0γtR(st,𝝁(st),st+1)|s=s0].{U}^{\boldsymbol{\mu}}(s)=\mathbb{E}^{\boldsymbol{\mu}}[\sum\limits_{t=0}^{\infty}\gamma^{t}~{}R(s_{t},\boldsymbol{\mu}(s_{t}),s_{t+1})|s=s_{0}]. (1)

In (1), 𝔼𝝁[]\mathbb{E}^{\boldsymbol{\mu}}[\cdot] denotes the expected value given that the PMDP follows the policy 𝝁\boldsymbol{\mu} [50], 0γ<10\leq\gamma<1 is the discount factor, and s0,,sts_{0},...,s_{t} is the sequence of states generated by 𝝁\boldsymbol{\mu} up to time tt, initialized at s0s_{0}. Since the PMDP has a finite state/action space and γ<1\gamma<1, there exists a stationary deterministic optimal policy 𝝁\boldsymbol{\mu}^{*} [50]. The reward function RR and the discount factor γ\gamma should be designed so that maximization of (1) is equivalent to maximization of the satisfaction probability. Efforts towards this direction are presented e.g., in [6, 8] while provably correct rewards and discount factors for PMDPs constructed using LDBA, instead of DRA, are proposed in [9, 14, 16]. However, as discussed in Section I, due to sparsity of these rewards, these methods are sample-inefficient. This is the main challenge that this paper aims to address.

Problem 1

Given (i) an MDP 𝔐\mathfrak{M} with unknown transition probabilities and underlying graph structure; (ii) a task specification captured by an LTL formula ϕ\phi; (iii) a reward function RR for the PMDP motivating satisfaction of its accepting condition, develop a sample-efficient RL algorithm that can learn a deterministic control policy 𝛍\boldsymbol{\mu}^{*} that maximizes (1).

III Accelerated Reinforcement Learning for Temporal Logic Control

To solve Problem 1, we propose a new reinforcement learning (RL) algorithm that can quickly synthesize control policies that maximize (1). The proposed algorithm is summarized in Algorithm 1 and described in detail in the following subsections. First, in Section III-A, we define a distance function over the DRA state-space. In Sections III-BIII-C, we describe the proposed logically-guided RL algorithm for LTL control objectives. To accelerate the learning phase, the distance function defined in Section III-A is utilized to guide exploration. A discussion on how the proposed algorithm can be applied to LDBA, that typically have a smaller state space than DRA, is provided in Appendix A.

III-A Distance Function over the DRA State Space

First, the LTL task ϕ\phi is converted into a DRA; see Definition II.4 [line 2, Alg. 1]. Then, we define a distance-like function over the DRA state-space that measures how ’far’ the robot is from accomplishing the assigned LTL tasks [line 3, Alg. 1]. In other words, this function returns how far any given DRA state is from the sets of accepting states 𝒢i{\mathcal{G}}_{i}. To define this function, first, we remove from the DRA all infeasible transitions that cannot be physically enabled. To define infeasible transitions, we first define feasible symbols as follows [33]; see Fig. 1.

Definition III.1 (Feasible symbols σΣ\sigma\in\Sigma)

A symbol σΣ\sigma\in\Sigma is feasible if and only if σ⊧̸binf\sigma\not\models b^{\text{inf}}, where binfb^{\text{inf}} is a Boolean formula defined as binf=xi𝒳(xe𝒳{xi}(πxiπxe))b^{\text{inf}}=\vee_{\forall x_{i}\in{\mathcal{X}}}(\vee_{\forall x_{e}\in{\mathcal{X}}\setminus\{x_{i}\}}(\pi^{x_{i}}\wedge\pi^{x_{e}})), where binfb^{\text{inf}} requires the robot to be present in more than one MDP state simultaneously. All feasible symbols σ\sigma are collected in a set denoted by ΣfeasΣ\Sigma_{\text{feas}}\subseteq\Sigma.\hfill\Box

Then, we prune the DRA by removing infeasible DRA transitions defined as follows:

Definition III.2 (Feasibility of DRA transitions)

A DRA transition from qDq_{D} to qDq_{D}^{\prime} is feasible if there exists at least one feasible symbol σΣfeas\sigma\in\Sigma_{\text{feas}} such that δ(qD,σ)=qD\delta(q_{D},\sigma)=q_{D}^{\prime}; otherwise, it is infeasible.\hfill\Box

Next, we define a function d:𝒬D×𝒬Dd:{\mathcal{Q}}_{D}\times{\mathcal{Q}}_{D}\rightarrow\mathbb{N} that returns the minimum number of feasible DRA transitions required to reach a state qD𝒬Dq_{D}^{\prime}\in{\mathcal{Q}}_{D} starting from a state qD𝒬Dq_{D}\in{\mathcal{Q}}_{D}. Particularly, we define this function as follows [33, 35]:

d(qD,qD)={|SPqD,qD|,if SPqD,qD exists,,otherwise,d(q_{D},q_{D}^{\prime})=\left\{\begin{array}[]{ll}|SP_{q_{D},q_{D}^{\prime}}|,\mbox{if $SP_{q_{D},q_{D}^{\prime}}$ exists,}\\ \infty,~{}~{}~{}~{}~{}~{}~{}~{}~{}\mbox{otherwise},\end{array}\right. (2)

where SPqD,qDSP_{q_{D},q_{D}^{\prime}} denotes the shortest path (in terms of hops) in the pruned DRA from qDq_{D} to qDq_{D}^{\prime} and |SPqD,qD||SP_{q_{D},q_{D}^{\prime}}| stands for its cost (number of hops). Note that if d(qD0,qD)=d(q_{D}^{0},q_{D})=\infty, for all qD𝒢iq_{D}\in{\mathcal{G}}_{i} and for all i{1,,n}i\in\{1,\dots,n\}, then the LTL formula can not be satisfied since the in the pruning process, only the DRA transitions that are impossible to enable are removed. Next, using (2), we define the following distance function:111Observe that, unlike [36, 51], dF(qD,)d_{F}(q_{D},{\mathcal{F}}) may not be equal to 0 even if qD𝒢iq_{D}\in{\mathcal{G}}_{i}. The latter may happen if qDq_{D} does not have a feasible self-loop.

dF(qD,)=minqDGi{1,,f}𝒢id(qD,qDG).d_{F}(q_{D},{\mathcal{F}})=\min_{q_{D}^{G}\in\cup_{i\in\{1,\dots,f\}}{\mathcal{G}}_{i}}d(q_{D},q_{D}^{G}). (3)

In words, (3) measures the distance from any DRA state qDq_{D} to the set of accepting pairs, i.e., the distance to the closest DRA state qDGq_{D}^{G} that belongs to i{1,,f}𝒢i\cup_{i\in\{1,\dots,f\}}{\mathcal{G}}_{i}; see also Fig. 1.

Algorithm 1 Accelerated RL for LTL Control Objectives
1:Initialize: (i) Q𝝁(s,a)Q^{\boldsymbol{\mu}}(s,a) arbitrarily, (ii) P^(x,a,x)=0\hat{P}(x,a,x^{\prime})=0, (iii) c(x,a,x)=0c(x,a,x^{\prime})=0, (iv) n(x,a)=0n(x,a)=0, for all x,x𝒳x,x^{\prime}\in{\mathcal{X}} and a𝒜(x)a\in{\mathcal{A}}(x), and (v) n𝔓(s,a,s)=0n_{\mathfrak{P}}(s,a,s^{\prime})=0 for all s,s𝒮s,s^{\prime}\in{\mathcal{S}} and a𝒜𝔓(s)a\in{\mathcal{A}}_{\mathfrak{P}}(s);
2:Convert ϕ\phi to a DRA 𝒟\mathcal{D};
3:Construct distance function dFd_{F} over the DRA as per (3);
4:𝝁=(ϵ,δ)greedy(Q)\boldsymbol{\mu}=(\epsilon,\delta)-\text{greedy}(Q);
5:episode-number=0\texttt{episode-number}=0;
6:while QQ has not converged do
7:     episode-number=episode-number+1\texttt{episode-number}=\texttt{episode-number}+1;
8:     Initialize time step t=0t=0;
9:     Initialize st=(x0,qD0)s_{t}=(x_{0},q_{D}^{0}) for a randomly selected x0x_{0};
10:     while t<τt<\tau do
11:         Pick action ata_{t} as per (8);
12:         Execute ata_{t} and observe st+1=(xt+1,qt+1){s}_{t+1}=(x_{t+1},q_{t+1}), and R(st,at,st+1)R({s}_{t},a_{t},{s}_{t+1});
13:         n(xt,at)=n(xt,at)+1n({x}_{t},a_{t})=n({x}_{t},a_{t})+1;
14:         c(xt,at,xt+1)=c(xt,at,xt+1)+1c({x}_{t},a_{t},{x}_{t+1})=c({x}_{t},a_{t},{x}_{t+1})+1;
15:         Update P^(xt,at,xt+1)\hat{P}({x}_{t},a_{t},{x}_{t+1}) as per (6);
16:         n𝔓(st,at)=n𝔓(st,at)+1n_{\mathfrak{P}}({s}_{t},a_{t})=n_{\mathfrak{P}}({s}_{t},a_{t})+1;
17:         Update Q𝝁(st,at)Q^{\boldsymbol{\mu}}({s}_{t},a_{t}) as per (III-B);
18:         st=snext{s}_{t}={s}_{\text{next}};
19:         t=t+1t=t+1;
20:         Update ϵ,δb,δe\epsilon,\delta_{b},\delta_{e};
21:     end while
22:end while

III-B Learning Optimal Temporal Logic Control Policies

In this section, we present the proposed accelerated RL algorithm for LTL control synthesis [lines 4-20, Alg. 1]. The output of the proposed algorithm is a stationary deterministic policy 𝝁\boldsymbol{\mu}^{*} for 𝔓\mathfrak{P} maximizing (1). To construct 𝝁\boldsymbol{\mu}^{*}, we employ episodic Q-learning (QL). Similar to standard QL, starting from an initial PMDP state, we define learning episodes over which the robot picks actions as per a stationary and stochastic control policy 𝝁:𝒮×𝒜𝔓[0,1]\boldsymbol{\mu}:{\mathcal{S}}\times{\mathcal{A}}_{\mathfrak{P}}\rightarrow[0,1] that eventually converges to 𝝁\boldsymbol{\mu}^{*} [lines 4-5, Alg. 1]. Each episode terminates after a user-specified number of time steps τ\tau or if the robot reaches a deadlock PMDP state, i.e., a state with no outgoing transitions [lines 7-20, Alg. 1]. Notice that the hyper-parameter τ\tau should be selected to be large enough to ensure that the agent learns how to repetitively visit the accepting states [9, 8, 13]. The RL algorithm terminates once an action value function Q𝝁(s,a)Q^{\boldsymbol{\mu}}(s,a) has converged. This action value function is defined as the expected return for taking action aa when at state ss and then following policy 𝝁\boldsymbol{\mu} [52], i.e.,

Q𝝁(s,a)=𝔼𝝁[t=0γtR(st,𝝁(st),st+1)|s0=s,a0=a].Q^{\boldsymbol{\mu}}(s,a)=\mathbb{E}^{\boldsymbol{\mu}}[\sum\limits_{t=0}^{\infty}\gamma^{t}~{}R(s_{t},\boldsymbol{\mu}(s_{t}),s_{t+1})|s_{0}=s,a_{0}=a]. (4)

We have that U𝝁(s)=maxa𝒜𝔓(s)Q𝝁(s,a)U^{\boldsymbol{\mu}}(s)=\max_{a\in\mathcal{A}_{\mathfrak{P}}(s)}Q^{\boldsymbol{\mu}}(s,a) [52]. The action-value function Q𝝁(s,a)Q^{\boldsymbol{\mu}}(s,a) can be initialized arbitrarily.

During any learning episode the following process is repeated until the episode terminates. First, given the PMDP state sts_{t} at the current time step tt, initialized as st=s0s_{t}=s_{0} [line 9, Alg. 1], an action ata_{t} is selected as per a policy 𝝁\boldsymbol{\mu} [line 11, Alg. 1]; the detailed definition of 𝝁\boldsymbol{\mu} will be given later. The selected action is executed yielding the next state st+1=(xt+1,qt+1){s}_{t+1}=(x_{t+1},q_{t+1}), and a reward R(st,at,st+1)R({s}_{t},a_{t},{s}_{t+1}). For instance, the reward function RR can be constructed as in [6] defined as follows:

R(s,a𝔓,s)={r𝒢,if s𝒢i𝔓,r,if si𝔓,rd,if s is a deadlock state,r0,otherwise,R(s,a_{\mathfrak{P}},s^{\prime})=\left\{\begin{array}[]{ll}r_{{\mathcal{G}}},~{}\mbox{if $s^{\prime}\in{\mathcal{G}}_{i}^{\mathfrak{P}}$,}\\ r_{{\mathcal{B}}},~{}\mbox{if $s^{\prime}\in{\mathcal{B}}_{i}^{\mathfrak{P}}$,}\\ r_{d},~{}\mbox{if $s^{\prime}$ is a deadlock state,}\\ r_{0},~{}\mbox{otherwise,}\end{array}\right. (5)

In (5), we have that r𝒢>0r_{{\mathcal{G}}}>0, for all i{1,,f}i\in\{1,\dots,f\}, and rd<r<r00r_{d}<r_{{\mathcal{B}}}<r_{0}\leq 0. This reward function motivates the robot to satisfy the PMDP accepting condition, i.e., to visit the states 𝒢j𝔓{\mathcal{G}}_{j}^{\mathfrak{P}} as often as possible and minimize the number of times it visits i𝔓{\mathcal{B}}_{i}^{\mathfrak{P}} and deadlock states while following the shortest possible path; deadlock states are visited when the LTL task is violated, e.g., when collision with an obstacle occurs.

Given the new state st+1{s}_{t+1}, the MDP model of the robot is updated. In particular, every time an MDP transition is enabled, the corresponding transition probability is updated. Let P^(xt,at,xt+1)\hat{P}(x_{t},a_{t},x_{t+1}) denote the estimated MDP transition probability from state xt𝒳x_{t}\in{\mathcal{X}} to state xt+1𝒳x_{t+1}\in{\mathcal{X}}, when an action aa is taken. These estimated MDP transition probabilities are initialized so that P^(x,a,x)=0\hat{P}(x,a,x^{\prime})=0, for all combinations of states and actions, and they are continuously updated at every time step tt of each episode as [lines 13-15]:

P^(xt,at,xt+1)=c(xt,at,xt+1)n(xt,at),\hat{P}(x_{t},a_{t},x_{t+1})=\frac{c(x_{t},a_{t},x_{t+1})}{n(x_{t},a_{t})}, (6)

where (i) n:𝒳×𝒜n:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{N} is a function that returns the number of times action aa has been taken at an MDP state xx and (ii) c:𝒳×𝒜×𝒳c:\mathcal{X}\times\mathcal{A}\times{\mathcal{X}}\rightarrow\mathbb{N} is a function that returns the number of times an MDP state xx^{\prime} has been visited after taking action aa at a state xx. Note that as n(x,a)n(x,a)\to\infty the estimated transition probabilities P^(x,a,x)\hat{P}(x,a,x^{\prime}) converge asymptotically to the true transition probabilities P(x,a,x)P(x,a,x^{\prime}), for all transitions.

Next, the action value function is updated as follows [52] [line 17, Alg. 1] :

Q𝝁(st,at)=Q𝝁(st,at)+(1/n𝔓(st,at))\displaystyle Q^{\boldsymbol{\mu}}({s}_{t},a_{t})=Q^{\boldsymbol{\mu}}({s}_{t},a_{t})+(1/n_{\mathfrak{P}}({s}_{t},a_{t}))
[R(st,at)Q𝝁(st,at)+γmaxaQ𝝁(st+1,a))],\displaystyle[R({s}_{t},a_{t})-Q^{\boldsymbol{\mu}}({s}_{t},a_{t})+\gamma\max_{a^{\prime}}Q^{\boldsymbol{\mu}}({s}_{t+1},a^{\prime}))], (7)

where n𝔓:𝒮×𝒜𝔓n_{\mathfrak{P}}:\mathcal{S}\times\mathcal{A}_{\mathfrak{P}}\rightarrow\mathbb{N} counts the number of times that action aa has been taken at the PMDP state ss. Once the action value function is updated, the current PMDP state is updated as st=st+1s_{t}=s_{t+1}, the time step tt is increased by one, and the policy 𝝁\boldsymbol{\mu} gets updated [lines 18-20, Alg. 1].

As a policy 𝝁\boldsymbol{\mu}, we propose an extension of the ϵ\epsilon-greedy policy, called (ϵ,δ)(\epsilon,\delta)-greedy policy, that selects an action aa at an PMDP state ss by using the learned action-value function Q𝝁(s,a)Q^{\boldsymbol{\mu}}(s,a) and the continuously learned transition probabilities P^(x,a,x)\hat{P}(x,a,x^{\prime}). Formally, the (ϵ,δ)(\epsilon,\delta)-greedy policy 𝝁\boldsymbol{\mu} is defined as

𝝁(s,a)={1ϵ+δe|𝒜𝔓(s)|if a=aand aab,1ϵ+δe|𝒜𝔓(s)|+δbif a=aand a=ab,δe/|𝒜𝔓(s)|ifa𝒜𝔓(s){a,ab},δb+δe/|𝒜𝔓(s)|if a=ab and aa,\boldsymbol{\mu}(s,a)=\begin{cases}1-\epsilon+\frac{\delta_{e}}{|\mathcal{A}_{\mathfrak{P}}(s)|}&\text{if~{}}{\color[rgb]{0,0,0}a=a^{*}~{}\text{and~{}}a\neq a_{b}},\\ 1-\epsilon+\frac{\delta_{e}}{|\mathcal{A}_{\mathfrak{P}}(s)|}+\delta_{b}&\text{if~{}}{\color[rgb]{0,0,0}a=a^{*}~{}\text{and~{}}a=a_{b},}\\ \delta_{e}/|\mathcal{A}_{\mathfrak{P}}(s)|&\text{if}~{}{\color[rgb]{0,0,0}a\in\mathcal{A}_{\mathfrak{P}}(s)\setminus\{a^{*},a_{b}\}},\\ \delta_{b}+\delta_{e}/|\mathcal{A}_{\mathfrak{P}}(s)|&\text{if~{}}{\color[rgb]{0,0,0}a=a_{b}\text{~{}and~{}}a\neq a^{*},}\end{cases} (8)

where δb,δe[0,1]\delta_{b},\delta_{e}\in[0,1] and ϵ=δb+δe[0,1]\epsilon=\delta_{b}+\delta_{e}\in[0,1]. In words, according to this policy, (i) with probability 1ϵ1-\epsilon, the greedy action a=argmaxa𝒜𝔓Q(s,a)a^{*}=\operatornamewithlimits{argmax}_{a\in\mathcal{A}_{\mathfrak{P}}}Q(s,a) is taken (as in the standard ϵ\epsilon-greedy policy); and (ii) an exploratory action is selected with probability ϵ=δb+δe\epsilon=\delta_{b}+\delta_{e}. The exploration strategy is defined as follows: (ii.1) with probability δe\delta_{e} a random action aa is selected (random exploration); and (ii.2) with probability δb\delta_{b} the action, denoted by aba_{b}, that is most likely to drive the robot towards an accepting product state in 𝒢i𝔓{\mathcal{G}}_{i}^{\mathfrak{P}} is taken (biased exploration). The action aba_{b} will be defined formally in Section III-C. As in standard QL, ϵ\epsilon should asymptotically converge to 0 while ensuring that eventually all actions have been applied infinitely often at all states. This ensures that 𝝁\boldsymbol{\mu} asymptotically converges to the optimal greedy policy

𝝁=argmaxa𝒜𝔓Q(s,a)\boldsymbol{\mu}^{*}=\operatornamewithlimits{argmax}_{a\in\mathcal{A}_{\mathfrak{P}}}Q^{*}(s,a) (9)

where QQ^{*} is the optimal action value function; see Sec. IV-A. We note that Q𝝁(s,𝝁(s))=U𝝁(s)=V(s)Q^{\boldsymbol{\mu}^{*}}(s,\boldsymbol{\mu}^{*}(s))=U^{\boldsymbol{\mu}^{*}}(s)=V^{*}({s}), where V(s)V^{*}({s}) is the optimal value function that could have been computed if the MDP was fully known [52, 53]. Given ϵ\epsilon, selection of the parameters δe\delta_{e} and δb\delta_{b} is discussed in Sec. IV.

III-C Specification-guided Exploration for Accelerated Learning

Next, we describe the design of the biased action aba_{b} in (8). First, we need to introduce the following definitions; see Fig. 2. Let st=(xt,qt)s_{t}=(x_{t},q_{t}) denote the current PMDP state at the current learning episode and time step tt of Algorithm 1. Let 𝒬goal(qt)𝒬{\mathcal{Q}}_{\text{goal}}(q_{t})\subset{\mathcal{Q}} be a set that collects all DRA states that are one-hop reachable from qtq_{t} in the pruned DRA and they are closer to the accepting DRA states than qtq_{t} is, as per (3). Formally, 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}) is defined as follows:

𝒬goal(qt)={q𝒬|(σΣfeassuch that\displaystyle{\mathcal{Q}}_{\text{goal}}(q_{t})=\{q^{\prime}\in{\mathcal{Q}}~{}|~{}(\exists\sigma\in\Sigma_{\text{feas}}~{}\text{such that} (10)
δD(qt,σ)=q)(dF(q,)=dF(qt,)1)}.\displaystyle\delta_{D}(q_{t},\sigma)=q^{\prime})\wedge(d_{F}(q^{\prime},{\mathcal{F}})=d_{F}(q_{t},{\mathcal{F}})-1)\}.
Refer to caption
Figure 2: Graphical depiction of the sets 𝒳goal(qt){\mathcal{X}}_{\text{goal}}(q_{t}). The disks represent MDP states and the arrows between states mean that there exists at least one action such that the transition probability from one state to another one is non-zero. The length of the shortest path from xtx_{t} to 𝒳goal{\mathcal{X}}_{\text{goal}} is 33 hops, i.e., Jxt,𝒳goal=3J_{x_{t},{\mathcal{X}}_{\text{goal}}}=3; see (12). Also, the paths pjtp_{j}^{t}, j𝒥={1,2}j\in{\mathcal{J}}=\{1,2\} are highlighted with thick green lines. The numbers on top of the green edges represent maxaP(pjt(e),a,pjt(e+1))\max_{a}P(p_{j}^{t}(e),a,p_{j}^{t}(e+1)); see (14). Observe that pp^{*} is the green path highlighted with gray color.

Also, let 𝒳goal(qt)𝒳{\mathcal{X}}_{\text{goal}}(q_{t})\subseteq{\mathcal{X}} be a set of MDP states, denoted by xgoalx_{\text{goal}}, that if the robot eventually reaches, then transition from sts_{t} to a product state sgoal=[xgoal,qgoal]s_{\text{goal}}=[x_{\text{goal}},q_{\text{goal}}] will occur, where qgoal𝒬goal(qt)q_{\text{goal}}\in{\mathcal{Q}}_{\text{goal}}(q_{t}); see also Ex. III.6. Formally, 𝒳goal(qt){\mathcal{X}}_{\text{goal}}(q_{t}) is defined as follows:

𝒳goal(qt)={\displaystyle{\mathcal{X}}_{\text{goal}}(q_{t})=\{ x𝒳|δD(qt,L(x))𝒬goal(qt)}.\displaystyle x\in{\mathcal{X}}~{}|~{}\delta_{D}(q_{t},L(x))\in{\mathcal{Q}}_{\text{goal}}(q_{t})\}. (11)

Next, we view the continuously learned MDP as a weighted directed graph 𝒢=(𝒱,,w){\mathcal{G}}=({\mathcal{V}},{\mathcal{E}},w) where the set 𝒱{\mathcal{V}} is the set of MDP states, {\mathcal{E}} is the set of edges, and w:+w:{\mathcal{E}}\rightarrow\mathbb{R}_{+} is function assigning weights to each edge. Specifically, an edge from the node (MDP state) xx to xx^{\prime} exists if there exists at least one action a𝒜(x)a\in{\mathcal{A}}(x) such that P^(x,a,x)>0\hat{P}(x,a,x^{\prime})>0. Hereafter, we assign a weight equal to 11 to each edge; see also Remarks III.4-III.5. We denote the cost of the shortest path from xx to xx^{\prime} by Jx,xJ_{x,x^{\prime}}. Next, we define the cost of the shortest path connecting a state xx to the set 𝒳goal{\mathcal{X}}_{\text{goal}} as follows:

Jx,𝒳goal=minx𝒳goalJx,x.J_{x,{\mathcal{X}}_{\text{goal}}}=\min_{x^{\prime}\in{\mathcal{X}}_{\text{goal}}}J_{x,x^{\prime}}. (12)

Let JJ be the total number of paths from xx to 𝒳goal{\mathcal{X}}_{\text{goal}}, where their length (i.e., number of hops) is Jx,𝒳goalJ_{x,{\mathcal{X}}_{\text{goal}}}. We denote such a path by pjtp_{j}^{t}, j𝒥:={1,,J}j\in{\mathcal{J}}:=\{1,\dots,J\}, and the ee-th MDP state in this path by pjt(e)p_{j}^{t}(e). Then, among all the paths pjtp_{j}^{t}, we compute the one with the minimum uncertainty-based cost C(pjt)C(p_{j}^{t}); see Fig. 2. We define this cost as

C(pjt)=e=1Jx,𝒳goal[maxaP^(pjt(e),a,pjt(e+1))],C(p_{j}^{t})=\prod_{e=1}^{J_{x,{\mathcal{X}}_{\text{goal}}}}\left[\max_{a}\hat{P}(p_{j}^{t}(e),a,p_{j}^{t}(e+1))\right], (13)

where the maximization is over all actions a𝒜(pjt(e))a\in{\mathcal{A}}(p_{j}^{t}(e)). We denote by pp^{*} the path with the minimum cost C(pjt)C(p_{j}^{t}), i.e., p=pjtp^{*}=p_{j^{*}}^{t}, where j=argmaxjC(pjt)j^{*}=\operatornamewithlimits{argmax}_{j}C(p_{j}^{t}). Thus, we have that:

C(p)C(pjt),j𝒥.C(p^{*})\geq C(p_{j}^{t}),\forall j\in{\mathcal{J}}. (14)

Once pp^{*} is constructed, the action aba_{b} is defined as follows:

ab=argmaxa𝒜(xt)P^(xt,a,xb),a_{b}=\operatornamewithlimits{argmax}_{a\in{\mathcal{A}}(x_{t})}\ \hat{P}(x_{t},a,x_{b}), (15)

where xb=p(2)x_{b}=p^{*}(2); see Fig. 2. In words, aba_{b} is the action with the highest probability of allowing the system to reach the state p(2)p^{*}(2), i.e., to move along the best path pp^{*}. Observe that computation of the biased action does not depend on the employed reward structure nor on perfectly learning all MDP transition probabilities.

Remark III.3 (Initialization)

Selection of the biased action aba_{b} requires knowledge of (i) the MDP states xx in (11) that need to be visited to enable transitions to DRA states in 𝒬goal{\mathcal{Q}}_{\text{goal}}; and (ii) the underlying MDP graph structure, determined by the (unknown) transition probabilities, to compute (12). However, neither of them may be available in early episodes. In this case, we randomly initialize 𝒳goal{\mathcal{X}}_{\text{goal}} for (i). Similarly, for (ii), the estimated transition probabilities are randomly initialized (or, simply, set equal to 0 [line 1, Alg. 1]) initializing this way the MDP graph structure. If no paths can be computed to determine Jxt,𝒳goalJ_{x_{t},{\mathcal{X}}_{\text{goal}}} in (12), we select a random biased action.

Remark III.4 (Computing Shortest Path)

It is possible that the shortest path from xtx_{t} to xgoal𝒳goal(qt)x_{\text{goal}}\in{\mathcal{X}}_{\text{goal}}(q_{t}) goes through states/nodes xx that if visited, a transition to a new state qqtq\neq q_{t} that does not belong to 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}) may be enabled. Therefore, when we compute the shortest paths, we treat all such nodes xx as ‘obstacles’ that should not be crossed. These states are collected in the set 𝒳avoid{\mathcal{X}}_{\text{avoid}} defined as 𝒳avoid={x𝒳|δ(qt,L(x))=qD𝒬goal}{\mathcal{X}}_{\text{avoid}}=\{x\in{\mathcal{X}}~{}|~{}\delta(q_{t},L(x))=q_{D}\notin{\mathcal{Q}}_{\text{goal}}\}.

Remark III.5 (Weights & Shortest Paths)

To design the biased action aba_{b}, the MDP is viewed as a weighted graph where a weight w=1w=1 is assigned to all edges. In Section IV, this definition of weights allows us to show how the probability of making progress towards satisfying the assigned task (i.e., reaching the DRA states 𝒬goal{\mathcal{Q}}_{\text{goal}}) within the minimum number of time steps (i.e., Jxt,𝒳goalJ_{x_{t},{\mathcal{X}}_{\text{goal}}} time steps) is positively affected by introducing bias in the exploration phase. Alternative weight assignments can be used that may further improve sample-efficiency in practice; see also Ex. III.6. For instance, the assigned weights can be equal to the reciprocal of the estimated transition probabilities. In this case, the shortest path between two MDP states models the path with the least uncertainty that connects these two states. However, in this case the theoretical results shown in Section IV do not hold.

Refer to caption
Figure 3: MDP-based representation of the interaction of a ground robot with corridor-like environment. The square cells represent MDP states, i.e., 𝒳={Exit1,Exit2,A,B,C,D,E}{\mathcal{X}}=\{\text{Exit1},\text{Exit2},\text{A},\text{B},\text{C},\text{D},\text{E}\}. An action enabling transition between adjacent cells with non-zero probability exists for all MDP states.
Example III.6 (Biased Exploration)

Consider a robot operating in a corridor of a building as in Figure 3. The robot is tasked with exiting the building i.e., eventually reaching one of the two exits. This can be captured by the following LTL formula: ϕ=(πExit1πExit2)\phi=\Diamond(\pi^{\text{Exit1}}\vee\pi^{\text{Exit2}}). The DRA of this specification is illustrated in Figure 1. Assume that qt=qD0q_{t}=q_{D}^{0}. Then, 𝒳goal={Exit1,Exit2}{\mathcal{X}}_{\text{goal}}=\{\text{Exit1},\text{Exit2}\}. The robot can take two actions at each state (besides the ‘exit’ states): a1=‘left’a_{1}=\text{`left'} and a2=‘right’a_{2}=\text{`right'}. (i) Assume that xt=Cx_{t}=C. Observe that Jxt,𝒳goal=3J_{x_{t},{\mathcal{X}}_{\text{goal}}}=3 and that J=2J=2. Specifically, the following two paths pjtp_{j}^{t} can be defined: p1t=C,D,E,Exit1p_{1}^{t}=\text{C,D,E,Exit1} and p2t=C,B,A,Exit2p_{2}^{t}=\text{C,B,A,Exit2}. Consider also transition probabilities that satisfy maxaP(C,a,D)=0.51\max_{a}P(C,a,D)=0.51, maxaP(D,a,E)=0.9\max_{a}P(D,a,E)=0.9, maxaP(E,a,Exit2)=1\max_{a}P(E,a,\text{Exit2})=1, maxaP(C,a,B)=0.9\max_{a}P(C,a,B)=0.9, maxaP(B,a,A)=0.6\max_{a}P(B,a,A)=0.6, maxaP(A,a,Exit1)=0.6\max_{a}P(A,a,\text{Exit1})=0.6. In this case, we have that C(p1t)=0.459C(p_{1}^{t})=0.459 and C(p2t)=0.324C(p_{2}^{t})=0.324. According to (14), we have that j=1j^{*}=1 and, therefore, xb=p1t(2)=Dx_{b}=p_{1}^{t}(2)=\text{D}. The biased action aba_{b} at xtx_{t} is ab=a2a_{b}=a_{2} as per (15). (ii) Assume that xt=Dx_{t}=D. Then, we have that Jxt,𝒳goal=2J_{x_{t},{\mathcal{X}}_{\text{goal}}}=2. Notice that there is only path to reach 𝒳goal{\mathcal{X}}_{\text{goal}} within Jxt,𝒳goal=2J_{x_{t},{\mathcal{X}}_{\text{goal}}}=2 hops/time steps defined as p1t=D,E,Exit1p_{1}^{t}=\text{D,E,Exit1}. Consider also transition probabilities that satisfy maxaP(D,a,E)=0.7\max_{a}P(D,a,E)=0.7, maxaP(E,a,Exit2)=0.7\max_{a}P(E,a,\text{Exit2})=0.7, maxaP(D,a,C)=1\max_{a}P(D,a,C)=1, maxaP(C,a,B)=1\max_{a}P(C,a,B)=1, maxaP(B,a,A)=1\max_{a}P(B,a,A)=1, maxaP(A,a,Exit1)=1\max_{a}P(A,a,\text{Exit1})=1. In this case, we have that C(p1t)=0.49C(p_{1}^{t})=0.49. The biased action aba_{b} at xtx_{t} is selected as follows. Assume P(D,a1,E)=0.3P(D,a_{1},E)=0.3 and P(D,a2,E)=0.7P(D,a_{2},E)=0.7. Given that xb=p1t(2)=Ex_{b}=p_{1}^{t}(2)=\text{E}, we have that ab=a2a_{b}=a_{2} as per (15). Observe that although there is a ‘deterministic’ path from xtx_{t} to Exit1 of length 44 that can be followed with probability 11, the biased action aims to drive the robot towards Exit2. This happens because the proposed algorithm is biased towards the shortest paths (of length 22 here), in terms of number of MDP transitions/hops, that will lead to DRA states that are closer to the accepting states by definition of the weights ww. We note that the paths stemming from the biased action are not necessarily the paths with the least uncertainty; see also Rem. III.5. Also, we highlight that we do not claim any optimality of aba_{b} with respect to the task satisfaction probability; intuitively, in (ii), the biased action is ‘sub-optimal’ with respect to the task satisfaction probability.

IV Algorithm Analysis

In this section, we show that any (ϵ,δ)(\epsilon,\delta)-greedy policy achieves policy improvement; see Proposition IV.1. Also, we provide conditions that δb\delta_{b} and δe\delta_{e} should satisfy under which the proposed biased exploration strategy results in learning control policies faster, in a probabilistic sense, than policies that rely on uniform-based exploration. We emphasize that these results should be interpreted primarily in an existential way as they rely on the unknown MDP transition probabilities. First, we provide ‘myopic’ sample-efficiency guarantees. Specifically, we show that starting from st=(xt,qt)s_{t}=(x_{t},q_{t}), the probability of reaching PMDP states st+1=(xt+1,qt+1))s_{t+1}=(x_{t+1},q_{t+1})), where xt+1x_{t+1} is closer to 𝒳goal{\mathcal{X}}_{\text{goal}} (see (11)) than xtx_{t}, is higher when bias is introduced in the exploration phase; see Section IV-B. Then, we provide non-myopic guarantees that ensure that starting from sts_{t} the probability of reaching PMDP states st=(xt,qt)s_{t^{\prime}}=(x_{t^{\prime}},q_{t^{\prime}}), where t>tt^{\prime}>t and qt𝒬goalq_{t^{\prime}}\in{\mathcal{Q}}_{\text{goal}} (see (10)), in the minimum number of time steps (as determined by Jxt,𝒳goalJ_{x_{t},{\mathcal{X}}_{\text{goal}}}) is higher when bias is introduced in the exploration phase; see Section IV-C.

IV-A Policy Improvement

Proposition IV.1 (Policy Improvement)

For any (ϵ,δ)(\epsilon,\delta)-greedy policy 𝛍\boldsymbol{\mu}, the updated (ϵ,δ)(\epsilon,\delta)-greedy policy 𝛍\boldsymbol{\mu^{\prime}} obtained after updating the state-action value function Q𝛍(s,a)Q^{\boldsymbol{\mu}}(s,a) satisfies U𝛍(s)U𝛍(s)U^{\boldsymbol{\mu^{\prime}}}(s)\geq U^{\boldsymbol{\mu}}(s), for all s𝒮s\in{\mathcal{S}}.\hfill\Box

Proof:

To show this result, we follow the same steps as in the policy improvement result for the ϵ\epsilon-greedy policy [52]. For simplicity of notation, hereafter we use A=|𝒜𝔓(s)|A=|{\mathcal{A}}_{\mathfrak{P}}(s)|. Thus, we have that: U𝝁(s)=a𝒜𝔓(s)𝝁(s,a)Q𝝁(s,a)=δeAQ𝝁(s,a)+(1ϵ)maxa𝒜𝔓(s)Q𝝁(s,a)+δbQ𝝁(s,ab)δeAQ𝝁(s,a)+(1ϵ)a𝒜𝔓(s)(𝝁(s,a)δeA𝕀a=abδb1ϵ)Q𝝁(s,a)+δbQ𝝁(s,ab)=a𝒜𝔓(s)𝝁(s,a)Q𝝁(s,a)=U𝝁(s)U^{\boldsymbol{\mu}^{\prime}}(s)=\sum_{a\in{\mathcal{A}}_{\mathfrak{P}}(s)}\boldsymbol{\mu}^{\prime}(s,a)Q^{\boldsymbol{\mu}}(s,a)=\frac{\delta_{e}}{A}\sum Q^{\boldsymbol{\mu}}(s,a)+(1-\epsilon)\max_{a\in{\mathcal{A}}_{\mathfrak{P}}(s)}Q^{\boldsymbol{\mu}}(s,a)+\delta_{b}Q^{\boldsymbol{\mu}}(s,a_{b})\geq\frac{\delta_{e}}{A}\sum Q^{\boldsymbol{\mu}}(s,a)+(1-\epsilon)\sum_{a\in{\mathcal{A}}_{\mathfrak{P}}(s)}\left(\frac{\boldsymbol{\mu}(s,a)-\frac{\delta_{e}}{A}-\mathbb{I}_{a=a_{b}}\delta_{b}}{1-\epsilon}\right)Q^{\boldsymbol{\mu}}(s,a)+\delta_{b}Q^{\boldsymbol{\mu}}(s,a_{b})=\sum_{a\in{\mathcal{A}}_{\mathfrak{P}}(s)}\boldsymbol{\mu}(s,a)Q^{\boldsymbol{\mu}}(s,a)=U^{\boldsymbol{\mu}}(s) where the inequality holds because the summation is a weighted average with non-negative weights summing to 11, and as such it must be less than the largest number averaged. ∎

In Proposition IV.1, the equality U𝝁(s)=U𝝁(s)U^{\boldsymbol{\mu^{\prime}}}(s)=U^{\boldsymbol{\mu}}(s), s𝒮\forall s\in{\mathcal{S}}, holds if 𝝁=𝝁=𝝁\boldsymbol{\mu}=\boldsymbol{\mu}^{\prime}=\boldsymbol{\mu}^{*}, where 𝝁\boldsymbol{\mu}^{*} is the optimal policy [52].

IV-B Myopic Effect of Biased Exploration

In this section, we demonstrate the myopic benefit of the biased exploration; the proofs can be found in Appendix B. To formally describe it we introduce first the following definitions. Let st=(xt,qt)s_{t}=(x_{t},q_{t}) be the PMDP state at the current time step tt of an RL episode of Algorithm 1. Also, let (xt)𝒳{\mathcal{R}}(x_{t})\subseteq{\mathcal{X}} denote a set collecting all MDP states that can be reached within one hop from xtx_{t}, i.e.,

(xt)={x𝒳|a𝒜(x)such thatP^(xt,a,x)>0}.{\mathcal{R}}(x_{t})=\{x\in{\mathcal{X}}~{}|~{}\exists a\in{\mathcal{A}}(x)~{}\text{such that}~{}\hat{P}(x_{t},a,x)>0\}. (16)

Then, we can define the set 𝒳closer{\mathcal{X}}_{\text{closer}} that collects all MDP states that are one hop reachable from xtx_{t} and they are closer to 𝒳goal(xt){\mathcal{X}}_{\text{goal}}(x_{t}) than xtx_{t} is, i.e.,

𝒳closer(xt)={x(xt)|Jx,𝒳goal=Jxt,𝒳goal1)}.{\mathcal{X}}_{\text{closer}}(x_{t})=\{x\in{\mathcal{R}}(x_{t})~{}|~{}J_{x,{\mathcal{X}}_{\text{goal}}}=J_{x_{t},{\mathcal{X}}_{\text{goal}}}-1)\}. (17)

The following result shows that the probability of xt+1𝒳closer(xt)x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t}) is higher when biased exploration is employed.

Proposition IV.2

Let st=(xt,qt)s_{t}=(x_{t},q_{t}) be the PMDP state at the current time step tt of an RL episode of Algorithm 1. Let also xb𝒳closer(xt)x_{b}\in{\mathcal{X}}_{\text{closer}}(x_{t}) denote the MDP state towards which the action aba_{b} is biased. If δb>0\delta_{b}>0 and (18) holds,

P(xt,ab,x)maxx¯𝒳closer(xt)aP(xt,a,x¯)|𝒜(xt)|,x𝒳closer(xt),P(x_{t},a_{b},x)\geq\max_{\bar{x}\in{\mathcal{X}}_{\text{closer}}(x_{t})}\sum_{a}\frac{P(x_{t},a,\bar{x})}{|{\mathcal{A}}(x_{t})|},\forall x\in{\mathcal{X}}_{\text{closer}}(x_{t}), (18)

where the summation is over a𝒜(xt)a\in{\mathcal{A}}(x_{t}), then we have that

b(xt+1𝒳closer(xt))g(xt+1𝒳closer(xt)).\mathbb{P}_{b}(x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t}))\geq\mathbb{P}_{g}(x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t})). (19)

In (19), g(xt+1𝒳closer(xt))\mathbb{P}_{g}(x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t})) and b(xt+1𝒳closer(xt))\mathbb{P}_{b}(x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t})) denote the probability of reaching any state xt+1𝒳closer(xt)x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t}) starting from xtx_{t} without and with bias introduced in the exploration phase, respectively.\hfill\Box

Next, we provide a ‘weaker’ result which, however, does not require the strong requirement of (18). The following result shows that the probability that the next state xt+1x_{t+1} will be equal to xb𝒳closerx_{b}\in{\mathcal{X}}_{\text{closer}} (as opposed to any state in 𝒳closer{\mathcal{X}}_{\text{closer}} in Prop. IV.2) is greater when bias is introduced in the exploration phase.

Proposition IV.3

Let st=(xt,qt)s_{t}=(x_{t},q_{t}) be the PMDP state at the current time step tt of an RL episode of Algorithm 1. Let also xb𝒳closer(xt)x_{b}\in{\mathcal{X}}_{\text{closer}}(x_{t}) denote the MDP state towards which the action aba_{b} is biased. If δb>0\delta_{b}>0, then

b(xt+1=xb)g(xt+1=xb),\mathbb{P}_{b}(x_{t+1}=x_{b})\geq\mathbb{P}_{g}(x_{t+1}=x_{b}), (20)

where g(xt+1=xb)\mathbb{P}_{g}(x_{t+1}=x_{b}) and b(xt+1=xb)\mathbb{P}_{b}(x_{t+1}=x_{b}) denote the probability of reaching at t+1t+1 the state xbx_{b} starting from xtx_{t} without and with bias introduced in the exploration phase, respectively.

IV-C Non-Myopic Effect of Biased Exploration

In this section, we demonstrate the non-myopic effect of the biased exploration; the proofs can be found in Appendix C. To present our main results, we need to introduce the following definitions. Let st=(xt,qt)s_{t}=(x_{t},q_{t}) be the current PMDP state. Also, let t=Jxt,𝒳goalt^{*}=J_{x_{t},{\mathcal{X}}_{\text{goal}}} denote the length (i.e., the number of hops/MDP transitions) of the paths pjtp_{j}^{t}. Recall that all paths pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, share the same length, in terms of number of hops, by construction. Second, we define a function β:𝒥[0,1]\beta:{\mathcal{J}}\rightarrow[0,1] that maps every path pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, into [0,1][0,1] as follows:

β(pjt)=m=0t1{P(xt+m,ab,xt+m+1)δb+\displaystyle\beta(p_{j}^{t})=\prod_{m=0}^{t^{*}-1}\{P(x_{t+m},a_{b},x_{t+m+1})\delta_{b}+
P(xt+m,a,xt+m+1)(1ϵ)+δe|𝒜(xt+m)|}.\displaystyle P(x_{t+m},a^{*},x_{t+m+1})(1-\epsilon)+\frac{\delta_{e}}{|{\mathcal{A}}(x_{t+m})|}\}. (21)

In (IV-C), we have that xt+m=pjt(m+1)x_{t+m}=p_{j}^{t}(m+1), for all m{0,,t1}m\in\{0,\dots,t^{*}-1\} and aba_{b} is the biased action computed at state st+m=(xt+m,qt)s_{t+m}=(x_{t+m},q_{t}) as discussed in Section III-C, i.e., using the path pjt+mp_{j^{*}}^{t+m}.

Proposition IV.4 (Most Likely Path)

At time step tt of the current RL episode, let (i) st=(xt,qt)s_{t}=(x_{t},q_{t}) be the current PMDP state; and (ii) pjtp_{j^{*}}^{t} be the path used to design the biased action at the time step tt. Let RjR_{j} be a (Bernouli) random variable that is true if after tt^{*} time steps (i.e., at the time step t+t)t+t^{*}), a path pjtp_{j}^{t} has been generated, for some j𝒥j\in{\mathcal{J}}. If there exists δb\delta_{b} and δe\delta_{e} satisfying the following condition

β(pjt)maxj𝒥β(pjt),\beta(p_{j^{*}}^{t})\geq\max_{j\in{\mathcal{J}}}\beta(p_{j}^{t}), (22)

then, we have that b(Rj=1)maxj𝒥b(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1)\geq\max_{j\in{\mathcal{J}}}\mathbb{P}_{b}(R_{j}=1), where b(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1) and b(Rj=1)\mathbb{P}_{b}(R_{j}=1) stand for the probability that Rj=1R_{j^{*}}=1 and Rj=1R_{j}=1, respectively, if the MDP evolves as per the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy. \hfill\Box

Remark IV.5 (Prop. IV.4)

Prop. IV.4 implies that there exists δb\delta_{b} and δe\delta_{e} such that among all paths pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, designed at time tt, the most likely path that the MDP will generate over the next tt^{*} time steps is pjtp_{j^{*}}^{t}. For instance, if δb=1\delta_{b}=1, and, therefore, ϵ=δe=0\epsilon=\delta_{e}=0, then we get that (22) is equivalent to m=0t1P(xt+m,ab,xt+m+1)maxj𝒥m=0t1P(x¯t+m,a¯b,x¯t+m+1)\prod_{m=0}^{t^{*}-1}P(x_{t+m},a_{b},x_{t+m+1})\geq\max_{j\in{\mathcal{J}}}\prod_{m=0}^{t^{*}-1}P(\bar{x}_{t+m},\bar{a}_{b},\bar{x}_{t+m+1}), due to (14)-(15), where xt+m=pjt(m+1)x_{t+m}=p_{j^{*}}^{t}(m+1), x¯t+m=pjt(m+1)\bar{x}_{t+m}=p_{j}^{t}(m+1) for all m{0,,t1}m\in\{0,\dots,t^{*}-1\}, and aba_{b} and a¯b\bar{a}_{b} denote the biased action at states xt+mx_{t+m} and x¯t+m\bar{x}_{t+m} using the path pjt+mp_{j^{*}}^{t+m}.

In what follows, we show that there exists δb\delta_{b} and δe\delta_{e} that ensure that the probability of generating the path pjtp_{j^{*}}^{t} under the (ϵ,δ)(\epsilon,\delta)-greedy policy (captured by b(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1)) is larger than the probability of generating any path pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, under the ϵ\epsilon-greedy policy. To make this comparative analysis meaningful, hereafter, we assume that probability of exploration ϵ=δb+δe\epsilon=\delta_{b}+\delta_{e} is the same for both policies; thus, the probability of selecting the greedy action is the same for both policies, as well. Recall again that the ϵ\epsilon-greedy policy can be recovered by removing bias from the (ϵ,δ)(\epsilon,\delta)-greedy policy, i.e., by setting δb=0\delta_{b}=0. To present this result, we need to define a function η:𝒥[0,1]\eta:{\mathcal{J}}\rightarrow[0,1] mapping every path pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, into [0,1][0,1] as follows:

η(pjt)=m=0t1{P(xt+m,a,xt+m+1)(1ϵ)+ϵ|𝒜(xt+m)|}.\displaystyle\eta(p_{j}^{t})=\prod_{m=0}^{t^{*}-1}\{P(x_{t+m},a^{*},x_{t+m+1})(1-\epsilon)+\frac{\epsilon}{|{\mathcal{A}}(x_{t+m})|}\}. (23)

In (23), we have that xt+m=pjt(m+1)x_{t+m}=p_{j}^{t}(m+1), for all m{0,,t1}m\in\{0,\dots,t^{*}-1\} and aa^{*} is the greedy action computed at state st+m=(xt+m,qt)s_{t+m}=(x_{t+m},q_{t}).

Proposition IV.6 (Random vs Biased Exploration)

At time step tt of the current RL episode, let (i) st=(xt,qt)s_{t}=(x_{t},q_{t}) be the current product state; and (ii) pjtp_{j^{*}}^{t} be the path used to design the current biased action. Let RjR_{j} be a (Bernouli) random variable that is true if after tt^{*} time steps (i.e., at the time step t+t)t+t^{*}), a path pjtp_{j}^{t} has been generated for some j𝒥j\in{\mathcal{J}} under a policy 𝛍\boldsymbol{\mu}. If there exists δb\delta_{b} and δe\delta_{e} satisfying the following condition

β(pjt)maxj𝒥η(pjt)\beta(p_{j^{*}}^{t})\geq\max_{j\in{\mathcal{J}}}\eta(p_{j}^{t}) (24)

then, we have that b(Rj=1)maxj𝒥g(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1)\geq\max_{j\in{\mathcal{J}}}\mathbb{P}_{g}(R_{j}=1), where b(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1) and g(Rj=1)\mathbb{P}_{g}(R_{j}=1) stand for the probability that Rj=1R_{j^{*}}=1 and Rj=1R_{j}=1, if the MDP evolves as per the proposed (ϵ,δ)(\epsilon,\delta)-greedy and ϵ\epsilon-greedy policy, respectively. \hfill\Box

Remark IV.7 (Prop. IV.6)

Prop. IV.6 states that among all paths pjtp_{j}^{t} of length tt^{*}, j𝒥j\in{\mathcal{J}}, there exists values for δb\delta_{b} and δe\delta_{e} under which there exists an MDP path (the one with index jj^{*}) that is more likely to be generated over the next tt^{*} time steps under the (ϵ,δ)(\epsilon,\delta)-greedy than any path pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}} that can be generated under the ϵ\epsilon-greedy policy. For instance, if δb=1\delta_{b}=1 and δe=0\delta_{e}=0, (i.e., ϵ=1\epsilon=1) then (24) is equivalent to m=0t1P(xt+m,ab,xt+m+1)maxj𝒥m=0t11|𝒜(x¯t+m)|\prod_{m=0}^{t^{*}-1}P(x_{t+m},a_{b},x_{t+m+1})\geq\max_{j\in{\mathcal{J}}}\prod_{m=0}^{t^{*}-1}\frac{1}{|{\mathcal{A}}(\bar{x}_{t+m})|}, where xt+m=pjt(m+1)x_{t+m}=p_{j^{*}}^{t}(m+1), and x¯t+m=pjt(m+1)\bar{x}_{t+m}=p_{j}^{t}(m+1) for all m{0,,t1}m\in\{0,\dots,t^{*}-1\}. Let Amin=minx𝒳|𝒜(x)|A_{\text{min}}=\min_{x\in{\mathcal{X}}}|{\mathcal{A}}(x)|. Then, for δb=1\delta_{b}=1, the result in Proposition IV.6 holds if m=0t1P(xt+m,ab,xt+m+1)(1Amin)t\prod_{m=0}^{t^{*}-1}P(x_{t+m},a_{b},x_{t+m+1})\geq(\frac{1}{A_{\text{min}}})^{t^{*}}. The latter is true if e.g., P(xt+m,ab,xt+m+1)1AminP(x_{t+m},a_{b},x_{t+m+1})\geq\frac{1}{A_{\text{min}}} for all m{0,,t1}m\in\{0,\dots,t^{*}-1\}. We note that a similar result is presented in [33] which employs a similar biased exploration to address deterministic temporal logic planning problems (see Remark 4.5 in [33]).

Proposition IV.6 compares the sample-efficiency of (ϵ,δ)(\epsilon,\delta)-greedy and ϵ\epsilon-greedy policies with respect to a specific path pjtp_{j^{*}}^{t}. In the following result, building upon Proposition IV.6, we provide a more general result. Specifically, we show that the probability that after t+1t^{*}+1 time steps a PMDP state s=(x,q)s=(x,q), where q𝒬goalq\in{\mathcal{Q}}_{\text{goal}} (see (10)), will be reached is higher when bias is introduced in the exploration phase. We emphasize again that given the current PMDP state st=(xt,qt)s_{t}=(x_{t},q_{t}) in an RL episode, the earliest that a PMDP state s=(x,q)s=(x,q), where q𝒬goalq\in{\mathcal{Q}}_{\text{goal}} can be reached is after t+1t^{*}+1 where t=Jxt,𝒳goalt^{*}=J_{x_{t},{\mathcal{X}}_{\text{goal}}} iterations. The reason is that the length of the shortest path from xtx_{t} to states 𝒳goal{\mathcal{X}}_{\text{goal}} that can enable the transition from qtq_{t} to 𝒬goal{\mathcal{Q}}_{\text{goal}} is t=Jxt,𝒳goalt^{*}=J_{x_{t},{\mathcal{X}}_{\text{goal}}}.

Proposition IV.8 (Sample Efficiency)

Let st=(xt,qt)s_{t}=(x_{t},q_{t}) be the product state reached at the tt-th time step of the current RL episode. A state sgoal=(x,qgoal)s_{\text{goal}}=(x,q_{\text{goal}}), where qgoal𝒬goalq_{\text{goal}}\in{\mathcal{Q}}_{\text{goal}} can be reached after at least t+1t^{*}+1 time steps, where t=Jxt,𝒳goalt^{*}=J_{x_{t},{\mathcal{X}}_{\text{goal}}}. If there exist δb\delta_{b} and δe\delta_{e} satisfying the following condition:

j𝒥β(pjt)j𝒥η(pjt),\displaystyle\sum_{j\in{\mathcal{J}}}\beta(p_{j}^{t})\geq\sum_{j\in{\mathcal{J}}}\eta(p_{j}^{t}), (25)

where jj^{*} stands for the index to the path selected as per (14), then b(qt+t+1𝒬goal)g(qt+t+1𝒬goal)\mathbb{P}_{b}(q_{t+t^{*}+1}\in{\mathcal{Q}}_{\text{goal}})\geq\mathbb{P}_{g}(q_{t+t^{*}+1}\in{\mathcal{Q}}_{\text{goal}}), where b(qt+t+1𝒬goal)\mathbb{P}_{b}(q_{t+t^{*}+1}\in{\mathcal{Q}}_{\text{goal}}) and g(qt+t+1𝒬goal)\mathbb{P}_{g}(q_{t+t^{*}+1}\in{\mathcal{Q}}_{\text{goal}}) stand for the probability that a PMDP state with a DRA state in 𝒬goal{\mathcal{Q}}_{\text{goal}} will be reached after exactly t+1t^{*}+1 time steps using the (ϵ,δ)(\epsilon,\delta)-greedy and ϵ\epsilon-greedy policy, respectively. \hfill\Box

Remark IV.9 (Selecting parameters δb\delta_{b} and δe\delta_{e})

(i) The result in Proposition IV.8 shows that there exist δb\delta_{b} and δe\delta_{e} to potentially improve sample efficiency compared to uniform/random exploration. However, selection of δb\delta_{b} and δe\delta_{e} as per Proposition IV.8 requires knowledge of the actual MDP transition probabilities along all paths pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}} which are not available. To address this, the estimated transition probabilities, computed in (6), can be used instead. To mitigate the fact that the initial estimated probabilities may be rather inaccurate, δe\delta_{e} can be selected so that δe>δb\delta_{e}>\delta_{b} for the first few episodes. Intuitively, this allows to initially perform random exploration to learn an accurate enough MDP transition probabilities across all directions. Once this happens and given ϵ\epsilon, values for δb\delta_{b} and δe\delta_{e} that satisfy the requirement (25) (using the estimated probabilities) can be computed by applying a simple line search algorithm over all possible values for δb{0,ϵ}\delta_{b}\in\{0,\epsilon\}, since δe+δb=ϵ\delta_{e}+\delta_{b}=\epsilon. (ii) A more efficient approach would be to pick δb\delta_{b} based on Proposition IV.6 instead of IV.8. The reason is that searching for δb\delta_{b} that satisfies (24) requires less computations than (25); see also Remark IV.7. (iii) An even more computationally efficient, but heuristic, approach to pick δb\delta_{b} and δe\delta_{e} is the following. We select δb\delta_{b} and δe\delta_{e} so that δe>δb\delta_{e}>\delta_{b} for the first few episodes to learn an accurate enough MDP model and then allow δe<δb\delta_{e}<\delta_{b} to prioritize exploration towards directions that may contribute to mission progress while letting both δb\delta_{b} and δb\delta_{b} to asymptotically converge to 0. Nevertheless, the values for δb\delta_{b} and δe\delta_{e} selected in this way may not satisfy the requirements mentioned in Propositions IV.4, IV.6, and IV.8.

Remark IV.10 (Limitations)

Alternative definitions of dFd_{F} may affect the construction of the set 𝒬goal{\mathcal{Q}}_{\text{goal}} in (10). Currently, dFd_{F} captures the shortest path, in terms of number of hops, between a DRA state and the set of accepting states. However, this definition neglects the underlying MDP structure which may compromise sample-efficiency. Specifically, the shortest DRA-based path may be harder for the MDP to realize than a longer DRA-based path, depending on the MDP transition probabilities. The result presented in Proposition IV.8 shows that given a distance function dFd_{F} and, consequently, 𝒬goal{\mathcal{Q}}_{\text{goal}}, there exist conditions that the parameters δb\delta_{b} and δe\delta_{e} should satisfy, so that the probability of reaching 𝒬goal{\mathcal{Q}}_{\text{goal}} within the minimum possible number of time steps (i.e., Jx,𝒳goalJ_{x,{\mathcal{X}}_{\text{goal}}} time steps) is larger when the (ϵ,δ)(\epsilon,\delta)-greedy policy is used. This does not necessarily imply that the probability of eventually reaching accepting states is also larger as this depends on the definition of dFd_{F} and, consequently, 𝒬goal{\mathcal{Q}}_{\text{goal}}. Designing dFd_{F} that optimizes sample-efficiency is a future research direction. However, our comparative experiments in Section V demonstrate sample-efficiency of the proposed method under various settings.

V Numerical Experiments

To demonstrate the sample-efficiency of our method, we provide extensive comparisons against existing model-free and model-based RL algorithms. All methods have been implemented on Python 3.8 and evaluated on a computer with an Nvidia RTX 3080 GPU, 12th Gen Intel(R) Core(TM) i7-12700K CPU, and 8GB RAM.

V-A Setting up Experiments & Baselines

MDP: We consider environments represented as 10×1010\times 10, 20×2020\times 20, and 50×5050\times 50 discrete grid worlds, resulting in MDPs with |𝒳|=100,400|{\mathcal{X}}|=100,400, and 2,5002,500 states denoted by 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3}, respectively. The robot has nine actions:‘left’, ‘right’, ‘up’, ‘down’, ‘idle’ as well as the corresponding four diagonal actions. At any MDP state xx, excluding the boundary ones, the set of actions 𝒜(x){\mathcal{A}}(x) that the robot can apply includes eight of these nine actions that are randomly selected while ensuring that the idle action is available at any state. The set of actions at boundary MDP states exclude those ones that drive the robot outside the environment. The transition probabilities are designed so that given any action, besides ‘idle’, the probability of reaching the intended state is randomly selected from the interval [0.7,0.8][0.7,0.8] while the probability of reaching neighboring MDP states is randomly selected as long as the summation of transition probabilities over the next states xx^{\prime} is equal to 11, for a fixed action aa and starting state xx. The action ‘idle’ is applied deterministically.

Refer to caption
Figure 4: Decay rates of the parameters δe\delta_{e}, δb\delta_{b}, and ϵ\epsilon considered in Section V for 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2}. The rate at which 1ϵ1-\epsilon (red) increases is the same in all figures. As the number of episodes goes to infinity, 1ϵ1-\epsilon converges to 11 and both δb\delta_{b} and δe\delta_{e} converge to 0. Notice that, in the bottom right figure, δb\delta_{b} is always equal to 0 resulting in random exploration (ϵ\epsilon-greedy policy).

Baselines: In the following case studies we demonstrate the performance of Algorithm 1 when it is equipped with the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy (8), the ϵ\epsilon-greedy policy, the Boltzman policy, and the UCB1 policy. Notice that Alg. 1 is model-free when it is equipped with these baselines as it does not require learning the MDP. We also compare it against a standard model-based approach that explicitly computes and stores the product MDP (PMDP) [3]. Computing the PMDP requires learning the underlying MDP model which can be achieved e.g., by simply letting the agent randomly explore the environment and then estimating the transition probabilities as in (6).333This would result in learning transition probabilities of 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} in 1.11.1 and 9090 minutes, respectively, with maximum error equal to 0.050.05. In our implementation, we directly use the ground-truth MDP transition probabilities giving an ‘unfair’ advantage to the model-based approach over the proposed one. Given the resulting PMDP, we apply dynamic programming to compute the optimal policy and its satisfaction probability [3].

Refer to caption
Figure 5: A Simple Coverage Task (Section V-B): Comparison of average satisfaction probability ¯\bar{\mathbb{P}} when Algorithm 1 is applied with the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy, ϵ\epsilon-greedy policy, Boltzmann policy, and UCB1 policy over the MDPs 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3}. Biased 130\text{Biased 1}-30 and Biased 1100\text{Biased 1}-100 refer to the cases where the Biased 1 exploration method is applied under the constraint that the MDP transition probabilities are updated only during the first 3030 and 100100 episodes, respectively. The legend also includes the total runtime per method. The black stars on top of each reward curve denote the training episode where the corresponding policy is when the fastest policy has finished training over the total number of episodes.

To examine sensitivity of the proposed algorithm with respect to the parameters δe\delta_{e} and δb\delta_{b}, we have considered three different decay rates for δe\delta_{e} and δb\delta_{b}, as per (iii) in Remark IV.9. Hereafter, we refer to the corresponding exploration strategies as ‘Biased 1’, ‘Biased 2’, and ‘Biased 3’, and ‘Random’, where the latter corresponds to the ϵ\epsilon-greedy policy. The rate at which δb\delta_{b} decreases over time gets smaller as we proceed from ‘Biased 1’, ‘Biased 2’, ‘Biased 3’, to ‘Random’. In other words, ‘Biased 1’ incurs the most ’aggressive’ bias in the exploration phase. The evolution of these parameters for the MDPs 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} is illustrated in Fig. 4. Similar biased strategies were selected for 𝔐3\mathfrak{M}_{3}. The only difference is that δb\delta_{b} is designed so that it converges to 0 slower due to the larger size of the state space. The corresponding mathematical formulas are provided in Appendix D. To make the comparison between the (ϵ,δ)(\epsilon,\delta)- and the ϵ\epsilon-greedy policy fair, we select the same ϵ\epsilon for both. The Boltzmann control policy is defined as follows: 𝝁B(s)=eQ𝝁B(s,a)/Ta𝒜𝔓eQ𝝁B(s,a)/T\boldsymbol{\mu}_{B}(s)=\frac{e^{Q^{\boldsymbol{\mu}_{B}}(s,a)/T}}{\sum_{a^{\prime}\in{\mathcal{A}}_{\mathfrak{P}}}e^{Q^{\boldsymbol{\mu}_{B}}(s,a^{\prime})/T}}, where T0T\geq 0 is the temperature parameter. The UCB1 control policy is defined as: 𝝁U(s)=argmaxa𝒜𝔓[Q𝝁U(s,a)+C×2log(N(s))n(s,a)]\boldsymbol{\mu}_{U}(s)=\operatornamewithlimits{argmax}_{a\in{\mathcal{A}}_{\mathfrak{P}}}\left[Q^{\boldsymbol{\mu}_{U}}(s,a)+C\times\sqrt{\frac{2\log(N(s))}{n(s,a)}}\right], where (i) N(s)N(s) and n(s,a)n(s,a) denote the number of times state ss has been visited and the number of times action aa has been selected at state ss and (ii) CC is an exploration parameter. This control policy is biased towards the least explored directions. In each case study, we pick values for CC and TT from a fixed set that yield the best performance. In all case studies, we adopt the reward function in (5) with γ=0.99\gamma=0.99 and r𝒢=10r_{{\mathcal{G}}}=10, r=0.1r_{{\mathcal{B}}}=-0.1, rd=100r_{d}=-100, and ro=0r_{o}=0. To convert the LTL formulas into DRA, we have used the ltl2dstar toolbox [54].

Performance Metrics: We utilize the satisfaction probabilities of the policies learned at various stages during training to assess performance of our algorithm and the baselines. Specifically, given a learned/fixed policy 𝝁\boldsymbol{\mu} and an initial PMDP state s=(x,qD0)s=(x,q_{D}^{0}), we compute the probability (𝝁ϕ|s=(x,qD0))\mathbb{P}(\boldsymbol{\mu}\models\phi|s=(x,q_{D}^{0})) using dynamic programming. We compute this probability for all x𝒳x\in{\mathcal{X}} and then we compute the average satisfaction probability ¯=[x𝒳(𝝁ϕ|s=(x,qD0))]/|𝒳|\bar{\mathbb{P}}=[\sum_{\forall x\in{\mathcal{X}}}\mathbb{P}(\boldsymbol{\mu}\models\phi|s=(x,q_{D}^{0}))]/{|{\mathcal{X}}|}. We report the average ¯\bar{\mathbb{P}} over five runs; see Figs. 5-8. The satisfaction probabilities are computed using the unknown-to-the-agent MDP transition probabilities. Since runtimes for a training episode may differ across methods, we also report runtime metrics; see Figs. 5-8. Specifically, we document the runtimes required for all methods to complete a a predetermined maximum number of episodes, as well as the training episode each method reaches when the fastest one completes the training process. This allows us to compare satisfaction probabilities over the policies more fairly based on fixed runtimes rather than a fixed number of training episodes.

Summary of Comparisons: Our experiments show that the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy outperforms the model-free baselines, learning policies with higher satisfaction probabilities over the same timeframe. This performance gap widens significantly as the size of the PMDP increases. Specifically, our method begins learning policies with non-zero satisfaction probabilities within the first few hundred training episodes. The baselines can catch up relatively quickly, narrowing the performance gap, typically after a few thousand episodes, but only in small PMDPs (fewer than 10,00010,000 states). In larger PMDPs (more than 10,00010,000 states), our method significantly outperforms the model-free baselines. Additionally, the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy and the ϵ\epsilon-greedy policy have similar runtimes, while they tend to be faster than UCB and, especially, Boltzmann. The model-based approach, on the other hand, demonstrates faster computation of the optimal policy compared to model-free baselines, including ours, when applied to small PMDPs (e.g., with fewer than 5,0005,000 states). However, this approach is memory inefficient, requiring storage of the PMDP and the action value function Q𝝁Q^{\boldsymbol{\mu}}. As a result, it failed to handle case studies with large PMDPs (more than 15,00015,000 states). In contrast, our method was able to handle PMDPs with hundreds of thousands of states; see e.g., Section V-E.

Remark V.1 (Limitations & Implementation Improvements)

A limitation of our method compared to model-free baselines is that it requires learning an MDP model, which can become memory-inefficient over large-scale MDPs. However, we believe that this limitation can be mitigated by more efficient implementations of our approach. For instance, in our current implementation [40], we store all learned MDP transition probabilities used to compute the biased action. However, the selection of the biased action does not require learning all transition probabilities; see (15). Instead, it only requires learning which action is most likely to drive the system from a state xx to a neighboring state xx^{\prime}. Once this property is learned for a pair of states xx and xx^{\prime}, the estimated transition probabilities P^(x,a,x)\hat{P}(x,a,x^{\prime}) in (6) can be discarded.

Refer to caption
Figure 6: A More Complex Coverage Task (Section V-C): Comparison of average accumulated reward (top row) and satisfaction probability ¯\bar{\mathbb{P}} (bottom row) when Algorithm 1 is applied with the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy, ϵ\epsilon-greedy policy, Boltzmann policy, and UCB1 policy over the MDPs 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3}. The legend also includes the total runtime per method. The black stars on top of each reward curve denote the training episode where the corresponding policy is when the fastest policy has finished training over the total number of episodes.

V-B Case Study I: A Simple Coverage Task

First, we consider a coverage/sequencing mission requiring the agent to eventually reach the states 9999, and 4646 or 9090 while avoiding 9999 until 3333 is reached, and always avoiding the obstacle states 7373, 2424, 1515, and 8888. This task is captured by the following LTL formula ϕ=(π99)(π46π90)(π33)(¬π99𝒰π33)¬πobs\phi=(\Diamond\pi^{99})\wedge\Diamond(\pi^{46}\vee\pi^{90})\wedge(\Diamond\pi^{33})\wedge(\neg\pi^{99}{\mathcal{U}}\pi^{33})\wedge\square\neg\pi^{\text{obs}}, where πobs\pi^{\text{obs}} is satisfied when the robot visits one of the obstacle states. This formula corresponds to a DRA with 77 states and 11 accepting pair. Thus, the PMDP constructed using 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3} has 700700, 2,8002,800, and 17,50017,500 states, respectively.

The comparative results are shown in Fig. 5. Our algorithm achieves the best performance when applied to 𝔐1\mathfrak{M}_{1}, regardless of the biased strategy. As for 𝔐2\mathfrak{M}_{2}, the best performance is achieved by our (ϵ,δ)(\epsilon,\delta)-greedy policy coupled with ’Biased 3’, followed closely by the ϵ\epsilon-greedy policy, ’Biased 2’, and ’Biased 1’. Notice that ϵ\epsilon-greedy can catch up quickly when applied to 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} due to the relatively small size of the resulting PMDPs. This figure also shows the performance of ’Biased 1’ when the MDP transition probabilities are updated only for the first 3030 and 100100 episodes for 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2}. The performance of our approach is not significantly affected by this choice, demonstrating robustness against model inaccuracies. This occurs because our algorithm does not require learning the ground truth values of the transition probabilities to compute the biased action; see (15).

The benefit of our method becomes more pronounced when 𝔐3\mathfrak{M}_{3} is considered, resulting in a larger PMDP. In this case, the average satisfaction probability ¯\bar{\mathbb{P}} of the policies learned by all baselines is close to 0 after 100,000100,000 training episodes (approximately 15 minutes). In contrast, the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy, coupled with ‘Biased 2’ and ‘Biased 3’, learned policies with ¯=0.58\bar{\mathbb{P}}=0.58 and ¯=0.71\bar{\mathbb{P}}=0.71, respectively, within the same timeframe. Also, notice that ‘Biased 1’ failed to yield a satisfactory policy for 𝔐3\mathfrak{M}_{3} within the same timeframe; recall that ‘Biased 1’ for 𝔐3\mathfrak{M}_{3} is more aggressive than ‘Biased 1 for 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2}. We attribute this to the aggressive nature of this exploration strategy towards a desired high-reward path, which possibly does not allow the agent to sufficiently explore a significant portion of the PMDP state space, resulting in a low average satisfaction probability. This shows that increasing the amount of bias does not necessarily yield policies with higher satisfaction probabilities.

The model-based approach was able to compute the optimal policy for the MDPs 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2}, but failed for 𝔐3\mathfrak{M}_{3} due to excessive memory requirements. Specifically, the optimal policy for 𝔐1\mathfrak{M}_{1} was computed in 0.50.5 minutes, and for 𝔐2\mathfrak{M}_{2}, it took 5.985.98 minutes (without including the time to learn the MDP model). The corresponding optimal average satisfaction probabilities ¯\bar{\mathbb{P}} were 0.9160.916 for 𝔐1\mathfrak{M}_{1} and 0.9110.911 for 𝔐2\mathfrak{M}_{2}. We noticed that the model-based approach tends to be faster than the model-free baselines, particularly for smaller PMDPs.

V-C Case Study II: A More Complex Coverage Task

Second, we consider a more complex sequencing task compared to the one in Section V-B, which involves visiting a larger number of MDP states. The goal is to eventually reach the MDP states x=81,95,80,88x=81,95,80,88, and 9292 in any order, while always avoiding the states x=5,15,54,32,24,66,42,70x=5,15,54,32,24,66,42,70, and 7171 representing obstacles in the environment. This task can be formulated using the following LTL formula: ϕ=π81π95π80π88π80π92¬πobs\phi=\Diamond\pi^{81}\wedge\Diamond\pi^{95}\wedge\Diamond\pi^{80}\wedge\Diamond\pi^{88}\wedge\Diamond\pi^{80}\wedge\Diamond\pi^{92}\wedge\square\neg\pi^{\text{obs}}, where πobs\pi^{\text{obs}} is true if the robot visits any of the obstacle states. This formula corresponds to a DRA with 3333 states and 11 accepting pair. Therefore, the PMDPs constructed using 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3} have 3,3003,300, 13,20013,200, and 82,50082,500 states, respectively, which are significantly larger than those of Section V-B.

Overall, our method, especially when coupled with ‘Biased 2’ and ‘Biased 3’, learns policies with higher satisfaction probabilities faster than the baselines; see Fig. 6. The benefit of our method is more pronounced as the PMDP size increases, as shown in the cases of 𝔐2\mathfrak{M}_{2} and 𝔐3\mathfrak{M}_{3}. For example, when considering the MDP 𝔐3\mathfrak{M}_{3}, our method equipped with ’Biased 2’ and ’Biased 3’ learns policies with ¯=0.55\bar{\mathbb{P}}=0.55 and ¯=0.64\bar{\mathbb{P}}=0.64, respectively, while ¯<0.05\bar{\mathbb{P}}<0.05 for all other baselines, given the same amount of training time. Also, as in Section V-B, observe that ‘Biased 1’ failed to learn a satisfactory policy for 𝔐3\mathfrak{M}_{3}.

The model-based baseline computed the optimal policy 𝝁\boldsymbol{\mu}^{*} for the MDPs 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} in 1.11.1 and 112.15112.15 minutes, respectively, while it failed to compute the optimal policy for 𝔐3\mathfrak{M}_{3} due to excessive memory requirements. The average optimal satisfaction probability of the learned policies for 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} is 0.98540.9854 and 0.94660.9466, respectively.

Refer to caption
Figure 7: Surveillance Task (Section V-D): Comparison of average satisfaction probability ¯\bar{\mathbb{P}} when Algorithm 1 is applied with the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy, ϵ\epsilon-greedy policy, Boltzmann policy, and UCB1 policy over the MDPs 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3}. The legend also includes the total runtime per method. The black stars on top of each reward curve denote the training episode where the corresponding policy is when the fastest policy has finished training over the total number of episodes.
Refer to caption
Figure 8: Disjoint Surveillance Task (Section V-E): Comparison of average satisfaction probability ¯\bar{\mathbb{P}} when Algorithm 1 is applied with the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy, ϵ\epsilon-greedy policy, Boltzmann policy, and UCB1 policy over the MDPs 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3}. The legend includes the total runtime per method. The black stars on top of each reward curve denote the training episode where the corresponding policy is when the fastest policy has finished training over the total number of episodes.

V-D Case Study III: Surveillance Task

Third, we consider a surveillance/recurrence mission captured by the following LTL formula: ϕ=π90π70(π80π63)π88(¬π88𝒰π90)¬πobs.\phi=\square\Diamond\pi^{90}\wedge\square\Diamond\pi^{70}\wedge\square\Diamond(\pi^{80}\vee\pi^{63})\wedge\square\Diamond\pi^{88}\wedge(\neg\pi^{88}{\mathcal{U}}\pi^{90})\wedge\square\neg\pi^{\text{obs}}. This formula requires the robot to (i) visit infinitely often and in any order the states 9090, 7070, 8080 or 6363 and 8888; (ii) avoid reaching 8888 until 8080 is visited; and (iii) always avoid the obstacle in state 3333. The corresponding DRA has 1616 states and 11 accepting pair. Thus, the PMDP constructed using 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3} has 1,6001,600, 6,4006,400, and 40,00040,000 states, respectively.

The comparative performance results are shown in Figure 7. Observe that the (ϵ,δ)(\epsilon,\delta)-greedy policy, especially when paired with ’Biased 2’ and ’Biased 3’, performs better that the model-free baselines in terms of sample-efficiency across all considered MDPs. For instance, in the case of 𝔐1\mathfrak{M}_{1}, our proposed algorithm learns policies with average satisfaction probabilities ranging from 0.750.75 to 0.90.9, depending on the biased exploration strategy, within 2,5002,500 training episodes. In contrast, the average satisfaction probability for the baselines is around 0.40.4 after the same number of episodes. As the number of episodes increases, the baselines manage to catch up due to the relatively small PMDP size. Similar trends are observed for 𝔐2\mathfrak{M}_{2}. As in the other case studies, the benefit of our method becomes more evident when considering 𝔐3\mathfrak{M}_{3}, which yields a significantly larger PMDP. In this scenario, our proposed algorithm, coupled with ’Biased 2’ and ’Biased 3’, learns a control policy with satisfaction probabilities of ¯=0.51\bar{\mathbb{P}}=0.51 and ¯=0.45\bar{\mathbb{P}}=0.45 within 100,000100,000 episodes (or approximately 2020 minutes), respectively. In contrast, the baselines achieve satisfaction probabilities ¯<0.2\bar{\mathbb{P}}<0.2 within the same timeframe. As discussed in Section V-C, ’Biased 1’ performs poorly in 𝔐3\mathfrak{M}_{3}, possibly due to its aggressive bias.

The model-based approach can compute the optimal policy only for the MDPs 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} while it failed in the case of 𝔐3\mathfrak{M}_{3} due to excessive memory requirements. Regarding the MDP 𝔐1\mathfrak{M}_{1} it computed an optimal policy corresponding to ¯=0.989\bar{\mathbb{P}}=0.989 within 2.312.31 minutes. As for the MDP 𝔐2\mathfrak{M}_{2}, it computed the optimal policy with ¯=0.981\bar{\mathbb{P}}=0.981 within 7.717.71 minutes.

V-E Case Study IV: Disjoint Task

Finally, we consider a mission ϕ\phi with two disjoint sub-tasks, i.e., ϕ=ϕ1ϕ2\phi=\phi_{1}\vee\phi_{2} requiring the robot to accomplish either ϕ1\phi_{1} or ϕ2\phi_{2}. The sub-tasks are defined as ϕ1=(π99π45π32¬π64)\phi_{1}=(\Diamond\pi^{99}\wedge\Diamond\pi^{45}\wedge\Diamond\pi^{32}\wedge\square\neg\pi^{64}) and ϕ2=(π18π72π4)\phi_{2}=(\Diamond\pi^{18}\wedge\Diamond\pi^{72}\wedge\Diamond\pi^{4}). The LTL formula ϕ\phi corresponds to a DRA with 6464 states and 22 accepting pairs. As a result, the PMDP constructed using 𝔐1\mathfrak{M}_{1}, 𝔐2\mathfrak{M}_{2}, and 𝔐3\mathfrak{M}_{3} has 6,4006,400, 25,60025,600, and 160,000160,000 states, respectively. This task requires the robot to eventually either visit the states 9999, 4545, and 3232 while always avoiding 6464 or visit the states 1818, 7272, and 44. Notice that the optimal satisfaction probability of ϕ1\phi_{1} and ϕ2\phi_{2} is 11 and less than 11, respectively.

The comparative results are reported in Figure 8. In 𝔐1\mathfrak{M}_{1}, our method coupled with ’Biased 1’ achieves the best performance, closely followed by ’Biased 3’. Both biased exploration strategies result in a control policy satisfying ϕ\phi with probability very close to 11 in approximately 0.50.5 minutes. Additionally, all other baselines, except UCB, perform satisfactorily, learning policies with [0.8,0.9]\mathbb{P}\in[0.8,0.9] in the same time frame. The performance gap between our method and the baselines becomes more pronounced with the larger PMDPs constructed using 𝔐2\mathfrak{M}_{2} and 𝔐3\mathfrak{M}_{3}. Specifically, in 𝔐2\mathfrak{M}_{2}, ‘Biased 1’ and ‘Biased 2’ achieve the best performance followed by ‘Boltzmann’, ‘UCB’, ‘Biased 2’, and ‘ϵ\epsilon-greedy’. In fact, ‘Biased 1’ and ‘Biased 2’ still manage to learn a policy with ¯\bar{\mathbb{P}} very close to 11 in 2.402.40 mins while for the other baselines it holds that ¯<0.8\bar{\mathbb{P}}<0.8. It is worth noting that the performance of ‘Biased 3’ has dropped significantly compared to 𝔐1\mathfrak{M}_{1}. This drop may be attributed to δb\delta_{b} converging quite fast to 0 relative to the large size of the PMDP. In fact, once δb\delta_{b} is almost equal to 0, then the (ϵ,δ)(\epsilon,\delta)-greedy policy closely resembles the standard ϵ\epsilon-greedy policy which in this case has also learned a policy with very low average satisfaction probability. Recall that 𝔐2\mathfrak{M}_{2} shared exactly the same biased exploration strategies (‘Biased 1’, ‘Biased 2’, and ‘Biased 3’) across all case studies regardless of the PMDP size. However, the PMDP for 𝔐2\mathfrak{M}_{2} is significantly larger than the ones considered in the other case studies which may explain the poor performance of ‘Biased 3’ compared the other 𝔐2\mathfrak{M}_{2} case studies. Observe in 𝔐3\mathfrak{M}_{3} that our method outperforms all baselines. Specifically, within 20.5520.55 mins, the average satisfaction probability corresponding to ‘Biased 1’, ‘Biased 2’, ‘Biased 3’, and ‘Boltzmann’ is 0.80.8, 0.710.71, 0.780.78, and 0.60.6 respectively. The Boltzmann policy requires in total 37.7837.78 minutes to eventually yield a policy with ¯=0.76\bar{\mathbb{P}}=0.76. Finally, the model-based approach was able to compute an optimal policy only for 𝔐1\mathfrak{M}_{1} within 6.16.1 minutes with ¯=0.9772\bar{\mathbb{P}}=0.9772; interestingly, model-free methods are faster in this case study.

VI Conclusions

In this paper, we proposed a new accelerated reinforcement learning (RL) for temporal logic control objectives. The proposed RL method relies on new control policy, called (ϵ,δ)(\epsilon,\delta)-greedy, that prioritizes exploration in the vicinity of task-related regions. This results in enhanced sample-efficiency as supported by theoretical results and comparative experiments. Our future work will focus on enhancing scalability by using function approximations (e.g., neural networks).

Appendix A Extensions: Biased Exploration over LDBA

In this appendix, we show that the proposed exploration strategy can be extended to Limit Deterministic Büchi Automaton (LDBA) that typically have a smaller state space than DRA which can further accelerate learning [38]. First, any LTL formula can be converted in an LDBA defined as follows:

Definition A.1 (LDBA [38])

An LDBA is defined as 𝔄=(𝒬,q0,Σ,,δ)\mathfrak{A}=({\mathcal{Q}},q_{0},\Sigma,{\mathcal{F}},\delta) where 𝒬{\mathcal{Q}} is a finite set of states, q0𝒬q_{0}\in{\mathcal{Q}} is the initial state, Σ=2𝒜𝒫\Sigma=2^{\mathcal{AP}} is a finite alphabet, ={1,,f}{\mathcal{F}}=\{{\mathcal{F}}_{1},\dots,{\mathcal{F}}_{f}\} is the set of accepting conditions where j𝒬{\mathcal{F}}_{j}\subset{\mathcal{Q}}, 1jf1\leq j\leq f, and δ:𝒬×Σ2𝒬\delta:{\mathcal{Q}}\times\Sigma\rightarrow 2^{{\mathcal{Q}}} is a transition relation. The set of states 𝒬{\mathcal{Q}} can be partitioned into two disjoint sets 𝒬=𝒬N𝒬D{\mathcal{Q}}={\mathcal{Q}}_{N}\cup{\mathcal{Q}}_{D}, so that (i) δ(q,π)𝒬D\delta(q,\pi)\subset{\mathcal{Q}}_{D} and |δ(q,π)|=1|\delta(q,\pi)|=1, for every state q𝒬Dq\in{\mathcal{Q}}_{D} and πΣ\pi\in\Sigma; and (ii) for every j{\mathcal{F}}_{j}\in{\mathcal{F}}, it holds that j𝒬D{\mathcal{F}}_{j}\subset{\mathcal{Q}}_{D} and there are ε\varepsilon-transitions from 𝒬N{\mathcal{Q}}_{N} to 𝒬D{\mathcal{Q}}_{D}. \hfill\Box

An infinite run ρ\rho of 𝔄\mathfrak{A} over an infinite word w=σ0σ1σ2Σωw=\sigma_{0}\sigma_{1}\sigma_{2}\dots\in\Sigma^{\omega}, σtΣ=2𝒜𝒫\sigma_{t}\in\Sigma=2^{\mathcal{AP}} t\forall t\in\mathbb{N}, is an infinite sequence of states qt𝒬q_{t}\in{\mathcal{Q}}, i.e., ρ=q0q1qt\rho=q_{0}q_{1}\dots q_{t}\dots, such that qt+1δ(qt,σt)q_{t+1}\in\delta(q_{t},\sigma_{t}). The infinite run ρ\rho is called accepting (and the respective word ww is accepted by the LDBA) if Inf(ρ)j,j{1,,f},\texttt{Inf}(\rho)\cap{\mathcal{F}}_{j}\neq\emptyset,\forall j\in\{1,\dots,f\}, where Inf(ρ)\texttt{Inf}(\rho) is the set of states that are visited infinitely often by ρ\rho. Also, an ε\varepsilon-transition allows the automaton to change its state without reading any specific input. In practice, the ε\varepsilon-transitions between 𝒬N\mathcal{Q}_{N} and 𝒬D\mathcal{Q}_{D} reflect the “guess” on reaching 𝒬D\mathcal{Q}_{D}: accordingly, if after an ε\varepsilon-transition the associated labels in the accepting LDBA set cannot be read, or if the accepting states cannot be visited, then the guess is deemed to be wrong, and the trace is disregarded and is not accepted by the automaton. However, if the trace is accepting, then the trace will stay in 𝒬D\mathcal{Q}_{D} ever after, i.e. 𝒬D\mathcal{Q}_{D} is invariant.

Given a (non-pruned) LDBA, we construct the product MDP (PMDP), similarly to Definition II.6. The formal definition of this PMDP can be found in [9, 8]. To synthesize a policy that satisfies the LDBA accepting condition, we can adopt any reward function for the product MDP proposed in the literature [8, 9]. Once the LDBA is constructed, it is pruned exactly as discussed in Section III-A. The ϵ\epsilon-transitions are not pruned. Given the resulting automaton, similar to (3), we define the distance to an accepting set of states j{\mathcal{F}}_{j} as dF(q,j)=minqGjd(q,qG)d_{F}(q,{\mathcal{F}}_{j})=\min_{q_{G}\in{\mathcal{F}}_{j}}d(q,q_{G}) where d(q,qG)d(q,q_{G}) is defined as in (2). This function is used to bias exploration so that each set j{\mathcal{F}}_{j} is visited infinitely often. To design a biased exploration strategy that can account for the LDBA accepting condition, we first define the set 𝒱{\mathcal{V}} that collects indices jj to the set of accepting states j{\mathcal{F}}_{j} that have not been visited during the current RL episode. Then, among all non-visited set of accepting states j{\mathcal{F}}_{j}, we pick one randomly based on which we define the set 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}). Similar to (10), we define the set 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}) as: 𝒬goal(qt)={q𝒬|(σΣfeassuch thatqδ(qt,σ))(dF(q,j)=dF(qt,j)1)}{\mathcal{Q}}_{\text{goal}}(q_{t})=\{q^{\prime}\in{\mathcal{Q}}~{}|~{}(\exists\sigma\in\Sigma_{\text{feas}}~{}\text{such that}~{}q^{\prime}\in\delta(q_{t},\sigma))\wedge(d_{F}(q^{\prime},{\mathcal{F}}_{j})=d_{F}(q_{t},{\mathcal{F}}_{j})-1)\}, where j𝒱j\in{\mathcal{V}}. Recall, that all ϵ\epsilon- transitions in the LDBA are feasible. Thus, by definition, 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}) includes all states qq where the transition from qtq_{t} to qq is an ϵ\epsilon-transition. Given 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}), the biased action is selected exactly as described in Section III-C. Once the set of states j{\mathcal{F}}_{j} is visited, the set 𝒱{\mathcal{V}} is updated as 𝒱=𝒱{j}{\mathcal{V}}={\mathcal{V}}\setminus\{j\}, and then the set 𝒬goal(qt){\mathcal{Q}}_{\text{goal}}(q_{t}) is updated accordingly.

Appendix B Proof for Results of Section IV-B

B-A Proof Proposition IV.2

The probability of reaching any state st+1=(xt+1,qt+1)s_{t+1}=(x_{t+1},q_{t+1}) where xt+1𝒳closer(xt)x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t}) under a stochastic policy 𝝁(s,a)\boldsymbol{\mu}(s,a) is: x𝒳closera𝒜(xt)𝝁(st,a)P(xt,a,x).\sum_{x\in{\mathcal{X}}_{\text{closer}}}\sum_{a\in{\mathcal{A}}(x_{t})}\boldsymbol{\mu}(s_{t},a)P(x_{t},a,x). Thus, we have that:444Note that qt+1q_{t+1} is selected deterministically, due to the DRA structure, i.e., qt+1=δD(qt,L(xt))q_{t+1}=\delta_{D}(q_{t},L(x_{t})).

b(xt+1𝒳closer(xt))g(xt+1𝒳closer(xt))=\displaystyle\mathbb{P}_{b}(x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t}))-\mathbb{P}_{g}(x_{t+1}\in{\mathcal{X}}_{\text{closer}}(x_{t}))=
x𝒳closer(xt)a𝒜(xt)P(xt,a,x)(𝝁b(st,a)𝝁g(st,a)),\displaystyle\sum_{x\in{\mathcal{X}}_{\text{closer}}(x_{t})}\sum_{a\in{\mathcal{A}}(x_{t})}P(x_{t},a,x)(\boldsymbol{\mu}_{b}(s_{t},a)-\boldsymbol{\mu}_{g}(s_{t},a)), (26)

where 𝝁g\boldsymbol{\mu}_{g} and 𝝁b\boldsymbol{\mu}_{b} refer to the ϵ\epsilon-greedy (no biased exploration) and (ϵ,δ)(\epsilon,\delta)-greedy policy (biased exploration), respectively. In what follows, we compute 𝝁b(st,a)𝝁g(st,a)\boldsymbol{\mu}_{b}(s_{t},a)-\boldsymbol{\mu}_{g}(s_{t},a), for all a𝒜𝔓a\in{\mathcal{A}}_{\mathfrak{P}}. Recall, that 𝝁b(st,a)\boldsymbol{\mu}_{b}(s_{t},a) is the probability of selecting the action aa at state sts_{t}. Also, hereafter, we assume that the greedy action aa^{*} is different from the biased action aba_{b}; however, the same logic applies if ab=aa_{b}=a^{*}, leading to the same result. For simplicity of notation, we use A=|𝒜𝔓(s)|A=|{\mathcal{A}}_{\mathfrak{P}}(s)|.

First, for the action a=aa=a^{*}, we have that (a) 𝝁b(st,a)𝝁g(st,a)=(1ϵ+δeA)(1ϵ+ϵA)=(1ϵ+δeA)(1ϵ+δb+δeA)=δbA\boldsymbol{\mu}_{b}(s_{t},a^{*})-\boldsymbol{\mu}_{g}(s_{t},a^{*})=(1-\epsilon+\frac{\delta_{e}}{A})-(1-\epsilon+\frac{\epsilon}{A})=(1-\epsilon+\frac{\delta_{e}}{A})-(1-\epsilon+\frac{\delta_{b}+\delta_{e}}{A})=\frac{\delta_{b}}{A}. Similarly, for a=aba=a_{b}, we have that (b) 𝝁b(st,ab)𝝁g(st,ab)=(δb+δeA)ϵA=δb(m1)A.\boldsymbol{\mu}_{b}(s_{t},a_{b})-\boldsymbol{\mu}_{g}(s_{t},a_{b})=(\delta_{b}+\frac{\delta_{e}}{A})-\frac{\epsilon}{A}=-\frac{\delta_{b}(m-1)}{A}. Also, for all other actions aab,aa\neq a_{b},a^{*}, we have that (c) 𝝁b(st,a)𝝁g(st,a)=δeAϵA=δbA\boldsymbol{\mu}_{b}(s_{t},a)-\boldsymbol{\mu}_{g}(s_{t},a)=\frac{\delta_{e}}{A}-\frac{\epsilon}{A}=-\frac{\delta_{b}}{A}. Substituting the above equations (a)-(c) into (B-A) yields: b(xt+1𝒳closer)g(xt+1𝒳closer)=δbx𝒳closer(P(xt,ab,x)a𝒜(xt)P(xt,a,x)A)\mathbb{P}_{b}(x_{t+1}\in{\mathcal{X}}_{\text{closer}})-\mathbb{P}_{g}(x_{t+1}\in{\mathcal{X}}_{\text{closer}})=\delta_{b}\sum_{x\in{\mathcal{X}}_{\text{closer}}}(P(x_{t},a_{b},x)-\sum_{a\in{\mathcal{A}}(x_{t})}\frac{P(x_{t},a,x)}{A}). Due to (18) and that δb>0\delta_{b}>0, we conclude that b(st+1𝒳closer)g(st+1𝒳closer)0\mathbb{P}_{b}(s_{t+1}\in{\mathcal{X}}_{\text{closer}})-\mathbb{P}_{g}(s_{t+1}\in{\mathcal{X}}_{\text{closer}})\geq 0 completing the proof.

B-B Proof Of Proposition IV.3

The probability of reaching a state st+1s_{t+1} where xt+1=xbx_{t+1}=x_{b} under a policy μ(s,a)\mu(s,a) is: a𝒜(xt)μ(st,a)P(xt,a,xb).\sum_{a\in{\mathcal{A}}(x_{t})}\mu(s_{t},a)P(x_{t},a,x_{b}). Thus, we have b(xt+1=xb)g(xt+1=xb)=a𝒜(xt)P(xt,a,xb)(μb(st,a)μg(st,a))\mathbb{P}_{b}(x_{t+1}=x_{b})-\mathbb{P}_{g}(x_{t+1}=x_{b})=\sum_{a\in{\mathcal{A}}(x_{t})}P(x_{t},a,x_{b})(\mu_{b}(s_{t},a)-\mu_{g}(s_{t},a)) Following the same steps as in the proof of Proposition IV.2, we get that b(xt+1=xb)g(xt+1=xb)0\mathbb{P}_{b}(x_{t+1}=x_{b})-\mathbb{P}_{g}(x_{t+1}=x_{b})\geq 0 if δb>0\delta_{b}>0, which holds by assumption, and P(xt,ab,xb)a𝒜(xt)P(xt,a,xb)|𝒜(xt)|P(x_{t},a_{b},x_{b})\geq\sum_{a\in{\mathcal{A}}(x_{t})}\frac{P(x_{t},a,x_{b})}{|{\mathcal{A}}(x_{t})|} which holds by definition of aba_{b} in (15). Specifically, given xtx_{t} and xbx_{b}, we have that P(xt,ab,xb)P(xt,a,xb)P(x_{t},a_{b},x_{b})\geq P(x_{t},a,x_{b}) for all a𝒜(xt)a\in{\mathcal{A}}(x_{t}) due to (15). Thus, P(xt,ab,xb)P(x_{t},a_{b},x_{b}) must be greater than or equal to the average transition probability over the actions aa i.e., a𝒜(xt)P(xt,a,xb)|𝒜(xt)|\sum_{a\in{\mathcal{A}}(x_{t})}\frac{P(x_{t},a,x_{b})}{|{\mathcal{A}}(x_{t})|} completing the proof.

Appendix C Proof of Results of Section IV-C

C-A Proof of Proposition IV.4

By definition of RjR_{j^{*}} and RjR_{j}, we can rewrite the inequality b(Rj=1)maxj𝒥b(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1)\geq\max_{j\in{\mathcal{J}}}\mathbb{P}_{b}(R_{j}=1) as

m=0t1a𝒜(xt+m)μb(st+m,a)P(xt+m,a,xt+m+1)\displaystyle\prod_{m=0}^{t^{*}-1}\sum_{a\in{\mathcal{A}}(x_{t+m})}\mu_{b}(s_{t+m},a)P(x_{t+m},a,x_{t+m+1})\geq (27)
maxj𝒥m=0t1a¯𝒜(x¯t+m)μb(s¯t+m,a¯)P(x¯t+m,a¯,x¯t+m+1).\displaystyle\max_{j\in{\mathcal{J}}}\prod_{m=0}^{t^{*}-1}\sum_{\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m})}\mu_{b}(\bar{s}_{t+m},\bar{a})P(\bar{x}_{t+m},\bar{a},\bar{x}_{t+m+1}).

where st+m=(xt+m,qt)s_{t+m}=(x_{t+m},q_{t}), xt+m=pjt(m+1)x_{t+m}=p_{j^{*}}^{t}(m+1), s¯t+m=(x¯t+1,qt)\bar{s}_{t+m}=(\bar{x}_{t+1},q_{t}), x¯t+m=pjt(m+1)\bar{x}_{t+m}=p_{j}^{t}(m+1), for all m{0,,t1}m\in\{0,\dots,t^{*}-1\}. Recall that by construction of the paths pjtp_{j}^{t}, the DRA state will remain equal to qtq_{t} as the MDP agent moves along any of the paths pjtp_{j}^{t}, for all j𝒥j\in{\mathcal{J}}; see Remark III.4. We will show that (27) holds by contradiction. Assume that there exists at least one path pj¯tp_{\bar{j}}^{t}, j¯𝒥\bar{j}\in{\mathcal{J}}, that does not satisfy (27), i.e.,

m=0t1a𝒜(xt+m)μb(st+m,a)P(xt+m,a,xt+m+1)<\displaystyle\prod_{m=0}^{t^{*}-1}\sum_{a\in{\mathcal{A}}(x_{t+m})}\mu_{b}(s_{t+m},a)P(x_{t+m},a,x_{t+m+1})< (28)
m=0t1a¯𝒜(x¯t+m)μb(s¯t+m,a¯)P(x¯t+m,a¯,x¯t+m+1),\displaystyle\prod_{m=0}^{t^{*}-1}\sum_{\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m})}\mu_{b}(\bar{s}_{t+m},\bar{a})P(\bar{x}_{t+m},\bar{a},\bar{x}_{t+m+1}),

where s¯t+m\bar{s}_{t+m} and x¯t+m\bar{x}_{t+m} are defined as per pj¯tp_{\bar{j}}^{t}.

Next, we assume that aaba^{*}\neq a_{b} and a¯a¯b\bar{a}^{*}\neq\bar{a}_{b}; the same logic applies even if this is not the case leading to the same result. Using (8), we plug the values of μb(st+m,a)\mu_{b}(s_{t+m},a) and μb(s¯t+m,a¯)\mu_{b}(\bar{s}_{t+m},\bar{a}) for all a𝒜(xt+m)a\in{\mathcal{A}}(x_{t+m}) and a¯𝒜(x¯t+m)\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m}) in (28) which yields:

m=0t1{P(xt+m,ab,xt+m+1)(δb+δe|𝒜(xt+m)|)+\displaystyle\prod_{m=0}^{t^{*}-1}\{P(x_{t+m},a_{b},x_{t+m+1})(\delta_{b}+\frac{\delta_{e}}{|{\mathcal{A}}(x_{t+m})|})+
P(xt+m,a,xt+m+1)(1ϵ+δe|𝒜(xt+m)|)+\displaystyle P(x_{t+m},a^{*},x_{t+m+1})(1-\epsilon+\frac{\delta_{e}}{|{\mathcal{A}}(x_{t+m})|})+
aa,abP(xt+m,a,xt+m+1)(δe|𝒜(xt+m)|)}<\displaystyle\sum_{a\neq a^{*},a_{b}}P(x_{t+m},a,x_{t+m+1})(\frac{\delta_{e}}{|{\mathcal{A}}(x_{t+m})|})\}<
m=0t1{P(x¯t+m,a¯b,x¯t+m+1)(δb+δe|𝒜(x¯t+m)|)+\displaystyle\prod_{m=0}^{t^{*}-1}\{P(\bar{x}_{t+m},\bar{a}_{b},\bar{x}_{t+m+1})(\delta_{b}+\frac{\delta_{e}}{|{\mathcal{A}}(\bar{x}_{t+m})|})+
P(x¯t+m,a¯,x¯t+m+1)(1ϵ+δe|𝒜(x¯t+m)|)+\displaystyle P(\bar{x}_{t+m},\bar{a}^{*},\bar{x}_{t+m+1})(1-\epsilon+\frac{\delta_{e}}{|{\mathcal{A}}(\bar{x}_{t+m})|})+
a¯a¯,a¯bP(x¯t+m,a¯,x¯t+m+1)(δe|𝒜(x¯t+m)|)}.\displaystyle\sum_{\bar{a}\neq\bar{a}^{*},\bar{a}_{b}}P(\bar{x}_{t+m},\bar{a},\bar{x}_{t+m+1})(\frac{\delta_{e}}{|{\mathcal{A}}(\bar{x}_{t+m})|})\}. (29)

In (C-A), aba_{b} and a¯b\bar{a}_{b} stand for the biased action computed when the PMDP state is st+ms_{t+m} and s¯t+m\bar{s}_{t+m} (using the optimal path pjt+mp_{j^{*}}^{t+m}, as per (14), as discussed in Section III-C). The same notation extends to all other actions. The purpose of this notation is only to emphasize that the biased and greedy actions at st+ms_{t+m} and s¯t+m\bar{s}_{t+m} are not necessarily the same. By rearranging the terms in (C-A), we get the following result

m=0t1{P(xt+m,ab,xt+m+1)δb+\displaystyle\prod_{m=0}^{t^{*}-1}\{P(x_{t+m},a_{b},x_{t+m+1})\delta_{b}+
P(xt+m,a,xt+m+1)(1ϵ)+δe|𝒜(xt+m)|}<\displaystyle P(x_{t+m},a^{*},x_{t+m+1})(1-\epsilon)+\frac{\delta_{e}}{|{\mathcal{A}}(x_{t+m})|}\}<
m=0t1{P(x¯t+m,a¯b,x¯t+m+1)δb+\displaystyle\prod_{m=0}^{t^{*}-1}\{P(\bar{x}_{t+m},\bar{a}_{b},\bar{x}_{t+m+1})\delta_{b}+
P(x¯t+m,a¯,x¯t+m+1)(1ϵ)+δe|𝒜(x¯t+m)|}.\displaystyle P(\bar{x}_{t+m},\bar{a}^{*},\bar{x}_{t+m+1})(1-\epsilon)+\frac{\delta_{e}}{|{\mathcal{A}}(\bar{x}_{t+m})|}\}. (30)

Due to (IV-C), (C-A) can expressed as β(pjt)<β(pj¯t)\beta(p_{j^{*}}^{t})<\beta(p_{\bar{j}}^{t}) which contradicts (22) completing the proof.555Notice that β(pjt)\beta(p_{j}^{t}) is equal to the probability that, starting from xtx_{t}, the MDP path pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, will be generated by the end of the time step t+tt+t^{*}, under the proposed (ϵ,δ)(\epsilon,\delta)-greedy policy.

C-B Proof of Proposition IV.6

This proof follows the same steps as the proof of Proposition IV.4. The inequality b(Rj=1)maxj𝒥g(Rj=1)\mathbb{P}_{b}(R_{j^{*}}=1)\geq\max_{j\in{\mathcal{J}}}\mathbb{P}_{g}(R_{j}=1) can re-written as

m=0t1(a𝒜(xt+m)μb(st+m,a)P(xt+m,a,xt+m+1))\displaystyle\prod_{m=0}^{t^{*}-1}\left(\sum_{a\in{\mathcal{A}}(x_{t+m})}\mu_{b}(s_{t+m},a)P(x_{t+m},a,x_{t+m+1})\right)\geq (31)
maxj𝒥m=0t1(a¯𝒜(x¯t+m)μg(s¯t+m,a¯)P(x¯t+m,a¯,x¯t+m+1))\displaystyle\max_{j\in{\mathcal{J}}}\prod_{m=0}^{t^{*}-1}\left(\sum_{\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m})}\mu_{g}(\bar{s}_{t+m},\bar{a})P(\bar{x}_{t+m},\bar{a},\bar{x}_{t+m+1})\right)

where st+m=(xt+m,qt)s_{t+m}=(x_{t+m},q_{t}), xt+m=pjt(m+1)x_{t+m}=p_{j^{*}}^{t}(m+1), s¯t+m=(x¯t+m,qt)\bar{s}_{t+m}=(\bar{x}_{t+m},q_{t}), and x¯t+m=pjt(m+1)\bar{x}_{t+m}=p_{j}^{t}(m+1), for all m{1,,t1}m\in\{1,\dots,t^{*}-1\}. We will show this result by contradiction. Assume that there exists at least one path pj¯tp_{\bar{j}}^{t}, j¯𝒥\bar{j}\in{\mathcal{J}}, that does not satisfy (31), i.e.,

m=0t1(a𝒜(xt+m)μb(st+m,a)P(xt+m,a,xt+m+1))<\displaystyle\prod_{m=0}^{t^{*}-1}\left(\sum_{a\in{\mathcal{A}}(x_{t+m})}\mu_{b}(s_{t+m},a)P(x_{t+m},a,x_{t+m+1})\right)< (32)
m=0t1(a¯𝒜(x¯t+m)μg(s¯t+m,a¯)P(x¯t+m,a,x¯t+m+1)),\displaystyle\prod_{m=0}^{t^{*}-1}\left(\sum_{\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m})}\mu_{g}(\bar{s}_{t+m},\bar{a})P(\bar{x}_{t+m},a,\bar{x}_{t+m+1})\right),

where s¯t+m\bar{s}_{t+m} and x¯t+m\bar{x}_{t+m} are defined as per pj¯tp_{\bar{j}}^{t}.

In what follows, we denote by aa^{*} and aba_{b} the greedy and the biased action as per μb\mu_{b}, and a¯\bar{a}^{*} the greedy action as per μg\mu_{g}. We assume that aaba^{*}\neq a_{b}; the same logic applies even if this is not the case leading to the same final result. We plug the values of μb(st+m,a)\mu_{b}(s_{t+m},a) and μg(s¯t+m,a¯)\mu_{g}(\bar{s}_{t+m},\bar{a}) for all a𝒜(xt+m)a\in{\mathcal{A}}(x_{t+m}) and a¯𝒜(x¯t+m)\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m}) in (32) yielding:

m=0t1{P(xt+m,ab,xt+m+1)δb+\displaystyle\prod_{m=0}^{t^{*}-1}\{P(x_{t+m},a_{b},x_{t+m+1})\delta_{b}+
P(xt+m,a,xt+m+1)(1ϵ)+δe|𝒜(xt+m)|}<\displaystyle P(x_{t+m},a^{*},x_{t+m+1})(1-\epsilon)+\frac{\delta_{e}}{|{\mathcal{A}}(x_{t+m})|}\}<
m=0t1{P(x¯t+m,a¯,x¯t+m+1)(1ϵ)+ϵ|𝒜(x¯t+m)|}\displaystyle\prod_{m=0}^{t^{*}-1}\{P(\bar{x}_{t+m},\bar{a}^{*},\bar{x}_{t+m+1})(1-\epsilon)+\frac{\epsilon}{|{\mathcal{A}}(\bar{x}_{t+m})|}\} (33)

Due to (IV-C) and (23), the result in (C-B) is equivalent to β(pjt)<η(pj¯t)\beta(p_{j^{*}}^{t})<\eta(p_{\bar{j}}^{t}) which contradicts (24) completing the proof.666Notice that η(pjt)\eta(p_{j}^{t}) is equal to the probability that, starting from xtx_{t}, the MDP path pjtp_{j}^{t}, will be generated by the end of the time step t+tt+t^{*}, if the PMDP evolves as per the ϵ\epsilon-greedy policy.

C-C Proof of Proposition IV.8

To show this result, it suffices to show that

b(xt+t𝒳goal)g(xt+t𝒳goal).\mathbb{P}_{b}(x_{t+t^{*}}\in{\mathcal{X}}_{\text{goal}})\geq\mathbb{P}_{g}(x_{t+t^{*}}\in{\mathcal{X}}_{\text{goal}}). (34)

The reason is that if at the time step t+tt+t^{*} an MDP state in 𝒳goal{\mathcal{X}}_{\text{goal}} is reached, then at the next time step t+t+1t+t^{*}+1, a DRA state in 𝒬goal{\mathcal{Q}}_{\text{goal}} will be reached. Notice that the MDP states in 𝒳goal{\mathcal{X}}_{\text{goal}} can be reached at the time step t+tt+t^{*} if any of the MDP paths pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, originating at xtx_{t}, are followed. Let RjR_{j} be a (Bernoulli) random variable that is true if after tt^{*} time steps (i.e., at the time step t+t)t+t^{*}), a path pjtp_{j}^{t}, j𝒥j\in{\mathcal{J}}, has been generated under a policy μ\mu. Then, (34) can be equivalently expressed as:

j𝒥b(Rj=1)j𝒥g(Rj=1).\sum_{j\in{\mathcal{J}}}\mathbb{P}_{b}(R_{j}=1)\geq\sum_{j\in{\mathcal{J}}}\mathbb{P}_{g}(R_{j}=1). (35)

The rest of the proof follows the same logic as the proof of Proposition IV.6. First, we can rewrite (35) as follows:

j𝒥(m=0t1(a𝒜(xt+m)μb(st+m,a)P(xt+m,a,xt+m+1)))\displaystyle\sum_{j\in{\mathcal{J}}}\left(\prod_{m=0}^{t^{*}-1}\left(\sum_{a\in{\mathcal{A}}(x_{t+m})}\mu_{b}(s_{t+m},a)P(x_{t+m},a,x_{t+m+1})\right)\right)\geq (36)
j𝒥(m=0t1(a¯𝒜(x¯t+m)μg(s¯t+m,a¯)P(x¯t+m,a¯,x¯t+m+1))).\displaystyle\sum_{j\in{\mathcal{J}}}\left(\prod_{m=0}^{t^{*}-1}\left(\sum_{\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m})}\mu_{g}(\bar{s}_{t+m},\bar{a})P(\bar{x}_{t+m},\bar{a},\bar{x}_{t+m+1})\right)\right).

Next, as in the proof of Proposition IV.6, we show that (36) holds by contradiction. Specifically, assume that (36) does not hold. Then, after plugging the values of μb(st+m,a)\mu_{b}(s_{t+m},a) and μg(s¯t+m,a¯)\mu_{g}(\bar{s}_{t+m},\bar{a}) for all a𝒜(xt+m)a\in{\mathcal{A}}(x_{t+m}) and a¯𝒜(x¯t+m)\bar{a}\in{\mathcal{A}}(\bar{x}_{t+m}) in (36) and after rearranging the terms, we get that j𝒥β(pjt)<j𝒥η(pjt)\sum_{j\in{\mathcal{J}}}\beta(p_{j}^{t})<\sum_{j\in{\mathcal{J}}}\eta(p_{j}^{t}). This contradicts (25) completing the proof.

Appendix D Decay Rates in Numerical Simulations

In this section, we mathematically define the decay rates used for ϵ,δb\epsilon,\delta_{b}, and δe\delta_{e} in Section V. The parameter ϵ\epsilon evolves over episodes epi, as ϵ(epi)=1/(epiα)\epsilon(\texttt{epi})=1/(\texttt{epi}^{\alpha}) where α\alpha is selected to be equal to 0.10.1 for 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} and 0.050.05 for 𝔐3\mathfrak{M}_{3}. In ‘Biased 1’, δb\delta_{b} and δb\delta_{b} evolve over episodes, as δb(epi)=(11epiβ)ϵ(epi)\delta_{b}(\texttt{epi})=(1-\frac{1}{\texttt{epi}^{\beta}})\epsilon(\texttt{epi}) and δe(epi)=ϵ(epi)epiβ\delta_{e}(\texttt{epi})=\frac{\epsilon(\texttt{epi})}{\texttt{epi}^{\beta}}. We select β=0.4\beta=0.4 for 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2} and β=0.15\beta=0.15 for 𝔐3\mathfrak{M}_{3}. Observe that δb(epi)+δe(epi)=ϵ(epi)\delta_{b}(\texttt{epi})+\delta_{e}(\texttt{epi})=\epsilon(\texttt{epi}). To define ‘Biased 2’ and ‘Biased 3’, we need first to define the following function denoted by g(epi)g(\texttt{epi}). If epi<100\texttt{epi}<100, then g(epi)=10.9exp(Aepi)g(\texttt{epi})=1-0.9\exp(-A\texttt{epi}). Otherwise, we have that g(epi)=10.1exp(Aepi)g(\texttt{epi})=1-0.1\exp(-A\texttt{epi}) for some AA. Then, we have that δb(epi)\delta_{b}(\texttt{epi}) and δe(epi)\delta_{e}(\texttt{epi}) evolve as δb(epi)=(1g(epi))ϵ(epi)\delta_{b}(\texttt{epi})=(1-g(\texttt{epi}))\epsilon(\texttt{epi}) and δe(epi)=g(epi)ϵ(epi)\delta_{e}(\texttt{epi})=g(\texttt{epi})\epsilon(\texttt{epi}). This choice prioritizes random exploration during the first 100100 episodes. The larger the AA, the faster δb\delta_{b} converges to 0. Regarding 𝔐1\mathfrak{M}_{1} and 𝔐2\mathfrak{M}_{2}, we select A=0.00015A=0.00015 for ‘Biased 2’, A=0.0015A=0.0015 for ‘Biased 3’, and A=A=\infty for ‘Random’. As for 𝔐3\mathfrak{M}_{3}, we choose A=0.000015A=0.000015 for ‘Biased 2’ and A=0.00015A=0.00015 for ‘Biased 3’.

References

  • [1] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909–4926, 2021.
  • [2] D. Dewey, “Reinforcement learning and the reward engineering principle,” in AAAI Spring Symposium Series, 2014.
  • [3] C. Baier and J.-P. Katoen, Principles of model checking.   MIT Press, 2008.
  • [4] J. Wang, X. Ding, M. Lahijanian, I. C. Paschalidis, and C. A. Belta, “Temporal logic motion control using actor–critic methods,” The International Journal of Robotics Research, vol. 34, no. 10, pp. 1329–1344, 2015.
  • [5] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Omega-regular objectives in model-free reinforcement learning,” International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2018.
  • [6] Q. Gao, D. Hajinezhad, Y. Zhang, Y. Kantaros, and M. M. Zavlanos, “Reduced variance deep reinforcement learning with temporal logic specifications,” in ACM/IEEE International Conference on Cyber-Physical Systems, Montreal, Canada, 2019.
  • [7] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J. Kochenderfer, and J. Tumova, “Reinforcement learning with probabilistic guarantees for autonomous driving,” arXiv preprint arXiv:1904.07189, 2019.
  • [8] M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, and I. Lee, “Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees,” in 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 2019, pp. 5338–5343.
  • [9] A. K. Bozkurt, Y. Wang, M. M. Zavlanos, and M. Pajic, “Control synthesis from linear temporal logic specifications using model-free reinforcement learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 10 349–10 355.
  • [10] M. Cai, H. Peng, Z. Li, and Z. Kan, “Learning-based probabilistic ltl motion planning with environment and motion uncertainties,” IEEE Transactions on Automatic Control, vol. 66, no. 5, pp. 2386–2392, 2020.
  • [11] A. Lavaei, F. Somenzi, S. Soudjani, A. Trivedi, and M. Zamani, “Formal controller synthesis for continuous-space mdps via model-free reinforcement learning,” in ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS).   IEEE, 2020, pp. 98–107.
  • [12] K. Jothimurugan, S. Bansal, O. Bastani, and R. Alur, “Compositional reinforcement learning from logical specifications,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • [13] M. Hasanbeig, D. Kroening, and A. Abate, “Lcrl: Certified policy synthesis via logically-constrained reinforcement learning,” in Quantitative Evaluation of Systems: 19th International Conference, QEST 2022, Warsaw, Poland, September 12–16, 2022, Proceedings.   Springer, 2022, pp. 217–231.
  • [14] H. Hasanbeig, D. Kroening, and A. Abate, “Certified reinforcement learning with logic guidance,” Artificial Intelligence, vol. 322, 2023.
  • [15] A. K. Bozkurt, Y. Wang, M. M. Zavlanos, and M. Pajic, “Learning optimal strategies for temporal tasks in stochastic games,” IEEE Transactions on Automatic Control, 2024.
  • [16] Z. Xuan, A. K. Bozkurt, M. Pajic, and Y. Wang, “On the uniqueness of solution for the bellman equation of ltl objectives,” in Learning for Dynamics and Control, 2024.
  • [17] M. Hasanbeig, N. Y. Jeppu, A. Abate, T. Melham, and D. Kroening, “Deepsynth: Automata synthesis for automatic task segmentation in deep reinforcement learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 9, 2021, pp. 7647–7656.
  • [18] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 2107–2116.
  • [19] Z. Wen, D. Precup, M. Ibrahimi, A. Barreto, B. Van Roy, and S. Singh, “On efficiency in hierarchical reinforcement learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 6708–6718, 2020.
  • [20] R. Toro Icarte, T. Q. Klassen, R. Valenzano, and S. A. McIlraith, “Reward machines:: Exploiting reward function structure in reinforcement learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 173–208, 2022.
  • [21] M. Cai, M. Mann, Z. Serlin, K. Leahy, and C.-I. Vasile, “Learning minimally-violating continuous control for infeasible linear temporal logic specifications,” in American Control Conference (ACC), 2023, pp. 1446–1452.
  • [22] A. Balakrishnan, S. Jakšić, E. A. Aguilar, D. Ničković, and J. V. Deshmukh, “Model-free reinforcement learning for spatiotemporal tasks using symbolic automata,” in 62nd IEEE Conference on Decision and Control (CDC), Singapore, 2023, pp. 6834–6840.
  • [23] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in International conference on machine learning.   PMLR, 2017, pp. 2778–2787.
  • [24] Y. Zhai, C. Baek, Z. Zhou, J. Jiao, and Y. Ma, “Computational benefits of intermediate rewards for goal-reaching policy learning,” Journal of Artificial Intelligence Research, vol. 73, pp. 847–896, 2022.
  • [25] M. Cai, E. Aasi, C. Belta, and C.-I. Vasile, “Overcoming exploration: Deep reinforcement learning for continuous control in cluttered environments from temporal logic specifications,” IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2158–2165, 2023.
  • [26] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of artificial intelligence research, vol. 4, pp. 237–285, 1996.
  • [27] N. Cesa-Bianchi, C. Gentile, G. Lugosi, and G. Neu, “Boltzmann exploration done right,” Advances in neural information processing systems, vol. 30, 2017.
  • [28] R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman, “Ucb exploration via q-ensembles,” arXiv preprint arXiv:1706.01502, 2017.
  • [29] S. Amin, M. Gomrokchi, H. Satija, H. van Hoof, and D. Precup, “A survey of exploration methods in reinforcement learning,” arXiv preprint arXiv:2109.00157, 2021.
  • [30] J. Fu and U. Topcu, “Probably approximately correct MDP learning and control with temporal logic constraints,” arXiv preprint arXiv:1404.7073, 2014.
  • [31] T. Brázdil, K. Chatterjee, M. Chmelik, V. Forejt, J. Křetínskỳ, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of markov decision processes using learning algorithms,” in Automated Technology for Verification and Analysis: 12th International Symposium, ATVA 2014, Sydney, NSW, Australia, November 3-7, 2014, Proceedings 12.   Springer, 2014, pp. 98–114.
  • [32] F. Fernández, J. García, and M. Veloso, “Probabilistic policy reuse for inter-task transfer learning,” Robotics and Autonomous Systems, vol. 58, no. 7, pp. 866–871, 2010.
  • [33] Y. Kantaros and M. M. Zavlanos, “Stylus*: A temporal logic optimal control synthesis algorithm for large-scale multi-robot systems,” The International Journal of Robotics Research, vol. 39, no. 7, pp. 812–836, 2020.
  • [34] Y. Kantaros, S. Kalluraya, Q. Jin, and G. J. Pappas, “Perception-based temporal logic planning in uncertain semantic maps,” IEEE Transactions on Robotics, 2022.
  • [35] X. Ding, M. Lazar, and C. Belta, “Ltl receding horizon control for finite deterministic systems,” Automatica, vol. 50, no. 2, pp. 399–408, 2014.
  • [36] B. Lacerda, D. Parker, and N. Hawes, “Optimal policy generation for partially satisfiable co-safe ltl specifications.” in International Joint Conference on Artificial Intelligenc, vol. 15.   Citeseer, 2015, pp. 1587–1593.
  • [37] Y. Kantaros, “Accelerated reinforcement learning for temporal logic control objectives,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, October 2022.
  • [38] S. Sickert, J. Esparza, S. Jaax, and J. Křetínskỳ, “Limit-deterministic Büchi automata for linear temporal logic,” in CAV, 2016, pp. 312–332.
  • [39] M. Cai, S. Xiao, Z. Li, and Z. Kan, “Optimal probabilistic motion planning with potential infeasible ltl constraints,” IEEE Transactions on Automatic Control, 2021.
  • [40] Software:, https://github.com/kantaroslab/AccRL.
  • [41] M. Kloetzer and C. Belta, “A fully automated framework for control of linear systems from temporal logic specifications,” IEEE Transactions on Automatic Control, vol. 53, no. 1, pp. 287–297, 2008.
  • [42] G. E. Fainekos, A. Girard, H. Kress-Gazit, and G. J. Pappas, “Temporal logic motion planning for dynamic robots,” Automatica, vol. 45, no. 2, pp. 343–352, 2009.
  • [43] K. Leahy, D. Zhou, C.-I. Vasile, K. Oikonomopoulos, M. Schwager, and C. Belta, “Persistent surveillance for unmanned aerial vehicles subject to charging and temporal logic constraints,” Autonomous Robots, vol. 40, no. 8, pp. 1363–1378, 2016.
  • [44] M. Guo and M. M. Zavlanos, “Distributed data gathering with buffer constraints and intermittent communication,” in International Conference on Robotics and Automation, May-June 2017, pp. 279–284.
  • [45] Y. Kantaros and M. M. Zavlanos, “Distributed intermittent connectivity control of mobile robot networks,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3109–3121, 2017.
  • [46] J. Fang, Z. Zhang, and R. V. Cowlagi, “Decentralized route-planning for multi-vehicle teams to satisfy a subclass of linear temporal logic specifications,” Automatica, vol. 140, p. 110228, 2022.
  • [47] C. I. Vasile, X. Li, and C. Belta, “Reactive sampling-based path planning with temporal logic specifications,” The International Journal of Robotics Research, vol. 39, no. 8, pp. 1002–1028, 2020.
  • [48] X. C. D. Ding, S. L. Smith, C. Belta, and D. Rus, “Ltl control in uncertain environments with probabilistic satisfaction guarantees,” IFAC Proceedings Volumes, vol. 44, no. 1, pp. 3515–3520, 2011.
  • [49] M. Guo and M. M. Zavlanos, “Probabilistic motion planning under temporal tasks and soft constraints,” IEEE Transanctions on Automatic Control, 2018.
  • [50] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
  • [51] X. Ding, S. L. Smith, C. Belta, and D. Rus, “Optimal control of Markov decision processes with linear temporal logic constraints,” IEEE Trans. on Automatic Control, vol. 59, no. 5, pp. 1244–1257, 2014.
  • [52] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [53] A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008.
  • [54] ltl2dstar, https://www.ltl2dstar.de/.
Yiannis Kantaros (S’14-M’18) is an Assistant Professor in the Department of Electrical and Systems Engineering, Washington University in St. Louis (WashU), St. Louis, MO, USA. He received the Diploma in Electrical and Computer Engineering in 2012 from the University of Patras, Patras, Greece. He also received the M.Sc. and the Ph.D. degrees in mechanical engineering from Duke University, Durham, NC, in 2017 and 2018, respectively. Prior to joining WashU, he was a postdoctoral associate in the Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA. His current research interests include machine learning, distributed control and optimization, and formal methods with applications in robotics. He received the Best Student Paper Award at the IEEE Global Conference on Signal and Information Processing (GlobalSIP) in 2014, a Best Multi-Robot Systems Paper Award, Finalist, at the IEEE International Conference in Robotics and Automation (ICRA) in 2024, the 2017-18 Outstanding Dissertation Research Award from the Department of Mechanical Engineering and Materials Science, Duke University, and a 2024 NSF CAREER Award.
Jun Wang (S’22) is a PhD candidate in the Department of Electrical and Systems Engineering at Washington University in St. Louis. He received his B.Eng. degree in Software Engineering from Sun Yat-Sen University in 2019 and his MSE degree in Robotics from the University of Pennsylvania in 2021. His research interests include robotics, machine learning, and control theory.