This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence Guarantee of Dynamic Programming for LTL Surrogate Reward

Zetong Xuan and Yu Wang Zetong Xuan is with Department of Mechanical & Aerospace Engineering, University of Florida, Gainesville, FL 32611, USAYu Wang is with Department of Mechanical & Aerospace Engineering, University of Florida, Gainesville, FL 32611, USA
Abstract

Linear Temporal Logic (LTL) is a formal way of specifying complex objectives for planning problems modeled as Markov Decision Processes (MDPs). The planning problem aims to find the optimal policy that maximizes the satisfaction probability of the LTL objective. One way to solve the planning problem is to use the surrogate reward with two discount factors and dynamic programming, which bypasses the graph analysis used in traditional model-checking. The surrogate reward is designed such that its value function represents the satisfaction probability. However, in some cases where one of the discount factors is set to 11 for higher accuracy, the computation of the value function using dynamic programming is not guaranteed. This work shows that a multi-step contraction always exists during dynamic programming updates, guaranteeing that the approximate value function will converge exponentially to the true value function. Thus, the computation of satisfaction probability is guaranteed.

I Introduction

Modern autonomous systems need to solve planning problems for complex rule-based tasks that are usually expressible by linear temporal logic (LTL)[1]. LTL is a symbolic language to describe high-level tasks like reaching a sequence of goals or ordering a set of events. The planning problem with the LTL objective is to find the optimal policy that maximizes the probability of satisfying the given LTL objective. This problem can be formally treated as a quantitative model-checking problem [2] when the environment is modeled as MDPs with known transition probability. That is, given an MDP and an LTL objective, find the maximum satisfaction probability within all possible policies.

Although an LTL objective can be complex, its maximal satisfaction probability and optimal policy can be computed by quantitative (or probabilistic) model checking [2, 3, 4] via reachability. First, a specific set of states is selected so the reachability probability to this set equals the maximum satisfaction probability of the LTL objective. The set of states shall be identified by graph analysis, such as a depth-first search, on the product MDP. The product MDP is a product of the original MDP and an ω\omega-regular automaton, encoding all the information necessary for quantitative model-checking. Then, dynamic programming or linear programming shall be applied to compute the reachability probability of the set.

An alternative approach finds the optimal policy via a surrogate reward function without the graph analysis, thus is more generalizable to model-free reinforcement learning (RL) [5, 6, 7, 8]. The surrogate reward is a reward function automatically derived from the LTL objective, which yields a value function representing the satisfaction probability of the LTL objective. The satisfaction probability can be computed using dynamic programming. Dynamic programming iteratively updates an approximate value function by the Bellman equation. Ideally, the approximate value function will converge to the value function during updates. Furthermore, the finding of the optimal policy and the corresponding maximum satisfaction probability can be done by policy or value iteration. Another advantage of using surrogate reward can be a smooth transformation into model-free RL to deal with situations when the MDP transitions are unknown. In this case, the two-phase model-checking approach has to be transformed into model-based RL and requires additional computation [9].

This work focuses on a widely used surrogate reward [8] based on the limit-deterministic Büchi automaton. It assigns a constant reward for “good” states with two discount factors. Given a policy, the probability of infinitely often visiting the “good” states shall equal the satisfaction probability of the LTL objective. The surrogate reward yields a value function equal to the satisfaction probability when taking the limit of both discount factors to one.

Nevertheless, whether dynamic programming based on this surrogate reward can find the satisfaction probability has still not been fully studied. We noticed that more recent works [10, 11, 12, 13] allow one discount factor to equal 11 while using the surrogate reward with two discount factors from [8]. Their Bellman equations yield multiple solutions, as the discount factor of 11 holds in many states. The approximate value function updated by these Bellman equations may not converge to the value function. [14] proposes a sufficient condition to identify the value function from multiple solutions satisfying the Bellman equation. However, when discounting is missing in many states, whether using dynamic programming or RL based on the Bellman equation shall give us the value function is still not fully answered.

This work gives a convergence guarantee for the approximate value function during dynamic programming updates. We find an upper bound that decays exponentially for the infinite norm of the approximation error. Even though the one-step dynamic programming update does not provide a contraction on the approximation error, we show that a multi-step contraction always happens as we keep doing the dynamic programming update for enough steps. We verify our result in the case study where our upper bound holds and approximation error goes to zero. The intuition behind our result is that although only a few states have discounting, the discount works on them shall be propagated to other states via the reachability between states.

II Preliminaries

This section introduces how an LTL objective can be translated to a surrogate reward function. First, we formulate the LTL planning problem into an MDP with a Büchi objective, which is a standard approach used by probabilistic model-checking [2, 15] and other LTL planning works (e.g.,  [8]). Then, we show that the satisfaction probability of the Büchi objective can be expressed in another form, i.e. the value function of the surrogate reward.

II-A Modeling planning problems with LTL objectives as MDPs with Büchi objectives

We propose our model called Markov decision processes with Büchi objective. It augments general MDPs [2] with a set of accepting states.

Definition 1

A Markov decision process with Büchi objective is a tuple =(S,A,P,sinit,B)\mathcal{M}=(S,A,P,s_{\mathrm{init}},B) where

  • SS is a finite set of states and sinitSs_{\mathrm{init}}\in S is the initial state,

  • AA is a finite set of actions where A(s)A(s) denotes the set of allowed actions in the state sSs\in S,

  • P:S×A×S[0,1]P:S\times A\times S\to[0,1] is the transition probability function such that for all sSs\in S, we have

    sSP(s,a,s)={1,aA(s)0,aA(s),\sum_{s^{\prime}\in S}P(s,a,s^{\prime})=\begin{cases}1,&a\in A(s)\\ 0,&a\notin A(s)\end{cases},
  • BSB\subseteq S is the set of accepting states. The set of rejecting states ¬B:=S\B\neg B:=S\backslash B.

A path of the MDP \mathcal{M} is an infinite state sequence σ=s0s1s2\sigma=s_{0}s_{1}s_{2}\cdots such that for all i0i\geq 0, there exists aiA(s)a_{i}\in A(s) and si,si+1Ss_{i},s_{i+1}\in S with P(si,ai,si+1)>0P(s_{i},a_{i},s_{i+1})>0. Given a path σ\sigma, the iith state is denoted by σ[i]=si\sigma[i]=s_{i}. We denote the prefix by σ[:i]=s0s1si\sigma[{:}i]=s_{0}s_{1}\cdots s_{i} and suffix by σ[i+1:]=si+1si+2\sigma[i{+}1{:}]=s_{i+1}s_{i+2}\cdots. We say a path σ\sigma satisfies the Büchi objective φB\varphi_{B} if inf(σ)B\mathrm{inf}(\sigma)\cap B\neq\emptyset. Here, inf(σ)\mathrm{inf}(\sigma) denotes the set of states visited infinitely many times on σ\sigma.

Our model is valid as the LTL objective can be translated into a Büchi objective. The translation is done by constructing a product MDP from the original MDP and a limit-deterministic Büchi automaton [15] generated from the LTL objective. Instead of looking for the optimal memory-dependent policy for the LTL objective, one can find the optimal memoryless policy on the product MDP [15]. Since the maximum satisfaction probability of Büchi objective can be achieved by a memoryless deterministic policy.

II-B Policy evaluation for Büchi objective via reachability

We change an LTL planning problem into seeking the policy that maximizes the satisfaction probability of the Büchi objective. Evaluation of the policy requires the calculation of the satisfaction probability, which can be done by calculating the reachability probability.

Definition 2

A memoryless policy π\pi is a function π:SA\pi:S\to{A} such that π(σ[n])A(σ[n])\pi(\sigma[n])\in{A}(\sigma[n]). Given an MDP =(S,A,P,sinit,B)\mathcal{M}=(S,A,P,s_{init},B) and a memoryless policy π\pi, a Markov chain (MC) induced by policy π\pi is a tuple π=(S,Pπ,sinit,B)\mathcal{M_{\pi}}=(S,P_{\pi},s_{init},B) where Pπ(s,s)=P(s,π(s),s)P_{\pi}(s,s^{\prime})=P(s,\pi(s),s^{\prime}) for all s,sSs,s^{\prime}\in S.

Under the policy π\pi, for a path σ\sigma starting at state ss, its satisfaction probability of the Büchi objective φB\varphi_{B} is defined as

π(s,B):=Prσπ(inf(σ)Bt:σ[t]=s).\displaystyle\mathbb{P}_{\pi}(s,B):=Pr_{\sigma\sim\mathcal{M}_{\pi}}\big{(}\mathrm{inf}(\sigma)\cap B\neq\emptyset\mid\exists t:\sigma[t]=s\big{)}. (1)

The calculation of the satisfaction probability can be done by calculating the reachability probability. The probability of a path under the policy π\pi satisfying the Büchi objective equals the probability of a path entering the accepting bottom strongly connected component (BSCC) of the induced MC π\mathcal{M}_{\pi}.

Definition 3

A bottom strongly connected component (BSCC) of an MC is a strongly connected component without outgoing transitions. A strongly connected component of an MC is a communicating class, which is a maximal set of states that communicate with each other. A BSCC is rejecting111Here we call a state sBs\in B as an accepting state, a state sBs\notin B as a rejecting state. Notice that an accepting state must not exist in a rejecting BSCC, and a rejecting state may exist in an accepting BSCC. if all states sBs\notin B. Otherwise, we call it an accepting BSCC.

Any path on the MC will eventually enter a BSCC and stay there. Any path entering an accepting BSCC will visit accepting states infinitely many times, therefore satisfying the Büchi objective.

The calculation of the reachability probability can be done via dynamic programming after one detects all the BSCCs. This approach is adopted by probabilistic model-checking and requires complete model knowledge.

II-C Surrogate reward that approximates the satisfaction probability

A surrogate reward allows one to calculate the satisfaction probability even without knowledge of BSCCs. This equivalent representation of the Büchi objective allows one to transfer an LTL objective as a discounted reward.

We study the surrogate reward for Büchi objective proposed in [8], which is also widely used in [10, 11, 12, 13, 16]. This surrogate reward is designed in a way that its value function approximates the satisfaction probability. It consists of a reward function R:SR:S\to\mathbb{R} and a state-dependent discount factor function Γ:S(0,1]\Gamma:S\to(0,1] with two discount factors 0<γB<γ10<\gamma_{B}<\gamma\leq 1,

R(s):={1γBsB0sB,Γ(s):={γBsBγsB.\displaystyle R(s):=\begin{cases}1-\gamma_{B}&s\in B\\ 0&s\notin B\end{cases},\quad\Gamma(s):=\begin{cases}\gamma_{B}&s\in B\\ \gamma&s\notin B\end{cases}. (2)

A positive reward is collected only when an accepting state is visited along the path. For this surrogate reward, the KK-step return (KK\in\mathbb{N} or K=K=\infty) of a path from time tt\in\mathbb{N} is

Gt:K(σ):=i=0KR(σ[t+i])j=0i1Γ(σ[t+j])\displaystyle G_{t:K}(\sigma):=\sum_{i=0}^{K}R(\sigma[t+i])\cdot\prod_{j=0}^{i-1}\Gamma(\sigma[t+j])
Gt(σ):=limKGt:K(σ).\displaystyle G_{t}(\sigma):=\lim_{K\to\infty}G_{t:K}(\sigma). (3)

Here, the definition follows a standard discounted reward setting [17] but allows state-dependent discounting. Suppose the discount factor γ=1\gamma=1. If a path satisfies the Büchi objective, its return shall be a summation of a geometric series as i=0(1γB)γBi=1γB1γB=1\sum_{i=0}^{\infty}(1-\gamma_{B})\gamma_{B}^{i}=\frac{1-\gamma_{B}}{1-\gamma_{B}}=1.

The value function Vπ(s)V_{\pi}(s) is the expected return conditional on the path starting at ss under the policy π\pi. And it is related to Büchi objective as follows,

Vπ(s)=𝔼π[Gt(σ)|σ[t]=s]\displaystyle V_{\pi}(s)=\mathbb{E}_{\pi}[G_{t}(\sigma)\,|\,\sigma[t]=s]
=𝔼π[Gt(σ)σ[t]=s,inf(σ)B]π(s,B)\displaystyle=\mathbb{E}_{\pi}[G_{t}(\sigma)\mid\sigma[t]=s,\mathrm{inf}(\sigma)\cap B\neq\emptyset]\cdot\mathbb{P}_{\pi}(s,B)
+𝔼π[Gt(σ)σ[t]=s,inf(σ)B=](1π(s,B)),\displaystyle+\mathbb{E}_{\pi}[G_{t}(\sigma)\mid\sigma[t]=s,\mathrm{inf}(\sigma)\cap B=\emptyset]\cdot(1-\mathbb{P}_{\pi}(s,B)), (4)

where 1π(s,B)1-\mathbb{P}_{\pi}(s,B) stands for the probability of a path not satisfying the Büchi objective conditional on the path starting at ss. As γB\gamma_{B}, γ\gamma close to 11, the value function becomes close to π(s,B)\pi(s,B) as

limγ1𝔼π[Gt(σ)σ[t]=s,inf(σ)B]=1\displaystyle\lim_{\gamma\to 1^{-}}\mathbb{E}_{\pi}[G_{t}(\sigma)\mid\sigma[t]=s,\mathrm{inf}(\sigma)\cap B\neq\emptyset]=1
limγB1𝔼π[Gt(σ)σ[t]=s,inf(σ)B=]=0.\displaystyle\lim_{\gamma_{B}\to 1^{-}}{\mathbb{E}_{\pi}[G_{t}(\sigma)\mid\sigma[t]=s,\mathrm{inf}(\sigma)\cap B=\emptyset]}=0. (5)

The setting γB\gamma_{B} and γ\gamma is critical for solving an discounted reward problem using the value iteration, or Q-learning. These methods are guaranteed to find the optimal policy for the surrogate reward when γ<1\gamma<1. To make sure the optimal policy for the surrogate reward is the optimal policy for the LTL objective, one has to take γ\gamma, γB\gamma_{B} as close to 11 as possible to reduce the error between the value function and the satisfaction probability. For γB\gamma_{B}, one can never take γB=1\gamma_{B}=1 as 1γB=01-\gamma_{B}=0 can not serve as a positive reward. Setting γ=1\gamma=1 seems to work in several works. However, [14] exposes setting γ=1\gamma=1 would break the uniqueness of the solution of the Bellman equation, thus hindering a correct evaluation of the satisfaction probability.

Remark 1

Other surrogate rewards based on Büchi and Rabin automata have been studied but have flaws. The surrogate rewards [6] based on limit-deterministic Büchi automata assign a constant reward for “good” states with a constant discount factor. This approach is technically flawed, as demonstrated by [7]. Surrogate reward [5] based on Rabin automata assigns constant positive rewards to certain “good” states and negative rewards to “bad” states. However, this surrogate reward function is also not technically correct, as demonstrated in [18].

III Problem Formulation and main result

This section introduces a key challenge when one calculates the satisfaction probability and our answer to it. Specifically, one needs to utilize the recursive formulation of the Bellman equation to solve it. Thus, two conditions (i)\mathrm{(i)} the Bellman equation to have a unique solution, and (ii)\mathrm{(ii)} a discount works on every dynamic programming update are required. However, our surrogate reward breaks both conditions as it allows γ=1\gamma=1. Here, we investigate how dynamic programming calculates the satisfaction probability under the state-dependent discounting.

III-A Bellman equation and sufficient condition for the uniqueness of the solution

Given a policy, the value function satisfies the Bellman equation.222We call Vπ(s)=R(s)+Γ(s)sSPπ(s,s)Vπ(s)V_{\pi}(s)=R(s)+\Gamma(s)\sum_{s^{\prime}\in S}P_{\pi}(s,s^{\prime})V_{\pi}(s^{\prime}) as the Bellman equation and Vπ(s)=maxaA(s){R(s)+Γ(s)sSP(s,a,s)Vπ(s)}V_{\pi}^{*}(s)=\max_{a\in A(s)}\{R(s)+\Gamma(s)\sum_{s^{\prime}\in S}P(s,a,s^{\prime})V_{\pi}^{*}(s^{\prime})\} as the Bellman optimality equation. The Bellman equation is derived from the fact that the value of the current state is equal to the expectation of the current reward plus the discounted value of the next state. For the surrogate reward in the equation (2), the Bellman equation is given as follows:

Vπ(s)\displaystyle V_{\pi}(s) ={1γB+γBsSPπ(s,s)Vπ(s)sBγsSPπ(s,s)Vπ(s)sB.\displaystyle=\begin{cases}1-\gamma_{B}+\gamma_{B}\sum_{s^{\prime}\in S}{P_{\pi}(s,s^{\prime})V_{\pi}(s^{\prime})}&s\in B\\ \gamma\sum_{s^{\prime}\in S}{P_{\pi}(s,s^{\prime})V_{\pi}(s)}&s\notin B\end{cases}. (6)

Previous work [10, 13, 11] allows γ=1\gamma=1. However, setting γ=1\gamma=1 yields multiple solutions of the Bellman equations, raising concerns about applying dynamic programming or RL. A sufficient condition to restrict the uniqueness of the solution is proposed in [14].

Lemma 1

The Bellman equation (6) has the value function as the unique solution, if and only if the solution for any state in a rejecting BSCC is zero [14].

III-B Open question on the convergence of dynamic programming

Based on Lemma 1, dynamic programming is a way to compute the value function using the Bellman equation. Given an initialization of the approximate value function U(0)U_{(0)} and the Bellman equation (6), we shall iteratively do the dynamic programming update as

U(k+1)=(1γB)[𝕀m𝕆n]+[γBIm×mγIn×n]PπU(k),\displaystyle U_{(k+1)}=(1-\gamma_{B})\begin{bmatrix}\mathbb{I}_{m}\\ \mathbb{O}_{n}\\ \end{bmatrix}+\begin{bmatrix}\gamma_{B}I_{m\times m}&\\ &\gamma I_{n\times n}\\ \end{bmatrix}P_{\pi}U_{(k)}, (7)

where m=|B|m=|B| is the number of accepting states, n=|¬B|n=|\neg B| is the number of rejecting states. 𝕀\mathbb{I} and 𝕆\mathbb{O} are column vectors with all 11 and 0 elements, respectively. We expect the approximate value function U(k)U_{(k)} to converge to the value function VV during updates.

Suppose we initialize U(0)U_{(0)} as a zero vector. The sufficient condition for the uniqueness of the solution holds for all U(k)U_{(k)}. The value function is the only fix-point for the above update. However, a convergence guarantee is still needed to show we can compute the satisfaction probability using the above dynamic programming update.

When γ<1\gamma<1, the convergence is guaranteed by the one-step contraction shown as

U(k+1)VγU(k)V.\displaystyle\|U_{(k+1)}-V\|_{\infty}\leq\gamma\|U_{(k)}-V\|_{\infty}. (8)

As γ=1\gamma=1, this contraction no longer holds, and the convergence to the value function is still an open question. This motivates us to study the following problem.

Problem Formulation: For an MDP with Büchi objective \mathcal{M} by Definition 1 and the surrogate reward (2), given a policy π\pi. Starting with U(0)=𝕆U_{(0)}=\mathbb{O}, show approximate value function U(k)U_{(k)} updated by dynamic programming (7) will converge to the value function VV as kk\to\infty. And give an upper bound for the error U(k)Vπ\|U_{(k)}-V_{\pi}\|_{\infty}.

In the following, we assume a fixed policy π\pi, leading us to omit the π\pi subscript from most notation when its implication is clear from the context.

III-C Overview on main result

When γ=1\gamma=1, we find a multi-step contraction shown as

U(k+N)VcU(k)V\displaystyle\|U_{(k+N)}-V\|_{\infty}\leq c\|U_{(k)}-V\|_{\infty} (9)

exists for the dynamic programming update, where N+N\in\mathbb{N}^{+} and c(0,1)c\in(0,1) is a constant. Even though rejecting states lacks discounting, their reachability to accepting states still provides contraction. With this finding, we claim the convergence guarantee as follows.

Theorem 1

Given an MDP \mathcal{M} and the surrogate reward (2), given a policy π\pi. Starting with U(0)=𝕆U_{(0)}=\mathbb{O}, approximate value function U(k)U_{(k)} updated by dynamic programming (7) will converge to the value function VV as

  • if γ<1\gamma<1

    U(k)V\displaystyle{\left\|U_{(k)}-V\right\|_{\infty}} γkV,\displaystyle\leq\gamma^{k}{\left\|V\right\|_{\infty}}, (10)
  • if γ=1\gamma=1

    U(k)V\displaystyle{\left\|U_{(k)}-V\right\|_{\infty}} (1(1γB)εn)kn+1V,\displaystyle\leq(1-(1-\gamma_{B})\varepsilon^{n^{\prime}})^{\left\lfloor\frac{k}{n^{\prime}+1}\right\rfloor}{\left\|V\right\|_{\infty}}, (11)

    where ε\varepsilon is the lower bound for all possible transitions, that is, for all (s,s){(s,s)|Pπ(s,s)>0},Pπ(s,s)ε>0(s,s^{\prime})\in\{(s,s^{\prime})|P_{\pi}(s,s^{\prime})>0\},P_{\pi}(s,s^{\prime})\geq\varepsilon>0.

As the theorem claims, one can use the surrogate reward and dynamic programming to compute the satisfaction probability of the Büchi objective on the MDP. Thus, the following corollary holds.

Corollary 1

For a product MDP constructed by the MDP and the limit-deterministic Büchi automaton [15] generated from the LTL objective. Given a policy π\pi on the product MDPs. The surrogate reward and dynamic programming can be used to compute the satisfaction probability of the LTL objective on the MDPs.

Here, we give the convergence guarantee based on the discount factor γ,γB\gamma,\gamma_{B} and the reachability to the discounted state expressed as εn\varepsilon^{n^{\prime}}. The first bound is commonly seen for dynamic programming as each update provides one-step contraction when γ<1\gamma<1. The second bound relies on the multi-step contraction shown in the following section.

IV Multi-step contraction and proof of the main result

In this section, we will formally prove the main result by exploiting the reachability from undiscounted rejecting states to discounted accepting states. First, we simplify the dynamic programming update (7) using Lemma 1. Then, we show the infinite norm of the error vector will surely be contracted when the update is applied enough times.

IV-A Dynamic programming for states outside rejecting BSCCs

The sufficient condition in Lemma 1 always holds as we initialize U(0)=𝕆U_{(0)}=\mathbb{O}. Since the approximate value function on the rejecting BSCCs stays at zero. We can simplify the dynamic programming update (7) by dropping all states inside rejecting BSCCs,

[UBU¬BT,A](k+1)=(1γB)[𝕀m𝕆n]+H[UBU¬BT,A](k),\displaystyle\begin{bmatrix}U^{B}\\ U^{\neg B_{T,A}}\\ \end{bmatrix}_{(k+1)}=(1-\gamma_{B})\begin{bmatrix}\mathbb{I}_{m}\\ \mathbb{O}_{n^{\prime}}\\ \end{bmatrix}+H\begin{bmatrix}U^{B}\\ U^{\neg B_{T,A}}\\ \end{bmatrix}_{(k)}, (12)

where

H\displaystyle H =[γBIm×mγIn×n][PBBPB¬BT,AP¬BT,ABP¬BT,A¬BT,A]T\displaystyle={\begin{bmatrix}\gamma_{B}I_{m\times m}&\\ &\gamma I_{n^{\prime}\times n^{\prime}}\\ \end{bmatrix}}\underbrace{\begin{bmatrix}P_{B\rightarrow B}&P_{B\rightarrow\neg B_{T,A}}\\ P_{\neg B_{T,A}\rightarrow B}&P_{\neg B_{T,A}\rightarrow\neg B_{T,A}}\\ \end{bmatrix}}_{T}
=[γBPBBγBPB¬BT,AγP¬BT,ABγP¬BT,A¬BT,A].\displaystyle=\begin{bmatrix}\gamma_{B}P_{B\rightarrow B}&\gamma_{B}P_{B\rightarrow\neg B_{T,A}}\\ \gamma P_{\neg B_{T,A}\rightarrow B}&\gamma P_{\neg B_{T,A}\rightarrow\neg B_{T,A}}\\ \end{bmatrix}. (13)

Here, the state space is partitioned into three sets of states BB, ¬BR\neg B_{R}, ¬BT,A\neg B_{T,A} representing the set of accepting states, the set of states in the rejecting BSCCs and the set of remaining rejecting states. UπBmU_{\pi}^{B}\in\mathbb{R}^{m}, Uπ¬BT,AnU_{\pi}^{\neg B_{T,A}}\in\mathbb{R}^{n^{\prime}} are the vectors listing the approximate value function for all sBs\in{B}, s¬BT,As\in{\neg B_{T,A}}, respectively. Matrix TT represents the transition between states in X:={B,¬BT,A}X:=\{B,\neg B_{T,A}\}. Sub-matrix PBBP_{B\rightarrow B}, PB¬BT,AP_{B\rightarrow\neg B_{T,A}} etc. contains the transition probability from a set of states to a set of states.

At each dynamic programming update, the approximation error is updated by a linear mapping as

D(k+1)\displaystyle D_{(k+1)} =HD(k).\displaystyle=HD_{(k)}. (14)

where

D(k)=[UBU¬BT,A](k)[VBV¬BT,A].\displaystyle{D}_{(k)}=\begin{bmatrix}U^{B}\\ U^{\neg B_{T,A}}\\ \end{bmatrix}_{(k)}-\begin{bmatrix}V^{B}\\ V^{\neg B_{T,A}}\\ \end{bmatrix}. (15)

VπBmV_{\pi}^{B}\in\mathbb{R}^{m}, Vπ¬BT,AnV_{\pi}^{\neg B_{T,A}}\in\mathbb{R}^{n^{\prime}} are the vectors listing the value function for states in BB and ¬BT,A{\neg B_{T,A}}, respectively. When setting γ=1\gamma=1, the requirement for the one-step contraction (8), which is H<1\|H\|_{\infty}<1 does not hold.

IV-B Multi-step contraction

Even with γ=1\gamma=1, we can still show there exists a N+N\in\mathbb{N}^{+} such that HN<1\|H^{N}\|_{\infty}<1. The multi-step update is given as

D(k+N)=HND(k).\displaystyle D_{(k+N)}=H^{N}D_{(k)}. (16)

By showing each row sum of HNH^{N} is strictly less than that of TNT^{N}, we guarantee HN<1\|H^{N}\|_{\infty}<1. In TNT^{N}, each element {TN}ij\{T^{N}\}_{ij} is the probability of a path starting at ii visiting jj in the next NN steps, thus TN1\|T^{N}\|_{\infty}\leq 1. In HNH^{N}, each element {HN}ij\{H^{N}\}_{ij} is expressed as a “discounted” version of {TN}ij\{T^{N}\}_{ij},

{HN}ij=s1XPπ,γB(i,s1)s2XPπ,γB(s1,s2)\displaystyle\{H^{N}\}_{ij}=\sum_{s_{1}\in X}{P_{\pi,\gamma_{B}}(i,s_{1})}\sum_{s_{2}\in X}{P_{\pi,\gamma_{B}}(s_{1},s_{2})}
sN1XPπ,γB(sN2,sN1)Pπ,γB(sN1,j),\displaystyle\cdots\sum_{s_{N-1}\in X}{P_{\pi,\gamma_{B}}(s_{N-2},s_{N-1})}{P_{\pi,\gamma_{B}}(s_{N-1},j)}, (17)

where

Pπ,γB(s,s)={γBPπ(s,s)sBPπ(s,s)sX\B.\displaystyle P_{\pi,\gamma_{B}}(s,s^{\prime})=\begin{cases}\gamma_{B}P_{\pi}(s,s^{\prime})&s\in B\\ P_{\pi}(s,s^{\prime})&s\in X\backslash B\end{cases}. (18)

Whenever a one-step transition starting from the accepting state happens, a discount γB\gamma_{B} shall be applied in (18), making {HN}ij<{TN}ij\{H^{N}\}_{ij}<\{T^{N}\}_{ij}.

The row sum of HNH^{N} shall be strictly less than the corresponding row sum of TNT^{N} if the probability of visiting an accepting state within the future N1N-1 steps is greater than zero. Thus, the contraction HN<1\|H^{N}\|_{\infty}<1 is brought in by the reachability from rejecting states to accepting, which is formalized as the following Lemma.

Lemma 2

Starting with a state in ¬BT,A\neg B_{T,A}, the probability of a path not visiting the set BB in n:=|¬BT,A|n^{\prime}:=|\neg B_{T,A}| steps is upper bounded by 1εn1-\varepsilon^{n^{\prime}}.

Proof:

A transition leaving ¬BT,A\neg B_{T,A} must exist since all states in ¬BT,A\neg B_{T,A} are either

  1. 1.

    a recurrent state inside an accepting BSCC. Any path starting from such a recurrent state will eventually meet an accepting state. Thus, at least one path leaves the set ¬BT,A\neg B_{T,A}.

  2. 2.

    A transient state that will enter a BSCC. Any path starting from such a transient state will eventually enter an accepting BSCC or a rejecting BSCC. Either way, the transient leaving the set ¬BT,A\neg B_{T,A} will happen.

As for all state i¬BT,Ai\in{\neg B_{T,A}}, there exists at least one path that will leave ¬BT,A{\neg B_{T,A}} in nn^{\prime} steps,

1(s1,sn¬BT,A|s0=i)εn\displaystyle 1-\mathbb{P}(s_{1},\cdots s_{n^{\prime}}\in{\neg B_{T,A}}|s_{0}=i)\geq\varepsilon^{n^{\prime}}
\displaystyle\Rightarrow (s1,sn¬BT,A,sn+1S|s0=i)1εn.\displaystyle\mathbb{P}(s_{1},\cdots s_{n^{\prime}}\in{\neg B_{T,A}},s_{n^{\prime}+1}\in S|s_{0}=i)\leq 1-\varepsilon^{n^{\prime}}. (19)

The existence of such a path can be shown by constructing a path starting at ii and never visiting any states in ¬BT,A{\neg B_{T,A}} more than once. One can use the diameter of a graph to prove this existence formally and we omit it for space considerations. ∎

By this reachability property, we show the multi-step contraction, which is technically described by the following lemma.

Lemma 3

When γ=1\gamma=1, given the approximation error D(k)D_{(k)} updated by equation (14) , there always exists a constant c(0,1)c\in(0,1) and a positive integer N|¬BT,A|+1N\leq|\neg B_{T,A}|+1 such that

D(k+N)cD(k).\displaystyle\|D_{(k+N)}\|_{\infty}\leq c\|D_{(k)}\|_{\infty}. (20)
Proof:

For the first mm rows of HNH^{N}, every path starts at an accepting state and receives discounting in the beginning. For all imi\leq m, each element {HN}ijγB{TN}ij\{H^{N}\}_{ij}\leq\gamma_{B}\{T^{N}\}_{ij}, we have jX{HN}ijγBjX{TN}ijγB\sum_{j\in X}\{H^{N}\}_{ij}\leq\gamma_{B}\sum_{j\in X}\{T^{N}\}_{ij}\leq\gamma_{B}.

The remaining nn^{\prime} rows need additional treatment, as discounting may not happen. However, we can rule out the case when no discount happens by setting N=n+1N=n^{\prime}+1.

We split the sum of each row of Tn+1T^{n^{\prime}+1} into two parts,

jX{Tn+1}ij=(s1,,sn+1X|s0=i)\displaystyle\sum_{j\in X}\{T^{n^{\prime}+1}\}_{ij}=\mathbb{P}(s_{1},\cdots,s_{n^{\prime}+1}\in X|s_{0}=i)
=ηno accepting states are visited in n steps\displaystyle=\underbrace{\eta}_{\textrm{no accepting states are visited in }n^{\prime}\textrm{ steps}}
+(s1,,sn+1X|s0=i)ηat least one accepting state is visited in n steps\displaystyle+\underbrace{\mathbb{P}(s_{1},\cdots,s_{n^{\prime}+1}\in X|s_{0}=i)-\eta}_{\textrm{at least one accepting state is visited in }n^{\prime}\textrm{ steps}}
1,\displaystyle\leq 1, (21)

where η\eta represents all the paths that will not visit accepting states in nn^{\prime} steps and thus will not get discounted. Suppose η=1\eta=1, the sum of the entire row of Hn+1H^{n^{\prime}+1} shall be 11. However Lemma 2 prohibits η=1\eta=1 from happening,

η\displaystyle\eta =(s1,sn¬BT,A,sn+1X|s0=i)\displaystyle=\mathbb{P}(s_{1},\cdots s_{n^{\prime}}\in{\neg B_{T,A}},s_{n^{\prime}+1}\in X|s_{0}=i)
(s1,,sn¬BT,A,sn+1S|s0=i)\displaystyle\leq\mathbb{P}(s_{1},\cdots,s_{n^{\prime}}\in\neg B_{T,A},s_{n^{\prime}+1}\in S|s_{0}=i)
1εn.\displaystyle\leq 1-\varepsilon^{n^{\prime}}. (22)

The second part in (IV-B) represents all paths that will visit the accepting state in nn^{\prime} steps at least once. Thus, in jX{Hn+1}\sum_{j\in X}\{H^{n^{\prime}+1}\} which represents sum of “discounted” probability (IV-B), the first part stays the same. Meanwhile, the second part has to be discounted at least once,

jX{Hn+1}ij\displaystyle\sum_{j\in X}\{H^{n^{\prime}+1}\}_{ij}
ηno discounting\displaystyle\leq\underbrace{\eta}_{\textrm{no discounting}}
+γB((s1,,sn+1X|s0=i)η)at least one accepting state is visited in n’ steps\displaystyle+\gamma_{B}\underbrace{(\mathbb{P}(s_{1},\cdots,s_{n^{\prime}+1}\in X|s_{0}=i)-\eta)}_{\textrm{at least one accepting state is visited in n' steps}}
η+γB(1η)\displaystyle\leq\eta+\gamma_{B}(1-\eta)
1(1γB)εn.\displaystyle\leq 1-(1-\gamma_{B})\varepsilon^{n^{\prime}}. (23)

Thus, the sum of each row of Hn+1cH^{n^{\prime}+1}\leq c where c=1(1γB)εnc=1-(1-\gamma_{B})\varepsilon^{n^{\prime}}. And we have

D(k+N)\displaystyle\|D_{(k+N)}\|_{\infty} =HND(k)cD(k).\displaystyle=\|H^{N}D_{(k)}\|_{\infty}\leq c\|D_{(k)}\|_{\infty}. (24)

The multi-step update always provides a multi-step contraction, so we can upper bound the convergence of approximation error.

IV-C Proof of Theorem 1

In the case of γ=1\gamma=1, we get the NN-step update of the approximation error as,

D(k+N)\displaystyle D_{(k+N)} =HND(k).\displaystyle=H^{N}D_{(k)}. (25)

By Lemma 3, after every NN update, the infinite norm of error must shrink by a constant c<1c<1,

D(nN)cnD(0).\displaystyle{\left\|D_{(nN)}\right\|_{\infty}}\leq c^{n}{\left\|D_{(0)}\right\|_{\infty}}. (26)

As nn\to\infty, the error D(nN)0\|{D}_{(nN)}\|_{\infty}\to 0.

Meanwhile, since the sum of each row of H1H\leq 1, the error won’t grow at each one-step update. Thus, we have the convergence as,

D(k)\displaystyle{\left\|D_{(k)}\right\|_{\infty}} ckND(0).\displaystyle\leq c^{\left\lfloor\frac{k}{N}\right\rfloor}{\left\|D_{(0)}\right\|_{\infty}}. (27)

Theorem 1 also considers the case when γ<1\gamma<1. We get the upper bound for the approximation error as

D(k)γkD(0).\displaystyle{\left\|D_{(k)}\right\|_{\infty}}\leq\gamma^{k}{\left\|D_{(0)}\right\|_{\infty}}. (28)

The upper bound on D(k)\|D_{(k)}\|_{\infty} instantly holds for U(k)V\|U_{(k)}-V\|_{\infty} as starting the approximate value function at 𝕆\mathbb{O} guarantees U(k)¬BRV¬BR=𝕆U^{\neg B_{R}}_{(k)}-V^{\neg B_{R}}=\mathbb{O}, where V¬BRV^{\neg B_{R}} is the vector listing the value function for all states inside a rejecting BSCCs. \blacksquare

V Case Study

In this section, we show our upper bound holds in a three-state Markov chain shown in Fig. 1. sbs_{b}, scs_{c} are states in ¬BT,A\neg B_{T,A} and sas_{a} is the only accepting state. All transitions are deterministic. The discount factor setting here is γB=0.99\gamma_{B}=0.99, γ=1\gamma=1. The dynamic programming update is given as

U(k+1)=[.0100]+[.9911][010100010]U(k).\displaystyle U_{(k+1)}=\begin{bmatrix}.01\\ 0\\ 0\\ \end{bmatrix}+\begin{bmatrix}.99&&\\ &1&\\ &&1\\ \end{bmatrix}\begin{bmatrix}0&1&0\\ 1&0&0\\ 0&1&0\\ \end{bmatrix}U_{(k)}.

The infinite norm of the HH matrix in (16) is shown as

H=H2=1,H3=0.99\displaystyle\|H\|_{\infty}=\|H^{2}\|_{\infty}=1,\quad\|H^{3}\|_{\infty}=0.99

The approximate value function is initialized as U(0)=𝕆U_{(0)}=\mathbb{O}, thus the approximate error is D(0)=𝕀D_{(0)}=\mathbb{I} and during the first three updates we have,

D(1)=[.9911],D(2)=[.99.991],D(3)=[.992.99.99].\displaystyle D_{(1)}=\begin{bmatrix}.99\\ 1\\ 1\\ \end{bmatrix},\quad D_{(2)}=\begin{bmatrix}.99\\ .99\\ 1\\ \end{bmatrix},\quad D_{(3)}=\begin{bmatrix}.99^{2}\\ .99\\ .99\\ \end{bmatrix}.

Since the discount γB\gamma_{B} only works on sas_{a}, D(sc)D(s_{c}) can decrease only after the D(sb)D(s_{b}) is decreased. It takes three updates for the D\|D\|_{\infty} to decrease from 11 to 0.990.99.

The upper bound for estimation error provided by Theorem 1 is U(k)V(1(1γB)εn)kn+1{\left\|U_{(k)}-V\right\|_{\infty}}\leq(1-(1-\gamma_{B})\varepsilon^{n^{\prime}})^{\left\lfloor\frac{k}{n^{\prime}+1}\right\rfloor} where ε=1\varepsilon=1 since all transitions are deterministic and n=2n^{\prime}=2 as |¬BT,A|=2|\neg B_{T,A}|=2. Thus, our theorem yields a three-step contraction shown as

U(k)V0.99k3\displaystyle{\left\|U_{(k)}-V\right\|_{\infty}}\leq 0.99^{\left\lfloor\frac{k}{3}\right\rfloor}

which captures the fact H3=0.99\|H^{3}\|_{\infty}=0.99. The infinite norm of error shall be contracted by γB\gamma_{B} after every three dynamic programming updates.

In Fig. 2, the error D(k)\|D_{(k)}\|_{\infty} is decreased by γB\gamma_{B} in the first three updates, aligning our upper bound. However, the error is decreased by γB\gamma_{B} after every two updates when k>3k>3. The reason is that our bound on the infinite norm only considers the worst case. Where the two-step update

H2=[.99000.990100]\displaystyle H^{2}=\begin{bmatrix}.99&0&0\\ 0&.99&0\\ 1&0&0\\ \end{bmatrix}

decreases D(sa)D(s_{a}), D(sb)D(s_{b}) by γB\gamma_{B} every two updates meanwhile making D(sc)(k+2)=D(sa)(k)D(s_{c})_{(k+2)}=D(s_{a})_{(k)}. Since in vector D(3)D_{(3)}, D(sa)(3)D(s_{a})_{(3)} is the smallest element, thus D(3)D_{(3)} serves as a good initialisation to make D(k)\|D_{(k)}\|_{\infty} decreasing after every two updates.

scs_{c}sbs_{b}sas_{a}111
Figure 1: Example of a three-state Markov Chain. A discount factor γB<1\gamma_{B}<1 and a reward 1γB1-\gamma_{B} hold at sas_{a}. Meanwhile, no rewards are gained at sbs_{b} and scs_{c}. Discount γ\gamma is applied to sbs_{b} and scs_{c} but γ\gamma can be set to 11.
Refer to caption
Figure 2: Exponential convergence of estimation error D(k)\|D_{(k)}\| during dynamic programming updates. Our theorem yields an upper bound that shrinks γB\gamma_{B} after every three dynamic programming updates. This upper bound captures the convergence in this example.

VI Conclusion and future work

This work answers a challenge when using surrogate reward with two discount factors for complex LTL objectives. Specifically, we can always find the value function using dynamic programming even if discounting does not hold in many states. We discuss the convergence when using dynamic programming and show that a multi-step contraction exists as we do dynamic programming updates enough times. Our findings have implications for the correct policy evaluation of LTL objectives.

Our future effect is to investigate if the convergence result still holds as we apply value iteration and Q-learning. The challenge is that once the policy is updated, it may induce a new MC, which may have different rejecting BSCCs. The sufficient condition for the uniqueness of the Bellman equation may not hold during policy updates.

References

  • [1] A. Pnueli, “The temporal logic of programs,” in Annual Symposium on Foundations of Computer Science, 1977.
  • [2] C. Baier and J.-P. Katoen, Principles of Model Checking.   The MIT Press, 2008.
  • [3] G. Fainekos, H. Kress-Gazit, and G. Pappas, “Temporal Logic Motion Planning for Mobile Robots,” in Proceedings of the 2005 IEEE International Conference on Robotics and Automation.   IEEE, 2005, pp. 2020–2025.
  • [4] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-based reactive mission and motion planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
  • [5] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in 53rd IEEE Conference on Decision and Control, 2014, pp. 1091–1096.
  • [6] M. Hasanbeig, Y. Kantaros, A. Abate, D. Kroening, G. J. Pappas, and I. Lee, “Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 5338–5343.
  • [7] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Faithful and effective reward schemes for model-free reinforcement learning of omega-regular objectives,” in Automated Technology for Verification and Analysis: 18th International Symposium, ATVA 2020, Hanoi, Vietnam, October 19–23, 2020, Proceedings.   Springer-Verlag, 2020, pp. 108–124.
  • [8] A. K. Bozkurt, Y. Wang, M. M. Zavlanos, and M. Pajic, “Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 10 349–10 355.
  • [9] P. Ashok, J. Křetínský, and M. Weininger, “PAC Statistical Model Checking for Markov Decision Processes and Stochastic Games,” in Computer Aided Verification.   Springer International Publishing, 2019, pp. 497–519.
  • [10] C. Voloshin, A. Verma, and Y. Yue, “Eventual Discounting Temporal Logic Counterfactual Experience Replay,” in Proceedings of the 40th International Conference on Machine Learning.   PMLR, 2023, pp. 35 137–35 150.
  • [11] D. Shao and M. Kwiatkowska, “Sample Efficient Model-free Reinforcement Learning from LTL Specifications with Optimality Guarantees,” in Thirty-Second International Joint Conference on Artificial Intelligence, vol. 4, 2023, pp. 4180–4189.
  • [12] M. Cai, M. Hasanbeig, S. Xiao, A. Abate, and Z. Kan, “Modular deep reinforcement learning for continuous motion planning with temporal logic,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7973–7980, 2021.
  • [13] H. Hasanbeig, D. Kroening, and A. Abate, “Certified reinforcement learning with logic guidance,” Artificial Intelligence, vol. 322, no. C, 2023.
  • [14] Z. Xuan, A. Bozkurt, M. Pajic, and Y. Wang, “On the uniqueness of solution for the Bellman equation of LTL objectives,” in Proceedings of the 6th Annual Learning for Dynamics & Control Conference.   PMLR, Jun. 2024, pp. 428–439.
  • [15] S. Sickert, J. Esparza, S. Jaax, and J. Křetínský, “Limit-Deterministic Büchi Automata for Linear Temporal Logic,” in Computer Aided Verification, vol. 9780.   Springer International Publishing, 2016, pp. 312–332.
  • [16] M. Cai, S. Xiao, J. Li, and Z. Kan, “Safe reinforcement learning under temporal logic with reward design and quantum action selection,” Scientific Reports, vol. 13, no. 1, p. 1925, 2023.
  • [17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, second edition ed.   The MIT Press, 2018.
  • [18] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Omega-regular objectives in model-free reinforcement learning,” in Tools and Algorithms for the Construction and Analysis of Systems.   Springer International Publishing, 2019, pp. 395–412.