Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

Shayan Meshkat Alsadat¹ Jean-Raphaël Gaglione² Daniel Neider ³ Ufuk Topcu²&Zhe Xu¹
¹Arizona State University
²University of Texas at Austin
³Technical University of Dortmund
{smeshka1, xzhe1}@asu.edu, {jr.gaglione, utopcu}@utexas.edu, [email protected]

Abstract

We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by $30\%$ by implementing our method in two case studies.

1 Introduction

Large Language Models (LLM) encode various types of information including domain-specific knowledge which enables them to be used as a source for extracting specific information about a domain. This knowledge can be used for planning and control of various systems such as autonomous systems by synthesizing the control policies Seff et al. (2023). These policies are essentially dictating the action of the agent in the environment in a high-level format.

In general, the extraction of the domain-specific knowledge may not be a straightforward task since prompting the LLM without a proper prompt technique may result in hallucinations Jang et al. (2023), even with proper techniques it is also known LLMs can hallucinate. There are several methods suggested for prompt engineering ranging from zero-shot Kojima et al. (2022) all the way to training the pre-trained language model on a domain-specific dataset (fine-tuning) Rafailov et al. (2023) to obtain more accurate responses. The latter approach is more involved and resource intensive since it requires a dataset related to the domain as well as enough computation power to train the pre-trained language model on the dataset Ding et al. (2023). Moreover, not all of the language models are available for fine-tuning due to their providers’ policy. Hence, fine-tuning is a less desirable approach as long as the other prompt engineering methods can produce accurate responses.

Our algorithm LARL-RM (large language model-based automaton for reinforcement learning) uses prompt engineering methods to extract the relevant knowledge from the LLM. This knowledge is then incorporated into the reinforcement learning (RL) algorithm in the form of high-level knowledge to expedite the RL and guide it to reach the optimal policy faster. We integrate the LLM-encoded high-level knowledge into the RL using deterministic finite automata (DFA). We have the following contributions to the field: (a) LARL-RM allows for integration of the high-level knowledge from a specific domain into the RL using LLM outputs directly. (b) Our proposed method is capable of adjusting the instructions of the LLM by closing the loop in learning and updating the prompt based on the chain-of-thought method to generate the automata when counterexamples are encountered during the policy update. (c) LARL-RM is capable of learning the reward machine and expediting the RL by providing LLM-generated DFA to the RL algorithm. We also show that LARL-RM speeds up the RL by $30\%$ by implementing it in two case studies.

Related work. Generative language models (GLM) such as the GPT models have been shown to encode world knowledge and be capable of generating instructions for a task Hendrycks et al. (2020). Researchers in West et al. (2021) have demonstrated the advances in GLM allow to construct task-relevant knowledge graphs. Generating automaton from GLM. Generating automaton from a generative language model for a specific task is shown in Yang et al. (2022); however, the method proposed by the authors called GLM2FSA uses verb and verb phrases to and maps them onto a predefined set of verbs in order to convert the instructions given by the GLM to an automaton which in turn reduces the generality of the method to be applied to any other domain and still requires an expert to pre-define this set. GLMs are also used to extract task-relevant semantic knowledge and generate plans to complete a task in robotics Vemprala et al. (2023). Learning reward machine. Researchers in Neider et al. (2021) have shown a method called AdvisoRL for RL where the algorithm is capable of learning the reward machine; however, an expert is required to guide the learning process through provision of so-called advice. In our method this is replaced by using LLM.

2 Preliminaries

In this section, we define our notations and introduce the necessary background for reinforcement learning, reward machines, and finite deterministic automata (DFA).

Definition 1 (labeled Markov decision process).

A labeled Markov decision process is a tuple $\mathcal{M}=\langle S,s_{I},A,p,R,\gamma,\mathcal{P},L\rangle$ where $S$ is a finite set of states, $s_{I}\in S$ is the agent’s initial state, $A$ is a finite set of actions, $p:S\times A\times S\mapsto\left[0,1\right]$ , reward function $R:S\times A\times S\mapsto\mathbb{R}$ , discount factor $\gamma\in[0,1]$ , a finite set of propositions $\mathcal{P}$ , and a labeling function $L:S\times A\times S\mapsto 2^{\mathcal{P}}$ that determines the high-level events that agent encounters in the environment.

A policy is a function mapping states $S$ to actions in $A$ with a probability distribution, meaning the agent at state $s\in S$ will choose action $a\in A$ with probability $\pi(s,a)$ using the policy $\pi$ that leads to a new state $s^{\prime}$ . The probability of the state $s^{\prime}$ is defined by $p(s,a,s^{\prime})$ . Trajectory $s_{0}a_{0}s_{1}\ldots s_{k}a_{k}s_{k+1}$ where $k\in\mathbb{N}$ is the resulting sequence of states and actions during this stochastic process. Its corresponding label sequence is $l_{0}l_{1}\ldots l_{k}$ where $L(s_{i},a_{i},s_{i+1})=l_{i}\ \forall\ i\leq k$ . Trace is the pair of labels and rewards $(\lambda,\rho)\coloneqq(l_{1}l_{2}\ldots l_{k},r_{1}r_{2}\ldots r_{k})$ where the payoff of the agent is $\sum_{i}\gamma^{i}r_{i}$ .

We exploit deterministic finite automaton (DFA) to encode the high-level knowledge obtained from the LLM to reinforcement learning in order to speed up the process of obtaining the optimal policy $\pi^{*}$ . A DFA consists of an input and output alphabet which we can use to steer the learning process.

Definition 2 (Deterministic finite automata).

A DFA is a tuple $\mathcal{D}=\langle H,h_{I},\Sigma,\delta,F\rangle$ where $H$ is the finite set of states, $h_{o}$ is the initial state, and $\Sigma$ is the input alphabet (here we consider $\Sigma=2^{\mathcal{P}}$ ), $\delta:H\times\Sigma\mapsto H$ is the transition function between the states, and $F\subseteq H$ is the set of accepting states.

We use deterministic finite state automata to model the LLM-generated DFA. The main idea of the LLM-generated DFA is to assist the algorithm with the learning of the reward machines through the provision of information about which label sequence may result in a positive reward and which one will definitely result in a non-positive reward Neider et al. (2021).

Reward machines are a type of Mealy machine that is used to encode a non-Markovian reward function. They are automatons that receive a label from the environment and respond with a reward as well as transition to their next state. Their input alphabet is a set of propositional variables (from the set of $\mathcal{P}$ ) and their output alphabet is a set of real numbers.

Definition 3 (Reward machine).

A reward machine is shown by $\mathcal{A}=\left(V,\nu_{I},2^{\mathcal{P}},M,\delta,\sigma\right)$ where $V$ is a finite set of states, $\nu_{I}\in V$ is the initial state, input alphabet $2^{\mathcal{P}}$ , output alphabet $M\subseteq\mathbb{R}$ , transition function $\delta:V\times 2^{\mathcal{P}}\mapsto V$ , and output function $\sigma:V\times 2^{\mathcal{P}}\mapsto M$ .

Applying the reward machine $\mathcal{A}$ on a sequence of labels $l_{1}l_{2}\ldots l_{k}\in(2^{\mathcal{P}})^{*}$ will create a sequence of states $\nu_{0}(l_{1},r_{1})\nu_{1}(l_{2},r_{2})\ldots\nu_{k-1}(l_{k},r_{k})\nu_{k}$ where $\nu_{0}=\nu_{I}$ for all $i\in\{0,\ldots,k-1\}$ . Transition to next state of the reward machine $\delta(\nu_{i},l_{i+1})=\nu_{i+1}$ results in a reward of $\sigma(\nu_{i},l_{i+1})=r_{i+1}$ . Hence, a label sequence of $l_{1}l_{2}\ldots l_{k}$ corresponds to a sequence of rewards produced by reward machine $\mathcal{A}(l_{1}l_{2}\ldots l_{k})=r_{1}r_{2}\ldots r_{k}$ . Therefore, reward function $R$ of an MDP process for every trajectory $s_{0}a_{0}s_{1}\ldots s_{k}a_{k}s_{k+1}$ corresponds to a label sequence of $l_{1}l_{2}\ldots l_{k}$ that encodes the reward machine $\mathcal{A}$ , agent then receives reward sequence of $\mathcal{A}(l_{1}l_{2}\ldots l_{k})$ .

Refer to caption — Figure 1: An autonomous car (agent) must first go to intersection $J$ and then make a right turn to go to intersection $b$ . Agent must check for the green traffic light $g$ , car $c$ , and pedestrian $p$ .

Motivating Example. We consider an autonomous car on American roads with the set of actions $A=\{\textit{up},\textit{down},\textit{right},\textit{left},\textit{stay}\}$ (see Figure 1) where the agent must reach intersection $J$ then making a right turn when the traffic light $g$ is green and if there is no car $c$ on the left and no pedestrian $p$ on the right. Then the agent reaches the intersection $b$ where it needs to go straight when the traffic light is green if there are no cars and pedestrians. The corresponding DFA for this motivating example is shown in Figure 2.

Figure 2: DFA of the motivating example where the agent must reach intersection

J

while avoiding cars and pedestrians.

3 Generating Domain Specific Knowledge Using LLM

Large Language Models (LLM) encode rich world and domain-specific knowledge Yang et al. (2023) which allows for the extraction of the relevant information through appropriate prompting and has shown that can even surpass fine-tuning LLMs in some cases Wei et al. (2022). Hence, we use prompt engineering to extract the relevant domain knowledge information from the LLM and provide it in the form of LLM-generated DFA Neider et al. (2021) to expedite the reinforcement learning. An LLM-generated DFA can be thought of as a means to provide the process of learning reward machine with information about which label sequence could be promising in such a way that it results in a reward.

Applying a DFA $\mathcal{D}$ on a sequence $\lambda=l_{1}l_{2}\ldots l_{n}\in\Sigma^{\star}$ will result in a sequence of states $h_{0}h_{1}\ldots h_{n}$ where $h_{0}=h_{I}$ and $h_{i+1}=\delta(h_{i},l_{i})\ \forall\ i\in\{1,\ldots,n\}$ . We define $\mathcal{L}(\mathcal{D})=\left\{l_{1}l_{2}\ldots l_{n}\in\Sigma^{\star}\middle|h_{n}\in F\right\}$ , also commonly referred to as the formal language accepted by $\mathcal{D}$ . A LLM-generated DFA could result in a positive reward, meaning that DFA $\mathcal{D}$ can be considered as a LLM-generated DFA if the label sequence $\lambda\in\mathcal{L}(\mathcal{D})$ could result in a positive reward; while, the label sequences $\lambda\notin\mathcal{L}(\mathcal{D})$ should not receive a positive reward; therefore, it is a binary classifier that points out the promising explorations Neider et al. (2021). Our proposed algorithm uses the LLM-generated DFAs to narrow down the search space for learning the ground truth reward machine. Hence, expediting the reinforcement learning; however, in our method, the LLM-generated DFA is generated automatically for a specific domain using the LLM. It is known that the LLMs hallucinate; therefore, our proposed algorithm is capable of ignoring the LLM-generated DFA $\mathcal{D}$ if it is incompatible the underlying ground truth reward machine.

Definition 4 (LLM-generated DFA compatibility).

An LLM-generated DFA $\mathcal{D}$ is compatible with the ground truth reward machine $\mathcal{A}$ if for all label sequences $l_{1}l_{2}\ldots l_{k}\in(2^{\mathcal{P}})^{*}$ with reward sequence $\mathcal{A}(l_{1}l_{2}\ldots l_{k})=r_{1}r_{2}\ldots r_{k}$ it holds that $r_{k}>0$ implies $l_{1}l_{1}\ldots l_{k}\in\mathcal{L}(\mathcal{D})$ .

We demonstrate an LLM-generated DFA for the motivating example in Figure 3.

Figure 3: An LLM-generated DFA for our motivating example.

In our proposed method we generate the domain-specific knowledge using the GPT series Liu et al. (2023). There are several methods of prompting LLMs such as zero-shot method Kojima et al. (2022) where the prompt is using a step-by-step approach to obtain better results or the few-shot method Brown et al. (2020) where the prompt is given similar examples encouraging the LLMs to produce similar reasoning to reach the correct answer. We use a combination of these methods to encourage the GPT to produce the correct information. First, we adopt a persona for the domain that we plan to extract relevant information from; in our example, the persona is an expert in American road traffic rules. Adopting a persona allows for more coherent and higher-quality responses from the language model. Figure 4 demonstrates the GPT response to a question for making a right turn at a traffic light while an expert persona is adopted.

Figure 4 demonstrates that we can obtain the knowledge for a specific domain using the appropriate prompting method. In general, LLMs are text-based which means that their output is text and it cannot directly be used in an RL algorithm; therefore we require a method to convert this textual output to an automata. The next step is to ask the LLM to convert the provided steps to relevant propositions $\mathcal{P}$ so it can be used for the generation of the automata. We can perform this by mapping the steps into the set of propositions as illustrated in Figure 5.

We use the mapped steps (instructions) to generate the DFA since actions and propositions are available; however, transitions and states are yet to be defined. We can use the labels for transitions between the states and to obtain the states we can circle back to the task and use the instructions to obtain the states. Therefore, we use the task description and the mapped steps to obtain the transitions and states, Figure 6 shows the prompting used to create the states and transitions.

We can construct a DFA by having the states $h_{i}$ , propositions $\mathcal{P}$ , and transition function $\delta(h_{i},l_{i})$ ; therefore, we can now use this information to generate the DFA for the task and use it to expedite convergence to optimal policy $\pi^{*}(s,a)$ .

Algorithm 1 PromptLLMforDFA: Constructing task DFA for a domain specific task using LLM

Input: prompt $f$
Parameter: temperature $T$ , Top p $p$ (OpenAI LLM parameters)
Output: DFA

output\leftarrow\texttt{promptLLM}(f)

2: for

step\in output

M_{\text{prop}}\leftarrow\texttt{MapToProp}(step)

W_{\text{action}}\leftarrow\texttt{MapToActions}(step)

5: end for

output\leftarrow\texttt{UpdatePrompt}(f,steps,M_{\text{prop}},W_{\text{action}})

H\leftarrow\{h_{0}\}

\delta\leftarrow\{\delta_{0}\}

\mathcal{P}\leftarrow\{\mathscr{P}_{0}\}

10:

i\leftarrow 0

11: for

instruction

output

12:

i\leftarrow i+1

13:

h_{i}\leftarrow\texttt{GetStates}(instruction)

14:

\delta_{i}\leftarrow\texttt{GetTransition}(instruction)

15:

\mathscr{P}_{i}\leftarrow\texttt{GetProposition}(instruction)

16:

H\leftarrow H\cup\{h_{i}\}

17:

\delta\leftarrow\delta\cup\{\delta_{i}\}

18:

\mathcal{P}\leftarrow\mathcal{P}\cup\{\mathscr{P}_{i}\}

19: end for

20: return

\langle H,h_{0},2^{\mathcal{P}},\delta,\{h_{i}\}\rangle

Algorithm 1 can be updated to not even require propositions $\mathcal{P}$ and actions $A$ to obtain the DFA. Generating DFA this method requires a different prompting technique which combines two methods of few-shots and chain-of-thought we call MixedR (mixed-reasoning). We exploit this approach to obtain tasks that are more involved since obtaining the propositions and actions requires more information about the system; however, the MixedR requires the task description and examples of DFAs which do not require to even be relevant to the task as long as the descriptions are provided for the example DFAs even in high-level since the LLMs are capable of extracting complex patterns from textual data. We use MixedR to generate the DFA. Figure 7 shows the prompt template used for MixedR.

We can use the motivating example and the MixedR prompt technique to obtain the relevant DFA for a smaller example of the problem. We can then use the DFA obtained from the MixedR and provide it as LLM-generated DFA for the RL algorithm to use in order to learn the ground truth reward machine in a smaller search space.

4 Expediting Reinforcement Learning Using LLM-generated DFA

We use the LLM output as instructions to expedite the reinforcement learning convergence to the optimal policy $\pi^{*}(s,a)$ using the language models. We use the LLMs to narrow down the search space for the RL algorithm since in a typical reinforcement learning algorithm the search space is considerable leading to cases where the training requires significant time and resources. Hence, using the pre-trained LLMs we can reduce this search space and converge to the optimal policy faster. One main issue to address is that LLMs typically generate outputs that are not aligned with facts; therefore, it is necessary to close the loop in the RL algorithm so that if a counterexample is met then the algorithm should be capable of adjusting the prompt, and updating it to create an updated LLM-generated DFA that could be used by the RL algorithm. Figure 8 illustrates the LARL-RM algorithm.

Algorithm 2 LARL-RM algorithm for incorporating high-level domain-specific knowledge from LLM into RL

Input: prompt $f$ , Episode length $Episode_{length}$
Parameter: learning rate $\alpha$ , discount factor $\gamma$ , epsilon-greedy $\epsilon$ , temperature $T$ , Top p $p$ , LLM query budget $\mathcal{J}\in\mathbb{N}$

X\leftarrow\emptyset

(empty sample)

\mathcal{D}\leftarrow\texttt{PromptLLMforDFA}(f)

\mathcal{A}\leftarrow\texttt{InitializeRewardMachine}()

(compatible with

\mathcal{D}

)

q(s,v,a)\leftarrow\texttt{InitializeQFunction}()

Q=\{q^{q}|q\in V\}

5: for

episode

1,\dots,Episode_{length}

X_{\text{init}}\leftarrow X

\mathcal{D}_{\text{init}}\leftarrow\mathcal{D}

(\lambda,\rho,Q)\leftarrow\texttt{QRM-episode}(\mathcal{A},Q)

9: if

\mathcal{A}(\lambda)\neq\rho

then

10:

X\leftarrow X\cup\{(\lambda,\rho)\}

11: end if

12: if

\mathcal{D}\neq\mathcal{D}_{\text{init}}~{}\text{or}X\neq X_{\text{init}}

then

13: if

\mathcal{J}\geq 0

and

\exists(\lambda^{\prime},\rho^{\prime})\in X,\ \rho^{\prime}>0\text{ and }\lambda^{\prime}\notin\mathcal{L}(\mathcal{D})

then

14:

f\leftarrow\texttt{UpdatePrompt}(f,\lambda^{\prime})

15:

\mathcal{D}\leftarrow\texttt{PromptLLMforDFA}(f)

16:

\mathcal{J}\leftarrow\mathcal{J}-1

17: end if

18:

\mathcal{A}\leftarrow\texttt{LearnRewardMachine}()

(compatible with

\mathcal{D}

and

X

)

19:

q(s,v,a)\leftarrow\texttt{InitializeQFunction}()

20: end if

21: end for

Our proposed method LARL-RM uses the prompt provided by the user to run the RL algorithm for a specific task that requires domain knowledge; however, by using the LLM the need for an expert is minimized and the algorithm itself can update the prompt using the counterexamples to update its DFA. LARL-RM first initializes an empty set for the trace $X$ , then uses the prompt to generate the relevant DFA for the RL (Lines 1-2). The algorithm then initializes the reward machine based on the LLM-generated DFA obtained from the LLM and also initializes the QRM (q-learning for reward machines) before starting the episode (Lines 3-4). LARL-RM stores the trace $X$ and DFA $\mathcal{D}$ in order to reinitialize the reward machines and q-values if counterexamples are met, then the QRM is called to update the q-values and if there are any counterexamples then that trace is stored so that it can be used to update the reward machine and the DFA (Lines 5-10). If the trace met by the agent is not compatible with the ground truth reward machine then it is removed from the LLM-generated DFA set $\mathcal{D}$ and then the algorithm uses this incompatible DFA and the initial prompt $f$ to obtain an updated prompt $f$ using the counterexample label sequence

, LARL-RM also uses an LLM query budget $\mathcal{J}$ to ensure that if the responses are not compatible with the ground truth reward machine then it will not get stuck in a loop and eventually after the budget $\mathcal{J}$ is depleted then it can start to learn the ground truth reward machine without LLM-generated DFA (Lines: 13-16, this also highlights the importance of prompt engineering further). Afterward, LARL-RM uses the stored DFA and trace sets to reinitialize the reward machine and q-values (Lines 18-19).

4.1 Using LLM-generated DFA to Learn the Reward Machine

LARL-RM uses the counterexamples to create a minimal reward machine while being guided by the LLM-generated DFA. Our method uses a similar approach as of the Neider et al. (2021) such that it maintains a finite sample set $X\subset 2^{\mathcal{P}}\times\mathbb{R}$ and a finite set of DFAs $\mathscr{D}=\{\mathcal{}{D}_{1},\ldots,\mathcal{D}_{l}\}$ . We assume that the LLM-generated DFA is compatible with the sample $X$ such that the $(l_{1}\ldots l_{k},r_{1}\ldots r_{k})\in X$ with positive reward $r_{k}>0$ where $l_{1}\ldots l_{k}\in\mathcal{L}(\mathcal{D})$ and if this criterion is not fulfilled then it gets ignored by the LARL-RM.

The RM learning which relies on the LLM-generated DFA performs the generation of the minimal reward machine $\mathcal{A}$ which is consistent with the trace $X$ and compatible with $\mathcal{D}$ . The minimal requirement for the reward machine to be minimal is important to the convergence of the LARL-RM to the optimal policy. LARL-RM uses SAT-based automata learning to verify the parametric systems since the learning task can be broken down into a series of satisfiability checks for propositional logic formulas. Essentially we construct and solve the propositional formulas $\Phi^{X,\mathcal{D}}_{n}$ where the values of $n>0$ Neider et al. (2021)Neider (2014).

We construct the formula $\Phi^{X,\mathcal{D}}_{n}$ based on a propositional set $X=\{x,y,z,\ldots\}$ using the Boolean connectives $\neg,\vee,\wedge$ , and $\mapsto$ , i.e., and interpretation is the mapping from propositional formulas to Boolean values such that $\mathcal{I}:X\mapsto\{0,1\}$ . If the the propositional formula is satisfied then $\mathcal{I}\models\Phi$ which interprets to $\mathcal{I}$ satisfies formula $\Phi$ . Using this approach we can obtain a minimal reward machine similar to Neider et al. (2021). Using the LLM we are capable of constructing an LLM-generated DFA in a format that is compatible with SAT-based automata learning methods. LARL-RM is also capable of adjusting the initial prompt $f$ so that in case it is not compatible with the ground truth reward machine then it could be updated $f$ . Updated prompt $f$ uses the counterexample label sequence $\lambda^{\prime}$ to call the LLM and obtain an updated DFA $\mathcal{D}$ which is compatible with the ground truth reward machine.

4.2 Refinement of Prompt and DFA

We use prompt $f$ to generate the DFA $\mathcal{D}=\langle H,h_{I},\Sigma,\delta,F\rangle$ for a specific domain. Transition exists for a proposition if $\delta(h_{i},l_{i})=1$ , meaning if the LLM-generated DFA generated by the prompt $f$ is incorrect then it cannot have the correct transitions and trajectory which leads to a counterexample $\lambda^{\prime}$ .

Therefore, we use the counterexample $\lambda^{\prime}=l_{1}l_{2}\ldots l_{k}$ to update the prompt in order to generate an updated DFA.

Hence, the updated DFA has a chance of becoming compatible with the counterexample. This process continues and each time the algorithm encounters a counterexample ( $\lambda^{\prime}$ ) uses it to update the prompt $f$ again.

5 Convergence to Optimal Policy

The learned reward machine will ultimately be equivalent to the ground truth reward machine under the condition that any label sequence is admissible by the underlying MDP, i.e., that are physically allowed. If this assumption does not hold, it is still ensured that learned reward machine and the ground truth reward machine will agree on every admissible label sequences.

Due to the fact that we use LLM to generate the DFA $\mathcal{D}$ and the fact that the LLMs are known to produce outputs that are not factual, we face an issue. We cannot make any assumption on the quality of the output of the LLM. For the purpose of proving guaranties of LARL-RM, we need to consider the LLM as adversarial, that is, considering the worst case.

Lemma 1.

Let $\mathcal{M}$ be a labeled MDP, $\mathcal{A}$ the ground truth reward machine encoding the rewards of $\mathcal{M}$ , and $\mathscr{D}^{\star}=\{\mathcal{D}_{1},\ldots,\mathcal{D}_{m}\}$ the set of all LLM-generated DFAs that are added to $\mathcal{D}$ during the run of LARL-RM. Additionally, let $n_{\text{max}}=\max_{\mathcal{D}\in\mathscr{D}^{\star}}\{|\mathcal{D}|\}$ and $m=\max\left\{2\lvert\mathcal{M}\rvert\cdot\left(\lvert\mathcal{A}\rvert+1\right)\cdot n_{\text{max}},\lvert\mathcal{M}\rvert\left(\lvert\mathcal{A}\rvert+1\right)^{2}\right\}$ . Then, LARL-RM with $Episode_{length}\geq m$ almost surely learns a reward machine that is equivalent to $\mathcal{A}$ .

LARL-RM provides us with an upper bound for the episode length that needs to be explored. Additionally, LARL-RM algorithm correctness follows the Lemma 1 and the correctness of QRM algorithm Icarte et al. (2018). Using the QRM algorithm guarantee we can now show the convergence to optimal policy

Theorem 1.

Let $\mathcal{M}$ , $\mathcal{A}$ , $\mathscr{D}^{\star}$ , and $m$ be as in Lemma 1. Then, LARL-RM will converge to an optimal policy almost surely if $Episode_{\text{length}}\geq m$ .

Theorem 1 guarantees the convergence of the LARL-RM to an optimal policy if sufficient episode length is given for exploration. It also provides an upper bound for convergence as shown in Lemma 1.

6 Case Studies

We implement the LARL-RM using the GPT series LLM, specifically we focus on model GPT-3.5-Turbo. We can set up a chat with either of the models using the provided APIs from OpenAI. For our example we consider the LARL-RM applied to an autonomous car example similar to the example found in Xu et al. (2020a).

In our example, the autonomous car must navigate to reach a destination while avoiding any pedestrians, but obeying traffic laws. It is worth mentioning that our algorithm is capable of running even without any LLM-generated DFA since in this case it can learn the ground truth reward machine without LLM-generated DFA. We consider two case studies for our motivating example to demonstrate the LARL-RM capabilities to expedite the RL. In both cases, we demonstrate the effect of incompatible LLM-generated DFA (not compatible with the ground truth reward machine) on the algorithm convergence to the optimal policy.

6.1 Case Study 1

The agent in our traffic example has the following set of actions $A=\{\textit{up},\textit{down},\textit{right},\textit{left},\textit{stay}\}$ . The layout of the environment is shown in Figure 1

We use the LLM-generated DFA $\mathcal{D}$ to guide the RL process. If the advice from the LLM is not compatible with the ground truth reward machine then it will be ignored by the algorithm. Prompt is updated and fed back to the LLM in order to obtain a new DFA that has a chance of becoming compatible with the counterexample. We show the ground truth reward machine for Case Study 1 in Figure 9.

Figure 9: Ground truth reward machine for Case Study 1. Agent must first navigate to intersection

J

then

b

The LLM-generated DFA generated by the prompt $f$ for this case study is compatible with the ground truth reward machine (Figure 9) and helps the algorithm to converge to the optimal policy faster. The generated DFA using the GPT is shown in Figure 10.

Figure 10: LLM-generated DFA

\mathcal{D},\{J\wedge g\wedge\neg c\wedge\neg p\}

generated by LLM for Case Study 1.

We use the DFA (10) to guide the learning process. The reward obtained by the LARL-RM algorithm shows that it can reach the optimal policy faster if the LLM-generated DFA exists in comparison to the case when there is none.

We average the results of $5$ independent runs for using a rolling mean with a window size of $10$ . Figure 11 demonstrates that the algorithm reaches the convergence policy using the LLM-generated DFA. In this example, the prompt $f$ is one time compatible and one time incompatible with the ground truth reward machine, but even if it is not compatible LARL-RM can use the counterexample to update the prompt and generate a new DFA which is more likely to be compatible with the counterexample.

6.2 Case Study 2

In Case Study 2 we show that if the suggested DFA $\mathcal{D}$ is not compatible with the ground truth reward machine then the LARL-RM uses the counterexample to adjust the prompt $f$ and update it to obtain new DFA which is compatible with the counterexample. In this case study we update the MDP environment to allow for multiple reward machines as well as DFAs to be considered as equivalent. This way the output of the LLM may not necessarily match the ground truth reward machine; however, the LARL-RM can still use this DFA as long as it is not incompatible with the ground truth reward machine, but is equivalent to it.

In this environment, we extend the autonomous car example and provide it with more options to reach the destinations and the ground truth reward machine does not consider a specific route, but rather emphasizes the target destination; however, the LLM-generated DFA might specify a certain route over another which is not incompatible with the ground truth reward machine, but rather equivalent. We show the ground truth reward machine for Case Study 2 in Figure 12.

Figure 12: Ground truth reward machine for Case Study 2. Agent must reach the destination no matter the route (

\mathcal{B}=b\wedge g\wedge\neg p\wedge\neg c

The LLM-generated DFA $\mathcal{D}$ , $\{J\wedge g\wedge\neg c\wedge\neg p,b\wedge g\wedge\neg c\wedge\neg p\}$ for this case study specifies a route which leads to the target and it is equivalent to the ground truth reward machine since this solution DFA can be considered a subset of the larger DFA $\mathcal{D}$ which is compatible with the ground truth reward machine. Figure 13 shows the LLM-generated DFA for Case Study 2.

Figure 13: LLM-generated DFA

\mathcal{D}

contains

\{b\wedge g\wedge\neg c\wedge\neg p\}

generated by LLM for Case Study 2.

We run the LARL-RM for the Case Study 2 using the prompt $f$ which generates the DFA $\mathcal{D}$ , $\{J,b\wedge g\wedge\neg c\wedge\neg p\}$ which is incompatible with the ground truth reward machine, but the LARL-RM takes this counterexample and generates an updated DFA $\mathcal{D}$ , $\{b\wedge g\wedge\neg c\wedge\neg p\}$ which is compatible with the ground truth reward machine. Figure 14 demonstrates the reward of LARL-RM using the DFA $\mathcal{D}$ .

Figure 14 shows the reward obtained by LARL-RM for Case Study 2 using the prompt $f$ which generates the DFA $\mathcal{D}$ . The reward is for $5$ independent runs averaged at each $10$ th step. For both cases of compatible and incompatible LLM-generated DFA the LARL-RM converges to the optimal policy faster than when there is no LLM-generated DFA.

7 Conclusion

We proposed a novel algorithm, LARL-RM, that uses a prompt to obtain an LLM-generated DFA to expedite reinforcement learning. LARL-RM uses counterexamples to automatically generate a new prompt and consequently, a new DFA that is compatible with the counterexamples to close the loop in RL. We showed that LARL-RM is guaranteed to converge to an optimal policy using the LLM-generated DFA. We showed that RL can be expedited using the LLM-generated DFA and in case the output of the LLM is not compatible with the ground truth reward machine, LARL-RM is capable of adjusting the prompt to obtain a more accurate DFA. In future work, we plan to extend the proposed framework to RL for multi-agent systems.

Acknowledgments

This research is partially supported by the National Science Foundation under grant NSF CNS 2304863 and the Office of Naval Research under grant ONR N00014-23-1-2505.

References

Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Icarte et al. [2018] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pages 2107–2116. PMLR, 2018.
Jang et al. [2023] Joel Jang, Seonghyeon Ye, and Minjoon Seo. Can large language models truly understand prompts? a case study with negated prompts. In Transfer Learning for Natural Language Processing Workshop, pages 52–62. PMLR, 2023.
Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
Liu et al. [2023] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, et al. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017, 2023.
Neider et al. [2021] Daniel Neider, Jean-Raphael Gaglione, Ivan Gavran, Ufuk Topcu, Bo Wu, and Zhe Xu. Advice-guided reinforcement learning in a non-markovian environment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9073–9080, 2021.
Neider [2014] Daniel Neider. Applications of automata learning in verification and synthesis. PhD thesis, Aachen, Techn. Hochsch., Diss., 2014, 2014.
Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
Seff et al. [2023] Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and Benjamin Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023.
Shallit [2008] Jeffrey O. Shallit. A Second Course in Formal Languages and Automata Theory. Cambridge University Press, 2008.
Vemprala et al. [2023] Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
West et al. [2021] Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
Xu et al. [2020a] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. Joint inference of reward machines and policies for reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pages 590–598, 2020.
Xu et al. [2020b] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. Joint inference of reward machines and policies for reinforcement learning. In ICAPS, pages 590–598. AAAI Press, 2020.
Yang et al. [2022] Yunhao Yang, Jean-Raphaël Gaglione, Cyrus Neary, and Ufuk Topcu. Automaton-based representations of task knowledge from generative language models. arXiv preprint arXiv:2212.01944, 2022.
Yang et al. [2023] Yunhao Yang, Neel P Bhatt, Tyler Ingebrand, William Ward, Steven Carr, Zhangyang Wang, and Ufuk Topcu. Fine-tuning language models using formal methods feedback. arXiv preprint arXiv:2310.18239, 2023.

Supplementary Materials

Appendix A Comparison of closed-loop and Open-loop LARL-RM

Our proposed algorithm can be modified to perform in an open-loop or closed-loop manner with respect to LLM-generated DFA, meaning that if a counterexample is found LARM-RM has the capability to either update the LLM-generated DFA $D$ or keep the original LLM-generated DFA. LARL-RM has two options either continuing the learning process of the reward machine without updating the LLM-generated DFA, referred to as open-loop. It can also call the LLM again and generate a new DFA based on the counterexample, referred to as the closed loop. We show our proposed algorithm’s convergence to optimal policy under both options. We demonstrate the open-loop configuration in Figure 15.

A.1 Case Study 1

We investigate the two open-loop and closed-loop methods in the Case Study 1. First, we demonstrate the effect of incompatible LLM-generated DFA with compatible LLM-generated DFA in open-loop conditions. Figure 16 illustrates the convergence to optimal policy for open-loop configuration.

Figure 16 demonstrates that in LARL-RM open-loop configuration if the LLM-generated DFA is compatible it converges to the optimal policy faster than when there is no LLM-generated DFA. However, if the LLM-generated DFA is incompatible then it might take longer to converge to the optimal policy. Now we consider the closed-loop configuration so that the DFA if incompatible then be used to update the prompt $f$ . Figure 17 demonstrates the LARL-RM convergence to the optimal policy when the closed-loop configuration is applied.

Figure 17 demonstrates that the closed-loop configuration is capable of updating the prompt $f$ considering the counterexample such that LARL-RM has a better chance of converging faster than when there is no LLM-generated DFA.

A.2 Case Study 2

We further investigate the effect of open-loop and closed-loop configurations. The same configuration for the open-loop as illustrated in Figure 15 is applied here. The open-loop configuration as shown in Figure 18 is converging to the optimal policy faster than the case with no LLM-generated DFA if the LLM-generated DFA is compatible with the ground truth reward machine.

If we apply the closed-loop configuration to Case Study 2, then as expected it can converge to the optimal policy faster than when there is no LLM-generated DFA and even if the initial LLM-generated DFA is incompatible with the ground truth reward machine. Figure 19 demonstrates the closed-loop configuration.

As illustrated in Figure 19, if the closed-loop configuration is applied then the LARL-RM has a better chance to converge to the optimal policy faster than when there is no LLM-generated DFA.

Appendix B Theoretical Guarantee of the LARL-RM

Here we demonstrate the convergence of the proposed algorithm to the optimal policy.

Proof of Lemma 1.

We first show that the set $\mathscr{D}$ remains stationary after a finite amount of time (i.e., that no new LLM-generated DFA is added or removed), assuming that all trajectories will be visited infinitely often. To this end, we observe the following: if an LLM-generated DFA added to $\mathscr{D}$ is compatible, then it will remain in $\mathscr{D}$ indefinitely. The reason is that there is no counterexample contradicting the LLM-generated DFA, meaning that the check-in Line 12 is never triggered. On the other hand, if an LLM-generated DFA is not consistent, then the algorithm eventually detects a trajectory witnessing this fact. Once this happens, it removes the corresponding LLM-generated DFA from the set $\mathscr{D}$ .

Observe now that the algorithm decrements $\mathcal{J}$ by one every time it adds an LLM-generated DFA to $\mathscr{D}$ , and it only does so as long as $\mathcal{J}>0$ . Thus, the total number of LLM-generated DFA that are generated during the run of the algorithm is bounded by $\mathcal{J}$ . Consequently, the set $\mathscr{D}$ no longer changes after a finite period, because the algorithm has either identified the true reward machine or all incompatible LLM-generated DFAs, have been removed and the budget $J$ is exhausted.

Once $\mathscr{D}$ becomes stationary, an argument analogous to that in the proof of Lemma 1 in Neider et al.’s work Neider et al. [2021] shows that the algorithm will eventually learn the true reward machine. Intuitively, as long as the current hypothesis is not equivalent to the true reward machine, there exists a “short” trajectory witnessing the fast. The episode length is chosen carefully such that such a witness will eventually be encountered and added to $X$ . In the worst case, all trajectories of $Episode_{length}$ will eventually be added to $X$ , at which point the learning algorithm is guaranteed to learn the true reward machine. ∎

Since our proposed algorithm learns the minimal reward machine that is consistent with $X$ and compatible with $D\in\mathscr{D}$ , we need to use an automata learning method based on SAT which is used to verify parametric systems Neider [2014]. The underlying idea is to reduce the learning task to a series of satisfiability checks of formulas for propositional logic. In other words, we build and solve a sequence of propositional formulas $\Phi^{X,D}_{n}$ for $n>0$ such that the following properties hold Neider et al. [2021]:

•

the satisfiability of $\Phi_{n}^{X,D}$ is contingent upon the existence of a reward machine with $n$ states, such that it is consistent with $\mathcal{X}$ and compatible with $D\in\mathscr{D}$ . This condition holds true if and only if such a reward machine exists.
•

if $\Phi_{n}^{X,D}$ is satisfiable, it implies that a satisfying assignment contains enough information to construct a reward machine with $n$ states that is both consistent and compatible.

We can now build and solve $\Phi^{X,D}_{n}$ starting from $n=1$ and increasing $n$ until it becomes satisfiable. Using this method our proposed algorithm constructs a minimal reward machine that is consistent with $X$ and compatible with the LLM-generated DFA $D$ .

To ease the notation in the remainder, we use $\mathcal{A}\colon\mathsf{p}\xrightarrow{w}\mathsf{q}$ to abbreviate a run of the reward machine $\mathcal{A}$ on the input-sequence $w$ that starts in $\mathsf{p}$ and leads to $\mathsf{q}$ . By definition, we have $\mathcal{A}\colon\mathsf{p}\xrightarrow{\varepsilon}\mathsf{p}$ for the empty sequence $\varepsilon$ and every state $\mathsf{p}$ . We later also use this notation for DFAs.

To encode the reward machine $\mathcal{A}=\langle V,V_{I},2^{\mathcal{P}},R_{X},\delta,\sigma\rangle$ in propositional logic where set $V$ of states and initial state $v_{I}$ is fixed, the reward machine $\mathcal{A}$ is uniquely determined by the transitions $\delta$ and the output function $\sigma$ . The size of states $\lvert V\rvert=n$ and the initial state $v_{I}$ is fixed, then to encode the transition function and the set representing the final states, we will introduce two propositional variables namely $d_{p,l,v}$ where $p,v\in V$ and $l\in 2^{\mathcal{P}}$ and $o_{p,l,r}$ for $p\in V$ , $l\in 2^{\mathcal{P}}$ , and $r\in R_{X}$ . We apply the constraints in (1) and (2) to ensure that the variables $d_{p,l,v}$ and $o_{p,l,r}$ encode deterministic functions.

	$\displaystyle\bigwedge_{\mathsf{p}\in V}\bigwedge_{l\in{2^{\mathcal{P}}}}\biggl{[}\!\bigl{[}\bigvee_{\mathsf{q}\in V}d_{\mathsf{p},l,\mathsf{q}}\bigr{]}\land\bigl{[}\bigwedge_{\mathsf{q}\neq\mathsf{q^{\prime}}\in V}\lnot d_{\mathsf{p},l,\mathsf{q}}\lor\lnot d_{\mathsf{p},l,\mathsf{q^{\prime}}}\bigr{]}\!\biggr{]}$		(1)
	$\displaystyle\bigwedge_{\mathsf{p}\in V}\bigwedge_{l\in{2^{\mathcal{P}}}}\biggl{[}\!\bigl{[}\bigvee_{r\in R_{X}}o_{\mathsf{p},l,r}\bigr{]}\land\bigl{[}\bigwedge_{r\neq r^{\prime}\in R_{X}}\lnot o_{\mathsf{p},l,r}\lor\lnot o_{\mathsf{p},l,r^{\prime}}\bigr{]}\!\biggr{]}$		(2)

We denote the conjunction of constraints (1) and (2) by $\Phi^{\mathit{RM}}_{n}$ which we later on use to show the consistency of the learned reward machine with $X$ . To apply constraints for consistency with samples in propositional logic, we propose auxiliary variables as $x_{\lambda,\rho}$ for every $(\lambda,\rho)\in\mathit{Pref}(X)$ where $q\in V$ , and for a trace $\tau=(l_{1}\ldots l_{k},r_{1}\ldots r_{k})\in(2^{\mathcal{P}})^{\ast}\times\mathbb{R}^{\ast}$ we define the set prefixes of $\tau$ by $\mathit{Pref}(\tau)=\{(l_{1}\ldots l_{i},r_{1}\ldots r_{i})\in(2^{\mathcal{P}})^{\ast}\times\mathbb{R}^{\ast}\mid 0\leq i\leq k\}$ ( $(\varepsilon,\varepsilon)\in\mathit{Pref}(\tau)$ is always correct). The value of $x_{\lambda,\rho}$ is set to true if and only if the prospective reward machine reaches states $q$ after reading $\lambda$ . To obtain the desired meaning, we add the following constraints:

	$\displaystyle x_{\varepsilon,\mathsf{q_{I}}}\land\bigwedge_{\mathsf{p}\in V\setminus\{\mathsf{q_{I}}\}}\lnot x_{\varepsilon,\mathsf{p}}$		(3)
	$\displaystyle\bigwedge_{(\lambda l,\rho r)\in\mathit{Pref}(X)}\bigwedge_{\mathsf{p},\mathsf{q}\in V}(x_{\lambda,p}\land d_{\mathsf{p},l,\mathsf{q}})\rightarrow x_{\lambda l,\mathsf{q}}$		(4)
	$\displaystyle\bigwedge_{(\lambda l,\rho r)\in\mathit{Pref}(X)}\bigwedge_{\mathsf{p}\in V}x_{\lambda,p}\rightarrow o_{\mathsf{p},l,\mathsf{r}}$		(5)

In order to ensure that the prospective reward machine $\mathcal{A}_{\mathcal{I}}$ and the LLM-generated DFA $D$ are synchronized we add auxiliary variables $y^{D}_{v,v^{\prime}}$ for $v\in V$ and $v^{\prime}\in V_{D}$ . IF there is a label sequence $\lambda$ where $\mathcal{A}_{\mathcal{I}}:v_{I}\xrightarrow{\lambda}q$ and $D:q_{I,D}\xrightarrow{\lambda}q^{\prime}$ is set to true. We denote the conjunction of constraints (3)-(5) by $\Phi^{\mathit{X}}_{n}$ which we later on use to show the consistency of the learned reward machine with $X$ . To obtain this behavior, we add the following constraints:

	$\displaystyle y_{\mathsf{q_{I}},\mathsf{q^{\prime}_{l,D}}}^{D}$		(6)
	$\displaystyle\bigwedge_{\mathsf{p,q}\in V}\bigwedge_{l\in{2^{\mathcal{P}}}}\bigwedge_{\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}}(y_{\mathsf{p},\mathsf{p^{\prime}}}^{D}\land d_{\mathsf{p},l,\mathsf{q}})\rightarrow y_{\mathsf{q},\mathsf{q^{\prime}}}^{D}$		(7)
	$\displaystyle\bigwedge_{\mathsf{p}\in V}\bigwedge_{\begin{subarray}{c}\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}\\ \mathsf{q^{\prime}}\notin\mathsf{F}_{D}\end{subarray}}y_{\mathsf{p},\mathsf{p^{\prime}}}^{D}\rightarrow\lnot\bigvee_{\begin{subarray}{c}r\in R_{X}\\ r>0\end{subarray}}o_{\mathsf{p},l,r}$		(8)

We denote the conjunction of constraints (6)-(8) by $\Phi^{\mathit{X}}_{n}$ which we later use to show the consistency of the learned reward machine with $X$ . Theorem 2 demonstrates the consistency of the LLM-generated DFA with the Algorithm 2. W e denote the conjunction of Formulas (3), (4), and (5) by $\Phi^{D}_{n}$ .

Theorem 2.

Consider $X\subset(2^{\mathcal{P}})^{+}\times\mathbb{R}^{+}$ be sample and $\mathscr{D}$ be a finite set of LLM-generated DFAs that are compatible with $X$ . Then, the following holds:

1.

If we have $\mathcal{I}\models\Phi_{n}^{X,D}$ , then the reward machine $\mathcal{A}_{\mathcal{I}}$ is consistent with $X$ and compatible with the LLM-generated DFA $D\in\mathscr{D}$ .
2.

If there exists a reward machine with $n$ states that is consistent with $X$ as well as compatible with each $D\in\mathscr{D}$ , then $\Phi_{n}^{X,D}$ is satisfiable.

Proof for Theorem 2 will follow in Section C after discussion of prerequisites.

Appendix C Learning Reward Machines

In this section, we prove the correctness of our SAT-based algorithm for learning reward machines. To this end, we show that the propositional formula $\Phi_{n}^{X}$ and $\Phi_{n}^{D}$ have the desired meaning. We begin with the formula $\Phi_{n}^{X}$ , which is designed to enforce that the learned reward machine is consistent with the sample $X$ .

Lemma 2.

Let $\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{X}$ and $\mathcal{A}_{\mathcal{I}}$ the reward machine defined above. Then, $\mathcal{A}_{\mathcal{I}}$ is consistent with $X$ (i.e., $\mathcal{A}_{\mathcal{I}}(\lambda)=\rho$ for each $(\lambda,\rho)\in X$ ).

Proof.

Let $\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{X}$ and $\mathcal{A}_{\mathcal{I}}$ the reward machine constructed as above. To prove Lemma 2, we show the following, more general statement by induction over the length of prefixes $(\lambda,\rho)\in\mathit{Pref}(X)$ : if $\mathcal{A}_{\mathcal{I}}\colon\mathsf{q_{I}}\xrightarrow{\lambda}\mathsf{q}$ , then

1.

$\mathcal{I}(x_{\lambda,\mathsf{q}})=1$ ; and
2.

$\mathcal{A}_{\mathcal{I}}(\lambda)=\rho$ .

Lemma 2 then follows immediately from Part 2 since $X\subseteq\mathit{Pref}(X)$ .

Base case:: Let $(\varepsilon,\varepsilon)\in\mathit{Pref}(X)$ . By definition of runs, we know that $\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\varepsilon}{\mathsf{q}_{I}}$ is the only run of $\mathcal{A}$ on the empty word. Similarly, Formula (3) guarantees that the initial state ${\mathsf{q}_{I}}$ is the unique state $\mathsf{q}\in V$ for which $\mathcal{I}(x_{\varepsilon,\mathsf{q}})=1$ holds. Both observations immediately prove Part 1. Moreover, $\mathcal{A}(\varepsilon)=\varepsilon$ holds by definition the semantics of reward machines, which proves Part 2.
Induction step:: Let $(\lambda l,\rho r)\in\mathit{Pref}(X)$ . Moreover, let $\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\lambda}\mathsf{p}\xrightarrow{l}\mathsf{q}$ be the unique run of $\mathcal{A}$ on $\lambda$ . By applying the induction hypothesis, we then obtain that both $\mathcal{I}(x_{\lambda,p})=1$ and $\mathcal{A}(\lambda)=\rho$ hold.

To prove Part 1, note that $\mathcal{A}_{\mathcal{I}}$ contains the transition $\delta(\mathsf{p},l)=\mathsf{q}$ since this transition was used in the last step of the run on $\lambda$ . By construction of $\mathcal{A}_{\mathcal{I}}$ , this can only be the case if $\mathcal{I}(d_{\mathsf{p},l,\mathsf{q}})=1$ . Then, however, Formula (4) implies that $\mathcal{I}(x_{\lambda l,q})=1$ because $\mathcal{I}(x_{\lambda,p})=1$ (which holds by induction hypothesis). This proves Part 1.

To prove Part 2, we exploit Formula (5). More precisely, Formula (5) guarantees that if $\mathcal{I}(x_{\lambda,p})=1$ and the next input is $l$ , then $\mathcal{I}(o_{p,l,r})=1$ . By construction of $\mathcal{A}_{\mathcal{I}}$ , this means that $\sigma(\mathsf{q},l)=r$ . Hence, $\mathcal{A}_{\mathcal{I}}$ outputs $r$ in the last step of the run on $\lambda l$ . Since $\mathcal{A}_{\mathcal{I}}(\lambda)=\rho$ (which holds by induction hypothesis), we obtain $\mathcal{A}_{\mathcal{I}}(\lambda l)=\rho r$ . This proves Part 2.

Thus, $\mathcal{A}_{\mathcal{I}}$ is consistent with $X$ . ∎

Next, we show that the formula $\Phi_{n}^{D}$ ensures that the learned reward machine is compatible with the LLM-generated DFA $D$ .

Lemma 3.

Let $\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{D}$ and $\mathcal{A}_{\mathcal{I}}$ the reward machine defined above. Then, $\mathcal{A}_{\mathcal{I}}$ is compatible with $D$ (i.e., $\mathcal{A}_{\mathcal{I}}(\ell_{1}\ell_{2}\ldots\ell_{k})=r_{1}r_{2}\ldots r_{k}$ and $r_{k}>0$ implies $\ell_{1}\ell_{2}\ldots\ell_{k}\in L(D)$ for every nonempty label sequence $\ell_{1}\ell_{2}\ldots\ell_{k}$ ).

Proof.

Let $\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{D}$ and $\mathcal{A}_{\mathcal{I}}$ the reward machine defined above. We first show that $\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\lambda}\mathsf{q}$ and $D\colon\mathsf{q_{I,D}}\xrightarrow{\lambda}\mathsf{q^{\prime}}$ imply $\mathcal{I}(y_{\mathsf{q},\mathsf{q^{\prime}}}^{D})=1$ for all label sequences $\lambda\in{2^{\mathcal{P}}}^{\ast}$ . The proof of this claim proceeds by induction of the length of label sequences, similar to Part 2 in the proof of Lemma 2.

Base case:: Let $\lambda=\varepsilon$ . By definition of runs, the only runs on the empty label sequence are $\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\varepsilon}{\mathsf{q}_{I}}$ and $D\colon\mathsf{q_{I,D}}\xrightarrow{\varepsilon}\mathsf{q_{I,D}}$ . Moreover, Formula (6) ensures that $\mathcal{I}(y_{\mathsf{q_{I}},\mathsf{q_{I,D}}}^{D})=1$ , which proves the claim.
Induction step:: Let $\lambda=\lambda^{\prime\prime}l$ . Moreover, let $\mathcal{A}_{\mathcal{I}}\colon\mathsf{q_{I}}\xrightarrow{\lambda^{\prime\prime}}\mathsf{p}\xrightarrow{l}\mathsf{q}$ and $D\colon\mathsf{q_{I,D}}\xrightarrow{\lambda^{\prime\prime}}\mathsf{p^{\prime}}\xrightarrow{l}\mathsf{q^{\prime}}$ be the runs of $\mathcal{A}_{\mathcal{I}}$ and $D$ on $\lambda=\lambda^{\prime\prime}l$ , respectively. By induction hypothesis, we then know that $\mathcal{I}(y_{\mathsf{p},\mathsf{p^{\prime}}}^{D})=1$ . Moreover, $\mathcal{A}_{\mathcal{I}}$ contains the transition $\delta(\mathsf{p},l)=\mathsf{q}$ because this transition was used in the last step of the run of $\mathcal{A}_{\mathcal{I}}$ on $\lambda$ . By construction of $\mathcal{A}_{\mathcal{I}}$ , this can only be the case if $\mathcal{I}(d_{\mathsf{p},l,\mathsf{q}})=1$ . In this situation, Formula 7 ensures $\mathcal{I}(y_{\mathsf{q},\mathsf{q^{\prime}}}^{D})=1$ (since also $\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}$ ), which proves the claim.

Let now $\lambda=\ell_{1}\ell_{2}\ldots\ell_{k}$ be a nonempty label sequence (i.e., $k\geq 1$ ). Moreover, let $\mathcal{A}_{\mathcal{I}}\colon\mathsf{q_{I}}\xrightarrow{l_{1}\ldots l_{k-1}}\mathsf{p}\xrightarrow{l_{k}}\mathsf{q}$ be the run of $\mathcal{A}_{\mathcal{I}}$ on $\lambda$ and $D\colon\mathsf{q_{I,D}}\xrightarrow{l_{1}\ldots l_{k-1}}\mathsf{p^{\prime}}\xrightarrow{l_{k}}\mathsf{q^{\prime}}$ the run of $D$ on $\lambda$ . Our induction shows that $\mathcal{I}(y_{\mathsf{p},\mathsf{p^{\prime}}})=1$ holds in this case.

Towards a contradiction, assume that $\mathcal{A}_{\mathcal{I}}(\ell_{1}\ell_{2}\ldots\ell_{k})=r_{1}r_{2}\ldots r_{k}$ , $r_{k}>0$ , and $\ell_{1}\ell_{2}\ldots\ell_{k}\notin L(D)$ . In particular, this means $\mathsf{q^{\prime}}\notin\mathsf{F}_{D}$ . Since $\delta_{D}(\mathsf{p^{\prime}},l_{k})=\mathsf{q^{\prime}}$ (which was used in the last step in the run of $D$ on $\ell_{1}\ell_{2}\ldots\ell_{k}$ ) and $\mathcal{I}(y_{\mathsf{p},\mathsf{p^{\prime}}}^{D})=1$ (due to the induction above), Formula (8) ensures that $\mathcal{I}(o_{\mathsf{p},l,r})=0$ for all $r\in R_{X}$ with $r>0$ . However, Formula (2) ensures that there is exactly one $r\in R_{X}$ with $\mathcal{I}(o_{\mathsf{p},l,r})=1$ . Thus, there has to exist an $r\in R_{X}$ such that $r\leq 0$ and $\mathcal{I}(o_{\mathsf{p},l,r})=1$ . By construction of $\mathcal{A}_{\mathcal{I}}$ , this means that the last output $r_{k}$ of $\mathcal{A}_{\mathcal{I}}$ on reading $l_{k}$ must have been $r_{k}\leq 0$ . However, our assumption was $r_{k}>0$ , which is a contradiction. Thus, $\mathcal{A}_{\mathcal{I}}$ is compatible with the LLM-generated DFA $D$ . ∎

We can now prove Theorem 2 (i.e., the correctness of our SAT-based learning algorithm for reward machines).

Proof of Theorem 2.

The proof of Part 1 follows immediately from Lemma 2 and Lemma 3.

To prove Part 2, let $\mathcal{A}=(V,{\mathsf{q}_{I}},{2^{\mathcal{P}}},M,\delta,\sigma)$ be a reward machine with $n$ states that is consistent with $X$ and compatible with each $D\in\mathscr{D}$ . From this reward machine, we can derive a valuation $\mathcal{I}$ for the variables $d_{\mathsf{p},l,\mathsf{q}}$ and $o_{\mathsf{p},l,r}$ in a straightforward way (e.g., setting $\mathcal{I}(d_{\mathsf{p},l,\mathsf{q}})=1$ if and only is $\delta(\mathsf{p},l)=\mathsf{q}$ ). Moreover, we obtain a valuation for the variables $x_{\lambda,\mathsf{p}}$ from the runs of (prefixes) of traces in the sample $X$ , and valuations for the variables $y_{\mathsf{p},\mathsf{p^{\prime}}}^{D}$ from the synchronized runs of $\mathcal{A}$ and $D$ for each $D\in\mathscr{D}$ . Then, $\mathcal{I}$ indeed satisfies $\Phi_{n}^{X,D}$ . ∎

Appendix D Convergence to an Optimal Policy

In this section, we prove that LARM-RM almost surely converges to an optimal policy in the limit. We begin by defining attainable trajectories—trajectories that can possibly appear in the exploration of an agent.

Definition 5.

Let

$\mathcal{M}=(S,s_{I},A,p,R,\gamma,\mathcal{P},L)$ be a labeled MDP and $m\in\mathbb{N}$ a natural number. A trajectory $\zeta=s_{0}a_{1}s_{1}\ldots a_{k}s_{k}\in(S\times A)^{\ast}\times S$ is said to be $m$ -attainable if $k\leq m$ and $p(s_{i-1},a_{i},s_{i})>0$ for each $i\in\{1,\ldots,k\}$ . Moreover, a trajectory $\zeta$ is called attainable if there exists an $m\in\mathbb{N}$ such that $\zeta$ is $m$ -attainable. Analogously, we call a label sequence $\lambda=l_{1}\ldots l_{k}$ ( $m$ -)attainable if there exists an ( $m$ -)attainable trajectory $s_{0}a_{1}s_{1}\ldots s_{k-1}a_{k}s_{k}$ such that $l_{i}=L(s_{i-1},a_{i},s_{i})$ for each $i\in\{1,\ldots,k\}$

An induction shows that LARM-RM almost surely explores every attainable trajectory in the limit (i.e., with probability 1 when the number of episodes goes to infinity. We refer the reader to the extended version of Xu et al. [2020b]¹¹1The extended version of Xu et al. [2020b] can be found on arXiv at https://arxiv.org/abs/1909.05912 for a detailed proof.

Lemma 4.

Given $m\in\mathbb{N}$ , LARM-RM with $\mathit{episode_{length}}\geq m$ almost surely explores every $m$ -attainable trajectory in the limit.

As an immediate consequence of Lemma 4, we obtain that LARM-RM almost sure explores every ( $m$ -)attainable label sequence in the limit as well.

Corollary 1.

Given $m\in\mathbb{N}$ , LARM-RM with $\mathit{episode_{length}}\geq m$ almost surely explores every $m$ -attainable label sequence in the limit.

In this paper, we assume that all possible label sequences are attainable. We relax this restriction later as in Neider et al. [2021] .

To prove Lemma 1, we show a series of intermediate results. We begin by reminding the reader of a well-known fact from automata theory, which can be found in most textbooks Shallit [2008].

Theorem 3.

Let $\mathfrak{A}_{1}$ and $\mathfrak{A}_{2}$ be two DFAs with $L(\mathfrak{A}_{1})\neq L(\mathfrak{A}_{2})$ . Then, there exists a word $w$ of length at most $|\mathfrak{A}_{1}|+|\mathfrak{A}_{2}|-1$ such that $w\in L(\mathfrak{A}_{1})$ if and only if $w\notin L(\mathfrak{A}_{2})$ .

We now observe that every reward machine $\mathcal{A}$ can be translated into a DFA $\mathfrak{A}_{\mathcal{A}}$ that is “equivalent” to the reward machine Xu et al. [2020b]. This DFA operates over the combined alphabet $2^{\mathcal{P}}\times M$ and accepts a sequence $(\ell_{1},r_{1})\ldots(\ell_{k},r_{k})$ if and only if $\mathcal{A}$ outputs the reward sequence $r_{1}r_{2}\ldots r_{k}$ on reading the label sequence $\ell_{1}\ell_{2}\ldots\ell_{k}$ .

Lemma 5 (Xu et al. [2020b]).

Given a reward machine $\mathcal{A}=(V,{\mathsf{q}_{I}},{2^{\mathcal{P}}},M,\delta,\sigma)$ , one can construct a DFA $\mathfrak{A}_{\mathcal{A}}$ with $|\mathcal{A}|+1$ states such that $L(\mathfrak{A}_{\mathcal{A}})=\{(l_{1},r_{1})\ldots(l_{k},r_{k})\in(2^{\mathcal{P}}\times\mathbb{R})\mid\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}\}$ .

Proof.

Let $\mathcal{A}=(\mathsf{Q}_{\mathcal{A}},\mathsf{q_{I,A}},2^{\mathcal{P}},\mathbb{R},\delta_{\mathcal{A}},\sigma_{\mathcal{A}})$ be a reward machine. Then, we define a DFA $\mathfrak{A}_{\mathcal{A}}=(\mathsf{Q},\mathsf{q_{I}},2^{\mathcal{P}}\times\mathbb{R},\delta,F)$ by

•

$\mathsf{Q}=\mathsf{Q_{A}}\cup\{\bot\}$ , where $\bot\notin\mathsf{Q_{A}}$ ;
•

$\mathsf{q_{I}}=\mathsf{q_{I,A}}$ ;
•

$\delta\bigl{(}\mathsf{p},(\ell,r)\bigr{)}=\begin{cases}\mathsf{q}&\text{if $\delta_{\mathcal{A}}(\mathsf{p},\ell)=\mathsf{q}$ and $\sigma_{\mathcal{A}}(\mathsf{p},\ell)=r$;}\\ \bot&\text{otherwise}\end{cases}$
•

$\mathsf{F}_{\mathcal{A}}=\mathsf{Q_{A}}$ .

In this definition, $\bot$ is a new sink state to which $\mathfrak{A}_{\mathcal{A}}$ moves if its input does not correspond to a valid input-output pair produced by $\mathcal{A}$ . A straightforward induction over the length of inputs to $\mathfrak{A}_{\mathcal{A}}$ shows that it indeed accepts the desired language. In total, $\mathfrak{A}_{\mathcal{A}}$ has $|\mathcal{A}|+1$ states. ∎

Next, we show that if two reward machines are semantically different, then we can bound the length of a trace witnessing this fact.

Lemma 6.

Let $\mathcal{A}_{1}$ and $\mathcal{A}_{2}$ be two reward machines. If $\mathcal{A}_{1}\neq\mathcal{A}_{2}$ , then there exists a label sequence $\lambda$ of length at most $|\mathcal{A}_{1}|+|\mathcal{A}_{2}|+1$ such that $\mathcal{A}_{1}(\lambda)\neq\mathcal{A}_{2}(\lambda)$ .

Proof of Lemma 6.

Consider the DFAs $\mathfrak{A}_{\mathcal{A}_{1}}$ and $\mathfrak{A}_{\mathcal{A}_{2}}$ obtained from Lemma 5. These DFAs have $|\mathcal{A}_{1}|+1$ and $|\mathcal{A}_{2}|+1$ states, respectively. If $\mathcal{A}_{1}\neq\mathcal{A}_{2}$ , then $L(\mathfrak{A}_{\mathcal{A}_{1}})\neq L(\mathfrak{A}_{\mathcal{A}_{2}})$ . Thus, applying Theorem 3 yields a sequence $w=(l_{1},r_{1})\ldots(l_{k},r_{k})$ with

k\leq|\mathcal{A}_{1}|+1+|\mathcal{A}_{2}|+1-1=|\mathcal{A}_{1}|+|\mathcal{A}_{2}|+1

such that $w\in L(\mathfrak{A}_{\mathcal{A}_{1}})$ if and only if $w\notin L(\mathfrak{A}_{\mathcal{A}_{2}})$ . By Lemma 5, the label sequence $\lambda=l_{1}\ldots l_{k}$ has the desired property. ∎

Similar to Lemma 6, we can show that if an LLM-generated DFA is not compatible with a reward machine, we can also bound the length of a label sequence witnessing this fact.

Lemma 7.

Let $\mathcal{A}$ be a reward machine and $D$ an LLM-generated DFA. If $D$ is not compatible with $\mathcal{A}$ , then there exists a label sequence $l_{1}\ldots l_{k}$ with $k\leq 2(|\mathcal{A}|+1)\cdot|D|$ such that $\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}$ , $r_{k}>0$ , and $l_{1}\ldots l_{k}\notin L(D)$ .

Proof.

Let $\mathcal{A}$ be a reward machine and $R\subset\mathbb{R}$ the finite set of rewards that $\mathcal{A}$ can output. Moreover, let $D=(V_{D},v_{I,D},2^{\mathcal{P}},\delta_{D},F_{D})$ be an LLM-generated DFA. Our proof proceeds by constructing four DFAs and a subsequent analysis of the final DFA to derive the desired bound.

First, we construct the DFA $\mathfrak{A}_{\mathcal{A}}=(V_{\mathcal{A}},v_{I,\mathcal{A}},2^{\mathcal{P}}\times R,\delta_{\mathcal{A}},F_{\mathcal{A}})$ according to Lemma 5. Recall that $(l_{1},r_{1})\ldots(l_{k},r_{k})\in L(\mathfrak{A}_{\mathcal{A}})$ if and only if $\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\dots r_{k}$ . Moreover, $\mathfrak{A}_{\mathcal{A}}$ has $|\mathcal{A}|+1$ states.

Second, we modify the DFA $\mathfrak{A}_{\mathcal{A}}$ such that it only accepts sequences $(l_{1},r_{1})\ldots(l_{k},r_{k})$ with $r_{k}>0$ . To this end, we augment the state space with an additional bit $b\in\{0,1\}$ , which tracks whether the most recent reward was $0$ or greater than $0$ . More formally, we define a DFA $\mathfrak{A}^{\prime}_{\mathcal{A}}=(V^{\prime}_{\mathcal{A}},v^{\prime}_{I,\mathcal{A}},2^{\mathcal{P}}\times R,\delta^{\prime}_{\mathcal{A}},F^{\prime}_{\mathcal{A}})$ by

•

$V^{\prime}_{\mathcal{A}}=V_{\mathcal{A}}\times\{0,1\}$ ;
•

$v^{\prime}_{I,\mathcal{A}}=(v_{I,\mathcal{A}},0)$ ;
•

$\delta_{\mathcal{A}}^{\prime}\bigl{(}(v,b),(l,r)\bigr{)}=\bigl{(}\delta_{\mathcal{A}}(v,(l,r)),b\bigr{)}$ where $b=1$ if and only if $r>0$ ; and
•

$F^{\prime}_{\mathcal{A}}=F_{\mathcal{A}}\times\{1\}$ .

It is not hard to verify that $\mathfrak{A}^{\prime}_{\mathcal{A}}$ indeed has the desired property. Moreover, by Lemma 5, $\mathfrak{A}^{\prime}_{\mathcal{A}}$ has $2(|\mathcal{A}|+1)$ states.

Third, we apply “cylindrification” to the LLM-generated DFA $D$ , which works over the alphabet $2^{\mathcal{P}}$ , to match the input alphabet $2^{\mathcal{P}}\times R$ of $\mathfrak{A}^{\prime}_{\mathcal{A}}$ . Our goal is to construct a new DFA $D^{\prime}=(V^{\prime}_{D},v^{\prime}_{I,D},2^{\mathcal{P}}\times R(X),\delta^{\prime}_{D},F^{\prime}_{D})$ that accepts a sequence $(l_{1},r_{1})\ldots(l_{k},r_{k})$ if and only if $l_{1}\ldots l_{k}\in L(D)$ for every reward sequence $r_{1}\ldots r_{k}$ (cylindrification can be thought of as the inverse of the classical projection operation). We achieve this by replacing each transition $\delta_{D}(v,l)=v^{\prime}$ in $D$ with $|R(X)|$ many transitions of the form $\delta^{\prime}_{D}\bigl{(}v,(l,r)\bigr{)}$ where $r\in R$ . Formally, we define the DFA $D^{\prime}$ by

•

$V^{\prime}_{D}=V_{D}$ ;
•

$v^{\prime}_{I,D}=v_{I,D}$ ;
•

$\delta^{\prime}_{D}\bigl{(}v,(l,r)\bigr{)}=\delta_{D}(v,l)$ for each $r\in R$ ; and
•

$F^{\prime}_{D}=F_{D}$ .

It is not hard to verify that $D^{\prime}$ has indeed the desired property and its size is $|D|$ .

Fourth, we construct the simple product DFA of $\mathfrak{A}^{\prime}_{\mathcal{A}}$ and $D^{\prime}$ . This DFA is given by $\mathfrak{A}=(V,v_{I},2^{\mathcal{P}}\times R(X),\delta,F)$ where

•

$V=V^{\prime}_{\mathcal{A}}\times V^{\prime}_{D}$ ;
•

$v_{I}=(v^{\prime}_{I,\mathcal{A}},v^{\prime}_{I,D})$ ;
•

$\delta\bigl{(}(v_{1},v_{2}),(l,r)\bigr{)}=\bigl{(}\delta^{\prime}_{\mathcal{A}}(v_{1},(l,r)),\delta^{\prime}_{D}(v_{2},(l,r))\bigr{)}$ ; and
•

$F=F^{\prime}_{\mathcal{A}}\times(Q^{\prime}_{D}\setminus F^{\prime}_{D})$ .

By construction of $\mathfrak{A}^{\prime}_{\mathcal{A}}$ and $D^{\prime}$ , is is not hard to see that $\mathfrak{A}$ accepts a sequence $(l_{1},r_{1})\ldots(l_{k},r_{k})$ if and only if $\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}$ with $r_{k}>0$ and $l_{1}\ldots l_{k}\notin L(D)$ —in other words, $L(\mathfrak{A})$ contains all sequences that witness that $D$ is not compatible with $\mathcal{A}$ . Moreover, $\mathfrak{A}$ has $2(|\mathcal{A}|+1)\cdot|D|$ states.

It is left to show that if $D$ is not compatible with $\mathcal{A}$ , then we can find a witness with the desired length. To this end, it is sufficient to show that if $L(\mathfrak{A})\neq\emptyset$ , then there exists a sequence $(l_{1},r_{1})\ldots(l_{k},r_{k})\in L(\mathfrak{A})$ with $k\leq 2(|\mathcal{A}|+1)\cdot|D|$ . This fact can be established using a simple pumping argument. To this end, assume that $(l_{1},r_{1})\ldots(l_{k},r_{k})\in L(\mathfrak{A})$ with $k>2(|\mathcal{A}|+1)\cdot|D|$ . Then, there exists a state $v\in V$ such that the unique accepting run of $\mathfrak{A}$ on $(l_{1},r_{1})\ldots(l_{k},r_{k})$ visits $v$ twice, say at the positions $i,j\in\{0,\ldots k\}$ with $i<j$ . In this situation, however, the DFA $\mathfrak{A}$ also accepts the sequence $(l_{1},r_{1})\ldots(l_{i},r_{i})(l_{j+1},r_{j+1})\ldots(l_{k},r_{k})$ , where we have removed the “loop” between the repeating visits of $v$ . Since this new sequence is shorter than the original sequence, we can repeatedly apply this argument until we arrive at a sequence $(l^{\prime}_{1},r^{\prime}_{1})\ldots(l^{\prime}_{\ell},r^{\prime}_{\ell})\in L(\mathfrak{A})$ with $\ell\leq 2(|\mathcal{A}|+1)\cdot|D|$ . By construction of $\mathfrak{A}$ , this means that $\mathcal{A}(l^{\prime}_{1}\ldots l^{\prime}_{\ell})=r^{\prime}_{1}\ldots r^{\prime}_{\ell}$ , $r_{\ell}>0$ , and $l^{\prime}_{1}\ldots l^{\prime}_{\ell}\notin L(D)$ , which proves the claim. ∎

With these intermediate results at hand, we are now ready to prove Lemma 1.

Proof of Lemma 1.

Let $(X_{0},D_{0}),(X_{1},D_{1}),\ldots$ be the sequence of samples and sets of LLM-generated DFAs that arise in the run of LARM-RM whenever a new counterexample is added to $X$ (in Lines 9 and 10 of Algorithm 2) or an LLM-generated DFA is removed from the set $D$ (Lines 12 and 13 of Algorithm 2). Moreover, let $\mathcal{A}_{0},\mathcal{A}_{1},\ldots$ be the corresponding sequence of reward machines that are computed from $(X_{i},D_{i})$ . Note that constructing a new reward machine is always possible because LARM-RM makes sure that all LLM-generated DFAs in the current set $\mathscr{D}$ are compatible with the traces in the sample $X$ .

We first observe three properties of these sequences:

1.

The true reward machine $\mathcal{A}$ (i.e., the one that encodes the reward function $R$ ) is consistent with every sample $X_{i}$ that is generated during the run of LARM-RM. This is due to the fact that each counterexample is obtained from an actual exploration of the MDP and, hence, corresponds to the “ground truth”.
2.

The sequence $X_{0},X_{1},\ldots$ grows monotonically (i.e., $X_{0}\subseteq X_{1}\subseteq\cdots$ ) because LARM-RM always adds counterexamples to $X$ and never removes them (Lines 9 and 10). In fact, whenever a counterexample $(\lambda^{\prime},\rho)$ is added to $X_{i}$ to form $X_{i+1}$ , then $(\lambda^{\prime},\rho)\notin X_{i}$ (i.e., $X_{i}\subsetneq X_{i+1}$ ). To see why this is the case, remember that LARM-RM always constructs hypotheses that are consistent with the current sample (and the current set of LLM-generated DFAs). Thus, the current reward machine $\mathcal{A}_{i}$ is consistent with $X_{i}$ , but the counterexample $(\lambda^{\prime},\rho)$ was added because $\mathcal{A}_{i}(\lambda^{\prime})\neq\rho$ . Thus, $(\lambda^{\prime},\rho)$ cannot have been an element of $X_{i}$ .
3.

The sequence $D_{0},D_{1},\ldots$ decreases monotonically (i.e., $D_{0}\supseteq D_{1}\supseteq\cdots$ ) because LARM-RM always removes LLM-generated DFAs from $D$ and never adds any (Lines 12 and 13). Thus, there exists a position $i^{\star}\in\mathbb{N}$ at which this sequence becomes stationary, implying that $D_{i}=D_{i+1}$ for $i\geq i^{\star}$ .

Similar to Property 1, we now show that each LLM-generated DFA in the set $D_{i}$ , $i\geq i^{\star}$ , is compatible with the true reward machine $\mathcal{A}$ . Towards a contradiction, let $D\in\mathscr{D}_{i}$ be an LLM-generated DFA and assume that $D$ is not compatible with $\mathcal{A}$ . Then, Lemma 7 guarantees the existence of a label sequence $l_{1}\ldots l_{k}$ with

	$\displaystyle k$	$\displaystyle\leq 2(\|\mathcal{A}\|+1)\cdot\|D\|$
		$\displaystyle\leq 2(\|\mathcal{A}\|+1)\cdot n_{\text{max}}$

such that $\mathcal{A}(l_{1},\ldots l_{k})=r_{1}\ldots r_{k}$ , $r_{k}>0$ , and $l_{1}\ldots l_{k}\notin L(D)$ . Since we assume all label sequences to be attainable and have chosen $\mathit{episode_{length}}\geq 2(|\mathcal{A}|+1)\cdot n_{\text{max}}$ , Corollary 1 guarantees that LARM-RM almost surely explores this label sequence in the limit. Once this happens, LARM-RM removes $D$ from the set $\mathscr{D}$ (Lines 12 and 13), which is diction to the fact that the sequence $D_{i^{\star}},D_{i^{\star}+1},\ldots$ is stationary (Property 3). Hence, we obtain the following:

4.

Every LLM-generated DFA in $D_{i}$ , $i\geq i^{\star}$ , is compatible with the true reward machine $\mathcal{A}$ .

Next, we establish the three additional properties about the sub-sequence $\mathcal{A}_{i^{\star}},\mathcal{A}_{i^{\star}+1}$ of hypotheses starting at position $i^{\star}$ :

5.

The size of the true reward machine $\mathcal{A}$ is an upper bound for the size of $\mathcal{A}_{i^{\star}}$ (i.e., $|\mathcal{A}_{i^{\star}}|\leq|\mathcal{A}|$ ). This is due to the fact that $\mathcal{A}$ is consistent with every sample $X_{i}$ (Property 1), every LLM-generated DFA in $\mathscr{D}_{i}$ , $i\geq i^{\star}$ , is compatible with $\mathcal{A}$ (Property 4), and LARM-RM always computes minimal consistent reward machines.
6.

We have $|\mathcal{A}_{i}|\leq|\mathcal{A}_{i+1}|$ for all $i\geq i^{\star}$ . Towards a contradiction, assume that $|\mathcal{A}_{i}|>|\mathcal{A}_{i+1}|$ . Since LARM-RM always computes consistent reward machines and $X_{i}\subsetneq X_{i+1}$ if $i\geq i^{\star}$ (see Property 2), we know that $\mathcal{A}_{i+1}$ is not only consistent with $X_{i+1}$ but also with $X_{i}$ (by definition of consistency). Moreover, LARM-RM computes minimal consistent reward machines. Hence, since $\mathcal{A}_{i+1}$ is consistent with $X_{i}$ and $|\mathcal{A}_{i+1}|<|\mathcal{A}_{i}|$ , the reward machine $\mathcal{A}_{i}$ is not minimal, which is a contradiction.
7.

We have $\mathcal{A}_{i}\neq\mathcal{A}_{j}$ for $i\geq i^{\star}$ and $j\in\{i^{\star},\ldots,i\}$ —in other words, the reward machines generated during the run of LARM-RM after the $i^{\star}$ -th recomputation are semantically distinct. This is a consequence of the facts that $(\lambda^{\prime}_{j},\rho_{j})$ was a counterexample to $\mathcal{A}_{j}$ (i.e., $\mathcal{A}_{j}(\lambda^{\prime}_{j})\neq\rho_{j})$ and that LARM-RMalways constructs consistent reward machines (which implies $\mathcal{A}_{i}(\lambda^{\prime}_{i})=\rho_{i}$ ).

Properties 5 and 6 now provide $|\mathcal{A}|$ as an upper bound on the size of any reward machine that LARM-RM constructs after the $i^{\star}$ -th recomputation. Since there are only finitely many reward machines of size at most $|\mathcal{A}|$ , Property 7 implies that there exists a $j^{\star}\geq i^{\star}$ after which no new reward machine is learned. Hence, it is left to show that $\mathcal{A}_{j^{\star}}=\mathcal{A}$ (i.e., $\mathcal{A}_{j^{\star}}(\lambda)=\mathcal{A}(\lambda)$ for all label sequences $\lambda$ ).

Towards a contradiction, let us assume that $\mathcal{A}_{j^{\star}}\neq\mathcal{A}$ . Then, Lemma 6 guarantees the existence of a label sequence $\lambda=l_{1}\ldots l_{k}$ with

	$\displaystyle k$	$\displaystyle\leq\|\mathcal{A}_{j^{\star}}\|+\|\mathcal{A}\|+1$
		$\displaystyle\leq 2\|\mathcal{A}\|+1$
		$\displaystyle\leq 2(\|\mathcal{A}\|+1)\cdot n_{\text{max}}$

such that $\mathcal{A}_{j^{\star}}(\lambda)\neq\mathcal{A}(\lambda)$ .

Since we assume all label sequences to be attainable and have chosen $\mathit{episode_{length}}\geq 2(|\mathcal{A}|+1)\cdot n_{\text{max}}$ , Corollary 1 guarantees that LARM-RM almost surely explores this label sequence in the limit. Thus, the trace $(\lambda,\rho)$ , where $\rho=\mathcal{A}(\lambda)$ , is almost surely returned as a new counterexample, resulting in a new sample $X_{j^{\star}+1}$ . This, in turn, causes the construction of a new reward machine, which contradicts the assumption that no further reward machine is generated. Thus, the reward machine $\mathcal{A}_{j^{\star}}$ is equivalent to the true reward machine $\mathcal{A}$ . ∎

Let us finally turn to the proof of Theorem 1 (i.e., that LARM-RM almost surely converges to the optimal policy in the limit). The key idea is to construct the product of the given MDP $\mathcal{M}$ and the true reward machine $\mathcal{A}$ . In this new MDP $\mathcal{M}^{\prime}$ , the reward function is in fact Markovian since the reward machine $\mathcal{A}$ encodes all necessary information in its states (and, hence, in the product $\mathcal{M}^{\prime}$ ). Thus, the fact that classical Q-learning almost surely converges to an optimal policy in the limit also guarantees the same for LARM-RM. We refer the reader to the extended version of Xu et al. [2020b] for detailed proof.

Appendix E Unattainable Label Sequences

Let us now consider the case that not all labels can be explored by the agent (e.g., because certain label sequences are not possible in the MDP). In this situation, two complications arise:

1.

It is no longer possible to uniquely learn the true reward machine $\mathcal{A}$ . In fact, any reward machine that agrees with $\mathcal{A}$ on the attainable traces becomes a valid solution.
2.

The notion of compatibility needs to reflect the attainable traces. More precisely, the condition $r_{k}>0$ implies $l_{1}\ldots l_{k}\in L(D)$ now has to hold only for attainable label sequences.

Both complications add a further layer of complexity, which makes the overall learning problem—both Q-learning and the learning of reward machines—harder. In particular, we have to adapt the episode length and also the size of the SAT encoding grows. In total, we obtain the following result.

Theorem 4.

Let $\mathcal{M}$ be a labeled MDP where all label sequences are attainable and $\mathcal{A}$ the true reward machine encoding the rewards of $\mathcal{M}$ . Moreover, let $\mathscr{D}=\{D_{1},\ldots,D_{\ell}\}$ be a set of LLM-generated DFAs and $n_{\text{max}}=\max_{1\leq i\leq\ell}{\{|D_{i}|\}}$ . Then, LARM-RM with

\mathit{episode_{length}}\geq\max{\bigl{\{}2|\mathcal{M}|\cdot(|\mathcal{A}|+1)\cdot n_{\text{max}},|\mathcal{M}|(|\mathcal{A}|+1)^{2}\bigr{\}}}

almost surely converges to an optimal policy in the limit. The size of the formula $\Phi_{n}^{X,D}$ grows linearly in $|\mathcal{M}|$ .

In order to prove Theorem 4, specifically the bound on the length of episodes, we first need to introduce the nondeterministic version of DFAs. Formally, a nondeterministic finite automaton is a tuple $\mathfrak{A}=(\mathsf{Q},\mathsf{q_{I}},\Sigma,\Delta,F)$ where $\mathsf{Q}$ , $\mathsf{q_{I}}$ , $\Sigma$ , $\mathsf{F}$ are as in DFAs and $\Delta\subset\mathsf{Q}\times\Sigma\times\mathsf{Q}$ is the transition relation. Similar to DFAs, a run of an NFA $\mathfrak{A}$ on a word $u=a_{1}\ldots a_{k}$ is a sequence $\mathsf{q}_{0},\ldots\mathsf{q}_{k}$ such that $\mathsf{q_{o}}=\mathsf{q_{I}}$ and $(\mathsf{q}_{i-1},a_{i},\mathsf{q}_{i})\in\Delta$ for each $i\in\{1,\ldots,k\}$ . In contrast to DFAs, however, NFAs permit multiple runs on the same input or even no run. Accepting runs and the language of NFAs are defined analogously to DFAs. Note that we use NFAs instead of DFAs because the former can be exponentially more succinct than the latter.

Similar to Lemma 5, one can translate an MDP $\mathcal{M}$ into an NFA $\mathfrak{A}_{\mathcal{M}}$ with $|\mathcal{M}|$ states that accepts exactly the attainable traces of an MDP $\mathcal{M}$ .

Lemma 8.

Given a labeled MDP $\mathcal{M}$ , one can construct an NFA $\mathfrak{A}_{\mathcal{M}}$ with at most $2|\mathcal{M}|$ states that accept exactly the admissible label sequences of $\mathcal{M}$ .

This construction is similar to Remark 1 of the extended version of Xu et al. [2020b] as proceeds as follows.

Proof.

Given a labeled MDP $\mathcal{M}=(S,s_{I},A,p,R,\gamma,\mathcal{P},L)$ , we construct an NFA $\mathfrak{A}_{\mathcal{M}}=(\mathsf{Q},\mathsf{q_{I}},2^{\mathcal{P}},\Delta,F)$ by

•

$\mathsf{Q}=S$ ;
•

$\mathsf{q_{I}}=s_{I}$ ;
•

$(s,\ell,s^{\prime})\in\Delta$ if and only if there exists an action $a\in A$ with $L(s,a,s^{\prime})=\ell$ and $p(s,a,s)>0$ ; and
•

$\mathsf{F}=S$ .

A straightforward induction shows that $\lambda\in L(\mathfrak{A}_{\mathcal{M}})$ holds if and only if $\lambda$ is an attainable label of $\mathcal{M}$ . ∎

We are now ready to prove Theorem 4.

Proof of Theorem 4.

We first modify Lemma 7 slightly. More precisely, we add another fifth step, where we build the product of $\mathfrak{A}$ with the cylindrification of the DFA $\mathfrak{A}_{\mathcal{M}}$ . It is not hard to verify that this results in an NFA that accepts a trace $(l_{1}\ldots l_{k},r_{1}\ldots r_{k})$ if and only if $l_{1}\ldots l_{k}$ is admissible and not in the language $L(D)$ , $\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}$ , and $r_{k}>0$ . This NFA has $2|\mathcal{M}|(|\mathcal{A}|+1)\cdot|D|$ states. A similar pumping argument as in the proof of Lemma 7 can now be used to show the existence of a witness of length at most $2|\mathcal{M}|(|\mathcal{A}|+1)\cdot|D|$ .

Analogously, we can modify Lemma 6. We build the input-synchronized product of the two DFAs $\mathfrak{A}_{\mathcal{A}_{1}}$ and $\mathfrak{A}_{\mathcal{A}_{2}}$ and the DFA $\mathfrak{A}_{\mathcal{M}}$ . This results in an NFA with $(|\mathcal{A}_{1}|+1)\cdot(|\mathcal{A}_{2}|+1)\cdot|\mathcal{M}|$ states. If $\mathcal{A}_{1}\neq\mathcal{A}_{2}$ on an attainable label sequence, then we can find a sequence such that leads in this product to a state where $\mathfrak{A}_{\mathcal{M}}$ accepts (the label sequence is attainable), but exactly one of $\mathcal{A}_{1}$ and $\mathcal{A}_{2}$ accepts. Using a pumping argument as above, we can bound the length of such an input to at most $(|\mathcal{A}_{1}|+1)\cdot(|\mathcal{A}_{2}|+1)\cdot|\mathcal{M}|$ .

Moreover, we have to modify Formula $\Phi_{n}^{X,D}$ to account for the new situation. Given $\mathfrak{A}_{\mathcal{M}}=(\mathsf{Q_{A_{\mathcal{M}}}},\mathsf{q_{I,A_{\mathcal{M}}}},2^{\mathcal{P}}\times\mathbb{R},\Delta_{\mathsf{A_{\mathcal{M}}}},F_{\mathsf{A_{\mathcal{M}}}})$ , we first introduce new auxiliary variables $z_{\mathsf{p},\mathsf{p^{\prime}},\mathsf{p^{\prime\prime}}}^{D}$ where $\mathsf{p^{\prime\prime}}\in\mathsf{Q_{\mathcal{A}_{\mathcal{M}}}}$ ; we use these variables instead of the variables $y_{\mathsf{p}},\mathsf{p^{\prime}}^{D}$ . Second, we replace Formulas (6), (7), and (8) with the following three formulas:

	$\displaystyle z_{\mathsf{q_{I}},\mathsf{q^{\prime}_{I,D},\mathsf{q_{I,A_{\mathcal{M}}}}}}^{D}$		(9)
	$\displaystyle\bigwedge_{\mathsf{p,q}\in\mathsf{Q}}\bigwedge_{l\in 2^{\mathcal{P}}}\bigwedge_{\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}}\bigwedge_{(\mathsf{p^{\prime\prime}},l,\mathsf{q^{\prime\prime}})\in\Delta_{\mathsf{A_{\mathcal{M}}}}}(z_{\mathsf{p},\mathsf{p^{\prime}},\mathsf{p^{\prime\prime}}}^{D}\land d_{\mathsf{p},l,\mathsf{q}})\rightarrow z_{\mathsf{q},\mathsf{q^{\prime}},\mathsf{q^{\prime\prime}}}^{D}$		(10)
	$\displaystyle\bigwedge_{\mathsf{p}\in\mathsf{Q}}\bigwedge_{\begin{subarray}{c}\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}\\ \mathsf{q^{\prime}}\notin\mathsf{F}_{D}\end{subarray}}\bigwedge_{\begin{subarray}{c}(\mathsf{p^{\prime\prime}},l,\mathsf{q^{\prime\prime}})\in\Delta_{\mathsf{A_{\mathcal{M}}}}\\ \mathsf{q^{\prime\prime}}\in F_{\mathsf{A_{\mathcal{M}}}}\end{subarray}}z_{\mathsf{p},\mathsf{p^{\prime}},\mathsf{p^{\prime\prime}}}^{D}\rightarrow\lnot\bigvee_{\begin{subarray}{c}r\in R_{X}r>0\end{subarray}}o_{\mathsf{p},l,r}$		(11)

Clearly, the size of $\Phi_{n}^{X,D}$ grows linearly in $|\mathcal{M}|$ as compared to the case that all label sequences are attainable. Moreover, it is not hard to verify that $\Phi_{n}^{X,D}$ has indeed the desired meaning. More precisely, we obtain a result analogous to Lemma 1, but for the case that not all label sequences are attainable.

Once we have established that our learning algorithm learns consistent and compatible reward machines, the overall convergence of LARM-RM to an optimal policy and the bound on the length of episodes can then be proven analogously to Theorem 1. ∎

	$\displaystyle k$	$\displaystyle\leq\|\mathcal{A}_{j^{\star}}\|+\|\mathcal{A}\|+1$
		$\displaystyle\leq 2\|\mathcal{A}\|+1$
		$\displaystyle\leq 2(\|\mathcal{A}\|+1)\cdot n_{\text{max}}$