This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine

Shayan Meshkat Alsadat1    Jean-Raphaël Gaglione2    Daniel Neider 3    Ufuk Topcu2&Zhe Xu1
1Arizona State University
2University of Texas at Austin
3Technical University of Dortmund
{smeshka1, xzhe1}@asu.edu, {jr.gaglione, utopcu}@utexas.edu, [email protected]
Abstract

We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by 30%30\% by implementing our method in two case studies.

1 Introduction

Large Language Models (LLM) encode various types of information including domain-specific knowledge which enables them to be used as a source for extracting specific information about a domain. This knowledge can be used for planning and control of various systems such as autonomous systems by synthesizing the control policies  Seff et al. (2023). These policies are essentially dictating the action of the agent in the environment in a high-level format.

In general, the extraction of the domain-specific knowledge may not be a straightforward task since prompting the LLM without a proper prompt technique may result in hallucinations Jang et al. (2023), even with proper techniques it is also known LLMs can hallucinate. There are several methods suggested for prompt engineering ranging from zero-shot Kojima et al. (2022) all the way to training the pre-trained language model on a domain-specific dataset (fine-tuning) Rafailov et al. (2023) to obtain more accurate responses. The latter approach is more involved and resource intensive since it requires a dataset related to the domain as well as enough computation power to train the pre-trained language model on the dataset Ding et al. (2023). Moreover, not all of the language models are available for fine-tuning due to their providers’ policy. Hence, fine-tuning is a less desirable approach as long as the other prompt engineering methods can produce accurate responses.

Our algorithm LARL-RM (large language model-based automaton for reinforcement learning) uses prompt engineering methods to extract the relevant knowledge from the LLM. This knowledge is then incorporated into the reinforcement learning (RL) algorithm in the form of high-level knowledge to expedite the RL and guide it to reach the optimal policy faster. We integrate the LLM-encoded high-level knowledge into the RL using deterministic finite automata (DFA). We have the following contributions to the field: (a) LARL-RM allows for integration of the high-level knowledge from a specific domain into the RL using LLM outputs directly. (b) Our proposed method is capable of adjusting the instructions of the LLM by closing the loop in learning and updating the prompt based on the chain-of-thought method to generate the automata when counterexamples are encountered during the policy update. (c) LARL-RM is capable of learning the reward machine and expediting the RL by providing LLM-generated DFA to the RL algorithm. We also show that LARL-RM speeds up the RL by 30%30\% by implementing it in two case studies.

Related work. Generative language models (GLM) such as the GPT models have been shown to encode world knowledge and be capable of generating instructions for a task Hendrycks et al. (2020). Researchers in West et al. (2021) have demonstrated the advances in GLM allow to construct task-relevant knowledge graphs. Generating automaton from GLM. Generating automaton from a generative language model for a specific task is shown in Yang et al. (2022); however, the method proposed by the authors called GLM2FSA uses verb and verb phrases to and maps them onto a predefined set of verbs in order to convert the instructions given by the GLM to an automaton which in turn reduces the generality of the method to be applied to any other domain and still requires an expert to pre-define this set. GLMs are also used to extract task-relevant semantic knowledge and generate plans to complete a task in robotics Vemprala et al. (2023). Learning reward machine. Researchers in Neider et al. (2021) have shown a method called AdvisoRL for RL where the algorithm is capable of learning the reward machine; however, an expert is required to guide the learning process through provision of so-called advice. In our method this is replaced by using LLM.

2 Preliminaries

In this section, we define our notations and introduce the necessary background for reinforcement learning, reward machines, and finite deterministic automata (DFA).

Definition 1 (labeled Markov decision process).

A labeled Markov decision process is a tuple =S,sI,A,p,R,γ,𝒫,L\mathcal{M}=\langle S,s_{I},A,p,R,\gamma,\mathcal{P},L\rangle where SS is a finite set of states, sISs_{I}\in S is the agent’s initial state, AA is a finite set of actions, p:S×A×S[0,1]p:S\times A\times S\mapsto\left[0,1\right], reward function R:S×A×SR:S\times A\times S\mapsto\mathbb{R}, discount factor γ[0,1]\gamma\in[0,1], a finite set of propositions 𝒫\mathcal{P}, and a labeling function L:S×A×S2𝒫L:S\times A\times S\mapsto 2^{\mathcal{P}} that determines the high-level events that agent encounters in the environment.

A policy is a function mapping states SS to actions in AA with a probability distribution, meaning the agent at state sSs\in S will choose action aAa\in A with probability π(s,a)\pi(s,a) using the policy π\pi that leads to a new state ss^{\prime}. The probability of the state ss^{\prime} is defined by p(s,a,s)p(s,a,s^{\prime}). Trajectory s0a0s1skaksk+1s_{0}a_{0}s_{1}\ldots s_{k}a_{k}s_{k+1} where kk\in\mathbb{N} is the resulting sequence of states and actions during this stochastic process. Its corresponding label sequence is l0l1lkl_{0}l_{1}\ldots l_{k} where L(si,ai,si+1)=liikL(s_{i},a_{i},s_{i+1})=l_{i}\ \forall\ i\leq k. Trace is the pair of labels and rewards (λ,ρ)(l1l2lk,r1r2rk)(\lambda,\rho)\coloneqq(l_{1}l_{2}\ldots l_{k},r_{1}r_{2}\ldots r_{k}) where the payoff of the agent is iγiri\sum_{i}\gamma^{i}r_{i}.

We exploit deterministic finite automaton (DFA) to encode the high-level knowledge obtained from the LLM to reinforcement learning in order to speed up the process of obtaining the optimal policy π\pi^{*}. A DFA consists of an input and output alphabet which we can use to steer the learning process.

Definition 2 (Deterministic finite automata).

A DFA is a tuple 𝒟=H,hI,Σ,δ,F\mathcal{D}=\langle H,h_{I},\Sigma,\delta,F\rangle where HH is the finite set of states, hoh_{o} is the initial state, and Σ\Sigma is the input alphabet (here we consider Σ=2𝒫\Sigma=2^{\mathcal{P}}), δ:H×ΣH\delta:H\times\Sigma\mapsto H is the transition function between the states, and FHF\subseteq H is the set of accepting states.

We use deterministic finite state automata to model the LLM-generated DFA. The main idea of the LLM-generated DFA is to assist the algorithm with the learning of the reward machines through the provision of information about which label sequence may result in a positive reward and which one will definitely result in a non-positive reward Neider et al. (2021).

Reward machines are a type of Mealy machine that is used to encode a non-Markovian reward function. They are automatons that receive a label from the environment and respond with a reward as well as transition to their next state. Their input alphabet is a set of propositional variables (from the set of 𝒫\mathcal{P}) and their output alphabet is a set of real numbers.

Definition 3 (Reward machine).

A reward machine is shown by 𝒜=(V,νI,2𝒫,M,δ,σ)\mathcal{A}=\left(V,\nu_{I},2^{\mathcal{P}},M,\delta,\sigma\right) where VV is a finite set of states, νIV\nu_{I}\in V is the initial state, input alphabet 2𝒫2^{\mathcal{P}}, output alphabet MM\subseteq\mathbb{R}, transition function δ:V×2𝒫V\delta:V\times 2^{\mathcal{P}}\mapsto V, and output function σ:V×2𝒫M\sigma:V\times 2^{\mathcal{P}}\mapsto M.

Applying the reward machine 𝒜\mathcal{A} on a sequence of labels l1l2lk(2𝒫)l_{1}l_{2}\ldots l_{k}\in(2^{\mathcal{P}})^{*} will create a sequence of states ν0(l1,r1)ν1(l2,r2)νk1(lk,rk)νk\nu_{0}(l_{1},r_{1})\nu_{1}(l_{2},r_{2})\ldots\nu_{k-1}(l_{k},r_{k})\nu_{k} where ν0=νI\nu_{0}=\nu_{I} for all i{0,,k1}i\in\{0,\ldots,k-1\}. Transition to next state of the reward machine δ(νi,li+1)=νi+1\delta(\nu_{i},l_{i+1})=\nu_{i+1} results in a reward of σ(νi,li+1)=ri+1\sigma(\nu_{i},l_{i+1})=r_{i+1}. Hence, a label sequence of l1l2lkl_{1}l_{2}\ldots l_{k} corresponds to a sequence of rewards produced by reward machine 𝒜(l1l2lk)=r1r2rk\mathcal{A}(l_{1}l_{2}\ldots l_{k})=r_{1}r_{2}\ldots r_{k}. Therefore, reward function RR of an MDP process for every trajectory s0a0s1skaksk+1s_{0}a_{0}s_{1}\ldots s_{k}a_{k}s_{k+1} corresponds to a label sequence of l1l2lkl_{1}l_{2}\ldots l_{k} that encodes the reward machine 𝒜\mathcal{A}, agent then receives reward sequence of 𝒜(l1l2lk)\mathcal{A}(l_{1}l_{2}\ldots l_{k}).

Refer to caption
Figure 1: An autonomous car (agent) must first go to intersection JJ and then make a right turn to go to intersection bb. Agent must check for the green traffic light gg, car cc, and pedestrian pp.

Motivating Example. We consider an autonomous car on American roads with the set of actions A={up,down,right,left,stay}A=\{\textit{up},\textit{down},\textit{right},\textit{left},\textit{stay}\} (see Figure 1) where the agent must reach intersection JJ then making a right turn when the traffic light gg is green and if there is no car cc on the left and no pedestrian pp on the right. Then the agent reaches the intersectionbb where it needs to go straight when the traffic light is green if there are no cars and pedestrians. The corresponding DFA for this motivating example is shown in Figure 2.

h0h_{0}starth1h_{1}h2h_{2}h3h_{3}h4h_{4}JJ¬J\neg Jg¬c¬pg\wedge\neg c\wedge\neg p¬g¬c¬p\neg g\vee\neg c\vee\neg p¬b\neg bbb¬gc\neg g\vee cg¬cg\wedge\neg c\top
Figure 2: DFA of the motivating example where the agent must reach intersection JJ while avoiding cars and pedestrians.

3 Generating Domain Specific Knowledge Using LLM

Large Language Models (LLM) encode rich world and domain-specific knowledge Yang et al. (2023) which allows for the extraction of the relevant information through appropriate prompting and has shown that can even surpass fine-tuning LLMs in some cases Wei et al. (2022). Hence, we use prompt engineering to extract the relevant domain knowledge information from the LLM and provide it in the form of LLM-generated DFA Neider et al. (2021) to expedite the reinforcement learning. An LLM-generated DFA can be thought of as a means to provide the process of learning reward machine with information about which label sequence could be promising in such a way that it results in a reward.

Applying a DFA 𝒟\mathcal{D} on a sequence λ=l1l2lnΣ\lambda=l_{1}l_{2}\ldots l_{n}\in\Sigma^{\star} will result in a sequence of states h0h1hnh_{0}h_{1}\ldots h_{n} where h0=hIh_{0}=h_{I} and hi+1=δ(hi,li)i{1,,n}h_{i+1}=\delta(h_{i},l_{i})\ \forall\ i\in\{1,\ldots,n\}. We define (𝒟)={l1l2lnΣ|hnF}\mathcal{L}(\mathcal{D})=\left\{l_{1}l_{2}\ldots l_{n}\in\Sigma^{\star}\middle|h_{n}\in F\right\}, also commonly referred to as the formal language accepted by 𝒟\mathcal{D}. A LLM-generated DFA could result in a positive reward, meaning that DFA 𝒟\mathcal{D} can be considered as a LLM-generated DFA if the label sequence λ(𝒟)\lambda\in\mathcal{L}(\mathcal{D}) could result in a positive reward; while, the label sequences λ(𝒟)\lambda\notin\mathcal{L}(\mathcal{D}) should not receive a positive reward; therefore, it is a binary classifier that points out the promising explorations Neider et al. (2021). Our proposed algorithm uses the LLM-generated DFAs to narrow down the search space for learning the ground truth reward machine. Hence, expediting the reinforcement learning; however, in our method, the LLM-generated DFA is generated automatically for a specific domain using the LLM. It is known that the LLMs hallucinate; therefore, our proposed algorithm is capable of ignoring the LLM-generated DFA 𝒟\mathcal{D} if it is incompatible the underlying ground truth reward machine.

Definition 4 (LLM-generated DFA compatibility).

An LLM-generated DFA 𝒟\mathcal{D} is compatible with the ground truth reward machine 𝒜\mathcal{A} if for all label sequences l1l2lk(2𝒫)l_{1}l_{2}\ldots l_{k}\in(2^{\mathcal{P}})^{*} with reward sequence 𝒜(l1l2lk)=r1r2rk\mathcal{A}(l_{1}l_{2}\ldots l_{k})=r_{1}r_{2}\ldots r_{k} it holds that rk>0r_{k}>0 implies l1l1lk(𝒟)l_{1}l_{1}\ldots l_{k}\in\mathcal{L}(\mathcal{D}).

We demonstrate an LLM-generated DFA for the motivating example in Figure 3.

h1h_{1}starth2h_{2}¬gcp\neg\textit{g}\vee\textit{c}\vee\textit{p}\topg¬c¬p\textit{g}\wedge\neg\textit{c}\wedge\neg\textit{p}
Figure 3: An LLM-generated DFA for our motivating example.

In our proposed method we generate the domain-specific knowledge using the GPT series Liu et al. (2023). There are several methods of prompting LLMs such as zero-shot method Kojima et al. (2022) where the prompt is using a step-by-step approach to obtain better results or the few-shot method Brown et al. (2020) where the prompt is given similar examples encouraging the LLMs to produce similar reasoning to reach the correct answer. We use a combination of these methods to encourage the GPT to produce the correct information. First, we adopt a persona for the domain that we plan to extract relevant information from; in our example, the persona is an expert in American road traffic rules. Adopting a persona allows for more coherent and higher-quality responses from the language model. Figure 4 demonstrates the GPT response to a question for making a right turn at a traffic light while an expert persona is adopted.

Refer to caption
Figure 4: Prompting GPT-3.5-Turbo to adopt a persona as an expert in traffic rules (GPT response and prompt).

Figure 4 demonstrates that we can obtain the knowledge for a specific domain using the appropriate prompting method. In general, LLMs are text-based which means that their output is text and it cannot directly be used in an RL algorithm; therefore we require a method to convert this textual output to an automata. The next step is to ask the LLM to convert the provided steps to relevant propositions 𝒫\mathcal{P} so it can be used for the generation of the automata. We can perform this by mapping the steps into the set of propositions as illustrated in Figure 5.

Refer to caption
Figure 5: Mapping the output of the LLM to a specific set of propositions.

We use the mapped steps (instructions) to generate the DFA since actions and propositions are available; however, transitions and states are yet to be defined. We can use the labels for transitions between the states and to obtain the states we can circle back to the task and use the instructions to obtain the states. Therefore, we use the task description and the mapped steps to obtain the transitions and states, Figure 6 shows the prompting used to create the states and transitions.

Refer to caption
Figure 6: We use the mapped instructions to propositions and the task description to obtain the states and transitions.

We can construct a DFA by having the states hih_{i}, propositions 𝒫\mathcal{P}, and transition function δ(hi,li)\delta(h_{i},l_{i}); therefore, we can now use this information to generate the DFA for the task and use it to expedite convergence to optimal policy π(s,a)\pi^{*}(s,a).

Algorithm 1 PromptLLMforDFA: Constructing task DFA for a domain specific task using LLM

Input: prompt ff
Parameter: temperature TT, Top p pp (OpenAI LLM parameters)
Output: DFA

1:  outputpromptLLM(f)output\leftarrow\texttt{promptLLM}(f)
2:  for stepoutputstep\in output do
3:     MpropMapToProp(step)M_{\text{prop}}\leftarrow\texttt{MapToProp}(step)
4:     WactionMapToActions(step)W_{\text{action}}\leftarrow\texttt{MapToActions}(step)
5:  end for
6:  outputUpdatePrompt(f,steps,Mprop,Waction)output\leftarrow\texttt{UpdatePrompt}(f,steps,M_{\text{prop}},W_{\text{action}})
7:  H{h0}H\leftarrow\{h_{0}\}
8:  δ{δ0}\delta\leftarrow\{\delta_{0}\}
9:  𝒫{𝒫0}\mathcal{P}\leftarrow\{\mathscr{P}_{0}\}
10:  i0i\leftarrow 0
11:  for instructioninstruction in outputoutput do
12:     ii+1i\leftarrow i+1
13:     hiGetStates(instruction)h_{i}\leftarrow\texttt{GetStates}(instruction)
14:     δiGetTransition(instruction)\delta_{i}\leftarrow\texttt{GetTransition}(instruction)
15:     𝒫iGetProposition(instruction)\mathscr{P}_{i}\leftarrow\texttt{GetProposition}(instruction)
16:     HH{hi}H\leftarrow H\cup\{h_{i}\}
17:     δδ{δi}\delta\leftarrow\delta\cup\{\delta_{i}\}
18:     𝒫𝒫{𝒫i}\mathcal{P}\leftarrow\mathcal{P}\cup\{\mathscr{P}_{i}\}
19:  end for
20:  return H,h0,2𝒫,δ,{hi}\langle H,h_{0},2^{\mathcal{P}},\delta,\{h_{i}\}\rangle

Algorithm 1 can be updated to not even require propositions 𝒫\mathcal{P} and actions AA to obtain the DFA. Generating DFA this method requires a different prompting technique which combines two methods of few-shots and chain-of-thought we call MixedR (mixed-reasoning). We exploit this approach to obtain tasks that are more involved since obtaining the propositions and actions requires more information about the system; however, the MixedR requires the task description and examples of DFAs which do not require to even be relevant to the task as long as the descriptions are provided for the example DFAs even in high-level since the LLMs are capable of extracting complex patterns from textual data. We use MixedR to generate the DFA. Figure 7 shows the prompt template used for MixedR.

Refer to caption
Figure 7: We use the MixedR to obtain the DFAs that are more complex.

We can use the motivating example and the MixedR prompt technique to obtain the relevant DFA for a smaller example of the problem. We can then use the DFA obtained from the MixedR and provide it as LLM-generated DFA for the RL algorithm to use in order to learn the ground truth reward machine in a smaller search space.

4 Expediting Reinforcement Learning Using LLM-generated DFA

We use the LLM output as instructions to expedite the reinforcement learning convergence to the optimal policy π(s,a)\pi^{*}(s,a) using the language models. We use the LLMs to narrow down the search space for the RL algorithm since in a typical reinforcement learning algorithm the search space is considerable leading to cases where the training requires significant time and resources. Hence, using the pre-trained LLMs we can reduce this search space and converge to the optimal policy faster. One main issue to address is that LLMs typically generate outputs that are not aligned with facts; therefore, it is necessary to close the loop in the RL algorithm so that if a counterexample is met then the algorithm should be capable of adjusting the prompt, and updating it to create an updated LLM-generated DFA that could be used by the RL algorithm. Figure 8 illustrates the LARL-RM algorithm.

Refer to caption
Figure 8: LARL-RM uses the LLM to generate the automata. The LLM-generated automata will be used by the RL algorithm to expedite the convergence to optimal policy.
Algorithm 2 LARL-RM algorithm for incorporating high-level domain-specific knowledge from LLM into RL

Input: prompt ff, Episode length EpisodelengthEpisode_{length}
Parameter: learning rate α\alpha, discount factor γ\gamma, epsilon-greedy ϵ\epsilon, temperature TT, Top p pp, LLM query budget 𝒥\mathcal{J}\in\mathbb{N}

1:  XX\leftarrow\emptyset (empty sample)
2:  𝒟PromptLLMforDFA(f)\mathcal{D}\leftarrow\texttt{PromptLLMforDFA}(f)
3:  𝒜InitializeRewardMachine()\mathcal{A}\leftarrow\texttt{InitializeRewardMachine}() (compatible with 𝒟\mathcal{D})
4:  q(s,v,a)InitializeQFunction()q(s,v,a)\leftarrow\texttt{InitializeQFunction}(), Q={qq|qV}Q=\{q^{q}|q\in V\}
5:  for episodeepisode in 1,,Episodelength1,\dots,Episode_{length} do
6:     XinitXX_{\text{init}}\leftarrow X
7:     𝒟init𝒟\mathcal{D}_{\text{init}}\leftarrow\mathcal{D}
8:     (λ,ρ,Q)QRM-episode(𝒜,Q)(\lambda,\rho,Q)\leftarrow\texttt{QRM-episode}(\mathcal{A},Q)
9:     if 𝒜(λ)ρ\mathcal{A}(\lambda)\neq\rho then
10:        XX{(λ,ρ)}X\leftarrow X\cup\{(\lambda,\rho)\}
11:     end if
12:     if 𝒟𝒟initorXXinit\mathcal{D}\neq\mathcal{D}_{\text{init}}~{}\text{or}X\neq X_{\text{init}} then
13:        if 𝒥0\mathcal{J}\geq 0 and (λ,ρ)X,ρ>0 and λ(𝒟)\exists(\lambda^{\prime},\rho^{\prime})\in X,\ \rho^{\prime}>0\text{ and }\lambda^{\prime}\notin\mathcal{L}(\mathcal{D}) then
14:           fUpdatePrompt(f,λ)f\leftarrow\texttt{UpdatePrompt}(f,\lambda^{\prime})
15:           𝒟PromptLLMforDFA(f)\mathcal{D}\leftarrow\texttt{PromptLLMforDFA}(f)
16:           𝒥𝒥1\mathcal{J}\leftarrow\mathcal{J}-1
17:        end if
18:        𝒜LearnRewardMachine()\mathcal{A}\leftarrow\texttt{LearnRewardMachine}() (compatible with 𝒟\mathcal{D} and XX)
19:        q(s,v,a)InitializeQFunction()q(s,v,a)\leftarrow\texttt{InitializeQFunction}()
20:     end if
21:  end for

Our proposed method LARL-RM uses the prompt provided by the user to run the RL algorithm for a specific task that requires domain knowledge; however, by using the LLM the need for an expert is minimized and the algorithm itself can update the prompt using the counterexamples to update its DFA. LARL-RM first initializes an empty set for the trace XX, then uses the prompt to generate the relevant DFA for the RL (Lines 1-2). The algorithm then initializes the reward machine based on the LLM-generated DFA obtained from the LLM and also initializes the QRM (q-learning for reward machines) before starting the episode (Lines 3-4). LARL-RM stores the trace XX and DFA 𝒟\mathcal{D} in order to reinitialize the reward machines and q-values if counterexamples are met, then the QRM is called to update the q-values and if there are any counterexamples then that trace is stored so that it can be used to update the reward machine and the DFA (Lines 5-10). If the trace met by the agent is not compatible with the ground truth reward machine then it is removed from the LLM-generated DFA set 𝒟\mathcal{D} and then the algorithm uses this incompatible DFA and the initial prompt ff to obtain an updated prompt ff using the counterexample label sequence

, LARL-RM also uses an LLM query budget 𝒥\mathcal{J} to ensure that if the responses are not compatible with the ground truth reward machine then it will not get stuck in a loop and eventually after the budget 𝒥\mathcal{J} is depleted then it can start to learn the ground truth reward machine without LLM-generated DFA (Lines: 13-16, this also highlights the importance of prompt engineering further). Afterward, LARL-RM uses the stored DFA and trace sets to reinitialize the reward machine and q-values (Lines 18-19).

4.1 Using LLM-generated DFA to Learn the Reward Machine

LARL-RM uses the counterexamples to create a minimal reward machine while being guided by the LLM-generated DFA. Our method uses a similar approach as of the Neider et al. (2021) such that it maintains a finite sample set X2𝒫×X\subset 2^{\mathcal{P}}\times\mathbb{R} and a finite set of DFAs 𝒟={D1,,𝒟l}\mathscr{D}=\{\mathcal{}{D}_{1},\ldots,\mathcal{D}_{l}\}. We assume that the LLM-generated DFA is compatible with the sample XX such that the (l1lk,r1rk)X(l_{1}\ldots l_{k},r_{1}\ldots r_{k})\in X with positive reward rk>0r_{k}>0 where l1lk(𝒟)l_{1}\ldots l_{k}\in\mathcal{L}(\mathcal{D}) and if this criterion is not fulfilled then it gets ignored by the LARL-RM.

The RM learning which relies on the LLM-generated DFA performs the generation of the minimal reward machine 𝒜\mathcal{A} which is consistent with the trace XX and compatible with 𝒟\mathcal{D}. The minimal requirement for the reward machine to be minimal is important to the convergence of the LARL-RM to the optimal policy. LARL-RM uses SAT-based automata learning to verify the parametric systems since the learning task can be broken down into a series of satisfiability checks for propositional logic formulas. Essentially we construct and solve the propositional formulas ΦnX,𝒟\Phi^{X,\mathcal{D}}_{n} where the values of n>0n>0 Neider et al. (2021)Neider (2014).

We construct the formula ΦnX,𝒟\Phi^{X,\mathcal{D}}_{n} based on a propositional set X={x,y,z,}X=\{x,y,z,\ldots\} using the Boolean connectives ¬,,\neg,\vee,\wedge, and \mapsto, i.e., and interpretation is the mapping from propositional formulas to Boolean values such that :X{0,1}\mathcal{I}:X\mapsto\{0,1\}. If the the propositional formula is satisfied then Φ\mathcal{I}\models\Phi which interprets to \mathcal{I} satisfies formula Φ\Phi. Using this approach we can obtain a minimal reward machine similar to Neider et al. (2021). Using the LLM we are capable of constructing an LLM-generated DFA in a format that is compatible with SAT-based automata learning methods. LARL-RM is also capable of adjusting the initial prompt ff so that in case it is not compatible with the ground truth reward machine then it could be updated ff. Updated prompt ff uses the counterexample label sequence λ\lambda^{\prime} to call the LLM and obtain an updated DFA 𝒟\mathcal{D} which is compatible with the ground truth reward machine.

4.2 Refinement of Prompt and DFA

We use prompt ff to generate the DFA 𝒟=H,hI,Σ,δ,F\mathcal{D}=\langle H,h_{I},\Sigma,\delta,F\rangle for a specific domain. Transition exists for a proposition if δ(hi,li)=1\delta(h_{i},l_{i})=1, meaning if the LLM-generated DFA generated by the prompt ff is incorrect then it cannot have the correct transitions and trajectory which leads to a counterexample λ\lambda^{\prime}.

Therefore, we use the counterexample λ=l1l2lk\lambda^{\prime}=l_{1}l_{2}\ldots l_{k} to update the prompt in order to generate an updated DFA.

Hence, the updated DFA has a chance of becoming compatible with the counterexample. This process continues and each time the algorithm encounters a counterexample (λ\lambda^{\prime}) uses it to update the prompt ff again.

5 Convergence to Optimal Policy

The learned reward machine will ultimately be equivalent to the ground truth reward machine under the condition that any label sequence is admissible by the underlying MDP, i.e., that are physically allowed. If this assumption does not hold, it is still ensured that learned reward machine and the ground truth reward machine will agree on every admissible label sequences.

Due to the fact that we use LLM to generate the DFA 𝒟\mathcal{D} and the fact that the LLMs are known to produce outputs that are not factual, we face an issue. We cannot make any assumption on the quality of the output of the LLM. For the purpose of proving guaranties of LARL-RM, we need to consider the LLM as adversarial, that is, considering the worst case.

Lemma 1.

Let \mathcal{M} be a labeled MDP, 𝒜\mathcal{A} the ground truth reward machine encoding the rewards of \mathcal{M}, and 𝒟={𝒟1,,𝒟m}\mathscr{D}^{\star}=\{\mathcal{D}_{1},\ldots,\mathcal{D}_{m}\} the set of all LLM-generated DFAs that are added to 𝒟\mathcal{D} during the run of LARL-RM. Additionally, let nmax=max𝒟𝒟{|𝒟|}n_{\text{max}}=\max_{\mathcal{D}\in\mathscr{D}^{\star}}\{|\mathcal{D}|\} and m=max{2||(|𝒜|+1)nmax,||(|𝒜|+1)2}m=\max\left\{2\lvert\mathcal{M}\rvert\cdot\left(\lvert\mathcal{A}\rvert+1\right)\cdot n_{\text{max}},\lvert\mathcal{M}\rvert\left(\lvert\mathcal{A}\rvert+1\right)^{2}\right\}. Then, LARL-RM with EpisodelengthmEpisode_{length}\geq m almost surely learns a reward machine that is equivalent to 𝒜\mathcal{A}.

LARL-RM provides us with an upper bound for the episode length that needs to be explored. Additionally, LARL-RM algorithm correctness follows the Lemma 1 and the correctness of QRM algorithm Icarte et al. (2018). Using the QRM algorithm guarantee we can now show the convergence to optimal policy

Theorem 1.

Let \mathcal{M}, 𝒜\mathcal{A}, 𝒟\mathscr{D}^{\star}, and mm be as in Lemma 1. Then, LARL-RM will converge to an optimal policy almost surely if EpisodelengthmEpisode_{\text{length}}\geq m.

Theorem 1 guarantees the convergence of the LARL-RM to an optimal policy if sufficient episode length is given for exploration. It also provides an upper bound for convergence as shown in Lemma 1.

6 Case Studies

We implement the LARL-RM using the GPT series LLM, specifically we focus on model GPT-3.5-Turbo. We can set up a chat with either of the models using the provided APIs from OpenAI. For our example we consider the LARL-RM applied to an autonomous car example similar to the example found in Xu et al. (2020a).

In our example, the autonomous car must navigate to reach a destination while avoiding any pedestrians, but obeying traffic laws. It is worth mentioning that our algorithm is capable of running even without any LLM-generated DFA since in this case it can learn the ground truth reward machine without LLM-generated DFA. We consider two case studies for our motivating example to demonstrate the LARL-RM capabilities to expedite the RL. In both cases, we demonstrate the effect of incompatible LLM-generated DFA (not compatible with the ground truth reward machine) on the algorithm convergence to the optimal policy.

6.1 Case Study 1

The agent in our traffic example has the following set of actions A={up,down,right,left,stay}A=\{\textit{up},\textit{down},\textit{right},\textit{left},\textit{stay}\}. The layout of the environment is shown in Figure 1

We use the LLM-generated DFA 𝒟\mathcal{D} to guide the RL process. If the advice from the LLM is not compatible with the ground truth reward machine then it will be ignored by the algorithm. Prompt is updated and fed back to the LLM in order to obtain a new DFA that has a chance of becoming compatible with the counterexample. We show the ground truth reward machine for Case Study 1 in Figure 9.

v0v_{0}startv1v_{1}v2v_{2}v3v_{3}v4v_{4}(g¬c¬p,0)(g\wedge\neg c\wedge\neg p,0)(¬gcp,0)(\neg g\vee c\vee p,0)(¬J¬gcp,0)(\neg J\vee\neg g\vee c\vee p,0)(Jg¬c¬p,0)(J\wedge g\wedge\neg c\wedge\neg p,0)(¬b¬gcp,0)(\neg b\vee\neg g\vee c\vee p,0)(bg¬c¬p,0)(b\wedge g\wedge\neg c\wedge\neg p,0)(¬z,0)(\neg z,0)(z,1)(z,1)(,0)(\top,0)
Figure 9: Ground truth reward machine for Case Study 1. Agent must first navigate to intersection JJ then bb.

The LLM-generated DFA generated by the prompt ff for this case study is compatible with the ground truth reward machine (Figure 9) and helps the algorithm to converge to the optimal policy faster. The generated DFA using the GPT is shown in Figure 10.

h1h_{1}starth2h_{2}Jg¬c¬pJ\wedge g\wedge\neg c\wedge\neg p¬J¬gcp\lnot J\lor\lnot g\lor c\lor p\top
Figure 10: LLM-generated DFA 𝒟,{Jg¬c¬p}\mathcal{D},\{J\wedge g\wedge\neg c\wedge\neg p\} generated by LLM for Case Study 1.

We use the DFA (10) to guide the learning process. The reward obtained by the LARL-RM algorithm shows that it can reach the optimal policy faster if the LLM-generated DFA exists in comparison to the case when there is none.

Refer to caption
Figure 11: LARL-RM uses the LLM-generated DFA 𝒟\mathcal{D} to obtain a reward machine that is aligned with the ground truth reward machine in order to converge to an optimal policy faster.

We average the results of 55 independent runs for using a rolling mean with a window size of 1010. Figure 11 demonstrates that the algorithm reaches the convergence policy using the LLM-generated DFA. In this example, the prompt ff is one time compatible and one time incompatible with the ground truth reward machine, but even if it is not compatible LARL-RM can use the counterexample to update the prompt and generate a new DFA which is more likely to be compatible with the counterexample.

6.2 Case Study 2

In Case Study 2 we show that if the suggested DFA 𝒟\mathcal{D} is not compatible with the ground truth reward machine then the LARL-RM uses the counterexample to adjust the prompt ff and update it to obtain new DFA which is compatible with the counterexample. In this case study we update the MDP environment to allow for multiple reward machines as well as DFAs to be considered as equivalent. This way the output of the LLM may not necessarily match the ground truth reward machine; however, the LARL-RM can still use this DFA as long as it is not incompatible with the ground truth reward machine, but is equivalent to it.

In this environment, we extend the autonomous car example and provide it with more options to reach the destinations and the ground truth reward machine does not consider a specific route, but rather emphasizes the target destination; however, the LLM-generated DFA might specify a certain route over another which is not incompatible with the ground truth reward machine, but rather equivalent. We show the ground truth reward machine for Case Study 2 in Figure 12.

v0v_{0}startv1v_{1}v2v_{2}v3v_{3}v4v_{4}v5v_{5}(g¬c¬p,0)(g\wedge\neg c\wedge\neg p,0)(¬gcp,0)(\neg g\vee c\vee p,0)(bg¬c¬p,0)(b\wedge g\wedge\neg c\wedge\neg p,0)(¬b¬gcp,0)(\neg b\vee\neg g\vee c\vee p,0)(¬e¬gpc,0)(\neg e\vee\neg g\vee p\vee c,0)(eg¬p¬c,0)(e\wedge g\wedge\neg p\wedge\neg c,0)(,0)(\mathcal{B},0)(¬z,0)(\neg z,0)(¬b¬gpc,0)(\neg b\vee\neg g\vee p\vee c,0)(z,1)(z,1)(,0)(\top,0)
Figure 12: Ground truth reward machine for Case Study 2. Agent must reach the destination no matter the route (=bg¬p¬c\mathcal{B}=b\wedge g\wedge\neg p\wedge\neg c).

The LLM-generated DFA 𝒟\mathcal{D}, {Jg¬c¬p,bg¬c¬p}\{J\wedge g\wedge\neg c\wedge\neg p,b\wedge g\wedge\neg c\wedge\neg p\} for this case study specifies a route which leads to the target and it is equivalent to the ground truth reward machine since this solution DFA can be considered a subset of the larger DFA 𝒟\mathcal{D} which is compatible with the ground truth reward machine. Figure 13 shows the LLM-generated DFA for Case Study 2.

h1h_{1}starth2h_{2}h3h_{3}Jg¬c¬pJ\wedge g\wedge\neg c\wedge\neg pbg¬c¬pb\wedge g\wedge\neg c\wedge\neg p¬J¬gcp\lnot J\lor\lnot g\lor c\lor p¬b¬gcp\lnot b\lor\lnot g\lor c\lor p\top
Figure 13: LLM-generated DFA 𝒟\mathcal{D} contains {bg¬c¬p}\{b\wedge g\wedge\neg c\wedge\neg p\} generated by LLM for Case Study 2.

We run the LARL-RM for the Case Study 2 using the prompt ff which generates the DFA 𝒟\mathcal{D}, {J,bg¬c¬p}\{J,b\wedge g\wedge\neg c\wedge\neg p\} which is incompatible with the ground truth reward machine, but the LARL-RM takes this counterexample and generates an updated DFA 𝒟\mathcal{D}, {bg¬c¬p}\{b\wedge g\wedge\neg c\wedge\neg p\} which is compatible with the ground truth reward machine. Figure 14 demonstrates the reward of LARL-RM using the DFA 𝒟\mathcal{D}.

Refer to caption
Figure 14: LARL-RM updates the prompt ff in order to obtain a DFA that is compatible with the ground truth reward machine.

Figure 14 shows the reward obtained by LARL-RM for Case Study 2 using the prompt ff which generates the DFA 𝒟\mathcal{D}. The reward is for 55 independent runs averaged at each 1010th step. For both cases of compatible and incompatible LLM-generated DFA the LARL-RM converges to the optimal policy faster than when there is no LLM-generated DFA.

7 Conclusion

We proposed a novel algorithm, LARL-RM, that uses a prompt to obtain an LLM-generated DFA to expedite reinforcement learning. LARL-RM uses counterexamples to automatically generate a new prompt and consequently, a new DFA that is compatible with the counterexamples to close the loop in RL. We showed that LARL-RM is guaranteed to converge to an optimal policy using the LLM-generated DFA. We showed that RL can be expedited using the LLM-generated DFA and in case the output of the LLM is not compatible with the ground truth reward machine, LARL-RM is capable of adjusting the prompt to obtain a more accurate DFA. In future work, we plan to extend the proposed framework to RL for multi-agent systems.

Acknowledgments

This research is partially supported by the National Science Foundation under grant NSF CNS 2304863 and the Office of Naval Research under grant ONR N00014-23-1-2505.

References

  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  • Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Icarte et al. [2018] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pages 2107–2116. PMLR, 2018.
  • Jang et al. [2023] Joel Jang, Seonghyeon Ye, and Minjoon Seo. Can large language models truly understand prompts? a case study with negated prompts. In Transfer Learning for Natural Language Processing Workshop, pages 52–62. PMLR, 2023.
  • Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • Liu et al. [2023] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, et al. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017, 2023.
  • Neider et al. [2021] Daniel Neider, Jean-Raphael Gaglione, Ivan Gavran, Ufuk Topcu, Bo Wu, and Zhe Xu. Advice-guided reinforcement learning in a non-markovian environment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9073–9080, 2021.
  • Neider [2014] Daniel Neider. Applications of automata learning in verification and synthesis. PhD thesis, Aachen, Techn. Hochsch., Diss., 2014, 2014.
  • Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  • Seff et al. [2023] Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and Benjamin Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023.
  • Shallit [2008] Jeffrey O. Shallit. A Second Course in Formal Languages and Automata Theory. Cambridge University Press, 2008.
  • Vemprala et al. [2023] Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  • Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • West et al. [2021] Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
  • Xu et al. [2020a] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. Joint inference of reward machines and policies for reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pages 590–598, 2020.
  • Xu et al. [2020b] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. Joint inference of reward machines and policies for reinforcement learning. In ICAPS, pages 590–598. AAAI Press, 2020.
  • Yang et al. [2022] Yunhao Yang, Jean-Raphaël Gaglione, Cyrus Neary, and Ufuk Topcu. Automaton-based representations of task knowledge from generative language models. arXiv preprint arXiv:2212.01944, 2022.
  • Yang et al. [2023] Yunhao Yang, Neel P Bhatt, Tyler Ingebrand, William Ward, Steven Carr, Zhangyang Wang, and Ufuk Topcu. Fine-tuning language models using formal methods feedback. arXiv preprint arXiv:2310.18239, 2023.

Supplementary Materials

Appendix A Comparison of closed-loop and Open-loop LARL-RM

Our proposed algorithm can be modified to perform in an open-loop or closed-loop manner with respect to LLM-generated DFA, meaning that if a counterexample is found LARM-RM has the capability to either update the LLM-generated DFA DD or keep the original LLM-generated DFA. LARL-RM has two options either continuing the learning process of the reward machine without updating the LLM-generated DFA, referred to as open-loop. It can also call the LLM again and generate a new DFA based on the counterexample, referred to as the closed loop. We show our proposed algorithm’s convergence to optimal policy under both options. We demonstrate the open-loop configuration in Figure 15.

Refer to caption
Figure 15: LARL-RM open-loop configuration, counterexample of LLM-generated DFAs are not used to update the prompt (dashed arrow).

A.1 Case Study 1

We investigate the two open-loop and closed-loop methods in the Case Study 1. First, we demonstrate the effect of incompatible LLM-generated DFA with compatible LLM-generated DFA in open-loop conditions. Figure 16 illustrates the convergence to optimal policy for open-loop configuration.

Refer to caption
Figure 16: LARL-RM open-loop and closed-loop configurations’ averaged rewards for Case Study 1 when the LLM-generated DFA is compatible with the ground truth reward machine. Rewards are for 5 independent runs, averaged at each 2020 step.

Figure 16 demonstrates that in LARL-RM open-loop configuration if the LLM-generated DFA is compatible it converges to the optimal policy faster than when there is no LLM-generated DFA. However, if the LLM-generated DFA is incompatible then it might take longer to converge to the optimal policy. Now we consider the closed-loop configuration so that the DFA if incompatible then be used to update the prompt ff. Figure 17 demonstrates the LARL-RM convergence to the optimal policy when the closed-loop configuration is applied.

Refer to caption
Figure 17: LARL-RM open-loop and closed-loop configurations’ averaged rewards for Case Study 1 when the LLM-generated DFA is incompatible with the ground truth reward machine. Rewards are for 5 independent runs, averaged at each 2020 step.

Figure 17 demonstrates that the closed-loop configuration is capable of updating the prompt ff considering the counterexample such that LARL-RM has a better chance of converging faster than when there is no LLM-generated DFA.

A.2 Case Study 2

We further investigate the effect of open-loop and closed-loop configurations. The same configuration for the open-loop as illustrated in Figure 15 is applied here. The open-loop configuration as shown in Figure 18 is converging to the optimal policy faster than the case with no LLM-generated DFA if the LLM-generated DFA is compatible with the ground truth reward machine.

Refer to caption
Figure 18: LARL-RM open-loop and closed-loop configurations reward for Case Study 2 when the LLM-generated DFA is compatible with the ground truth reward machine. Rewards are for 5 independent runs, averaged at each 2020 step.

If we apply the closed-loop configuration to Case Study 2, then as expected it can converge to the optimal policy faster than when there is no LLM-generated DFA and even if the initial LLM-generated DFA is incompatible with the ground truth reward machine. Figure 19 demonstrates the closed-loop configuration.

Refer to caption
Figure 19: LARL-RM open-loop and closed-loop configurations reward for Case Study 2 when the LLM-generated DFA is incompatible with the ground truth reward machine. Rewards are for 5 independent runs, averaged at each 2020 step.

As illustrated in Figure 19, if the closed-loop configuration is applied then the LARL-RM has a better chance to converge to the optimal policy faster than when there is no LLM-generated DFA.

Appendix B Theoretical Guarantee of the LARL-RM

Here we demonstrate the convergence of the proposed algorithm to the optimal policy.

Proof of Lemma 1.

We first show that the set 𝒟\mathscr{D} remains stationary after a finite amount of time (i.e., that no new LLM-generated DFA is added or removed), assuming that all trajectories will be visited infinitely often. To this end, we observe the following: if an LLM-generated DFA added to 𝒟\mathscr{D} is compatible, then it will remain in 𝒟\mathscr{D} indefinitely. The reason is that there is no counterexample contradicting the LLM-generated DFA, meaning that the check-in Line 12 is never triggered. On the other hand, if an LLM-generated DFA is not consistent, then the algorithm eventually detects a trajectory witnessing this fact. Once this happens, it removes the corresponding LLM-generated DFA from the set 𝒟\mathscr{D}.

Observe now that the algorithm decrements 𝒥\mathcal{J} by one every time it adds an LLM-generated DFA to 𝒟\mathscr{D}, and it only does so as long as 𝒥>0\mathcal{J}>0. Thus, the total number of LLM-generated DFA that are generated during the run of the algorithm is bounded by 𝒥\mathcal{J}. Consequently, the set 𝒟\mathscr{D} no longer changes after a finite period, because the algorithm has either identified the true reward machine or all incompatible LLM-generated DFAs, have been removed and the budget JJ is exhausted.

Once 𝒟\mathscr{D} becomes stationary, an argument analogous to that in the proof of Lemma 1 in Neider et al.’s work Neider et al. [2021] shows that the algorithm will eventually learn the true reward machine. Intuitively, as long as the current hypothesis is not equivalent to the true reward machine, there exists a “short” trajectory witnessing the fast. The episode length is chosen carefully such that such a witness will eventually be encountered and added to XX. In the worst case, all trajectories of EpisodelengthEpisode_{length} will eventually be added to XX, at which point the learning algorithm is guaranteed to learn the true reward machine. ∎

Since our proposed algorithm learns the minimal reward machine that is consistent with XX and compatible with D𝒟D\in\mathscr{D}, we need to use an automata learning method based on SAT which is used to verify parametric systems Neider [2014]. The underlying idea is to reduce the learning task to a series of satisfiability checks of formulas for propositional logic. In other words, we build and solve a sequence of propositional formulas ΦnX,D\Phi^{X,D}_{n} for n>0n>0 such that the following properties hold Neider et al. [2021]:

  • the satisfiability of ΦnX,D\Phi_{n}^{X,D} is contingent upon the existence of a reward machine with nn states, such that it is consistent with 𝒳\mathcal{X} and compatible with D𝒟D\in\mathscr{D}. This condition holds true if and only if such a reward machine exists.

  • if ΦnX,D\Phi_{n}^{X,D} is satisfiable, it implies that a satisfying assignment contains enough information to construct a reward machine with nn states that is both consistent and compatible.

We can now build and solve ΦnX,D\Phi^{X,D}_{n} starting from n=1n=1 and increasing nn until it becomes satisfiable. Using this method our proposed algorithm constructs a minimal reward machine that is consistent with XX and compatible with the LLM-generated DFA DD.

To ease the notation in the remainder, we use 𝒜:𝗉𝑤𝗊\mathcal{A}\colon\mathsf{p}\xrightarrow{w}\mathsf{q} to abbreviate a run of the reward machine 𝒜\mathcal{A} on the input-sequence ww that starts in 𝗉\mathsf{p} and leads to 𝗊\mathsf{q}. By definition, we have 𝒜:𝗉𝜀𝗉\mathcal{A}\colon\mathsf{p}\xrightarrow{\varepsilon}\mathsf{p} for the empty sequence ε\varepsilon and every state 𝗉\mathsf{p}. We later also use this notation for DFAs.

To encode the reward machine 𝒜=V,VI,2𝒫,RX,δ,σ\mathcal{A}=\langle V,V_{I},2^{\mathcal{P}},R_{X},\delta,\sigma\rangle in propositional logic where set VV of states and initial state vIv_{I} is fixed, the reward machine 𝒜\mathcal{A} is uniquely determined by the transitions δ\delta and the output function σ\sigma. The size of states |V|=n\lvert V\rvert=n and the initial state vIv_{I} is fixed, then to encode the transition function and the set representing the final states, we will introduce two propositional variables namely dp,l,vd_{p,l,v} where p,vVp,v\in V and l2𝒫l\in 2^{\mathcal{P}} and op,l,ro_{p,l,r} for pVp\in V , l2𝒫l\in 2^{\mathcal{P}}, and rRXr\in R_{X}. We apply the constraints in (1) and (2) to ensure that the variables dp,l,vd_{p,l,v} and op,l,ro_{p,l,r} encode deterministic functions.

𝗉Vl2𝒫[[𝗊Vd𝗉,l,𝗊][𝗊𝗊V¬d𝗉,l,𝗊¬d𝗉,l,𝗊]]\displaystyle\bigwedge_{\mathsf{p}\in V}\bigwedge_{l\in{2^{\mathcal{P}}}}\biggl{[}\!\bigl{[}\bigvee_{\mathsf{q}\in V}d_{\mathsf{p},l,\mathsf{q}}\bigr{]}\land\bigl{[}\bigwedge_{\mathsf{q}\neq\mathsf{q^{\prime}}\in V}\lnot d_{\mathsf{p},l,\mathsf{q}}\lor\lnot d_{\mathsf{p},l,\mathsf{q^{\prime}}}\bigr{]}\!\biggr{]} (1)
𝗉Vl2𝒫[[rRXo𝗉,l,r][rrRX¬o𝗉,l,r¬o𝗉,l,r]]\displaystyle\bigwedge_{\mathsf{p}\in V}\bigwedge_{l\in{2^{\mathcal{P}}}}\biggl{[}\!\bigl{[}\bigvee_{r\in R_{X}}o_{\mathsf{p},l,r}\bigr{]}\land\bigl{[}\bigwedge_{r\neq r^{\prime}\in R_{X}}\lnot o_{\mathsf{p},l,r}\lor\lnot o_{\mathsf{p},l,r^{\prime}}\bigr{]}\!\biggr{]} (2)

We denote the conjunction of constraints (1) and (2) by Φn𝑅𝑀\Phi^{\mathit{RM}}_{n} which we later on use to show the consistency of the learned reward machine with XX. To apply constraints for consistency with samples in propositional logic, we propose auxiliary variables as xλ,ρx_{\lambda,\rho} for every (λ,ρ)𝑃𝑟𝑒𝑓(X)(\lambda,\rho)\in\mathit{Pref}(X) where qVq\in V, and for a trace τ=(l1lk,r1rk)(2𝒫)×\tau=(l_{1}\ldots l_{k},r_{1}\ldots r_{k})\in(2^{\mathcal{P}})^{\ast}\times\mathbb{R}^{\ast} we define the set prefixes of τ\tau by 𝑃𝑟𝑒𝑓(τ)={(l1li,r1ri)(2𝒫)×0ik}\mathit{Pref}(\tau)=\{(l_{1}\ldots l_{i},r_{1}\ldots r_{i})\in(2^{\mathcal{P}})^{\ast}\times\mathbb{R}^{\ast}\mid 0\leq i\leq k\} ( (ε,ε)𝑃𝑟𝑒𝑓(τ)(\varepsilon,\varepsilon)\in\mathit{Pref}(\tau) is always correct). The value of xλ,ρx_{\lambda,\rho} is set to true if and only if the prospective reward machine reaches states qq after reading λ\lambda. To obtain the desired meaning, we add the following constraints:

xε,𝗊𝖨𝗉V{𝗊𝖨}¬xε,𝗉\displaystyle x_{\varepsilon,\mathsf{q_{I}}}\land\bigwedge_{\mathsf{p}\in V\setminus\{\mathsf{q_{I}}\}}\lnot x_{\varepsilon,\mathsf{p}} (3)
(λl,ρr)𝑃𝑟𝑒𝑓(X)𝗉,𝗊V(xλ,pd𝗉,l,𝗊)xλl,𝗊\displaystyle\bigwedge_{(\lambda l,\rho r)\in\mathit{Pref}(X)}\bigwedge_{\mathsf{p},\mathsf{q}\in V}(x_{\lambda,p}\land d_{\mathsf{p},l,\mathsf{q}})\rightarrow x_{\lambda l,\mathsf{q}} (4)
(λl,ρr)𝑃𝑟𝑒𝑓(X)𝗉Vxλ,po𝗉,l,𝗋\displaystyle\bigwedge_{(\lambda l,\rho r)\in\mathit{Pref}(X)}\bigwedge_{\mathsf{p}\in V}x_{\lambda,p}\rightarrow o_{\mathsf{p},l,\mathsf{r}} (5)

In order to ensure that the prospective reward machine 𝒜\mathcal{A}_{\mathcal{I}} and the LLM-generated DFA DD are synchronized we add auxiliary variables yv,vDy^{D}_{v,v^{\prime}} for vVv\in V and vVDv^{\prime}\in V_{D}. IF there is a label sequence λ\lambda where 𝒜:vI𝜆q\mathcal{A}_{\mathcal{I}}:v_{I}\xrightarrow{\lambda}q and D:qI,D𝜆qD:q_{I,D}\xrightarrow{\lambda}q^{\prime} is set to true. We denote the conjunction of constraints (3)-(5) by ΦnX\Phi^{\mathit{X}}_{n} which we later on use to show the consistency of the learned reward machine with XX. To obtain this behavior, we add the following constraints:

y𝗊𝖨,𝗊𝗅,𝖣D\displaystyle y_{\mathsf{q_{I}},\mathsf{q^{\prime}_{l,D}}}^{D} (6)
𝗉,𝗊Vl2𝒫δD(𝗉,l)=𝗊(y𝗉,𝗉Dd𝗉,l,𝗊)y𝗊,𝗊D\displaystyle\bigwedge_{\mathsf{p,q}\in V}\bigwedge_{l\in{2^{\mathcal{P}}}}\bigwedge_{\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}}(y_{\mathsf{p},\mathsf{p^{\prime}}}^{D}\land d_{\mathsf{p},l,\mathsf{q}})\rightarrow y_{\mathsf{q},\mathsf{q^{\prime}}}^{D} (7)
𝗉VδD(𝗉,l)=𝗊𝗊𝖥Dy𝗉,𝗉D¬rRXr>0o𝗉,l,r\displaystyle\bigwedge_{\mathsf{p}\in V}\bigwedge_{\begin{subarray}{c}\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}\\ \mathsf{q^{\prime}}\notin\mathsf{F}_{D}\end{subarray}}y_{\mathsf{p},\mathsf{p^{\prime}}}^{D}\rightarrow\lnot\bigvee_{\begin{subarray}{c}r\in R_{X}\\ r>0\end{subarray}}o_{\mathsf{p},l,r} (8)

We denote the conjunction of constraints (6)-(8) by ΦnX\Phi^{\mathit{X}}_{n} which we later use to show the consistency of the learned reward machine with XX. Theorem 2 demonstrates the consistency of the LLM-generated DFA with the Algorithm 2. W e denote the conjunction of Formulas (3), (4), and (5) by ΦnD\Phi^{D}_{n}.

Theorem 2.

Consider X(2𝒫)+×+X\subset(2^{\mathcal{P}})^{+}\times\mathbb{R}^{+} be sample and 𝒟\mathscr{D} be a finite set of LLM-generated DFAs that are compatible with XX. Then, the following holds:

  1. 1.

    If we have ΦnX,D\mathcal{I}\models\Phi_{n}^{X,D}, then the reward machine 𝒜\mathcal{A}_{\mathcal{I}} is consistent with XX and compatible with the LLM-generated DFA D𝒟D\in\mathscr{D}.

  2. 2.

    If there exists a reward machine with nn states that is consistent with XX as well as compatible with each D𝒟D\in\mathscr{D}, then ΦnX,D\Phi_{n}^{X,D} is satisfiable.

Proof for Theorem 2 will follow in Section C after discussion of prerequisites.

Appendix C Learning Reward Machines

In this section, we prove the correctness of our SAT-based algorithm for learning reward machines. To this end, we show that the propositional formula ΦnX\Phi_{n}^{X} and ΦnD\Phi_{n}^{D} have the desired meaning. We begin with the formula ΦnX\Phi_{n}^{X}, which is designed to enforce that the learned reward machine is consistent with the sample XX.

Lemma 2.

Let Φn𝑅𝑀ΦnX\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{X} and 𝒜\mathcal{A}_{\mathcal{I}} the reward machine defined above. Then, 𝒜\mathcal{A}_{\mathcal{I}} is consistent with XX (i.e., 𝒜(λ)=ρ\mathcal{A}_{\mathcal{I}}(\lambda)=\rho for each (λ,ρ)X(\lambda,\rho)\in X).

Proof.

Let Φn𝑅𝑀ΦnX\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{X} and 𝒜\mathcal{A}_{\mathcal{I}} the reward machine constructed as above. To prove Lemma 2, we show the following, more general statement by induction over the length of prefixes (λ,ρ)𝑃𝑟𝑒𝑓(X)(\lambda,\rho)\in\mathit{Pref}(X): if 𝒜:𝗊𝖨𝜆𝗊\mathcal{A}_{\mathcal{I}}\colon\mathsf{q_{I}}\xrightarrow{\lambda}\mathsf{q}, then

  1. 1.

    (xλ,𝗊)=1\mathcal{I}(x_{\lambda,\mathsf{q}})=1; and

  2. 2.

    𝒜(λ)=ρ\mathcal{A}_{\mathcal{I}}(\lambda)=\rho.

Lemma 2 then follows immediately from Part 2 since X𝑃𝑟𝑒𝑓(X)X\subseteq\mathit{Pref}(X).

Base case:

Let (ε,ε)𝑃𝑟𝑒𝑓(X)(\varepsilon,\varepsilon)\in\mathit{Pref}(X). By definition of runs, we know that 𝒜:𝗊I𝜀𝗊I\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\varepsilon}{\mathsf{q}_{I}} is the only run of 𝒜\mathcal{A} on the empty word. Similarly, Formula (3) guarantees that the initial state 𝗊I{\mathsf{q}_{I}} is the unique state 𝗊V\mathsf{q}\in V for which (xε,𝗊)=1\mathcal{I}(x_{\varepsilon,\mathsf{q}})=1 holds. Both observations immediately prove Part 1. Moreover, 𝒜(ε)=ε\mathcal{A}(\varepsilon)=\varepsilon holds by definition the semantics of reward machines, which proves Part 2.

Induction step:

Let (λl,ρr)𝑃𝑟𝑒𝑓(X)(\lambda l,\rho r)\in\mathit{Pref}(X). Moreover, let 𝒜:𝗊I𝜆𝗉𝑙𝗊\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\lambda}\mathsf{p}\xrightarrow{l}\mathsf{q} be the unique run of 𝒜\mathcal{A} on λ\lambda. By applying the induction hypothesis, we then obtain that both (xλ,p)=1\mathcal{I}(x_{\lambda,p})=1 and 𝒜(λ)=ρ\mathcal{A}(\lambda)=\rho hold.

To prove Part 1, note that 𝒜\mathcal{A}_{\mathcal{I}} contains the transition δ(𝗉,l)=𝗊\delta(\mathsf{p},l)=\mathsf{q} since this transition was used in the last step of the run on λ\lambda. By construction of 𝒜\mathcal{A}_{\mathcal{I}}, this can only be the case if (d𝗉,l,𝗊)=1\mathcal{I}(d_{\mathsf{p},l,\mathsf{q}})=1. Then, however, Formula (4) implies that (xλl,q)=1\mathcal{I}(x_{\lambda l,q})=1 because (xλ,p)=1\mathcal{I}(x_{\lambda,p})=1 (which holds by induction hypothesis). This proves Part 1.

To prove Part 2, we exploit Formula (5). More precisely, Formula (5) guarantees that if (xλ,p)=1\mathcal{I}(x_{\lambda,p})=1 and the next input is ll, then (op,l,r)=1\mathcal{I}(o_{p,l,r})=1. By construction of 𝒜\mathcal{A}_{\mathcal{I}}, this means that σ(𝗊,l)=r\sigma(\mathsf{q},l)=r. Hence, 𝒜\mathcal{A}_{\mathcal{I}} outputs rr in the last step of the run on λl\lambda l. Since 𝒜(λ)=ρ\mathcal{A}_{\mathcal{I}}(\lambda)=\rho (which holds by induction hypothesis), we obtain 𝒜(λl)=ρr\mathcal{A}_{\mathcal{I}}(\lambda l)=\rho r. This proves Part 2.

Thus, 𝒜\mathcal{A}_{\mathcal{I}} is consistent with XX. ∎

Next, we show that the formula ΦnD\Phi_{n}^{D} ensures that the learned reward machine is compatible with the LLM-generated DFA DD.

Lemma 3.

Let Φn𝑅𝑀ΦnD\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{D} and 𝒜\mathcal{A}_{\mathcal{I}} the reward machine defined above. Then, 𝒜\mathcal{A}_{\mathcal{I}} is compatible with DD (i.e., 𝒜(12k)=r1r2rk\mathcal{A}_{\mathcal{I}}(\ell_{1}\ell_{2}\ldots\ell_{k})=r_{1}r_{2}\ldots r_{k} and rk>0r_{k}>0 implies 12kL(D)\ell_{1}\ell_{2}\ldots\ell_{k}\in L(D) for every nonempty label sequence 12k\ell_{1}\ell_{2}\ldots\ell_{k}).

Proof.

Let Φn𝑅𝑀ΦnD\mathcal{I}\models\Phi_{n}^{\mathit{RM}}\land\Phi_{n}^{D} and 𝒜\mathcal{A}_{\mathcal{I}} the reward machine defined above. We first show that 𝒜:𝗊I𝜆𝗊\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\lambda}\mathsf{q} and D:𝗊𝖨,𝖣𝜆𝗊D\colon\mathsf{q_{I,D}}\xrightarrow{\lambda}\mathsf{q^{\prime}} imply (y𝗊,𝗊D)=1\mathcal{I}(y_{\mathsf{q},\mathsf{q^{\prime}}}^{D})=1 for all label sequences λ2𝒫\lambda\in{2^{\mathcal{P}}}^{\ast}. The proof of this claim proceeds by induction of the length of label sequences, similar to Part 2 in the proof of Lemma 2.

Base case:

Let λ=ε\lambda=\varepsilon. By definition of runs, the only runs on the empty label sequence are 𝒜:𝗊I𝜀𝗊I\mathcal{A}_{\mathcal{I}}\colon{\mathsf{q}_{I}}\xrightarrow{\varepsilon}{\mathsf{q}_{I}} and D:𝗊𝖨,𝖣𝜀𝗊𝖨,𝖣D\colon\mathsf{q_{I,D}}\xrightarrow{\varepsilon}\mathsf{q_{I,D}}. Moreover, Formula (6) ensures that (y𝗊𝖨,𝗊𝖨,𝖣D)=1\mathcal{I}(y_{\mathsf{q_{I}},\mathsf{q_{I,D}}}^{D})=1, which proves the claim.

Induction step:

Let λ=λ′′l\lambda=\lambda^{\prime\prime}l. Moreover, let 𝒜:𝗊𝖨λ′′𝗉𝑙𝗊\mathcal{A}_{\mathcal{I}}\colon\mathsf{q_{I}}\xrightarrow{\lambda^{\prime\prime}}\mathsf{p}\xrightarrow{l}\mathsf{q} and D:𝗊𝖨,𝖣λ′′𝗉𝑙𝗊D\colon\mathsf{q_{I,D}}\xrightarrow{\lambda^{\prime\prime}}\mathsf{p^{\prime}}\xrightarrow{l}\mathsf{q^{\prime}} be the runs of 𝒜\mathcal{A}_{\mathcal{I}} and DD on λ=λ′′l\lambda=\lambda^{\prime\prime}l, respectively. By induction hypothesis, we then know that (y𝗉,𝗉D)=1\mathcal{I}(y_{\mathsf{p},\mathsf{p^{\prime}}}^{D})=1. Moreover, 𝒜\mathcal{A}_{\mathcal{I}} contains the transition δ(𝗉,l)=𝗊\delta(\mathsf{p},l)=\mathsf{q} because this transition was used in the last step of the run of 𝒜\mathcal{A}_{\mathcal{I}} on λ\lambda. By construction of 𝒜\mathcal{A}_{\mathcal{I}}, this can only be the case if (d𝗉,l,𝗊)=1\mathcal{I}(d_{\mathsf{p},l,\mathsf{q}})=1. In this situation, Formula 7 ensures (y𝗊,𝗊D)=1\mathcal{I}(y_{\mathsf{q},\mathsf{q^{\prime}}}^{D})=1 (since also δD(𝗉,l)=𝗊\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}), which proves the claim.

Let now λ=12k\lambda=\ell_{1}\ell_{2}\ldots\ell_{k} be a nonempty label sequence (i.e., k1k\geq 1). Moreover, let 𝒜:𝗊𝖨l1lk1𝗉lk𝗊\mathcal{A}_{\mathcal{I}}\colon\mathsf{q_{I}}\xrightarrow{l_{1}\ldots l_{k-1}}\mathsf{p}\xrightarrow{l_{k}}\mathsf{q} be the run of 𝒜\mathcal{A}_{\mathcal{I}} on λ\lambda and D:𝗊𝖨,𝖣l1lk1𝗉lk𝗊D\colon\mathsf{q_{I,D}}\xrightarrow{l_{1}\ldots l_{k-1}}\mathsf{p^{\prime}}\xrightarrow{l_{k}}\mathsf{q^{\prime}} the run of DD on λ\lambda. Our induction shows that (y𝗉,𝗉)=1\mathcal{I}(y_{\mathsf{p},\mathsf{p^{\prime}}})=1 holds in this case.

Towards a contradiction, assume that 𝒜(12k)=r1r2rk\mathcal{A}_{\mathcal{I}}(\ell_{1}\ell_{2}\ldots\ell_{k})=r_{1}r_{2}\ldots r_{k}, rk>0r_{k}>0, and 12kL(D)\ell_{1}\ell_{2}\ldots\ell_{k}\notin L(D). In particular, this means 𝗊𝖥D\mathsf{q^{\prime}}\notin\mathsf{F}_{D}. Since δD(𝗉,lk)=𝗊\delta_{D}(\mathsf{p^{\prime}},l_{k})=\mathsf{q^{\prime}} (which was used in the last step in the run of DD on 12k\ell_{1}\ell_{2}\ldots\ell_{k}) and (y𝗉,𝗉D)=1\mathcal{I}(y_{\mathsf{p},\mathsf{p^{\prime}}}^{D})=1 (due to the induction above), Formula (8) ensures that (o𝗉,l,r)=0\mathcal{I}(o_{\mathsf{p},l,r})=0 for all rRXr\in R_{X} with r>0r>0. However, Formula (2) ensures that there is exactly one rRXr\in R_{X} with (o𝗉,l,r)=1\mathcal{I}(o_{\mathsf{p},l,r})=1. Thus, there has to exist an rRXr\in R_{X} such that r0r\leq 0 and (o𝗉,l,r)=1\mathcal{I}(o_{\mathsf{p},l,r})=1. By construction of 𝒜\mathcal{A}_{\mathcal{I}}, this means that the last output rkr_{k} of 𝒜\mathcal{A}_{\mathcal{I}} on reading lkl_{k} must have been rk0r_{k}\leq 0. However, our assumption was rk>0r_{k}>0, which is a contradiction. Thus, 𝒜\mathcal{A}_{\mathcal{I}} is compatible with the LLM-generated DFA DD. ∎

We can now prove Theorem 2 (i.e., the correctness of our SAT-based learning algorithm for reward machines).

Proof of Theorem 2.

The proof of Part 1 follows immediately from Lemma 2 and Lemma 3.

To prove Part 2, let 𝒜=(V,𝗊I,2𝒫,M,δ,σ)\mathcal{A}=(V,{\mathsf{q}_{I}},{2^{\mathcal{P}}},M,\delta,\sigma) be a reward machine with nn states that is consistent with XX and compatible with each D𝒟D\in\mathscr{D}. From this reward machine, we can derive a valuation \mathcal{I} for the variables d𝗉,l,𝗊d_{\mathsf{p},l,\mathsf{q}} and o𝗉,l,ro_{\mathsf{p},l,r} in a straightforward way (e.g., setting (d𝗉,l,𝗊)=1\mathcal{I}(d_{\mathsf{p},l,\mathsf{q}})=1 if and only is δ(𝗉,l)=𝗊\delta(\mathsf{p},l)=\mathsf{q}). Moreover, we obtain a valuation for the variables xλ,𝗉x_{\lambda,\mathsf{p}} from the runs of (prefixes) of traces in the sample XX, and valuations for the variables y𝗉,𝗉Dy_{\mathsf{p},\mathsf{p^{\prime}}}^{D} from the synchronized runs of 𝒜\mathcal{A} and DD for each D𝒟D\in\mathscr{D}. Then, \mathcal{I} indeed satisfies ΦnX,D\Phi_{n}^{X,D}. ∎

Appendix D Convergence to an Optimal Policy

In this section, we prove that LARM-RM almost surely converges to an optimal policy in the limit. We begin by defining attainable trajectories—trajectories that can possibly appear in the exploration of an agent.

Definition 5.

Let

=(S,sI,A,p,R,γ,𝒫,L)\mathcal{M}=(S,s_{I},A,p,R,\gamma,\mathcal{P},L) be a labeled MDP and mm\in\mathbb{N} a natural number. A trajectory ζ=s0a1s1aksk(S×A)×S\zeta=s_{0}a_{1}s_{1}\ldots a_{k}s_{k}\in(S\times A)^{\ast}\times S is said to be mm-attainable if kmk\leq m and p(si1,ai,si)>0p(s_{i-1},a_{i},s_{i})>0 for each i{1,,k}i\in\{1,\ldots,k\}. Moreover, a trajectory ζ\zeta is called attainable if there exists an mm\in\mathbb{N} such that ζ\zeta is mm-attainable. Analogously, we call a label sequence λ=l1lk\lambda=l_{1}\ldots l_{k} (mm-)attainable if there exists an (mm-)attainable trajectory s0a1s1sk1aksks_{0}a_{1}s_{1}\ldots s_{k-1}a_{k}s_{k} such that li=L(si1,ai,si)l_{i}=L(s_{i-1},a_{i},s_{i}) for each i{1,,k}i\in\{1,\ldots,k\}

An induction shows that LARM-RM almost surely explores every attainable trajectory in the limit (i.e., with probability 1 when the number of episodes goes to infinity. We refer the reader to the extended version of Xu et al. [2020b]111The extended version of Xu et al. [2020b] can be found on arXiv at https://arxiv.org/abs/1909.05912 for a detailed proof.

Lemma 4.

Given mm\in\mathbb{N}, LARM-RM with 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑙𝑒𝑛𝑔𝑡ℎm\mathit{episode_{length}}\geq m almost surely explores every mm-attainable trajectory in the limit.

As an immediate consequence of Lemma 4, we obtain that LARM-RM almost sure explores every (mm-)attainable label sequence in the limit as well.

Corollary 1.

Given mm\in\mathbb{N}, LARM-RM with 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑙𝑒𝑛𝑔𝑡ℎm\mathit{episode_{length}}\geq m almost surely explores every mm-attainable label sequence in the limit.

In this paper, we assume that all possible label sequences are attainable. We relax this restriction later as in Neider et al. [2021] .

To prove Lemma 1, we show a series of intermediate results. We begin by reminding the reader of a well-known fact from automata theory, which can be found in most textbooks Shallit [2008].

Theorem 3.

Let 𝔄1\mathfrak{A}_{1} and 𝔄2\mathfrak{A}_{2} be two DFAs with L(𝔄1)L(𝔄2)L(\mathfrak{A}_{1})\neq L(\mathfrak{A}_{2}). Then, there exists a word ww of length at most |𝔄1|+|𝔄2|1|\mathfrak{A}_{1}|+|\mathfrak{A}_{2}|-1 such that wL(𝔄1)w\in L(\mathfrak{A}_{1}) if and only if wL(𝔄2)w\notin L(\mathfrak{A}_{2}).

We now observe that every reward machine 𝒜\mathcal{A} can be translated into a DFA 𝔄𝒜\mathfrak{A}_{\mathcal{A}} that is “equivalent” to the reward machine Xu et al. [2020b]. This DFA operates over the combined alphabet 2𝒫×M2^{\mathcal{P}}\times M and accepts a sequence (1,r1)(k,rk)(\ell_{1},r_{1})\ldots(\ell_{k},r_{k}) if and only if 𝒜\mathcal{A} outputs the reward sequence r1r2rkr_{1}r_{2}\ldots r_{k} on reading the label sequence 12k\ell_{1}\ell_{2}\ldots\ell_{k}.

Lemma 5 (Xu et al. [2020b]).

Given a reward machine 𝒜=(V,𝗊I,2𝒫,M,δ,σ)\mathcal{A}=(V,{\mathsf{q}_{I}},{2^{\mathcal{P}}},M,\delta,\sigma), one can construct a DFA 𝔄𝒜\mathfrak{A}_{\mathcal{A}} with |𝒜|+1|\mathcal{A}|+1 states such that L(𝔄𝒜)={(l1,r1)(lk,rk)(2𝒫×)𝒜(l1lk)=r1rk}L(\mathfrak{A}_{\mathcal{A}})=\{(l_{1},r_{1})\ldots(l_{k},r_{k})\in(2^{\mathcal{P}}\times\mathbb{R})\mid\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}\}.

Proof.

Let 𝒜=(𝖰𝒜,𝗊𝖨,𝖠,2𝒫,,δ𝒜,σ𝒜)\mathcal{A}=(\mathsf{Q}_{\mathcal{A}},\mathsf{q_{I,A}},2^{\mathcal{P}},\mathbb{R},\delta_{\mathcal{A}},\sigma_{\mathcal{A}}) be a reward machine. Then, we define a DFA 𝔄𝒜=(𝖰,𝗊𝖨,2𝒫×,δ,F)\mathfrak{A}_{\mathcal{A}}=(\mathsf{Q},\mathsf{q_{I}},2^{\mathcal{P}}\times\mathbb{R},\delta,F) by

  • 𝖰=𝖰𝖠{}\mathsf{Q}=\mathsf{Q_{A}}\cup\{\bot\}, where 𝖰𝖠\bot\notin\mathsf{Q_{A}};

  • 𝗊𝖨=𝗊𝖨,𝖠\mathsf{q_{I}}=\mathsf{q_{I,A}};

  • δ(𝗉,(,r))={𝗊if δ𝒜(𝗉,)=𝗊 and σ𝒜(𝗉,)=r;otherwise\delta\bigl{(}\mathsf{p},(\ell,r)\bigr{)}=\begin{cases}\mathsf{q}&\text{if $\delta_{\mathcal{A}}(\mathsf{p},\ell)=\mathsf{q}$ and $\sigma_{\mathcal{A}}(\mathsf{p},\ell)=r$;}\\ \bot&\text{otherwise}\end{cases}

  • 𝖥𝒜=𝖰𝖠\mathsf{F}_{\mathcal{A}}=\mathsf{Q_{A}}.

In this definition, \bot is a new sink state to which 𝔄𝒜\mathfrak{A}_{\mathcal{A}} moves if its input does not correspond to a valid input-output pair produced by 𝒜\mathcal{A}. A straightforward induction over the length of inputs to 𝔄𝒜\mathfrak{A}_{\mathcal{A}} shows that it indeed accepts the desired language. In total, 𝔄𝒜\mathfrak{A}_{\mathcal{A}} has |𝒜|+1|\mathcal{A}|+1 states. ∎

Next, we show that if two reward machines are semantically different, then we can bound the length of a trace witnessing this fact.

Lemma 6.

Let 𝒜1\mathcal{A}_{1} and 𝒜2\mathcal{A}_{2} be two reward machines. If 𝒜1𝒜2\mathcal{A}_{1}\neq\mathcal{A}_{2}, then there exists a label sequence λ\lambda of length at most |𝒜1|+|𝒜2|+1|\mathcal{A}_{1}|+|\mathcal{A}_{2}|+1 such that 𝒜1(λ)𝒜2(λ)\mathcal{A}_{1}(\lambda)\neq\mathcal{A}_{2}(\lambda).

Proof of Lemma 6.

Consider the DFAs 𝔄𝒜1\mathfrak{A}_{\mathcal{A}_{1}} and 𝔄𝒜2\mathfrak{A}_{\mathcal{A}_{2}} obtained from Lemma 5. These DFAs have |𝒜1|+1|\mathcal{A}_{1}|+1 and |𝒜2|+1|\mathcal{A}_{2}|+1 states, respectively. If 𝒜1𝒜2\mathcal{A}_{1}\neq\mathcal{A}_{2}, then L(𝔄𝒜1)L(𝔄𝒜2)L(\mathfrak{A}_{\mathcal{A}_{1}})\neq L(\mathfrak{A}_{\mathcal{A}_{2}}). Thus, applying Theorem 3 yields a sequence w=(l1,r1)(lk,rk)w=(l_{1},r_{1})\ldots(l_{k},r_{k}) with

k|𝒜1|+1+|𝒜2|+11=|𝒜1|+|𝒜2|+1k\leq|\mathcal{A}_{1}|+1+|\mathcal{A}_{2}|+1-1=|\mathcal{A}_{1}|+|\mathcal{A}_{2}|+1

such that wL(𝔄𝒜1)w\in L(\mathfrak{A}_{\mathcal{A}_{1}}) if and only if wL(𝔄𝒜2)w\notin L(\mathfrak{A}_{\mathcal{A}_{2}}). By Lemma 5, the label sequence λ=l1lk\lambda=l_{1}\ldots l_{k} has the desired property. ∎

Similar to Lemma 6, we can show that if an LLM-generated DFA is not compatible with a reward machine, we can also bound the length of a label sequence witnessing this fact.

Lemma 7.

Let 𝒜\mathcal{A} be a reward machine and DD an LLM-generated DFA. If DD is not compatible with 𝒜\mathcal{A}, then there exists a label sequence l1lkl_{1}\ldots l_{k} with k2(|𝒜|+1)|D|k\leq 2(|\mathcal{A}|+1)\cdot|D| such that 𝒜(l1lk)=r1rk\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}, rk>0r_{k}>0, and l1lkL(D)l_{1}\ldots l_{k}\notin L(D).

Proof.

Let 𝒜\mathcal{A} be a reward machine and RR\subset\mathbb{R} the finite set of rewards that 𝒜\mathcal{A} can output. Moreover, let D=(VD,vI,D,2𝒫,δD,FD)D=(V_{D},v_{I,D},2^{\mathcal{P}},\delta_{D},F_{D}) be an LLM-generated DFA. Our proof proceeds by constructing four DFAs and a subsequent analysis of the final DFA to derive the desired bound.

First, we construct the DFA 𝔄𝒜=(V𝒜,vI,𝒜,2𝒫×R,δ𝒜,F𝒜)\mathfrak{A}_{\mathcal{A}}=(V_{\mathcal{A}},v_{I,\mathcal{A}},2^{\mathcal{P}}\times R,\delta_{\mathcal{A}},F_{\mathcal{A}}) according to Lemma 5. Recall that (l1,r1)(lk,rk)L(𝔄𝒜)(l_{1},r_{1})\ldots(l_{k},r_{k})\in L(\mathfrak{A}_{\mathcal{A}}) if and only if 𝒜(l1lk)=r1rk\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\dots r_{k}. Moreover, 𝔄𝒜\mathfrak{A}_{\mathcal{A}} has |𝒜|+1|\mathcal{A}|+1 states.

Second, we modify the DFA 𝔄𝒜\mathfrak{A}_{\mathcal{A}} such that it only accepts sequences (l1,r1)(lk,rk)(l_{1},r_{1})\ldots(l_{k},r_{k}) with rk>0r_{k}>0. To this end, we augment the state space with an additional bit b{0,1}b\in\{0,1\}, which tracks whether the most recent reward was 0 or greater than 0. More formally, we define a DFA 𝔄𝒜=(V𝒜,vI,𝒜,2𝒫×R,δ𝒜,F𝒜)\mathfrak{A}^{\prime}_{\mathcal{A}}=(V^{\prime}_{\mathcal{A}},v^{\prime}_{I,\mathcal{A}},2^{\mathcal{P}}\times R,\delta^{\prime}_{\mathcal{A}},F^{\prime}_{\mathcal{A}}) by

  • V𝒜=V𝒜×{0,1}V^{\prime}_{\mathcal{A}}=V_{\mathcal{A}}\times\{0,1\};

  • vI,𝒜=(vI,𝒜,0)v^{\prime}_{I,\mathcal{A}}=(v_{I,\mathcal{A}},0);

  • δ𝒜((v,b),(l,r))=(δ𝒜(v,(l,r)),b)\delta_{\mathcal{A}}^{\prime}\bigl{(}(v,b),(l,r)\bigr{)}=\bigl{(}\delta_{\mathcal{A}}(v,(l,r)),b\bigr{)} where b=1b=1 if and only if r>0r>0; and

  • F𝒜=F𝒜×{1}F^{\prime}_{\mathcal{A}}=F_{\mathcal{A}}\times\{1\}.

It is not hard to verify that 𝔄𝒜\mathfrak{A}^{\prime}_{\mathcal{A}} indeed has the desired property. Moreover, by Lemma 5, 𝔄𝒜\mathfrak{A}^{\prime}_{\mathcal{A}} has 2(|𝒜|+1)2(|\mathcal{A}|+1) states.

Third, we apply “cylindrification” to the LLM-generated DFA DD, which works over the alphabet 2𝒫2^{\mathcal{P}}, to match the input alphabet 2𝒫×R2^{\mathcal{P}}\times R of 𝔄𝒜\mathfrak{A}^{\prime}_{\mathcal{A}}. Our goal is to construct a new DFA D=(VD,vI,D,2𝒫×R(X),δD,FD)D^{\prime}=(V^{\prime}_{D},v^{\prime}_{I,D},2^{\mathcal{P}}\times R(X),\delta^{\prime}_{D},F^{\prime}_{D}) that accepts a sequence (l1,r1)(lk,rk)(l_{1},r_{1})\ldots(l_{k},r_{k}) if and only if l1lkL(D)l_{1}\ldots l_{k}\in L(D) for every reward sequence r1rkr_{1}\ldots r_{k} (cylindrification can be thought of as the inverse of the classical projection operation). We achieve this by replacing each transition δD(v,l)=v\delta_{D}(v,l)=v^{\prime} in DD with |R(X)||R(X)| many transitions of the form δD(v,(l,r))\delta^{\prime}_{D}\bigl{(}v,(l,r)\bigr{)} where rRr\in R. Formally, we define the DFA DD^{\prime} by

  • VD=VDV^{\prime}_{D}=V_{D};

  • vI,D=vI,Dv^{\prime}_{I,D}=v_{I,D};

  • δD(v,(l,r))=δD(v,l)\delta^{\prime}_{D}\bigl{(}v,(l,r)\bigr{)}=\delta_{D}(v,l) for each rRr\in R; and

  • FD=FDF^{\prime}_{D}=F_{D}.

It is not hard to verify that DD^{\prime} has indeed the desired property and its size is |D||D|.

Fourth, we construct the simple product DFA of 𝔄𝒜\mathfrak{A}^{\prime}_{\mathcal{A}} and DD^{\prime}. This DFA is given by 𝔄=(V,vI,2𝒫×R(X),δ,F)\mathfrak{A}=(V,v_{I},2^{\mathcal{P}}\times R(X),\delta,F) where

  • V=V𝒜×VDV=V^{\prime}_{\mathcal{A}}\times V^{\prime}_{D};

  • vI=(vI,𝒜,vI,D)v_{I}=(v^{\prime}_{I,\mathcal{A}},v^{\prime}_{I,D});

  • δ((v1,v2),(l,r))=(δ𝒜(v1,(l,r)),δD(v2,(l,r)))\delta\bigl{(}(v_{1},v_{2}),(l,r)\bigr{)}=\bigl{(}\delta^{\prime}_{\mathcal{A}}(v_{1},(l,r)),\delta^{\prime}_{D}(v_{2},(l,r))\bigr{)}; and

  • F=F𝒜×(QDFD)F=F^{\prime}_{\mathcal{A}}\times(Q^{\prime}_{D}\setminus F^{\prime}_{D}).

By construction of 𝔄𝒜\mathfrak{A}^{\prime}_{\mathcal{A}} and DD^{\prime}, is is not hard to see that 𝔄\mathfrak{A} accepts a sequence (l1,r1)(lk,rk)(l_{1},r_{1})\ldots(l_{k},r_{k}) if and only if 𝒜(l1lk)=r1rk\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k} with rk>0r_{k}>0 and l1lkL(D)l_{1}\ldots l_{k}\notin L(D)—in other words, L(𝔄)L(\mathfrak{A}) contains all sequences that witness that DD is not compatible with 𝒜\mathcal{A}. Moreover, 𝔄\mathfrak{A} has 2(|𝒜|+1)|D|2(|\mathcal{A}|+1)\cdot|D| states.

It is left to show that if DD is not compatible with 𝒜\mathcal{A}, then we can find a witness with the desired length. To this end, it is sufficient to show that if L(𝔄)L(\mathfrak{A})\neq\emptyset, then there exists a sequence (l1,r1)(lk,rk)L(𝔄)(l_{1},r_{1})\ldots(l_{k},r_{k})\in L(\mathfrak{A}) with k2(|𝒜|+1)|D|k\leq 2(|\mathcal{A}|+1)\cdot|D|. This fact can be established using a simple pumping argument. To this end, assume that (l1,r1)(lk,rk)L(𝔄)(l_{1},r_{1})\ldots(l_{k},r_{k})\in L(\mathfrak{A}) with k>2(|𝒜|+1)|D|k>2(|\mathcal{A}|+1)\cdot|D|. Then, there exists a state vVv\in V such that the unique accepting run of 𝔄\mathfrak{A} on (l1,r1)(lk,rk)(l_{1},r_{1})\ldots(l_{k},r_{k}) visits vv twice, say at the positions i,j{0,k}i,j\in\{0,\ldots k\} with i<ji<j. In this situation, however, the DFA 𝔄\mathfrak{A} also accepts the sequence (l1,r1)(li,ri)(lj+1,rj+1)(lk,rk)(l_{1},r_{1})\ldots(l_{i},r_{i})(l_{j+1},r_{j+1})\ldots(l_{k},r_{k}), where we have removed the “loop” between the repeating visits of vv. Since this new sequence is shorter than the original sequence, we can repeatedly apply this argument until we arrive at a sequence (l1,r1)(l,r)L(𝔄)(l^{\prime}_{1},r^{\prime}_{1})\ldots(l^{\prime}_{\ell},r^{\prime}_{\ell})\in L(\mathfrak{A}) with 2(|𝒜|+1)|D|\ell\leq 2(|\mathcal{A}|+1)\cdot|D|. By construction of 𝔄\mathfrak{A}, this means that 𝒜(l1l)=r1r\mathcal{A}(l^{\prime}_{1}\ldots l^{\prime}_{\ell})=r^{\prime}_{1}\ldots r^{\prime}_{\ell}, r>0r_{\ell}>0, and l1lL(D)l^{\prime}_{1}\ldots l^{\prime}_{\ell}\notin L(D), which proves the claim. ∎

With these intermediate results at hand, we are now ready to prove Lemma 1.

Proof of Lemma 1.

Let (X0,D0),(X1,D1),(X_{0},D_{0}),(X_{1},D_{1}),\ldots be the sequence of samples and sets of LLM-generated DFAs that arise in the run of LARM-RM whenever a new counterexample is added to XX (in Lines 9 and 10 of Algorithm 2) or an LLM-generated DFA is removed from the set DD (Lines 12 and 13 of Algorithm 2). Moreover, let 𝒜0,𝒜1,\mathcal{A}_{0},\mathcal{A}_{1},\ldots be the corresponding sequence of reward machines that are computed from (Xi,Di)(X_{i},D_{i}). Note that constructing a new reward machine is always possible because LARM-RM makes sure that all LLM-generated DFAs in the current set 𝒟\mathscr{D} are compatible with the traces in the sample XX.

We first observe three properties of these sequences:

  1. 1.

    The true reward machine 𝒜\mathcal{A} (i.e., the one that encodes the reward function RR) is consistent with every sample XiX_{i} that is generated during the run of LARM-RM. This is due to the fact that each counterexample is obtained from an actual exploration of the MDP and, hence, corresponds to the “ground truth”.

  2. 2.

    The sequence X0,X1,X_{0},X_{1},\ldots grows monotonically (i.e., X0X1X_{0}\subseteq X_{1}\subseteq\cdots) because LARM-RM always adds counterexamples to XX and never removes them (Lines 9 and 10). In fact, whenever a counterexample (λ,ρ)(\lambda^{\prime},\rho) is added to XiX_{i} to form Xi+1X_{i+1}, then (λ,ρ)Xi(\lambda^{\prime},\rho)\notin X_{i} (i.e., XiXi+1X_{i}\subsetneq X_{i+1}). To see why this is the case, remember that LARM-RM always constructs hypotheses that are consistent with the current sample (and the current set of LLM-generated DFAs). Thus, the current reward machine 𝒜i\mathcal{A}_{i} is consistent with XiX_{i}, but the counterexample (λ,ρ)(\lambda^{\prime},\rho) was added because 𝒜i(λ)ρ\mathcal{A}_{i}(\lambda^{\prime})\neq\rho. Thus, (λ,ρ)(\lambda^{\prime},\rho) cannot have been an element of XiX_{i}.

  3. 3.

    The sequence D0,D1,D_{0},D_{1},\ldots decreases monotonically (i.e., D0D1D_{0}\supseteq D_{1}\supseteq\cdots) because LARM-RM always removes LLM-generated DFAs from DD and never adds any (Lines 12 and 13). Thus, there exists a position ii^{\star}\in\mathbb{N} at which this sequence becomes stationary, implying that Di=Di+1D_{i}=D_{i+1} for iii\geq i^{\star}.

Similar to Property 1, we now show that each LLM-generated DFA in the set DiD_{i}, iii\geq i^{\star}, is compatible with the true reward machine 𝒜\mathcal{A}. Towards a contradiction, let D𝒟iD\in\mathscr{D}_{i} be an LLM-generated DFA and assume that DD is not compatible with 𝒜\mathcal{A}. Then, Lemma 7 guarantees the existence of a label sequence l1lkl_{1}\ldots l_{k} with

k\displaystyle k 2(|𝒜|+1)|D|\displaystyle\leq 2(|\mathcal{A}|+1)\cdot|D|
2(|𝒜|+1)nmax\displaystyle\leq 2(|\mathcal{A}|+1)\cdot n_{\text{max}}

such that 𝒜(l1,lk)=r1rk\mathcal{A}(l_{1},\ldots l_{k})=r_{1}\ldots r_{k},rk>0r_{k}>0, and l1lkL(D)l_{1}\ldots l_{k}\notin L(D). Since we assume all label sequences to be attainable and have chosen 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑙𝑒𝑛𝑔𝑡ℎ2(|𝒜|+1)nmax\mathit{episode_{length}}\geq 2(|\mathcal{A}|+1)\cdot n_{\text{max}}, Corollary 1 guarantees that LARM-RM almost surely explores this label sequence in the limit. Once this happens, LARM-RM removes DD from the set 𝒟\mathscr{D} (Lines 12 and 13), which is diction to the fact that the sequence Di,Di+1,D_{i^{\star}},D_{i^{\star}+1},\ldots is stationary (Property 3). Hence, we obtain the following:

  1. 4.

    Every LLM-generated DFA in DiD_{i}, iii\geq i^{\star}, is compatible with the true reward machine 𝒜\mathcal{A}.

Next, we establish the three additional properties about the sub-sequence 𝒜i,𝒜i+1\mathcal{A}_{i^{\star}},\mathcal{A}_{i^{\star}+1} of hypotheses starting at position ii^{\star}:

  1. 5.

    The size of the true reward machine 𝒜\mathcal{A} is an upper bound for the size of 𝒜i\mathcal{A}_{i^{\star}} (i.e., |𝒜i||𝒜||\mathcal{A}_{i^{\star}}|\leq|\mathcal{A}|). This is due to the fact that 𝒜\mathcal{A} is consistent with every sample XiX_{i} (Property 1), every LLM-generated DFA in 𝒟i\mathscr{D}_{i}, iii\geq i^{\star}, is compatible with 𝒜\mathcal{A} (Property 4), and LARM-RM always computes minimal consistent reward machines.

  2. 6.

    We have |𝒜i||𝒜i+1||\mathcal{A}_{i}|\leq|\mathcal{A}_{i+1}| for all iii\geq i^{\star}. Towards a contradiction, assume that |𝒜i|>|𝒜i+1||\mathcal{A}_{i}|>|\mathcal{A}_{i+1}|. Since LARM-RM always computes consistent reward machines and XiXi+1X_{i}\subsetneq X_{i+1} if iii\geq i^{\star} (see Property 2), we know that 𝒜i+1\mathcal{A}_{i+1} is not only consistent with Xi+1X_{i+1} but also with XiX_{i} (by definition of consistency). Moreover, LARM-RM computes minimal consistent reward machines. Hence, since 𝒜i+1\mathcal{A}_{i+1} is consistent with XiX_{i} and |𝒜i+1|<|𝒜i||\mathcal{A}_{i+1}|<|\mathcal{A}_{i}|, the reward machine 𝒜i\mathcal{A}_{i} is not minimal, which is a contradiction.

  3. 7.

    We have 𝒜i𝒜j\mathcal{A}_{i}\neq\mathcal{A}_{j} for iii\geq i^{\star} and j{i,,i}j\in\{i^{\star},\ldots,i\}—in other words, the reward machines generated during the run of LARM-RM after the ii^{\star}-th recomputation are semantically distinct. This is a consequence of the facts that (λj,ρj)(\lambda^{\prime}_{j},\rho_{j}) was a counterexample to 𝒜j\mathcal{A}_{j} (i.e., 𝒜j(λj)ρj)\mathcal{A}_{j}(\lambda^{\prime}_{j})\neq\rho_{j}) and that LARM-RMalways constructs consistent reward machines (which implies 𝒜i(λi)=ρi\mathcal{A}_{i}(\lambda^{\prime}_{i})=\rho_{i}).

Properties 5 and 6 now provide |𝒜||\mathcal{A}| as an upper bound on the size of any reward machine that LARM-RM constructs after the ii^{\star}-th recomputation. Since there are only finitely many reward machines of size at most |𝒜||\mathcal{A}|, Property 7 implies that there exists a jij^{\star}\geq i^{\star} after which no new reward machine is learned. Hence, it is left to show that 𝒜j=𝒜\mathcal{A}_{j^{\star}}=\mathcal{A} (i.e., 𝒜j(λ)=𝒜(λ)\mathcal{A}_{j^{\star}}(\lambda)=\mathcal{A}(\lambda) for all label sequences λ\lambda).

Towards a contradiction, let us assume that 𝒜j𝒜\mathcal{A}_{j^{\star}}\neq\mathcal{A}. Then, Lemma 6 guarantees the existence of a label sequence λ=l1lk\lambda=l_{1}\ldots l_{k} with

k\displaystyle k |𝒜j|+|𝒜|+1\displaystyle\leq|\mathcal{A}_{j^{\star}}|+|\mathcal{A}|+1
2|𝒜|+1\displaystyle\leq 2|\mathcal{A}|+1
2(|𝒜|+1)nmax\displaystyle\leq 2(|\mathcal{A}|+1)\cdot n_{\text{max}}

such that 𝒜j(λ)𝒜(λ)\mathcal{A}_{j^{\star}}(\lambda)\neq\mathcal{A}(\lambda).

Since we assume all label sequences to be attainable and have chosen 𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑙𝑒𝑛𝑔𝑡ℎ2(|𝒜|+1)nmax\mathit{episode_{length}}\geq 2(|\mathcal{A}|+1)\cdot n_{\text{max}}, Corollary 1 guarantees that LARM-RM almost surely explores this label sequence in the limit. Thus, the trace (λ,ρ)(\lambda,\rho), where ρ=𝒜(λ)\rho=\mathcal{A}(\lambda), is almost surely returned as a new counterexample, resulting in a new sample Xj+1X_{j^{\star}+1}. This, in turn, causes the construction of a new reward machine, which contradicts the assumption that no further reward machine is generated. Thus, the reward machine 𝒜j\mathcal{A}_{j^{\star}} is equivalent to the true reward machine 𝒜\mathcal{A}. ∎

Let us finally turn to the proof of Theorem 1 (i.e., that LARM-RM almost surely converges to the optimal policy in the limit). The key idea is to construct the product of the given MDP \mathcal{M} and the true reward machine 𝒜\mathcal{A}. In this new MDP \mathcal{M}^{\prime}, the reward function is in fact Markovian since the reward machine 𝒜\mathcal{A} encodes all necessary information in its states (and, hence, in the product \mathcal{M}^{\prime}). Thus, the fact that classical Q-learning almost surely converges to an optimal policy in the limit also guarantees the same for LARM-RM. We refer the reader to the extended version of Xu et al. [2020b] for detailed proof.

Appendix E Unattainable Label Sequences

Let us now consider the case that not all labels can be explored by the agent (e.g., because certain label sequences are not possible in the MDP). In this situation, two complications arise:

  1. 1.

    It is no longer possible to uniquely learn the true reward machine 𝒜\mathcal{A}. In fact, any reward machine that agrees with 𝒜\mathcal{A} on the attainable traces becomes a valid solution.

  2. 2.

    The notion of compatibility needs to reflect the attainable traces. More precisely, the condition rk>0r_{k}>0 implies l1lkL(D)l_{1}\ldots l_{k}\in L(D) now has to hold only for attainable label sequences.

Both complications add a further layer of complexity, which makes the overall learning problem—both Q-learning and the learning of reward machines—harder. In particular, we have to adapt the episode length and also the size of the SAT encoding grows. In total, we obtain the following result.

Theorem 4.

Let \mathcal{M} be a labeled MDP where all label sequences are attainable and 𝒜\mathcal{A} the true reward machine encoding the rewards of \mathcal{M}. Moreover, let 𝒟={D1,,D}\mathscr{D}=\{D_{1},\ldots,D_{\ell}\} be a set of LLM-generated DFAs and nmax=max1i{|Di|}n_{\text{max}}=\max_{1\leq i\leq\ell}{\{|D_{i}|\}}. Then, LARM-RM with

𝑒𝑝𝑖𝑠𝑜𝑑𝑒𝑙𝑒𝑛𝑔𝑡ℎmax{2||(|𝒜|+1)nmax,||(|𝒜|+1)2}\mathit{episode_{length}}\geq\max{\bigl{\{}2|\mathcal{M}|\cdot(|\mathcal{A}|+1)\cdot n_{\text{max}},|\mathcal{M}|(|\mathcal{A}|+1)^{2}\bigr{\}}}

almost surely converges to an optimal policy in the limit. The size of the formula ΦnX,D\Phi_{n}^{X,D} grows linearly in |||\mathcal{M}|.

In order to prove Theorem 4, specifically the bound on the length of episodes, we first need to introduce the nondeterministic version of DFAs. Formally, a nondeterministic finite automaton is a tuple 𝔄=(𝖰,𝗊𝖨,Σ,Δ,F)\mathfrak{A}=(\mathsf{Q},\mathsf{q_{I}},\Sigma,\Delta,F) where 𝖰\mathsf{Q}, 𝗊𝖨\mathsf{q_{I}}, Σ\Sigma, 𝖥\mathsf{F} are as in DFAs and Δ𝖰×Σ×𝖰\Delta\subset\mathsf{Q}\times\Sigma\times\mathsf{Q} is the transition relation. Similar to DFAs, a run of an NFA 𝔄\mathfrak{A} on a word u=a1aku=a_{1}\ldots a_{k} is a sequence 𝗊0,𝗊k\mathsf{q}_{0},\ldots\mathsf{q}_{k} such that 𝗊𝗈=𝗊𝖨\mathsf{q_{o}}=\mathsf{q_{I}} and (𝗊i1,ai,𝗊i)Δ(\mathsf{q}_{i-1},a_{i},\mathsf{q}_{i})\in\Delta for each i{1,,k}i\in\{1,\ldots,k\}. In contrast to DFAs, however, NFAs permit multiple runs on the same input or even no run. Accepting runs and the language of NFAs are defined analogously to DFAs. Note that we use NFAs instead of DFAs because the former can be exponentially more succinct than the latter.

Similar to Lemma 5, one can translate an MDP \mathcal{M} into an NFA 𝔄\mathfrak{A}_{\mathcal{M}} with |||\mathcal{M}| states that accepts exactly the attainable traces of an MDP \mathcal{M}.

Lemma 8.

Given a labeled MDP \mathcal{M}, one can construct an NFA 𝔄\mathfrak{A}_{\mathcal{M}} with at most 2||2|\mathcal{M}| states that accept exactly the admissible label sequences of \mathcal{M}.

This construction is similar to Remark 1 of the extended version of Xu et al. [2020b] as proceeds as follows.

Proof.

Given a labeled MDP =(S,sI,A,p,R,γ,𝒫,L)\mathcal{M}=(S,s_{I},A,p,R,\gamma,\mathcal{P},L), we construct an NFA 𝔄=(𝖰,𝗊𝖨,2𝒫,Δ,F)\mathfrak{A}_{\mathcal{M}}=(\mathsf{Q},\mathsf{q_{I}},2^{\mathcal{P}},\Delta,F) by

  • 𝖰=S\mathsf{Q}=S;

  • 𝗊𝖨=sI\mathsf{q_{I}}=s_{I};

  • (s,,s)Δ(s,\ell,s^{\prime})\in\Delta if and only if there exists an action aAa\in A with L(s,a,s)=L(s,a,s^{\prime})=\ell and p(s,a,s)>0p(s,a,s)>0; and

  • 𝖥=S\mathsf{F}=S.

A straightforward induction shows that λL(𝔄)\lambda\in L(\mathfrak{A}_{\mathcal{M}}) holds if and only if λ\lambda is an attainable label of \mathcal{M}. ∎

We are now ready to prove Theorem 4.

Proof of Theorem 4.

We first modify Lemma 7 slightly. More precisely, we add another fifth step, where we build the product of 𝔄\mathfrak{A} with the cylindrification of the DFA 𝔄\mathfrak{A}_{\mathcal{M}}. It is not hard to verify that this results in an NFA that accepts a trace (l1lk,r1rk)(l_{1}\ldots l_{k},r_{1}\ldots r_{k}) if and only if l1lkl_{1}\ldots l_{k} is admissible and not in the language L(D)L(D) , 𝒜(l1lk)=r1rk\mathcal{A}(l_{1}\ldots l_{k})=r_{1}\ldots r_{k}, and rk>0r_{k}>0. This NFA has 2||(|𝒜|+1)|D|2|\mathcal{M}|(|\mathcal{A}|+1)\cdot|D| states. A similar pumping argument as in the proof of Lemma 7 can now be used to show the existence of a witness of length at most 2||(|𝒜|+1)|D|2|\mathcal{M}|(|\mathcal{A}|+1)\cdot|D|.

Analogously, we can modify Lemma 6. We build the input-synchronized product of the two DFAs 𝔄𝒜1\mathfrak{A}_{\mathcal{A}_{1}} and 𝔄𝒜2\mathfrak{A}_{\mathcal{A}_{2}} and the DFA 𝔄\mathfrak{A}_{\mathcal{M}}. This results in an NFA with (|𝒜1|+1)(|𝒜2|+1)||(|\mathcal{A}_{1}|+1)\cdot(|\mathcal{A}_{2}|+1)\cdot|\mathcal{M}| states. If 𝒜1𝒜2\mathcal{A}_{1}\neq\mathcal{A}_{2} on an attainable label sequence, then we can find a sequence such that leads in this product to a state where 𝔄\mathfrak{A}_{\mathcal{M}} accepts (the label sequence is attainable), but exactly one of 𝒜1\mathcal{A}_{1} and 𝒜2\mathcal{A}_{2} accepts. Using a pumping argument as above, we can bound the length of such an input to at most (|𝒜1|+1)(|𝒜2|+1)||(|\mathcal{A}_{1}|+1)\cdot(|\mathcal{A}_{2}|+1)\cdot|\mathcal{M}|.

Moreover, we have to modify Formula ΦnX,D\Phi_{n}^{X,D} to account for the new situation. Given 𝔄=(𝖰𝖠,𝗊𝖨,𝖠,2𝒫×,Δ𝖠,F𝖠)\mathfrak{A}_{\mathcal{M}}=(\mathsf{Q_{A_{\mathcal{M}}}},\mathsf{q_{I,A_{\mathcal{M}}}},2^{\mathcal{P}}\times\mathbb{R},\Delta_{\mathsf{A_{\mathcal{M}}}},F_{\mathsf{A_{\mathcal{M}}}}), we first introduce new auxiliary variables z𝗉,𝗉,𝗉′′Dz_{\mathsf{p},\mathsf{p^{\prime}},\mathsf{p^{\prime\prime}}}^{D} where 𝗉′′𝖰𝒜\mathsf{p^{\prime\prime}}\in\mathsf{Q_{\mathcal{A}_{\mathcal{M}}}}; we use these variables instead of the variables y𝗉,𝗉Dy_{\mathsf{p}},\mathsf{p^{\prime}}^{D}. Second, we replace Formulas (6), (7), and (8) with the following three formulas:

z𝗊𝖨,𝗊𝖨,𝖣,𝗊𝖨,𝖠D\displaystyle z_{\mathsf{q_{I}},\mathsf{q^{\prime}_{I,D},\mathsf{q_{I,A_{\mathcal{M}}}}}}^{D} (9)
𝗉,𝗊𝖰l2𝒫δD(𝗉,l)=𝗊(𝗉′′,l,𝗊′′)Δ𝖠(z𝗉,𝗉,𝗉′′Dd𝗉,l,𝗊)z𝗊,𝗊,𝗊′′D\displaystyle\bigwedge_{\mathsf{p,q}\in\mathsf{Q}}\bigwedge_{l\in 2^{\mathcal{P}}}\bigwedge_{\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}}\bigwedge_{(\mathsf{p^{\prime\prime}},l,\mathsf{q^{\prime\prime}})\in\Delta_{\mathsf{A_{\mathcal{M}}}}}(z_{\mathsf{p},\mathsf{p^{\prime}},\mathsf{p^{\prime\prime}}}^{D}\land d_{\mathsf{p},l,\mathsf{q}})\rightarrow z_{\mathsf{q},\mathsf{q^{\prime}},\mathsf{q^{\prime\prime}}}^{D} (10)
𝗉𝖰δD(𝗉,l)=𝗊𝗊𝖥D(𝗉′′,l,𝗊′′)Δ𝖠𝗊′′F𝖠z𝗉,𝗉,𝗉′′D¬rRXr>0o𝗉,l,r\displaystyle\bigwedge_{\mathsf{p}\in\mathsf{Q}}\bigwedge_{\begin{subarray}{c}\delta_{D}(\mathsf{p^{\prime}},l)=\mathsf{q^{\prime}}\\ \mathsf{q^{\prime}}\notin\mathsf{F}_{D}\end{subarray}}\bigwedge_{\begin{subarray}{c}(\mathsf{p^{\prime\prime}},l,\mathsf{q^{\prime\prime}})\in\Delta_{\mathsf{A_{\mathcal{M}}}}\\ \mathsf{q^{\prime\prime}}\in F_{\mathsf{A_{\mathcal{M}}}}\end{subarray}}z_{\mathsf{p},\mathsf{p^{\prime}},\mathsf{p^{\prime\prime}}}^{D}\rightarrow\lnot\bigvee_{\begin{subarray}{c}r\in R_{X}r>0\end{subarray}}o_{\mathsf{p},l,r} (11)

Clearly, the size of ΦnX,D\Phi_{n}^{X,D} grows linearly in |||\mathcal{M}| as compared to the case that all label sequences are attainable. Moreover, it is not hard to verify that ΦnX,D\Phi_{n}^{X,D} has indeed the desired meaning. More precisely, we obtain a result analogous to Lemma 1, but for the case that not all label sequences are attainable.

Once we have established that our learning algorithm learns consistent and compatible reward machines, the overall convergence of LARM-RM to an optimal policy and the bound on the length of episodes can then be proven analogously to Theorem 1. ∎