Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine
Abstract
We present LARL-RM (Large language model-generated Automaton for Reinforcement Learning with Reward Machine) algorithm in order to encode high-level knowledge into reinforcement learning using automaton to expedite the reinforcement learning. Our method uses Large Language Models (LLM) to obtain high-level domain-specific knowledge using prompt engineering instead of providing the reinforcement learning algorithm directly with the high-level knowledge which requires an expert to encode the automaton. We use chain-of-thought and few-shot methods for prompt engineering and demonstrate that our method works using these approaches. Additionally, LARL-RM allows for fully closed-loop reinforcement learning without the need for an expert to guide and supervise the learning since LARL-RM can use the LLM directly to generate the required high-level knowledge for the task at hand. We also show the theoretical guarantee of our algorithm to converge to an optimal policy. We demonstrate that LARL-RM speeds up the convergence by by implementing our method in two case studies.
1 Introduction
Large Language Models (LLM) encode various types of information including domain-specific knowledge which enables them to be used as a source for extracting specific information about a domain. This knowledge can be used for planning and control of various systems such as autonomous systems by synthesizing the control policies Seff et al. (2023). These policies are essentially dictating the action of the agent in the environment in a high-level format.
In general, the extraction of the domain-specific knowledge may not be a straightforward task since prompting the LLM without a proper prompt technique may result in hallucinations Jang et al. (2023), even with proper techniques it is also known LLMs can hallucinate. There are several methods suggested for prompt engineering ranging from zero-shot Kojima et al. (2022) all the way to training the pre-trained language model on a domain-specific dataset (fine-tuning) Rafailov et al. (2023) to obtain more accurate responses. The latter approach is more involved and resource intensive since it requires a dataset related to the domain as well as enough computation power to train the pre-trained language model on the dataset Ding et al. (2023). Moreover, not all of the language models are available for fine-tuning due to their providers’ policy. Hence, fine-tuning is a less desirable approach as long as the other prompt engineering methods can produce accurate responses.
Our algorithm LARL-RM (large language model-based automaton for reinforcement learning) uses prompt engineering methods to extract the relevant knowledge from the LLM. This knowledge is then incorporated into the reinforcement learning (RL) algorithm in the form of high-level knowledge to expedite the RL and guide it to reach the optimal policy faster. We integrate the LLM-encoded high-level knowledge into the RL using deterministic finite automata (DFA). We have the following contributions to the field: (a) LARL-RM allows for integration of the high-level knowledge from a specific domain into the RL using LLM outputs directly. (b) Our proposed method is capable of adjusting the instructions of the LLM by closing the loop in learning and updating the prompt based on the chain-of-thought method to generate the automata when counterexamples are encountered during the policy update. (c) LARL-RM is capable of learning the reward machine and expediting the RL by providing LLM-generated DFA to the RL algorithm. We also show that LARL-RM speeds up the RL by by implementing it in two case studies.
Related work. Generative language models (GLM) such as the GPT models have been shown to encode world knowledge and be capable of generating instructions for a task Hendrycks et al. (2020). Researchers in West et al. (2021) have demonstrated the advances in GLM allow to construct task-relevant knowledge graphs. Generating automaton from GLM. Generating automaton from a generative language model for a specific task is shown in Yang et al. (2022); however, the method proposed by the authors called GLM2FSA uses verb and verb phrases to and maps them onto a predefined set of verbs in order to convert the instructions given by the GLM to an automaton which in turn reduces the generality of the method to be applied to any other domain and still requires an expert to pre-define this set. GLMs are also used to extract task-relevant semantic knowledge and generate plans to complete a task in robotics Vemprala et al. (2023). Learning reward machine. Researchers in Neider et al. (2021) have shown a method called AdvisoRL for RL where the algorithm is capable of learning the reward machine; however, an expert is required to guide the learning process through provision of so-called advice. In our method this is replaced by using LLM.
2 Preliminaries
In this section, we define our notations and introduce the necessary background for reinforcement learning, reward machines, and finite deterministic automata (DFA).
Definition 1 (labeled Markov decision process).
A labeled Markov decision process is a tuple where is a finite set of states, is the agent’s initial state, is a finite set of actions, , reward function , discount factor , a finite set of propositions , and a labeling function that determines the high-level events that agent encounters in the environment.
A policy is a function mapping states to actions in with a probability distribution, meaning the agent at state will choose action with probability using the policy that leads to a new state . The probability of the state is defined by . Trajectory where is the resulting sequence of states and actions during this stochastic process. Its corresponding label sequence is where . Trace is the pair of labels and rewards where the payoff of the agent is .
We exploit deterministic finite automaton (DFA) to encode the high-level knowledge obtained from the LLM to reinforcement learning in order to speed up the process of obtaining the optimal policy . A DFA consists of an input and output alphabet which we can use to steer the learning process.
Definition 2 (Deterministic finite automata).
A DFA is a tuple where is the finite set of states, is the initial state, and is the input alphabet (here we consider ), is the transition function between the states, and is the set of accepting states.
We use deterministic finite state automata to model the LLM-generated DFA. The main idea of the LLM-generated DFA is to assist the algorithm with the learning of the reward machines through the provision of information about which label sequence may result in a positive reward and which one will definitely result in a non-positive reward Neider et al. (2021).
Reward machines are a type of Mealy machine that is used to encode a non-Markovian reward function. They are automatons that receive a label from the environment and respond with a reward as well as transition to their next state. Their input alphabet is a set of propositional variables (from the set of ) and their output alphabet is a set of real numbers.
Definition 3 (Reward machine).
A reward machine is shown by where is a finite set of states, is the initial state, input alphabet , output alphabet , transition function , and output function .
Applying the reward machine on a sequence of labels will create a sequence of states where for all . Transition to next state of the reward machine results in a reward of . Hence, a label sequence of corresponds to a sequence of rewards produced by reward machine . Therefore, reward function of an MDP process for every trajectory corresponds to a label sequence of that encodes the reward machine , agent then receives reward sequence of .

Motivating Example. We consider an autonomous car on American roads with the set of actions (see Figure 1) where the agent must reach intersection then making a right turn when the traffic light is green and if there is no car on the left and no pedestrian on the right. Then the agent reaches the intersection where it needs to go straight when the traffic light is green if there are no cars and pedestrians. The corresponding DFA for this motivating example is shown in Figure 2.
3 Generating Domain Specific Knowledge Using LLM
Large Language Models (LLM) encode rich world and domain-specific knowledge Yang et al. (2023) which allows for the extraction of the relevant information through appropriate prompting and has shown that can even surpass fine-tuning LLMs in some cases Wei et al. (2022). Hence, we use prompt engineering to extract the relevant domain knowledge information from the LLM and provide it in the form of LLM-generated DFA Neider et al. (2021) to expedite the reinforcement learning. An LLM-generated DFA can be thought of as a means to provide the process of learning reward machine with information about which label sequence could be promising in such a way that it results in a reward.
Applying a DFA on a sequence will result in a sequence of states where and . We define , also commonly referred to as the formal language accepted by . A LLM-generated DFA could result in a positive reward, meaning that DFA can be considered as a LLM-generated DFA if the label sequence could result in a positive reward; while, the label sequences should not receive a positive reward; therefore, it is a binary classifier that points out the promising explorations Neider et al. (2021). Our proposed algorithm uses the LLM-generated DFAs to narrow down the search space for learning the ground truth reward machine. Hence, expediting the reinforcement learning; however, in our method, the LLM-generated DFA is generated automatically for a specific domain using the LLM. It is known that the LLMs hallucinate; therefore, our proposed algorithm is capable of ignoring the LLM-generated DFA if it is incompatible the underlying ground truth reward machine.
Definition 4 (LLM-generated DFA compatibility).
An LLM-generated DFA is compatible with the ground truth reward machine if for all label sequences with reward sequence it holds that implies .
We demonstrate an LLM-generated DFA for the motivating example in Figure 3.
In our proposed method we generate the domain-specific knowledge using the GPT series Liu et al. (2023). There are several methods of prompting LLMs such as zero-shot method Kojima et al. (2022) where the prompt is using a step-by-step approach to obtain better results or the few-shot method Brown et al. (2020) where the prompt is given similar examples encouraging the LLMs to produce similar reasoning to reach the correct answer. We use a combination of these methods to encourage the GPT to produce the correct information. First, we adopt a persona for the domain that we plan to extract relevant information from; in our example, the persona is an expert in American road traffic rules. Adopting a persona allows for more coherent and higher-quality responses from the language model. Figure 4 demonstrates the GPT response to a question for making a right turn at a traffic light while an expert persona is adopted.

Figure 4 demonstrates that we can obtain the knowledge for a specific domain using the appropriate prompting method. In general, LLMs are text-based which means that their output is text and it cannot directly be used in an RL algorithm; therefore we require a method to convert this textual output to an automata. The next step is to ask the LLM to convert the provided steps to relevant propositions so it can be used for the generation of the automata. We can perform this by mapping the steps into the set of propositions as illustrated in Figure 5.

We use the mapped steps (instructions) to generate the DFA since actions and propositions are available; however, transitions and states are yet to be defined. We can use the labels for transitions between the states and to obtain the states we can circle back to the task and use the instructions to obtain the states. Therefore, we use the task description and the mapped steps to obtain the transitions and states, Figure 6 shows the prompting used to create the states and transitions.

We can construct a DFA by having the states , propositions , and transition function ; therefore, we can now use this information to generate the DFA for the task and use it to expedite convergence to optimal policy .
Input: prompt
Parameter: temperature , Top p (OpenAI LLM parameters)
Output: DFA
Algorithm 1 can be updated to not even require propositions and actions to obtain the DFA. Generating DFA this method requires a different prompting technique which combines two methods of few-shots and chain-of-thought we call MixedR (mixed-reasoning). We exploit this approach to obtain tasks that are more involved since obtaining the propositions and actions requires more information about the system; however, the MixedR requires the task description and examples of DFAs which do not require to even be relevant to the task as long as the descriptions are provided for the example DFAs even in high-level since the LLMs are capable of extracting complex patterns from textual data. We use MixedR to generate the DFA. Figure 7 shows the prompt template used for MixedR.

We can use the motivating example and the MixedR prompt technique to obtain the relevant DFA for a smaller example of the problem. We can then use the DFA obtained from the MixedR and provide it as LLM-generated DFA for the RL algorithm to use in order to learn the ground truth reward machine in a smaller search space.
4 Expediting Reinforcement Learning Using LLM-generated DFA
We use the LLM output as instructions to expedite the reinforcement learning convergence to the optimal policy using the language models. We use the LLMs to narrow down the search space for the RL algorithm since in a typical reinforcement learning algorithm the search space is considerable leading to cases where the training requires significant time and resources. Hence, using the pre-trained LLMs we can reduce this search space and converge to the optimal policy faster. One main issue to address is that LLMs typically generate outputs that are not aligned with facts; therefore, it is necessary to close the loop in the RL algorithm so that if a counterexample is met then the algorithm should be capable of adjusting the prompt, and updating it to create an updated LLM-generated DFA that could be used by the RL algorithm. Figure 8 illustrates the LARL-RM algorithm.

Input: prompt , Episode length
Parameter: learning rate , discount factor , epsilon-greedy , temperature , Top p , LLM query budget
Our proposed method LARL-RM uses the prompt provided by the user to run the RL algorithm for a specific task that requires domain knowledge; however, by using the LLM the need for an expert is minimized and the algorithm itself can update the prompt using the counterexamples to update its DFA. LARL-RM first initializes an empty set for the trace , then uses the prompt to generate the relevant DFA for the RL (Lines 1-2). The algorithm then initializes the reward machine based on the LLM-generated DFA obtained from the LLM and also initializes the QRM (q-learning for reward machines) before starting the episode (Lines 3-4). LARL-RM stores the trace and DFA in order to reinitialize the reward machines and q-values if counterexamples are met, then the QRM is called to update the q-values and if there are any counterexamples then that trace is stored so that it can be used to update the reward machine and the DFA (Lines 5-10). If the trace met by the agent is not compatible with the ground truth reward machine then it is removed from the LLM-generated DFA set and then the algorithm uses this incompatible DFA and the initial prompt to obtain an updated prompt using the counterexample label sequence
, LARL-RM also uses an LLM query budget to ensure that if the responses are not compatible with the ground truth reward machine then it will not get stuck in a loop and eventually after the budget is depleted then it can start to learn the ground truth reward machine without LLM-generated DFA (Lines: 13-16, this also highlights the importance of prompt engineering further). Afterward, LARL-RM uses the stored DFA and trace sets to reinitialize the reward machine and q-values (Lines 18-19).
4.1 Using LLM-generated DFA to Learn the Reward Machine
LARL-RM uses the counterexamples to create a minimal reward machine while being guided by the LLM-generated DFA. Our method uses a similar approach as of the Neider et al. (2021) such that it maintains a finite sample set and a finite set of DFAs . We assume that the LLM-generated DFA is compatible with the sample such that the with positive reward where and if this criterion is not fulfilled then it gets ignored by the LARL-RM.
The RM learning which relies on the LLM-generated DFA performs the generation of the minimal reward machine which is consistent with the trace and compatible with . The minimal requirement for the reward machine to be minimal is important to the convergence of the LARL-RM to the optimal policy. LARL-RM uses SAT-based automata learning to verify the parametric systems since the learning task can be broken down into a series of satisfiability checks for propositional logic formulas. Essentially we construct and solve the propositional formulas where the values of Neider et al. (2021)Neider (2014).
We construct the formula based on a propositional set using the Boolean connectives , and , i.e., and interpretation is the mapping from propositional formulas to Boolean values such that . If the the propositional formula is satisfied then which interprets to satisfies formula . Using this approach we can obtain a minimal reward machine similar to Neider et al. (2021). Using the LLM we are capable of constructing an LLM-generated DFA in a format that is compatible with SAT-based automata learning methods. LARL-RM is also capable of adjusting the initial prompt so that in case it is not compatible with the ground truth reward machine then it could be updated . Updated prompt uses the counterexample label sequence to call the LLM and obtain an updated DFA which is compatible with the ground truth reward machine.
4.2 Refinement of Prompt and DFA
We use prompt to generate the DFA for a specific domain. Transition exists for a proposition if , meaning if the LLM-generated DFA generated by the prompt is incorrect then it cannot have the correct transitions and trajectory which leads to a counterexample .
Therefore, we use the counterexample to update the prompt in order to generate an updated DFA.
Hence, the updated DFA has a chance of becoming compatible with the counterexample. This process continues and each time the algorithm encounters a counterexample () uses it to update the prompt again.
5 Convergence to Optimal Policy
The learned reward machine will ultimately be equivalent to the ground truth reward machine under the condition that any label sequence is admissible by the underlying MDP, i.e., that are physically allowed. If this assumption does not hold, it is still ensured that learned reward machine and the ground truth reward machine will agree on every admissible label sequences.
Due to the fact that we use LLM to generate the DFA and the fact that the LLMs are known to produce outputs that are not factual, we face an issue. We cannot make any assumption on the quality of the output of the LLM. For the purpose of proving guaranties of LARL-RM, we need to consider the LLM as adversarial, that is, considering the worst case.
Lemma 1.
Let be a labeled MDP, the ground truth reward machine encoding the rewards of , and the set of all LLM-generated DFAs that are added to during the run of LARL-RM. Additionally, let and . Then, LARL-RM with almost surely learns a reward machine that is equivalent to .
LARL-RM provides us with an upper bound for the episode length that needs to be explored. Additionally, LARL-RM algorithm correctness follows the Lemma 1 and the correctness of QRM algorithm Icarte et al. (2018). Using the QRM algorithm guarantee we can now show the convergence to optimal policy
Theorem 1.
Let , , , and be as in Lemma 1. Then, LARL-RM will converge to an optimal policy almost surely if .
6 Case Studies
We implement the LARL-RM using the GPT series LLM, specifically we focus on model GPT-3.5-Turbo. We can set up a chat with either of the models using the provided APIs from OpenAI. For our example we consider the LARL-RM applied to an autonomous car example similar to the example found in Xu et al. (2020a).
In our example, the autonomous car must navigate to reach a destination while avoiding any pedestrians, but obeying traffic laws. It is worth mentioning that our algorithm is capable of running even without any LLM-generated DFA since in this case it can learn the ground truth reward machine without LLM-generated DFA. We consider two case studies for our motivating example to demonstrate the LARL-RM capabilities to expedite the RL. In both cases, we demonstrate the effect of incompatible LLM-generated DFA (not compatible with the ground truth reward machine) on the algorithm convergence to the optimal policy.
6.1 Case Study 1
The agent in our traffic example has the following set of actions . The layout of the environment is shown in Figure 1
We use the LLM-generated DFA to guide the RL process. If the advice from the LLM is not compatible with the ground truth reward machine then it will be ignored by the algorithm. Prompt is updated and fed back to the LLM in order to obtain a new DFA that has a chance of becoming compatible with the counterexample. We show the ground truth reward machine for Case Study 1 in Figure 9.
The LLM-generated DFA generated by the prompt for this case study is compatible with the ground truth reward machine (Figure 9) and helps the algorithm to converge to the optimal policy faster. The generated DFA using the GPT is shown in Figure 10.
We use the DFA (10) to guide the learning process. The reward obtained by the LARL-RM algorithm shows that it can reach the optimal policy faster if the LLM-generated DFA exists in comparison to the case when there is none.

We average the results of independent runs for using a rolling mean with a window size of . Figure 11 demonstrates that the algorithm reaches the convergence policy using the LLM-generated DFA. In this example, the prompt is one time compatible and one time incompatible with the ground truth reward machine, but even if it is not compatible LARL-RM can use the counterexample to update the prompt and generate a new DFA which is more likely to be compatible with the counterexample.
6.2 Case Study 2
In Case Study 2 we show that if the suggested DFA is not compatible with the ground truth reward machine then the LARL-RM uses the counterexample to adjust the prompt and update it to obtain new DFA which is compatible with the counterexample. In this case study we update the MDP environment to allow for multiple reward machines as well as DFAs to be considered as equivalent. This way the output of the LLM may not necessarily match the ground truth reward machine; however, the LARL-RM can still use this DFA as long as it is not incompatible with the ground truth reward machine, but is equivalent to it.
In this environment, we extend the autonomous car example and provide it with more options to reach the destinations and the ground truth reward machine does not consider a specific route, but rather emphasizes the target destination; however, the LLM-generated DFA might specify a certain route over another which is not incompatible with the ground truth reward machine, but rather equivalent. We show the ground truth reward machine for Case Study 2 in Figure 12.
The LLM-generated DFA , for this case study specifies a route which leads to the target and it is equivalent to the ground truth reward machine since this solution DFA can be considered a subset of the larger DFA which is compatible with the ground truth reward machine. Figure 13 shows the LLM-generated DFA for Case Study 2.
We run the LARL-RM for the Case Study 2 using the prompt which generates the DFA , which is incompatible with the ground truth reward machine, but the LARL-RM takes this counterexample and generates an updated DFA , which is compatible with the ground truth reward machine. Figure 14 demonstrates the reward of LARL-RM using the DFA .

Figure 14 shows the reward obtained by LARL-RM for Case Study 2 using the prompt which generates the DFA . The reward is for independent runs averaged at each th step. For both cases of compatible and incompatible LLM-generated DFA the LARL-RM converges to the optimal policy faster than when there is no LLM-generated DFA.
7 Conclusion
We proposed a novel algorithm, LARL-RM, that uses a prompt to obtain an LLM-generated DFA to expedite reinforcement learning. LARL-RM uses counterexamples to automatically generate a new prompt and consequently, a new DFA that is compatible with the counterexamples to close the loop in RL. We showed that LARL-RM is guaranteed to converge to an optimal policy using the LLM-generated DFA. We showed that RL can be expedited using the LLM-generated DFA and in case the output of the LLM is not compatible with the ground truth reward machine, LARL-RM is capable of adjusting the prompt to obtain a more accurate DFA. In future work, we plan to extend the proposed framework to RL for multi-agent systems.
Acknowledgments
This research is partially supported by the National Science Foundation under grant NSF CNS 2304863 and the Office of Naval Research under grant ONR N00014-23-1-2505.
References
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
- Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Icarte et al. [2018] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, pages 2107–2116. PMLR, 2018.
- Jang et al. [2023] Joel Jang, Seonghyeon Ye, and Minjoon Seo. Can large language models truly understand prompts? a case study with negated prompts. In Transfer Learning for Natural Language Processing Workshop, pages 52–62. PMLR, 2023.
- Kojima et al. [2022] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Liu et al. [2023] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen He, Zhengliang Liu, et al. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, page 100017, 2023.
- Neider et al. [2021] Daniel Neider, Jean-Raphael Gaglione, Ivan Gavran, Ufuk Topcu, Bo Wu, and Zhe Xu. Advice-guided reinforcement learning in a non-markovian environment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9073–9080, 2021.
- Neider [2014] Daniel Neider. Applications of automata learning in verification and synthesis. PhD thesis, Aachen, Techn. Hochsch., Diss., 2014, 2014.
- Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Seff et al. [2023] Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti, Khaled S Refaat, Rami Al-Rfou, and Benjamin Sapp. Motionlm: Multi-agent motion forecasting as language modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023.
- Shallit [2008] Jeffrey O. Shallit. A Second Course in Formal Languages and Automata Theory. Cambridge University Press, 2008.
- Vemprala et al. [2023] Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- West et al. [2021] Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
- Xu et al. [2020a] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. Joint inference of reward machines and policies for reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 30, pages 590–598, 2020.
- Xu et al. [2020b] Zhe Xu, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu, and Bo Wu. Joint inference of reward machines and policies for reinforcement learning. In ICAPS, pages 590–598. AAAI Press, 2020.
- Yang et al. [2022] Yunhao Yang, Jean-Raphaël Gaglione, Cyrus Neary, and Ufuk Topcu. Automaton-based representations of task knowledge from generative language models. arXiv preprint arXiv:2212.01944, 2022.
- Yang et al. [2023] Yunhao Yang, Neel P Bhatt, Tyler Ingebrand, William Ward, Steven Carr, Zhangyang Wang, and Ufuk Topcu. Fine-tuning language models using formal methods feedback. arXiv preprint arXiv:2310.18239, 2023.
Supplementary Materials
Appendix A Comparison of closed-loop and Open-loop LARL-RM
Our proposed algorithm can be modified to perform in an open-loop or closed-loop manner with respect to LLM-generated DFA, meaning that if a counterexample is found LARM-RM has the capability to either update the LLM-generated DFA or keep the original LLM-generated DFA. LARL-RM has two options either continuing the learning process of the reward machine without updating the LLM-generated DFA, referred to as open-loop. It can also call the LLM again and generate a new DFA based on the counterexample, referred to as the closed loop. We show our proposed algorithm’s convergence to optimal policy under both options. We demonstrate the open-loop configuration in Figure 15.

A.1 Case Study 1
We investigate the two open-loop and closed-loop methods in the Case Study 1. First, we demonstrate the effect of incompatible LLM-generated DFA with compatible LLM-generated DFA in open-loop conditions. Figure 16 illustrates the convergence to optimal policy for open-loop configuration.

Figure 16 demonstrates that in LARL-RM open-loop configuration if the LLM-generated DFA is compatible it converges to the optimal policy faster than when there is no LLM-generated DFA. However, if the LLM-generated DFA is incompatible then it might take longer to converge to the optimal policy. Now we consider the closed-loop configuration so that the DFA if incompatible then be used to update the prompt . Figure 17 demonstrates the LARL-RM convergence to the optimal policy when the closed-loop configuration is applied.

Figure 17 demonstrates that the closed-loop configuration is capable of updating the prompt considering the counterexample such that LARL-RM has a better chance of converging faster than when there is no LLM-generated DFA.
A.2 Case Study 2
We further investigate the effect of open-loop and closed-loop configurations. The same configuration for the open-loop as illustrated in Figure 15 is applied here. The open-loop configuration as shown in Figure 18 is converging to the optimal policy faster than the case with no LLM-generated DFA if the LLM-generated DFA is compatible with the ground truth reward machine.

If we apply the closed-loop configuration to Case Study 2, then as expected it can converge to the optimal policy faster than when there is no LLM-generated DFA and even if the initial LLM-generated DFA is incompatible with the ground truth reward machine. Figure 19 demonstrates the closed-loop configuration.

As illustrated in Figure 19, if the closed-loop configuration is applied then the LARL-RM has a better chance to converge to the optimal policy faster than when there is no LLM-generated DFA.
Appendix B Theoretical Guarantee of the LARL-RM
Here we demonstrate the convergence of the proposed algorithm to the optimal policy.
Proof of Lemma 1.
We first show that the set remains stationary after a finite amount of time (i.e., that no new LLM-generated DFA is added or removed), assuming that all trajectories will be visited infinitely often. To this end, we observe the following: if an LLM-generated DFA added to is compatible, then it will remain in indefinitely. The reason is that there is no counterexample contradicting the LLM-generated DFA, meaning that the check-in Line 12 is never triggered. On the other hand, if an LLM-generated DFA is not consistent, then the algorithm eventually detects a trajectory witnessing this fact. Once this happens, it removes the corresponding LLM-generated DFA from the set .
Observe now that the algorithm decrements by one every time it adds an LLM-generated DFA to , and it only does so as long as . Thus, the total number of LLM-generated DFA that are generated during the run of the algorithm is bounded by . Consequently, the set no longer changes after a finite period, because the algorithm has either identified the true reward machine or all incompatible LLM-generated DFAs, have been removed and the budget is exhausted.
Once becomes stationary, an argument analogous to that in the proof of Lemma 1 in Neider et al.’s work Neider et al. [2021] shows that the algorithm will eventually learn the true reward machine. Intuitively, as long as the current hypothesis is not equivalent to the true reward machine, there exists a “short” trajectory witnessing the fast. The episode length is chosen carefully such that such a witness will eventually be encountered and added to . In the worst case, all trajectories of will eventually be added to , at which point the learning algorithm is guaranteed to learn the true reward machine. ∎
Since our proposed algorithm learns the minimal reward machine that is consistent with and compatible with , we need to use an automata learning method based on SAT which is used to verify parametric systems Neider [2014]. The underlying idea is to reduce the learning task to a series of satisfiability checks of formulas for propositional logic. In other words, we build and solve a sequence of propositional formulas for such that the following properties hold Neider et al. [2021]:
-
•
the satisfiability of is contingent upon the existence of a reward machine with states, such that it is consistent with and compatible with . This condition holds true if and only if such a reward machine exists.
-
•
if is satisfiable, it implies that a satisfying assignment contains enough information to construct a reward machine with states that is both consistent and compatible.
We can now build and solve starting from and increasing until it becomes satisfiable. Using this method our proposed algorithm constructs a minimal reward machine that is consistent with and compatible with the LLM-generated DFA .
To ease the notation in the remainder, we use to abbreviate a run of the reward machine on the input-sequence that starts in and leads to . By definition, we have for the empty sequence and every state . We later also use this notation for DFAs.
To encode the reward machine in propositional logic where set of states and initial state is fixed, the reward machine is uniquely determined by the transitions and the output function . The size of states and the initial state is fixed, then to encode the transition function and the set representing the final states, we will introduce two propositional variables namely where and and for , , and . We apply the constraints in (1) and (2) to ensure that the variables and encode deterministic functions.
(1) | |||
(2) |
We denote the conjunction of constraints (1) and (2) by which we later on use to show the consistency of the learned reward machine with . To apply constraints for consistency with samples in propositional logic, we propose auxiliary variables as for every where , and for a trace we define the set prefixes of by ( is always correct). The value of is set to true if and only if the prospective reward machine reaches states after reading . To obtain the desired meaning, we add the following constraints:
(3) | |||
(4) | |||
(5) |
In order to ensure that the prospective reward machine and the LLM-generated DFA are synchronized we add auxiliary variables for and . IF there is a label sequence where and is set to true. We denote the conjunction of constraints (3)-(5) by which we later on use to show the consistency of the learned reward machine with . To obtain this behavior, we add the following constraints:
(6) | |||
(7) | |||
(8) |
We denote the conjunction of constraints (6)-(8) by which we later use to show the consistency of the learned reward machine with . Theorem 2 demonstrates the consistency of the LLM-generated DFA with the Algorithm 2. W e denote the conjunction of Formulas (3), (4), and (5) by .
Theorem 2.
Consider be sample and be a finite set of LLM-generated DFAs that are compatible with . Then, the following holds:
-
1.
If we have , then the reward machine is consistent with and compatible with the LLM-generated DFA .
-
2.
If there exists a reward machine with states that is consistent with as well as compatible with each , then is satisfiable.
Appendix C Learning Reward Machines
In this section, we prove the correctness of our SAT-based algorithm for learning reward machines. To this end, we show that the propositional formula and have the desired meaning. We begin with the formula , which is designed to enforce that the learned reward machine is consistent with the sample .
Lemma 2.
Let and the reward machine defined above. Then, is consistent with (i.e., for each ).
Proof.
Let and the reward machine constructed as above. To prove Lemma 2, we show the following, more general statement by induction over the length of prefixes : if , then
-
1.
; and
-
2.
.
- Base case:
-
Let . By definition of runs, we know that is the only run of on the empty word. Similarly, Formula (3) guarantees that the initial state is the unique state for which holds. Both observations immediately prove Part 1. Moreover, holds by definition the semantics of reward machines, which proves Part 2.
- Induction step:
-
Let . Moreover, let be the unique run of on . By applying the induction hypothesis, we then obtain that both and hold.
Thus, is consistent with . ∎
Next, we show that the formula ensures that the learned reward machine is compatible with the LLM-generated DFA .
Lemma 3.
Let and the reward machine defined above. Then, is compatible with (i.e., and implies for every nonempty label sequence ).
Proof.
Let and the reward machine defined above. We first show that and imply for all label sequences . The proof of this claim proceeds by induction of the length of label sequences, similar to Part 2 in the proof of Lemma 2.
- Base case:
-
Let . By definition of runs, the only runs on the empty label sequence are and . Moreover, Formula (6) ensures that , which proves the claim.
- Induction step:
-
Let . Moreover, let and be the runs of and on , respectively. By induction hypothesis, we then know that . Moreover, contains the transition because this transition was used in the last step of the run of on . By construction of , this can only be the case if . In this situation, Formula 7 ensures (since also ), which proves the claim.
Let now be a nonempty label sequence (i.e., ). Moreover, let be the run of on and the run of on . Our induction shows that holds in this case.
Towards a contradiction, assume that , , and . In particular, this means . Since (which was used in the last step in the run of on ) and (due to the induction above), Formula (8) ensures that for all with . However, Formula (2) ensures that there is exactly one with . Thus, there has to exist an such that and . By construction of , this means that the last output of on reading must have been . However, our assumption was , which is a contradiction. Thus, is compatible with the LLM-generated DFA . ∎
We can now prove Theorem 2 (i.e., the correctness of our SAT-based learning algorithm for reward machines).
Proof of Theorem 2.
To prove Part 2, let be a reward machine with states that is consistent with and compatible with each . From this reward machine, we can derive a valuation for the variables and in a straightforward way (e.g., setting if and only is ). Moreover, we obtain a valuation for the variables from the runs of (prefixes) of traces in the sample , and valuations for the variables from the synchronized runs of and for each . Then, indeed satisfies . ∎
Appendix D Convergence to an Optimal Policy
In this section, we prove that LARM-RM almost surely converges to an optimal policy in the limit. We begin by defining attainable trajectories—trajectories that can possibly appear in the exploration of an agent.
Definition 5.
Let
be a labeled MDP and a natural number. A trajectory is said to be -attainable if and for each . Moreover, a trajectory is called attainable if there exists an such that is -attainable. Analogously, we call a label sequence (-)attainable if there exists an (-)attainable trajectory such that for each
An induction shows that LARM-RM almost surely explores every attainable trajectory in the limit (i.e., with probability 1 when the number of episodes goes to infinity. We refer the reader to the extended version of Xu et al. [2020b]111The extended version of Xu et al. [2020b] can be found on arXiv at https://arxiv.org/abs/1909.05912 for a detailed proof.
Lemma 4.
Given , LARM-RM with almost surely explores every -attainable trajectory in the limit.
As an immediate consequence of Lemma 4, we obtain that LARM-RM almost sure explores every (-)attainable label sequence in the limit as well.
Corollary 1.
Given , LARM-RM with almost surely explores every -attainable label sequence in the limit.
In this paper, we assume that all possible label sequences are attainable. We relax this restriction later as in Neider et al. [2021] .
To prove Lemma 1, we show a series of intermediate results. We begin by reminding the reader of a well-known fact from automata theory, which can be found in most textbooks Shallit [2008].
Theorem 3.
Let and be two DFAs with . Then, there exists a word of length at most such that if and only if .
We now observe that every reward machine can be translated into a DFA that is “equivalent” to the reward machine Xu et al. [2020b]. This DFA operates over the combined alphabet and accepts a sequence if and only if outputs the reward sequence on reading the label sequence .
Lemma 5 (Xu et al. [2020b]).
Given a reward machine , one can construct a DFA with states such that .
Proof.
Let be a reward machine. Then, we define a DFA by
-
•
, where ;
-
•
;
-
•
-
•
.
In this definition, is a new sink state to which moves if its input does not correspond to a valid input-output pair produced by . A straightforward induction over the length of inputs to shows that it indeed accepts the desired language. In total, has states. ∎
Next, we show that if two reward machines are semantically different, then we can bound the length of a trace witnessing this fact.
Lemma 6.
Let and be two reward machines. If , then there exists a label sequence of length at most such that .
Proof of Lemma 6.
Similar to Lemma 6, we can show that if an LLM-generated DFA is not compatible with a reward machine, we can also bound the length of a label sequence witnessing this fact.
Lemma 7.
Let be a reward machine and an LLM-generated DFA. If is not compatible with , then there exists a label sequence with such that , , and .
Proof.
Let be a reward machine and the finite set of rewards that can output. Moreover, let be an LLM-generated DFA. Our proof proceeds by constructing four DFAs and a subsequent analysis of the final DFA to derive the desired bound.
First, we construct the DFA according to Lemma 5. Recall that if and only if . Moreover, has states.
Second, we modify the DFA such that it only accepts sequences with . To this end, we augment the state space with an additional bit , which tracks whether the most recent reward was or greater than . More formally, we define a DFA by
-
•
;
-
•
;
-
•
where if and only if ; and
-
•
.
It is not hard to verify that indeed has the desired property. Moreover, by Lemma 5, has states.
Third, we apply “cylindrification” to the LLM-generated DFA , which works over the alphabet , to match the input alphabet of . Our goal is to construct a new DFA that accepts a sequence if and only if for every reward sequence (cylindrification can be thought of as the inverse of the classical projection operation). We achieve this by replacing each transition in with many transitions of the form where . Formally, we define the DFA by
-
•
;
-
•
;
-
•
for each ; and
-
•
.
It is not hard to verify that has indeed the desired property and its size is .
Fourth, we construct the simple product DFA of and . This DFA is given by where
-
•
;
-
•
;
-
•
; and
-
•
.
By construction of and , is is not hard to see that accepts a sequence if and only if with and —in other words, contains all sequences that witness that is not compatible with . Moreover, has states.
It is left to show that if is not compatible with , then we can find a witness with the desired length. To this end, it is sufficient to show that if , then there exists a sequence with . This fact can be established using a simple pumping argument. To this end, assume that with . Then, there exists a state such that the unique accepting run of on visits twice, say at the positions with . In this situation, however, the DFA also accepts the sequence , where we have removed the “loop” between the repeating visits of . Since this new sequence is shorter than the original sequence, we can repeatedly apply this argument until we arrive at a sequence with . By construction of , this means that , , and , which proves the claim. ∎
With these intermediate results at hand, we are now ready to prove Lemma 1.
Proof of Lemma 1.
Let be the sequence of samples and sets of LLM-generated DFAs that arise in the run of LARM-RM whenever a new counterexample is added to (in Lines 9 and 10 of Algorithm 2) or an LLM-generated DFA is removed from the set (Lines 12 and 13 of Algorithm 2). Moreover, let be the corresponding sequence of reward machines that are computed from . Note that constructing a new reward machine is always possible because LARM-RM makes sure that all LLM-generated DFAs in the current set are compatible with the traces in the sample .
We first observe three properties of these sequences:
-
1.
The true reward machine (i.e., the one that encodes the reward function ) is consistent with every sample that is generated during the run of LARM-RM. This is due to the fact that each counterexample is obtained from an actual exploration of the MDP and, hence, corresponds to the “ground truth”.
-
2.
The sequence grows monotonically (i.e., ) because LARM-RM always adds counterexamples to and never removes them (Lines 9 and 10). In fact, whenever a counterexample is added to to form , then (i.e., ). To see why this is the case, remember that LARM-RM always constructs hypotheses that are consistent with the current sample (and the current set of LLM-generated DFAs). Thus, the current reward machine is consistent with , but the counterexample was added because . Thus, cannot have been an element of .
- 3.
Similar to Property 1, we now show that each LLM-generated DFA in the set , , is compatible with the true reward machine . Towards a contradiction, let be an LLM-generated DFA and assume that is not compatible with . Then, Lemma 7 guarantees the existence of a label sequence with
such that ,, and . Since we assume all label sequences to be attainable and have chosen , Corollary 1 guarantees that LARM-RM almost surely explores this label sequence in the limit. Once this happens, LARM-RM removes from the set (Lines 12 and 13), which is diction to the fact that the sequence is stationary (Property 3). Hence, we obtain the following:
-
4.
Every LLM-generated DFA in , , is compatible with the true reward machine .
Next, we establish the three additional properties about the sub-sequence of hypotheses starting at position :
- 5.
-
6.
We have for all . Towards a contradiction, assume that . Since LARM-RM always computes consistent reward machines and if (see Property 2), we know that is not only consistent with but also with (by definition of consistency). Moreover, LARM-RM computes minimal consistent reward machines. Hence, since is consistent with and , the reward machine is not minimal, which is a contradiction.
-
7.
We have for and —in other words, the reward machines generated during the run of LARM-RM after the -th recomputation are semantically distinct. This is a consequence of the facts that was a counterexample to (i.e., and that LARM-RMalways constructs consistent reward machines (which implies ).
Properties 5 and 6 now provide as an upper bound on the size of any reward machine that LARM-RM constructs after the -th recomputation. Since there are only finitely many reward machines of size at most , Property 7 implies that there exists a after which no new reward machine is learned. Hence, it is left to show that (i.e., for all label sequences ).
Towards a contradiction, let us assume that . Then, Lemma 6 guarantees the existence of a label sequence with
such that .
Since we assume all label sequences to be attainable and have chosen , Corollary 1 guarantees that LARM-RM almost surely explores this label sequence in the limit. Thus, the trace , where , is almost surely returned as a new counterexample, resulting in a new sample . This, in turn, causes the construction of a new reward machine, which contradicts the assumption that no further reward machine is generated. Thus, the reward machine is equivalent to the true reward machine . ∎
Let us finally turn to the proof of Theorem 1 (i.e., that LARM-RM almost surely converges to the optimal policy in the limit). The key idea is to construct the product of the given MDP and the true reward machine . In this new MDP , the reward function is in fact Markovian since the reward machine encodes all necessary information in its states (and, hence, in the product ). Thus, the fact that classical Q-learning almost surely converges to an optimal policy in the limit also guarantees the same for LARM-RM. We refer the reader to the extended version of Xu et al. [2020b] for detailed proof.
Appendix E Unattainable Label Sequences
Let us now consider the case that not all labels can be explored by the agent (e.g., because certain label sequences are not possible in the MDP). In this situation, two complications arise:
-
1.
It is no longer possible to uniquely learn the true reward machine . In fact, any reward machine that agrees with on the attainable traces becomes a valid solution.
-
2.
The notion of compatibility needs to reflect the attainable traces. More precisely, the condition implies now has to hold only for attainable label sequences.
Both complications add a further layer of complexity, which makes the overall learning problem—both Q-learning and the learning of reward machines—harder. In particular, we have to adapt the episode length and also the size of the SAT encoding grows. In total, we obtain the following result.
Theorem 4.
Let be a labeled MDP where all label sequences are attainable and the true reward machine encoding the rewards of . Moreover, let be a set of LLM-generated DFAs and . Then, LARM-RM with
almost surely converges to an optimal policy in the limit. The size of the formula grows linearly in .
In order to prove Theorem 4, specifically the bound on the length of episodes, we first need to introduce the nondeterministic version of DFAs. Formally, a nondeterministic finite automaton is a tuple where , , , are as in DFAs and is the transition relation. Similar to DFAs, a run of an NFA on a word is a sequence such that and for each . In contrast to DFAs, however, NFAs permit multiple runs on the same input or even no run. Accepting runs and the language of NFAs are defined analogously to DFAs. Note that we use NFAs instead of DFAs because the former can be exponentially more succinct than the latter.
Similar to Lemma 5, one can translate an MDP into an NFA with states that accepts exactly the attainable traces of an MDP .
Lemma 8.
Given a labeled MDP , one can construct an NFA with at most states that accept exactly the admissible label sequences of .
This construction is similar to Remark 1 of the extended version of Xu et al. [2020b] as proceeds as follows.
Proof.
Given a labeled MDP , we construct an NFA by
-
•
;
-
•
;
-
•
if and only if there exists an action with and ; and
-
•
.
A straightforward induction shows that holds if and only if is an attainable label of . ∎
We are now ready to prove Theorem 4.
Proof of Theorem 4.
We first modify Lemma 7 slightly. More precisely, we add another fifth step, where we build the product of with the cylindrification of the DFA . It is not hard to verify that this results in an NFA that accepts a trace if and only if is admissible and not in the language , , and . This NFA has states. A similar pumping argument as in the proof of Lemma 7 can now be used to show the existence of a witness of length at most .
Analogously, we can modify Lemma 6. We build the input-synchronized product of the two DFAs and and the DFA . This results in an NFA with states. If on an attainable label sequence, then we can find a sequence such that leads in this product to a state where accepts (the label sequence is attainable), but exactly one of and accepts. Using a pumping argument as above, we can bound the length of such an input to at most .
Moreover, we have to modify Formula to account for the new situation. Given , we first introduce new auxiliary variables where ; we use these variables instead of the variables . Second, we replace Formulas (6), (7), and (8) with the following three formulas:
(9) | |||
(10) | |||
(11) |
Clearly, the size of grows linearly in as compared to the case that all label sequences are attainable. Moreover, it is not hard to verify that has indeed the desired meaning. More precisely, we obtain a result analogous to Lemma 1, but for the case that not all label sequences are attainable.
Once we have established that our learning algorithm learns consistent and compatible reward machines, the overall convergence of LARM-RM to an optimal policy and the bound on the length of episodes can then be proven analogously to Theorem 1. ∎