PlanCritic: Formal Planning with Human Feedback
^†^†thanks: This project was funded by ONR Award No. N000142312840 and NSF Award No. 2021-67021-35329.

Owen Burns College of Engineering and Comptuer Science
University of Central Florida
Orlando, United States
[email protected] Dana Hughes Robotics Institute
Carnegie Mellon University
Pittsburgh, United States
[email protected] Katia Sycara Robotics Institute
Carnegie Mellon University
Pittsburgh, United States
[email protected]

Abstract

Real world planning problems are often too complex to be effectively tackled by a single unaided human. To alleviate this, some recent work has focused on developing a collaborative planning system to assist humans in complex domains, with bridging the gap between the system’s problem representation and the real world being a key consideration. Transferring the speed and correctness formal planners provide to real-world planning problems is greatly complicated by the dynamic and online nature of such tasks. Formal specifications of task and environment dynamics frequently lack constraints on some behaviors or goal conditions relevant to the way a human operator prefers a plan to be carried out. While adding constraints to the representation with the objective of increasing its realism risks slowing down the planner, we posit that the same benefits can be realized without sacrificing speed by modeling this problem as an online preference learning task. As part of a broader cooperative planning system, we present a feedback-driven plan critic. This method makes use of reinforcement learning with human feedback in conjunction with a genetic algorithm to directly optimize a plan with respect to natural-language user preferences despite the non-differentiability of traditional planners. Directly optimizing the plan bridges the gap between research into more efficient planners and research into planning with language models by utilizing the convenience of natural language to guide the output of formal planners. We demonstrate the effectiveness of our plan critic at adhering to user preferences on a disaster recovery task, and observe improved performance compared to an llm-only neurosymbolic approach.

Index Terms:

planning, PDDL, genetic algorithm, human-computer interaction, reinforcement learning with human feedback, preference learning

I Introduction

The ability of human planners to effectively manage situations involving complex, multifaceted objectives is limited. We consider the case of a non-expert human planner in particular. In an increasingly interconnected world the potential downstream impacts of failed planning only grow, a fact borne out by the worldwide supply chain shortages seen in the wake of COVID [1].

Formally defining tasks, domains, and goals in a symbolic language, such as the Planning Domain Definition Language (PDDL), allows for automating plan generation from provided descriptions. While the classical planners carrying out that generation excel at solving complex tasks, the strict format in which problems must be specified typically makes these systems impractical for the non-experts in charge of planning that could benefit from them [2]. While large language models (LLMs) have demonstrated proficiency at translating natural language to symbolic language [3], the resulting specifications often suffer from syntactical mistakes or deviate from the user’s intent [4].

Recent research into utilizing LLMs for generating task description in PDDL have focused primarily on the ability of an LLM to generate accurate PDDL from natural language; in this paper, we focus on developing an LLM-based system for collaborative human-AI planning. In such a setting, plan generation may involve multiple iterations of human feedback to an AI in order to effectively capture the human’s preferences for various plan constraints. Approaches focused purely on LLM-based PDDL generation also lack the ability to search the space of available plans beyond what may be arrived at downstream of the LLM’s generation; we address that limitation by using a genetic algorithm to efficiently explore the planning space. Our contributions are as follows:

1.

We introduce PlanCritic, a neurosymbolic framework capable of assisting human planners in dynamic and complex environments by optimizing a plan’s state trajectory constraints with respect to their preferences.
2.

We use reinforcement learning with human feedback and a genetic algorithm-based optimizer to search the space of possible plans beyond what can be derived from an LLM’s outputs.
3.

We integrate PlanCritic into a broader cooperative planning system
4.

We demonstrate that PlanCritic outperforms LLM-only neurosymbolic approaches at replanning in response to changing user preferences in a time-constricted planning environment.

TABLE I: Symbolic Grounding

Preference Mid-Level Goal Make sure the scout asset only visits the endpoint once Limit the scout asset (‘sct_ast_0‘) to visiting the endpoint (‘wpt_end‘) at most one time throughout the entire plan. We need to clear the route from debris station 0 to the endpoint within 5 hours Ensure that after time step 5, the route between ‘deb_stn_0‘ and ‘wpt_end‘ is always unblocked. Don’t remove any underwater debris Ensure that the underwater debris u_deb_ini_b_0 remains at wpt_ini at all times. Ensure that the underwater debris u_deb_b_0_end remains at wpt_b_0 at all times.

II Background and Related Works

II-A Neurosymbolic Planning

Interest in combining the flexibility of neural models with the speed and correctness of classical planners has been extensively studied. A number of early works focused on learning domains, with [5] using a deep-q network to translate existing instruction manuals into domains one-shot and [6] going a step further to learn STRIPS domains online with the help of an agent generating informative plan traces for training. With the advent of LLMs and their effectiveness at translation, particularly from natural language into (PDDL) [3], focus shifted towards verification, with [7] using an LLM-based agent to test the self-consistency of generated domains and make necessary repairs. The linguistic proficiency brought by LLMs allowed research into neurosymbolic planning architectures to branch out from domain learning as well. One approach explored is to generate a symbolic description of some aspect of the planning problem from natural language, such as [8] or [9], which would then be provided to an existing symbolic planner to produce a symbolic plan. Alternatively, the LLM could be queried for a symbolic plan directly from a natural language description of the problem, and which could then be symbolicly verified in an post-hoc manner [10]. Other works took advantage of the coding proficiency of larger models, with [11] using GPT-4 few-shot to generate programs which in turn created valid plans from a domain. These approaches, however, are all limited in their ability to explore the space of available plans due to their reliance on the LLM for variation.

II-B Genetic Algorithms

Genetic algorithms (GAs) are frequently used to solve combinatorial optimization problems with large search spaces, drawing inspiration from the principles of natural selection and genetics [12]. By maintaining a population of potential solutions and stochastically improving them with respect to an objective function, these methods efficiently comb the search space without requiring that objective to be differentiable with respect to any of the candidates [13]. Genetic algorithms are also inherently parallelizable, enabling them to explore multiple regions of the search space simultaneously, increasing the likelihood that the global optimum is discovered [14]. Genetic algorithms have been successfully applied to tasks including production scheduling [15], route optimization [16], and resource allocation [17]. In robotic path planning, genetic algorithms help find collision-free paths that minimize travel time and energy consumption [18].

II-C Reinforcement Learning with Human Feedback

Reinforcement Learning with Human Feedback (RLHF) leverages human feedback to guide the learning process of a reinforcement learning agent, shaping the agent’s behavior in a way more aligned with human preferences than traditional reinforcement learning [19]. One prominent application of RLHF is in natural language processing tasks, where the goal is to align language models with human preferences and ethical guidelines. For instance, OpenAI’s GPT models have been fine-tuned using RLHF to ensure that their responses are not only relevant but also adhere to safety and ethical standards [20]. In the context of planning and decision-making, RLHF has been employed to refine the outputs of planning algorithms based on user feedback. For example, [21] demonstrated the use of RLHF in Atari games, where human feedback was used to train an agent to perform better than with traditional reward functions alone. Similarly, in robotic manipulation tasks, RLHF has been used to teach robots complex behaviors that are difficult to specify through conventional reward functions [22].

III Method

Refer to caption — Figure 1: Diagram of the PlanCritic system. When the user provides preferences, it is first symbolically grounded into mid-level goals according to the problem file before being used to generate an initial population of constraints to be optimized by the genetic algorithm.

III-A Reinforcement Learning with Human Feedback

To ensure performance in highly dynamic environments, we focused on optimizing a plan with respect to user feedback as opposed to generating entire problem files. This enabled PlanCritic to rapidly adapt the way a plan is executed without the risk of derailing the ground-truth end goal. To achieve this, for any given problem $\mathcal{P}^{i}$ we treat the classical planner $\rho$ as a model parameterized by a set of constraints $\mathcal{C}^{i}=\{{c^{i}}_{1},{c^{i}}_{2},...,{c^{i}}_{n}\}$ specified in PDDL such that:

\displaystyle\rho:(\mathcal{C}^{i},\mathcal{P}^{i})\rightarrow\mathcal{S}

(1)

Where $\mathcal{S}$ is the resulting set of steps constituting the plan. In order to optimize this model with respect to user feedback specified in natural language, we followed RLHF and trained a reward model to score plans generated by the planning model. Taking the set of user feedback to be $\mathcal{F}=\{f_{1},f_{2},...,f_{n}\}$ , we define reward model $\nu$ as follows:

\displaystyle\nu:(\mathcal{S},\mathcal{F})\rightarrow[0,1]

(2)

We assume that the symbolic planner used is deterministic. Thus, each $\mathcal{C}^{i}$ corresponds to only one $\mathcal{S}$ , enabling us to use the reward model to uniquely determine the adherence of a constraint set to the feedback provided by the user.

III-A1 Reward Model

We implemented the reward $\nu$ as an LSTM-based classifier which could determine if a single plan adhered to a single planning constraint specified in natural language. It sequentially processes each step of the plan combined with the planning constraint in question. We do this for all the current planning constraints, and return the percentage adhered as the measure of fitness.

To construct a dataset to train this model, we first generated large numbers of constraint sets for a number of PDDL domains and problems and converted those into plans with Optic. Then, for each plan we generated the same number of constraints and used Val [23] to confirm that the plan did not satisfy them. This ensured that for each plan there were an equal number of positive and negative cases. Finally, we converted all constraints to natural language using an LLM to match the semantics which would be encountered in practice.

III-B Genetic Algorithm

While RLHF approaches typically use a gradient-based optimizer to tune their initial model with respect to their score model, the use of an external symbolic planner to generate plans makes this gradient-based approaches infeasible. To address this concern we used a genetic algorithm, treating different $\mathcal{C}^{i}$ for problem $i$ as individuals in the GA’s population. Since we used PDDL, we can rigidly define a constraint as:

$\displaystyle c^{i}=$	$\displaystyle[not]\>t\>p$	(3)
where	$\displaystyle t\in\{always,sometime,...\}$
	$\displaystyle p\in\mathcal{P}^{i}_{preds}$

Defining constraints in this way gives three axes of change by which to mutate a constraint without entirely replacing it: negation, changing its state-trajectory constraint $t$ , and modifying one of the arguments of its predicate $p$ . In addition to constraint-level mutations, we define adding and removing constraints as rare individual-level mutations to ensure more rapid exploration and prevent getting stuck in local minima. Algorithm 1 describes the mutation operation used by the GA.

Algorithm 1 Mutation

\mathcal{C}\leftarrow\{c_{1},c_{2},...,c_{n}\}

i\sim\{1,2,3\}

i

1

then

\mathcal{C}\leftarrow\mathcal{C}+\{c\}

else if

i

2

and

|\mathcal{C}|>1

then

c\sim\mathcal{C}

\mathcal{C}\leftarrow\mathcal{C}-\{c\}

else if

i

3

|\mathcal{C}|\leq 1

then

c^{\prime}\sim\mathcal{C}

j\sim\{1,2,3\}

j

1

then

{c^{\prime}}_{not}\leftarrow!c_{not}

else if

j

2

then

t^{\prime}\sim\{always,sometime,...\}

{c^{\prime}}_{t}\leftarrow t^{\prime}

else if

j

3

then

{c^{\prime}}_{p}\leftarrow CHANGEARG({c^{\prime}}_{p})

end if

\mathcal{C}\leftarrow(\mathcal{C}-\{c\})\cup\{c^{\prime}\}

end if

return

\mathcal{C}

We describe the crossover operation in Algorithm 2. The crossover operation generates a child by selecting a random crossover point, and combining the constraints before the crossover point from one parent with the constraints after the crossover point from the other parent.

Algorithm 2 Crossover

\mathcal{C}_{i}\leftarrow\{{c^{i}}_{1},{c^{i}}_{2},...,{c^{i}}_{n}\}

\mathcal{C}_{j}\leftarrow\{{c^{j}}_{1},{c^{j}}_{2},...,{c^{j}}_{m}\}

i\sim[1,MIN(|\mathcal{C}_{i}|,|\mathcal{C}_{j}|)

\mathcal{C}_{k}\leftarrow\mathcal{C}_{i}[:i]\cup\mathcal{C}_{j}[i:]

return

\mathcal{C}_{k}

To assess the fitness of the individuals in the population, we use the LSTM-based reward model $\nu$ described in the previous section.

III-C User Interaction

To both maximize the information available to the reward model and ensure that the genetic algorithm starts its optimization in the neighborhood of the actual solution, we implemented the system architecture described in Fig. 1 as an interface between the user and the GA-based planner. After the user gives their preferences, GPT-4 is utilized to translate these preferences into symbolically grounded natural language constrains we term mid-level goals. Examples are given in Table I. These mid-level goals are then used by GPT-4 to generate an initial candidate $\mathcal{C}$ . This initial constraint set is duplicated to produce the initial population of candidate constraints for the GA, and these duplicates are mutated to ensure diversity. To prevent issues caused by GPT-4 returning invalid PDDL constraints, we developed a parser which takes in the model’s output and corrects syntactical errors with the most likely valid replacement as determined by Gestalt pattern matching [24]. Once planning is complete, GPT-4 summarizes the result of planning and each of the plan steps and returns the outcome to the user.

III-C1 Cooperative Planning System

PlanCritic is developed as a component of a broader planning architecture which tackles planning from plan creation through the end of execution. Our system fits into the beginning of that process, enabling users to adapt a plan created for a static goal to meet the shifting constraints of the planning environment. The interface developed to integrate PlanCritic with the user-facing system is shown in 2. Users enter preferences on the left, and a step-by-step summary of the plan is found on the right in addition to a visual representation of the plan execution and a summary of the planning outcome.

IV Experimental Setup

To demonstrate the effectiveness of our GA-based optimization approach, we compared the performance of the plan revision system with and without the GA, in the latter case planning exclusively based on the constraints guessed by GPT-4. The domain we used was a waterway restoration domain with the goal being to retrieve a ship from a location blocked by debris and tow it to a ship dock. A figure illustrating the environment is given in Fig. 3. We used Optic [25] as the symbolic planner to generate plans from problem descriptions and constraints.

To ensure our results were as aligned with the downstream use of PlanCritic as possible, we conducted a user study in which participants stated each of the planning objectives in a predefined list as preferences in their own words and allowed the system to attempt to adhere to them.

V Results

Table II shows the results of our user study on the six predefined planning objectives chosen. Overall, the genetic algorithm had a higher success rate of 75% compared to 53% for GPT-4 by itself, which aligned exactly with its validation accuracy during training. Table III shows a cross-comparison of the results of the GA and the LLM. We find that the genetic algorithm is especially useful at correcting the LLM when it gives an incorrect answer, returning a correct plan 88% of the time when the initial guess generated by the LLM is either wrong or fails to generate a plan. However, in 36% of cases where the LLM did give an initially correct answer, the final answer returned by the GA is incorrect. In three out of four such cases, the correct answer required multiple constraints. In these cases, the GA had mutated one of the initially correct constraints into an incorrect one but still arrived at a partially correct answer (e.g., removing one out of the two desired pieces of debris) which the reward model misclassified. We attribute this to the way constraints were selected for the reward model training set; because the constraints corresponding to positive and negative samples were selected randomly, it was unlikely to have similar constraints be both positive and negative for the same plan, making the model less precise when detecting ”near misses”.

TABLE II: Comparison of objectives achieved by LLM and GA

Objective	GA		No GA
	right	wrong	right	wrong
all underwater debris is removed	2	1	2	1
waypoint b is made unrestricted	3	0	2	1
scout asset reaches end point before debris asset reaches initial point	6	0	2	4
no assets visit waypoint a	0	3	2	1
make sure step 6 happens before step 5	2	0	1	1
all of the underwater debris is removed and none of the normal debris is removed	2	1	2	1

TABLE III: Cross-comparison of LLM and GA performance

	GA correct	GA incorrect
LLM correct	7	4
LLM incorrect	8	1

VI Future Work

We plan to conduct further studies testing out different architectures of reward model, as well as measuring how overall performance changes when using a smaller LLM (e.g. GPT-3.5) to generate the initial candidate. Potential avenues for expanding the scope of PlanCritic include support for weighted preferences and expanding on its ability to explicitly re-plan when downstream plan execution fails due to a change in the environment.

VII Conclusion

We propose PlanCritic, a neurosymbolic architecture to optimize PDDL plans with respect to user preferences in online and dynamic scenarios. We approach the problem through the lens of reinforcement learning with human feedback, treating a classical planner as a model parameterized by plan constraints and optimizing it with a genetic algorithm. We find that our approach is superior at generating plans whose state trajectory aligns with stated user preferences than a neurosymbolic architecture using an LLM alone, and that it is extremely effective at catching the LLMs mistakes. However, we also find that the reward model underpinning the genetic algorithm is prone to failure when a plan is a ”near miss” to the mid-level goal, though we expect that this can be resolved by accounting for this during model training in future works.

References

[1] An Thi Binh Duong, Tho Pham, Huy Truong Quang, Thinh Gia Hoang, Scott McDonald, Thu-Hang Hoang, and Hai Thanh Pham, “Ripple effect of disruptions on performance in supply chains: an empirical study,” Engineering, Construction and Architectural Management, vol. 31, no. 13, pp. 1–22, Jan. 2023, Publisher: Emerald Publishing Limited.
[2] M. S. Boddy, “Imperfect Match: PDDL 2.1 and Real Applications,” Journal of Artificial Intelligence Research, vol. 20, pp. 133–137, Dec. 2003.
[3] Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh, “Translating Natural Language to Planning Goals with Large-Language Models,” Feb. 2023, arXiv:2302.05128 [cs].
[4] Alba Gragera and Alberto Pozanco, “Exploring the Limitations of using Large Language Models to Fix Planning Tasks,” .
[5] S Miglani and N Yorke-Smith, “NLtoPDDL: One-Shot Learning of PDDL Models from Natural Language Process Manuals,” .
[6] Leonardo Lamanna, Alessandro Saetti, Luciano Serafini, Alfonso Gerevini, and Paolo Traverso, “Online Learning of Action Models for PDDL Planning,” Aug. 2021, vol. 4, pp. 4112–4118, ISSN: 1045-0823.
[7] Pavel Smirnov, Frank Joublin, Antonello Ceravola, and Michael Gienger, “Generating consistent PDDL domains with Large Language Models,” Apr. 2024, arXiv:2404.07751 [cs].
[8] Gautier Dagan, Frank Keller, and Alex Lascarides, “Dynamic Planning with a LLM,” Aug. 2023, arXiv:2308.06391 [cs].
[9] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone, “LLM+P: Empowering Large Language Models with Optimal Planning Proficiency,” Sept. 2023, arXiv:2304.11477 [cs].
[10] Alessio Capitanelli and Fulvio Mastrogiovanni, “A Framework for Neurosymbolic Robot Action Planning using Large Language Models,” Frontiers in Neurorobotics, vol. 18, pp. 1342786, June 2024, arXiv:2303.00438 [cs].
[11] Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B. Tenenbaum, Leslie Pack Kaelbling, and Michael Katz, “Generalized Planning in PDDL Domains with Pretrained Large Language Models,” Dec. 2023, arXiv:2305.11014 [cs].
[12] David E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Longman Publishing Co., Inc., USA, 1st edition, Sept. 1989.
[13] An Introduction to Genetic Algorithms, The MIT Press.
[14] E. Alba and M. Tomassini, “Parallelism and evolutionary algorithms,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 5, pp. 443–462, Oct. 2002, Conference Name: IEEE Transactions on Evolutionary Computation.
[15] Anas Neumann, Adnene Hajji, Monia Rekik, and Robert Pellerin, “Genetic algorithms for planning and scheduling engineer-to-order production: a systematic review,” International Journal of Production Research, vol. 62, no. 8, pp. 2888–2917, Apr. 2024, Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/00207543.2023.2237122.
[16] Barrie Baker and M.A. Ayechew, “A genetic algorithm for the vehicle routing problem,” Computers & Operations Research, vol. 30, pp. 787–800, Apr. 2003.
[17] Javier Alcaraz and Concepción Maroto, “A Robust Genetic Algorithm for Resource Allocation in Project Scheduling,” Annals OR, vol. 102, pp. 83–109, Feb. 2001.
[18] Chaymaa Lamini, Said Benhlima, and Ali Elbekri, “Genetic algorithm based approach for autonomous mobile robot path planning,” Procedia Computer Science, vol. 127, pp. 180–189, 2018, PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING IN DATA SCIENCES, ICDS2017.
[19] Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier, “A survey of reinforcement learning from human feedback,” 2024.
[20] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe, “Training language models to follow instructions with human feedback,” 2022.
[21] Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei, “Deep reinforcement learning from human preferences,” 2023.
[22] Ayano Hiranaka, Minjune Hwang, Sharon Lee, Chen Wang, Li Fei-Fei, Jiajun Wu, and Ruohan Zhang, “Primitive skill-based robot learning from human evaluative feedback,” 2023.
[23] R. Howey, D. Long, and M. Fox, “Val: automatic plan validation, continuous effects and mixed initiative planning using pddl,” in 16th IEEE International Conference on Tools with Artificial Intelligence, 2004, pp. 294–301.
[24] John W. Ratclif, “Pattern Matching: the Gestalt Approach,” .
[25] Amanda Coles J. Benton, “Temporal Planning with Preferences and Time-Dependent Continuous Costs,” .

PlanCritic: Formal Planning with Human Feedback ††thanks: This project was funded by ONR Award No. N000142312840 and NSF Award No. 2021-67021-35329.

Abstract

Index Terms:

I Introduction

II Background and Related Works

II-A Neurosymbolic Planning

II-B Genetic Algorithms

II-C Reinforcement Learning with Human Feedback

III Method

III-A Reinforcement Learning with Human Feedback

III-A1 Reward Model

III-B Genetic Algorithm

III-C User Interaction

III-C1 Cooperative Planning System

IV Experimental Setup

V Results

VI Future Work

VII Conclusion

References

PlanCritic: Formal Planning with Human Feedback
^†^†thanks: This project was funded by ONR Award No. N000142312840 and NSF Award No. 2021-67021-35329.