This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Scene Graph Planning with Large Language Model Guidance

Zhirui Dai1   Arash Asgharivaskasi1   Thai Duong1   Shusen Lin1   Maria-Elizabeth Tzes2
George Pappas2   Nikolay Atanasov1
We acknowledge support from ARL DCIST CRA W911NF-17-2-0181.1The authors are with the Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA 92093, USA, e-mails: {zhdai,shl090,tduong,aasghari,natanasov}@ucsd.edu.2The authors are with GRASP Lab, University of Pennsylvania, Philadelphia, PA 19104, USA, e-mails: {mtzes,pappasg}@seas.upenn.edu.
Abstract

Recent advances in metric, semantic, and topological mapping have equipped autonomous robots with semantic concept grounding capabilities to interpret natural language tasks. This work aims to leverage these new capabilities with an efficient task planning algorithm for hierarchical metric-semantic models. We consider a scene graph representation of the environment and utilize a large language model (LLM) to convert a natural language task into a linear temporal logic (LTL) automaton. Our main contribution is to enable optimal hierarchical LTL planning with LLM guidance over scene graphs. To achieve efficiency, we construct a hierarchical planning domain that captures the attributes and connectivity of the scene graph and the task automaton, and provide semantic guidance via an LLM heuristic function. To guarantee optimality, we design an LTL heuristic function that is provably consistent and supplements the potentially inadmissible LLM guidance in multi-heuristic planning. We demonstrate efficient planning of complex natural language tasks in scene graphs of virtualized real environments.

I INTRODUCTION

Advances in robot perception and computer vision have enabled metric-semantic mapping [1, 2, 3, 4, 5, 6, 7, 8, 9], offering rich information in support of robot autonomy. Beyond single-level maps, hierarchical models encode topological relations among local maps and semantic elements [10, 11]. A scene graph [11] is a prominent example that models buildings, floors, rooms, objects, and occupancy in a unified hierarchical representation. Scene graph construction can be done from streaming sensor data [12, 5, 13]. The metric, semantic, and topological elements of such models offer the building blocks for robots to execute semantic tasks [14]. The objective of this work is to approach this challenge by generalizing goal-directed motion planning in flat geometric maps to natural language task planning in scene graphs.

Connecting the concepts in a natural language task to the real-world objects they refer to is a challenging problem, known as symbol grounding [15]. Large language models (LLMs), such as GPT-3 [16], BERT [17], and LLaMA [18], offer a possible resolution with their ability to relate entities in an environment model to concepts in natural language. Chen et al. [19, 20] use LLMs for scene graph labeling, showing their capability of high-level understanding of indoor scenes. Shah et al. [21] use GPT-3 to parse text instructions to landmarks and the contrastive language image pre-training (CLIP) model [22] to infer a joint landmark-image distribution for visual navigation. Seminal papers [23, 24] in the early 2000s established formal logics and automata as powerful representations of robot tasks. In work closely related to ours, Chen et al. [25] show that natural language tasks in 2D maps encoded as sets of landmarks can be converted to signal temporal logic (STL) [26] via LLM re-prompting and automatic syntax correction, enabling the use of existing temporal logic planners [27, 28, 29, 30, 31]. Beyond temporal logics, other expressive robot task representations include the planning domain definition language [32], Petri nets [33, 34], and process algebra [35]. However, few of the existing works have considered task specification in hierarchical models. We use an LLM to translate a natural language task defined by the attributes of a scene graph to a linear temporal logic (LTL) formula. We describe the scene graph to the LLM via an attribute hierarchy and perform co-safe syntax checking to ensure generation of correct and finite-path verifiable LTL. Further, we develop an approach to obtain task execution guidance from the LLM, which is used to accelerate the downstream task planning algorithm.

Refer to caption
Figure 1: Planning a natural language mission, μ:“Reach the oven in the kitchen”\mu:\text{``Reach the oven in the kitchen''}, in a scene graph 𝒢\mathcal{G} of the Gibson environment Benevolence [36] with object, room, and floor attributes. The terms “oven” and “kitchen” in μ\mu belong to the object and room attributes of the scene graph, respectively. The scene graph 𝒢\mathcal{G} is described to the LLM using the connectivity of its attributes (attribute hierarchy) and the LLM is used to translate μ\mu to LTL formula ϕμ\phi_{\mu} and associated Automaton ϕ\mathcal{M}_{\phi}. We construct a hierarchical planning domain from the scene graph, and use multi-resolution multi-heuristic planning [37] to plan the mission execution. In addition to mission translation, the LLM is used to provide heuristic guidance to accelerate the planning, while an LTL heuristic is used to guarantees optimality.

Given a symbolically grounded task representation, the next challenge is to plan its execution in a hierarchical model efficiently. A key component for achieving efficiency in traditional goal-oriented planning in single-level environments is the use of guidance from a heuristic function. Heuristics play a critical role in accelerating both search-based [38, 39] and sampling-based [40, 41] planners. More complex tasks than goal reaching involve sequencing, branching, and recurrence, making heuristic guidance even more important for efficiency. Our work is inspired by Fu et al. [29] who develop an admissible heuristic for LTL planning in probabilistic landmark maps. We extend the approach by (a) generalizing to hierarchical scene graphs via multi-resolution planning, (b) designing a consistent LTL heuristic allowing acceleration over admissible-only heuristic planning, and (c) introducing an LLM heuristic allowing acceleration from LLM semantic guidance while retaining optimality guarantees via multi-heuristic planning. Our approach is enabled by the anytime multi-resolution multi-heuristic A* (AMRA*) [37], which combines the advantages of multi-resolution A* [42] and multi-heuristic A* [39]. Our key contribution is to define the nodes, edges, and costs of a hierarchical planning domain from a scene graph and to introduce guidance from a consistent LTL heuristic and a semantically informed LLM heuristic.

Related works consider object search and semantic goal navigation in unknown environments, represented as semantic occupancy maps [43], topological maps [44], or scene graphs [45, 46]. Shah et al. [43] develop an exploration approach that uses an LLM to score subgoal candidates and provide an exploration heuristic. Kostavelis et al. [44] perform place recognition using spatial and temporal proximity to obtain a topological map and encode the connectivity of its attributes in a navigation graph to enable Dijkstra planning to semantically specified goals. Amiri et al. [45] employ a scene graph generation network to construct a scene graph from images and a partially observable Markov decision process to obtain an object-search policy. Ravichandran et al. [46] embed partial scene graph observations in feature space using a graph neural network (GNN) and train value and policy GNNs via proximal policy optimization. In contrast with these works, we consider significantly more general missions but perform planning in a known scene graph.

In summary, this paper makes the following contributions.

  • We use an LLM to translate natural language to LTL tasks over the attributes of a scene graph.

  • We construct a hierarchical planning domain capturing the structure of the scene graph and LTL task.

  • We design new LTL and LLM heuristic functions for planning, and prove that the LTL heuristic is consistent.

  • We employ hierarchical multi-heuristic planning to guarantee efficiency (due to LLM semantic guidance) and optimality (due to LTL consistent guidance), despite potential inadmissibility of the LLM heuristic.

II PROBLEM STATEMENT

Consider an agent planning a navigation mission specified in terms of semantic concepts, such as objects and regions, in a known environment. We assume that the environment is represented as a scene graph [11].

Definition 1.

A scene graph is a graph 𝒢=(𝒱,,{𝒜k}k=1K)\mathcal{G}=(\mathcal{V},\mathcal{E},\{\mathcal{A}_{k}\}_{k=1}^{K}) with node set 𝒱\mathcal{V}, edge set 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}, and KK attribute sets 𝒜k\mathcal{A}_{k} for k=1,,Kk=1,\ldots,K. Each attribute a𝒜ka\in\mathcal{A}_{k} is associated with a subset 𝒱a\mathcal{V}_{a} of the nodes 𝒱\mathcal{V}.

A scene graph provides a hierarchical semantic description of an environment, such as a building, in terms of floors, rooms, objects, and occupancy (see Fig. 1). For example, the graph nodes 𝒱\mathcal{V} may represent free space, while the objects may be encoded as an attribute set 𝒜1\mathcal{A}_{1} such that an object o𝒜1o\in\mathcal{A}_{1} is associated with a free region 𝒱o𝒱\mathcal{V}_{o}\subset\mathcal{V} around it. Similarly, rooms may be represented as a set 𝒜2\mathcal{A}_{2} such that a room r𝒜2r\in\mathcal{A}_{2} is associated with a free-space region 𝒱r𝒱\mathcal{V}_{r}\subset\mathcal{V}.

Given the initial agent location s𝒱s\in\mathcal{V}, and a cost function c:>0c:\mathcal{E}\mapsto\mathbb{R}_{>0}, our objective is to plan a sequence of scene graph nodes (path) that achieves a mission μ\mu with minimal cost. We assume μ\mu is provided in natural language using terms from the attribute sets 𝒜k\mathcal{A}_{k} of the scene graph. An example scene graph mission is provided in Fig. 1.

To interpret a natural language mission, we define atomic propositions whose combinations can capture the mission requirements. An atomic proposition pa:𝒱{𝖿𝖺𝗅𝗌𝖾,𝗍𝗋𝗎𝖾}p_{a}:\mathcal{V}\mapsto\{\mathsf{false},\mathsf{true}\} associated with attribute a𝒜ka\in\mathcal{A}_{k} of the scene graph 𝒢\mathcal{G} evaluates true at node s𝒱s\in\mathcal{V} if ss belongs to the nodes 𝒱a\mathcal{V}_{a} that satisfy attribute aa. We denote this by pa(s):s𝒱ap_{a}(s):s\in\mathcal{V}_{a}. The set of all atomic propositions at s𝒱s\in\mathcal{V} is denoted by:

𝒜𝒫(s){pa(s)a𝒜k,k=1,,K}.\small\mathcal{A}\!\mathcal{P}(s)\coloneqq\left\{p_{a}(s)\mid a\in\mathcal{A}_{k},k=1,\ldots,K\right\}. (1)

Intuitively, pa(s)p_{a}(s) being 𝗍𝗋𝗎𝖾\mathsf{true} means that the agent is at node ss that satisfies attribute aa, e.g., reaching an object in 𝒜1\mathcal{A}_{1} or being in a room in 𝒜2\mathcal{A}_{2}. Avoiding an object or leaving a room can be specified via the negation of such propositions. To determine which atomic propositions are satisfied at a particular node, we define a labeling function.

Definition 2.

Consider a scene graph 𝒢\mathcal{G} with atomic propositions 𝒜𝒫=s𝒱𝒜𝒫(s)\mathcal{A}\!\mathcal{P}=\cup_{s\in\mathcal{V}}\mathcal{A}\!\mathcal{P}(s). A labeling function :𝒱2𝒜𝒫\ell:\mathcal{V}\mapsto 2^{\mathcal{A}\!\mathcal{P}} maps a node s𝒱s\in\mathcal{V} to a set (s)𝒜𝒫\ell(s)\subseteq\mathcal{A}\!\mathcal{P} of atomic propositions that evaluate 𝗍𝗋𝗎𝖾\mathsf{true} at ss.

The labels along a path s1:Ts_{1:T} are obtained as (s1:T)=(s1)(s2)(sT)\ell(s_{1:T})=\ell(s_{1})\ell(s_{2})\ldots\ell(s_{T}) and are called a word. A word contains information about the objects, rooms, and floors that an agent visits along its path in a scene graph. We can decide whether the agent path satisfies a mission μ\mu by interpreting its word. We denote a word (s1:T)\ell(s_{1:T}) that satisfies a mission μ\mu by (s1:T)μ\ell(s_{1:T})\models\mu, and define the precise semantics of this notation in Sec. III. With this, we are ready to state the problem of natural language mission planning in scene graphs.

Problem.

Given a scene graph 𝒢=(𝒱,,{𝒜k}k=1K)\mathcal{G}=(\mathcal{V},\mathcal{E},\{\mathcal{A}_{k}\}_{k=1}^{K}), natural language mission μ\mu over the attributes of 𝒢\mathcal{G}, cost function c:>0c:\mathcal{E}\mapsto\mathbb{R}_{>0}, and initial node s1𝒱s_{1}\in\mathcal{V}, plan a path s1:Ts_{1:T} that satisfies μ\mu with minimal cost:

minT,s1:T𝒱T\displaystyle\min_{T\in\mathbb{N},s_{1:T}\in\mathcal{V}^{T}} t=1T1c(st,st+1)\displaystyle\sum_{t=1}^{T-1}c(s_{t},s_{t+1}) (2)
s.t. (st,st+1),t=1,,T1,\displaystyle(s_{t},s_{t+1})\in\mathcal{E},\quad t=1,\ldots,T-1,
(s1:T)μ.\displaystyle\ell(s_{1:T})\models\mu.

III NATURAL LANGUAGE TO TEMPORAL LOGIC

We use an LLM to translate natural language missions to logic formulas over the scene graph propositions 𝒜𝒫\mathcal{A}\!\mathcal{P}. The key challenge is to describe the structure of a scene graph 𝒢\mathcal{G} to an LLM and ask it to translate a mission μ\mu to a logic formula ϕμ\phi_{\mu}. We focus on linear temporal logic (LTL) [47] with syntax in Table. I due to its popularity and sufficient expressiveness to capture temporal ordering of propositions.

We require a syntactically co-safe formula [48] to allow checking its satisfaction over finite agent paths. A co-safe LTL formula can be satisfied by a word (s1:T)\ell(s_{1:T}) that consists of a finite prefix followed by a (potentially infinite) continuation that does not affect the formula’s truth value. LTL formulas that contain only 𝐗\mathbf{X} and 𝐔\mathbf{U} temporal operators when written in negated normal form (¬\neg appears only in front of atomic propositions) are syntactically co-safe [48].

To use an LLM for scene understanding, it is necessary to design a prompt that describes the scene graph 𝒢\mathcal{G} succinctly. Otherwise, the input might exceed the model token limit or confuse the model about the relationship between the sentences. For this aim, we simplify the scene graph 𝒢\mathcal{G} into an attribute hierarchy 𝒢¯\bar{\mathcal{G}} that compactly represents scene entities in a YAML format. In our setup, the top level of 𝒢¯\bar{\mathcal{G}} contains floors f𝒜3f\in\mathcal{A}_{3}. The rooms rfr_{f} on floor ff are defined as {r𝒜2|𝒱r𝒱f}\left\{r\in\mathcal{A}_{2}|\mathcal{V}_{r}\subseteq\mathcal{V}_{f}\right\}, and nested as children of floor ff. Additionally, each room rr stores connections to other rooms on the same floor. Similarly, the objects in room rr, {o𝒜1|𝒱o𝒱r}\left\{o\in\mathcal{A}_{1}|\mathcal{V}_{o}\subseteq\mathcal{V}_{r}\right\}, are stored as children of room rr. Each entity in 𝒢¯\bar{\mathcal{G}} is tagged with a unique ID to differentiate rooms and objects with the same name. See Fig. 2a for an example attribute hierarchy. In our examples, we define attributes for floors, rooms, and objects and the attribute hierarchy contains 33 levels but this can be extended to more generalized attribute sets (e.g., level for object affordances). The attribute hierarchy removes the dense node and edge descriptions in 𝒢\mathcal{G} which are redundant for mission translation, leading to a significant reduction in prompt size.

Given the natural language mission μ\mu and the attribute hierarchy 𝒢¯\bar{\mathcal{G}}, the first call to the LLM is only responsible to extract unique IDs from the context of μ\mu, outputting μunique\mu_{\text{unique}}. This step facilitates LTL translations by separating high-level scene understanding from accurate LTL generation. Specifically, this step links contextually rich specifications, which can be potentially difficult to be parsed, to unequivocal mission descriptions that are void of confusion for the LLM when it comes to LTL translation (Fig. 2b). The list of entities involved in the mission are extracted from μunique\mu_{\text{unique}} using regular expression, and stored as μregex\mu_{\text{regex}}. This allows to inform the LLM about the essential parts of the mission μunique\mu_{\text{unique}}, without providing 𝒢¯\bar{\mathcal{G}}. The savings in prompt size are used to augment the prompt with natural language to LTL translation examples expressed in pre-fix notation. For instance, ϕφ\phi\wedge\varphi is expressed as ϕφ\wedge\phi\varphi in pre-fix format. This circumvents the issue of balancing parenthesis over formula ϕμ\phi_{\mu}. Ultimately, the LTL translation prompt includes μunique\mu_{\text{unique}}, μregex\mu_{\text{regex}}, and the translation examples (Fig. 2c).

TABLE I: Grammar for LTL formulas ϕ\phi and φ\varphi.
Syntax (Semantics)
pap_{a} (Atomic Proposition) ϕφ\phi\vee\varphi (Or) ϕ𝐔φ\phi\mathbf{U}\varphi (Until)
¬ϕ\neg\phi (Negation) ϕφ\phi\Rightarrow\varphi (Imply) 𝐅ϕ\mathbf{F}\phi (Eventually)
ϕφ\phi\wedge\varphi (And) 𝐗ϕ\mathbf{X}\phi (Next) 𝐆ϕ\mathbf{G}\phi (Always)

The translated LTL formula ϕμ\phi_{\mu} is verified for syntactic correctness and co-safety using an LTL syntax checker. Further calls of the LLM are made to correct ϕμ\phi_{\mu} if it does not pass the checks, up to a maximum number of allowed verification steps, after which human feedback is used to rephrase the natural language specification μ\mu (Fig. 2d). Once the mission μ\mu is successfully translated into a co-safe LTL formula ϕμ\phi_{\mu}, we can evaluate whether an agent path s1:Ts_{1:T} and its corresponding word (s1:T)\ell(s_{1:T}) satisfy ϕμ\phi_{\mu} by constructing an automaton representation of ϕμ\phi_{\mu} (Fig. 2e).

Definition 3.

A deterministic automaton over atomic propositions 𝒜𝒫\mathcal{A}\!\mathcal{P} is a tuple =(𝒬,2𝒜𝒫,T,,q1)\mathcal{M}=(\mathcal{Q},2^{\mathcal{A}\!\mathcal{P}},T,\mathcal{F},q_{1}), where 𝒬\mathcal{Q} is a set of states, 2𝒜𝒫2^{\mathcal{A}\!\mathcal{P}} is the power set of 𝒜𝒫\mathcal{A}\!\mathcal{P} called alphabet, T:𝒬×2𝒜𝒫𝒬T:\mathcal{Q}\times 2^{\mathcal{A}\!\mathcal{P}}\mapsto\mathcal{Q} is a transition function that specifies the next state T(q,l)T(q,l) from state q𝒬q\in\mathcal{Q} under label l2𝒜𝒫l\in 2^{\mathcal{A}\!\mathcal{P}}, 𝒬\mathcal{F}\subseteq\mathcal{Q} is a set of final states, and q1𝒬q_{1}\in\mathcal{Q} is an initial state.

A co-safe LTL formula ϕ\phi can be translated into a deterministic automaton ϕμ\mathcal{M}_{\phi_{\mu}} using model checking tools such as PRISM [49, 50] or Spot [51]. The automaton checks whether a word satisfies ϕμ\phi_{\mu}. A word 1:T\ell_{1:T} is accepted by ϕμ\mathcal{M}_{\phi_{\mu}}, i.e., 1:Tϕμ\ell_{1:T}\models\phi_{\mu}, if and only if the state qT+1q_{T+1} obtained after transitions qt+1=T(qt,t)q_{t+1}=T(q_{t},\ell_{t}) for t=1,,Tt=1,\ldots,T is contained in the final states \mathcal{F}. Hence, a path s1:Ts_{1:T} satisfies a co-safe LTL formula ϕμ\phi_{\mu} if and only if its word (s1:T)\ell(s_{1:T}) contains a prefix (s1:t)\ell(s_{1:t}) that is accepted by ϕμ\mathcal{M}_{\phi_{\mu}}.

Refer to caption
Figure 2: Natural language to LTL translation. (a) Attribute hierarchy 𝒢¯\bar{\mathcal{G}}. The unique IDs and the room connections are shown in parenthesis and inside red brackets, respectively. (b) Unique ID extraction from natural language mission μ\mu. (c) LTL formula generation from natural language specification. (d) Syntax and co-safety check over the generated LTL formula ϕμ\phi_{\mu}. (e) Automaton construction.

IV OPTIMAL SCENE GRAPH PLANNING

We use the structure of the scene graph 𝒢\mathcal{G} with guidance from the automaton ϕμ\mathcal{M}_{\phi_{\mu}} and the LLM’s mission semantics understanding to achieve efficient and optimal planning.

IV-A AMRA* Planning

We perform mission planning using AMRA* [37]. The key challenge is to define a hierarchical planning domain and heuristic functions that describe the scene graph and mission. AMRA* requires several node sets 𝒳r\mathcal{X}_{r} and action sets 𝒰r\mathcal{U}_{r} for different planning resolution levels, r=1,2,r=1,2,\ldots. Each level (𝒳r,𝒰r)(\mathcal{X}_{r},\mathcal{U}_{r}) has an associated cost function cr:𝒳r×𝒳r>0c_{r}:\mathcal{X}_{r}\times\mathcal{X}_{r}\mapsto\mathbb{R}_{>0}. The algorithm defines an anchor level (𝒳0,𝒰0)(\mathcal{X}_{0},\mathcal{U}_{0}), as 𝒳0:=r>0𝒳r\mathcal{X}_{0}:=\cup_{r>0}\mathcal{X}_{r}, 𝒰0:=r>0𝒰r\mathcal{U}_{0}:=\cup_{r>0}\mathcal{U}_{r}, and requires an initial state x𝒳0x\in\mathcal{X}_{0} and a goal region 𝒳0\mathcal{R}\subseteq\mathcal{X}_{0}. AMRA* allows multiple heuristics for each level but requires the anchor-level heuristics to be consistent to guarantee optimality. A heuristic hh is consistent with respect to cost cc if h(x)c(x,x)+h(x)h(x)\leq c(x,x^{\prime})+h(x^{\prime}). We construct the levels {𝒳r,𝒰r,cr}\{\mathcal{X}_{r},\mathcal{U}_{r},c_{r}\}, initial state xx, and goal region \mathcal{R} required for running AMRA* from the scene graph 𝒢\mathcal{G} and the automaton ϕμ\mathcal{M}_{\phi_{\mu}} in Sec. IV-B. We define two heuristics, hLTLh_{\textsc{LTL}}, which is used in the anchor level and other levels, and hLLMh_{\textsc{LLM}} for other levels, in Sec. IV-C and Sec. IV-D, respectively, and prove that hLTLh_{\textsc{LTL}} is consistent.

IV-B Hierarchical Planning Domain Description

Given scene graph 𝒢\mathcal{G} with initial node s1𝒱s_{1}\in\mathcal{V} and automaton ϕμ\mathcal{M}_{\phi_{\mu}}, we construct a hierarchical planning domain with KK levels corresponding to each scene graph attribute 𝒜k\mathcal{A}_{k}. Given an attribute set 𝒜k\mathcal{A}_{k}, we define 𝒱k:=a𝒜k𝒱a\mathcal{V}_{k}:=\cup_{a\in\mathcal{A}_{k}}\mathcal{V}_{a}, where 𝒱a\mathcal{V}_{a} is the node set associated with attribute a𝒜ka\in\mathcal{A}_{k}. Then, the node set corresponding to level kk of the planning domain is defined as 𝒳k:=𝒱k×𝒬\mathcal{X}_{k}:=\mathcal{V}_{k}\times\mathcal{Q}. We define the actions 𝒰k\mathcal{U}_{k} as transitions from xix_{i} to xjx_{j} in 𝒳k\mathcal{X}_{k} with associated cost ck(xi,xj)c_{k}(x_{i},x_{j}). A transition from xi=(si,qi)x_{i}=(s_{i},q_{i}) and xj=(sj,qj)x_{j}=(s_{j},q_{j}) exists if the following conditions are satisfied:

  1. 1.

    the transition is from 𝒱a\mathcal{V}_{a} of attribute aa to the boundary 𝒱b\partial\mathcal{V}_{b} of attribute bb such that si𝒱as_{i}\in\mathcal{V}_{a}, sj𝒱bs_{j}\in\partial\mathcal{V}_{b} for aba\neq b with a,b𝒜ka,b\in\mathcal{A}_{k} and int𝒱aint𝒱b=\text{int}\mathcal{V}_{a}\cap\text{int}\mathcal{V}_{b}=\emptyset,

  2. 2.

    the automaton transitions are respected: qj=T(qi,(si))q_{j}=T(q_{i},\ell(s_{i})), where (si)2𝒜𝒫\ell(s_{i})\in 2^{\mathcal{A}\!\mathcal{P}} is the label at sis_{i},

  3. 3.

    the transition is along the shortest path, sjargminsd(si,s)s_{j}\in\arg\min_{s}d(s_{i},s), where d:𝒱×𝒱d:\mathcal{V}\times\mathcal{V}\rightarrow\mathbb{R} is the shortest feasible path between two scene graph nodes.

The transition cost is defined as ck(xi,xj)=d(si,sj)c_{k}(x_{i},x_{j})=d(s_{i},s_{j}).

Refer to caption
Figure 3: Four-level hierarchical planning domain for Benevolence.

Since 𝒳k𝒱×𝒬\mathcal{X}_{k}\subseteq\mathcal{V}\times\mathcal{Q}, we can define the AMRA* anchor level as 𝒳0=𝒱×𝒬\mathcal{X}_{0}=\mathcal{V}\times\mathcal{Q} with actions 𝒰0\mathcal{U}_{0} derived from the scene graph edges \mathcal{E}, automaton transitions, and k>0𝒰k\cup_{k>0}\mathcal{U}_{k}. The initial state and goal region are defined as x=(s1,q1)x=(s_{1},q_{1}) and =𝒱×\mathcal{R}=\mathcal{V}\times\mathcal{F}.

The hierarchical planning domain is illustrated in Fig. 3. Four levels, occupancy (𝒱\mathcal{V}), objects (𝒜1\mathcal{A}_{1}), rooms (𝒜2\mathcal{A}_{2}) and floors (𝒜3\mathcal{A}_{3}), are used in our experiments. For example, in the object level, the agent can take an action to move directly from the couch to the TV with transition cost computed as the shortest path in the occupancy level.

IV-C LTL Heuristic

To ensure optimal AMRA* planning, a consistent heuristic is required at the anchor level. Inspired by the admissible but inconsistent heuristic in [29], we design a consistent LTL heuristic function hLTLh_{\textsc{LTL}} that approximates the scene graph distance to =𝒱×\mathcal{R}=\mathcal{V}\times\mathcal{F} using information from the automaton ϕμ\mathcal{M}_{\phi_{\mu}} and the scene graph attributes. We require that the scene graph cost cc satisfies the triangle inequality, i.e., c(s,s)c(s,s′′)+c(s′′,s)c(s,s^{\prime})\leq c(s,s^{\prime\prime})+c(s^{\prime\prime},s^{\prime}) for any s,s,s′′𝒱s,s^{\prime},s^{\prime\prime}\in\mathcal{V}. Then, we define the cost between two labels l1,l22𝒜𝒫l_{1},l_{2}\in 2^{\mathcal{A}\!\mathcal{P}} as

c(l1,l2)=mins1,s2:(s1)=l1,(s2)=l2c(s1,s2),c_{\ell}(l_{1},l_{2})=\min_{s_{1},s_{2}:\ell(s_{1})=l_{1},\ell(s_{2})=l_{2}}c(s_{1},s_{2}), (3)

which is a lower bound on the transition cost from l1l_{1} to l2l_{2} in the metric space of cost function cc. Next, we define a lower bound on the transition cost from automaton state q𝒬q\in\mathcal{Q} with label l2𝒜𝒫l\in 2^{\mathcal{A}\!\mathcal{P}} to an accepting state qfq_{f}\in\mathcal{F} as:

g(l,q)=minl2𝒜𝒫c(l,l)+g(l,T(q,l)).g(l,q)=\min_{l^{\prime}\in 2^{\mathcal{A}\!\mathcal{P}}}c_{\ell}(l,l^{\prime})+g(l^{\prime},T(q,l^{\prime})). (4)

The function g:2𝒜𝒫×𝒬0g:2^{\mathcal{A}\!\mathcal{P}}\times\mathcal{Q}\mapsto\mathbb{R}_{\geq 0} can be pre-computed via Dijkstra’s algorithm on the automaton ϕμ\mathcal{M}_{\phi_{\mu}}. We also define a next labeling function n:2𝒜𝒫×𝒬2𝒜𝒫\ell_{n}:2^{\mathcal{A}\!\mathcal{P}}\times\mathcal{Q}\mapsto 2^{\mathcal{A}\!\mathcal{P}} that tracks the least-cost label sequence returned by Dijkstra’s algorithm with g(l,q)=0g(l,q)=0, n(l,q)=𝗍𝗋𝗎𝖾\ell_{n}(l,q)=\mathsf{true}, q\forall q\in\mathcal{F}.

Proposition 1.

The heuristic function hLTL:𝒱×𝒬h_{\textsc{LTL}}:\mathcal{V}\times\mathcal{Q}\rightarrow\mathbb{R} defined below is consistent:

hLTL(s,q)\displaystyle h_{\textsc{LTL}}(s,q) =mint𝒱[c(s,t)+g(l(t),T(q,l(t)))],\displaystyle=\min_{t\in\mathcal{V}}\left[c(s,t)+g(l(t),T(q,l(t)))\right], (5)
hLTL(s,q)\displaystyle h_{\textsc{LTL}}(s,q) =0,q.\displaystyle=0,\quad\forall q\in\mathcal{F}.
Proof.

Consider a state x=(sx,qx)x=(s_{x},q_{x}) with label lx=l(sx)l_{x}=l(s_{x}). For any y=(sy,qy)y=(s_{y},q_{y}) that sys_{y} is reachable from sxs_{x} in one step, we have two cases to handle. When T(qx,l(sy))=qy=qxT(q_{x},l(s_{y}))=q_{y}=q_{x}:

hLTL\displaystyle h_{\textsc{LTL}} (sx,qx)mint𝒱[c(sx,sy)+c(sy,t)+g(l(t),T(qx,l(t)))]\displaystyle(s_{x},q_{x})\leq\min_{t\in\mathcal{V}}\left[c(s_{x},s_{y})+c(s_{y},t)+g(l(t),T(q_{x},l(t)))\right]
=c(sx,sy)+mint𝒱[c(sy,t)+g(l(t),T(qy,l(t)))]\displaystyle=c(s_{x},s_{y})+\min_{t\in\mathcal{V}}\left[c(s_{y},t)+g(l(t),T(q_{y},l(t)))\right]
=c(sx,sy)+hLTL(sy,qy).\displaystyle=c(s_{x},s_{y})+h_{\textsc{LTL}}(s_{y},q_{y}).

When T(qx,ly)=qyqxT(q_{x},l_{y})=q_{y}\neq q_{x}, with ly=l(sy)l_{y}=l(s_{y}), we have:

c(sx,sy)+hLTL(sy,qy)\displaystyle c(s_{x},s_{y})+h_{\textsc{LTL}}(s_{y},q_{y})
=c(sx,sy)+mint𝒱[c(sy,t)+g(l(t),T(qy,l(t)))]\displaystyle=c(s_{x},s_{y})+\min_{t\in\mathcal{V}}\left[c(s_{y},t)+g(l(t),T(q_{y},l(t)))\right]
mint𝒱,l(t)=lyc(sx,t)+mint𝒱[c(sy,t)+g(l(t),T(qy,l(t)))]\displaystyle\geq\min_{t\in\mathcal{V},l(t)=l_{y}}c(s_{x},t)+\min_{t\in\mathcal{V}}\left[c(s_{y},t)+g(l(t),T(q_{y},l(t)))\right]
mint𝒱,l(t)=lyc(sx,t)+minl2𝒜𝒫[cl(ly,l)+g(l,T(qy,l))]\displaystyle\geq\min_{t\in\mathcal{V},l(t)=l_{y}}c(s_{x},t)+\min_{l^{\prime}\in 2^{\mathcal{A}\!\mathcal{P}}}\left[c_{l}(l_{y},l^{\prime})+g(l^{\prime},T(q_{y},l^{\prime}))\right]
=mint𝒱,l(t)=ly[c(sx,t)]+g(ly,qy)\displaystyle=\min_{t\in\mathcal{V},l(t)=l_{y}}[c(s_{x},t)]+g(l_{y},q_{y})
mint𝒱[c(sx,t)+g(l(t),T(qx,l(t)))]=hLTL(sx,qx).\displaystyle\geq\min_{t\in\mathcal{V}}\left[c(s_{x},t)+g(l(t),T(q_{x},l(t)))\right]=h_{\textsc{LTL}}(s_{x},q_{x}).\qed

IV-D LLM Heuristic

In this section, we seek to assist AMRA* by developing an LLM heuristic hLLM:𝒱×𝒬h_{\textsc{LLM}}:\mathcal{V}\times\mathcal{Q}\rightarrow\mathbb{R} that captures the hierarchical semantic information of the scene graph. The LLM heuristic uses all attributes at a node s𝒱s\in\mathcal{V}, the current automaton state q𝒬q\in\mathcal{Q}, and the attribute hierarchy 𝒢¯\bar{\mathcal{G}}, and returns an attribute-based guide that helps the AMRA to search in the right direction for an optimal path. We design the prompt to ask for LLM attribute-based guidance with 44 components as follows:

  • environment description from its attribute hierarchy 𝒢¯\bar{\mathcal{G}},

  • list of motions M={mi(,)}M=\{m_{i}(\cdot,\cdot)\}, where mi(aj,ak)m_{i}(a_{j},a_{k}), aj𝒜ja_{j}\in\mathcal{A}_{j}, ak𝒜ka_{k}\in\mathcal{A}_{k} describes movements on 𝒢¯\bar{\mathcal{G}} from attribute aia_{i} to aja_{j} that the LLM model uses to generate its guides,

  • an example of the mission μunique\mu_{unique} and how to response,

  • description of the mission μunique\mu_{unique}, current attributes, remaining task given the automaton state q𝒬q\in\mathcal{Q}, and request for guidance on how to finish the task.

The LLM model returns a sequence of function calls {fi(aj,ak)}i=0N\{f_{i}(a_{j},a_{k})\}_{i=0}^{N}, fiMf_{i}\in M, aj𝒜ja_{j}\in\mathcal{A}_{j}, ak𝒜ka_{k}\in\mathcal{A}_{k} in XML format, easing response parsing [52]. Each function call returns a user-defined cost, e.g., Euclidean distance between attributes: fi(aj,ak)=c(sj,sk)f_{i}(a_{j},a_{k})=c(s_{j},s_{k}), where sj,sks_{j},s_{k} are the center of 𝒱aj\mathcal{V}_{a_{j}} and 𝒱ak\mathcal{V}_{a_{k}}, respectively. The total cost of the LLM functions is used as an LLM heuristic hLLM(s,q)=i=0Nfi(aj,ak)h_{LLM}(s,q)=\sum_{i=0}^{N}f_{i}(a_{j},a_{k}). Due to the LLM query delay and its limited query rates, the sequence of function calls suggested by the LLM model is obtained offline stored and used to calculate the heuristic hLLMh_{LLM} online in AMRA* based on the user-defined cost.

Refer to caption
Figure 4: ChatGPT prompt requesting a scene graph path.
Refer to caption
Figure 5: The automaton graph TT for the mission "go to the bedroom 2, then visit the kitchen 3, reach the oven 11, and always avoid the TV 9" with an initial node q1=4q_{1}=4 and an accepting node 0.

Fig. 4 illustrates a sample prompt and response. The prompt first describes how the attributes are connected based on the attribute hierarchy 𝒢¯\bar{\mathcal{G}}. Each attribute is mentioned with a unique ID to avoid confusing the LLM, as shown in the first paragraph of the prompt in Fig. 4. The second part of the prompt provides a list of possible functions used to guide the agent, such as move(1,2)move(1,2) to move from room 11 to room 22, or reach(1,3)reach(1,3) to reach an object 33 in room 11. The third component provides an example of a mission and how the LLM should response. The last component describes the current attributes and the remaining mission, generated based on the current automaton state qq, and requests LLM to generate a high-level plan using the provided functions.

The automaton state qq represents how far we have achieved the mission. Thus, to describe the remaining mission, we run Dijkstra’s algorithm on the automaton 𝒯\mathcal{T} to find the shortest path from qq to an accepting state in \mathcal{F}. We obtain a set of atomic prepositions evaluated true along the path, and concatenate their descriptions to describe the remaining mission in the prompt. For example, the desired mission is to “go to the bedroom 2, then visit the kitchen 3, reach the oven 11, while always avoid the TV 9". Let the atomic prepositions be defined as p2p_{2}, p3p_{3}, p11p_{11}, and p9p_{9}, where the indices correspond to the ID of the room or object.

The task can be described using an LTL as follows: ϕ=𝐅(p2𝐅(p3𝐅p11))¬p9,\phi=\mathbf{F}(p_{2}\wedge\mathbf{F}(p_{3}\wedge\mathbf{F}p_{11}))\wedge\neg p_{9}, whose automaton graph TT generated from Spot [51] is shown in Fig. 5 with the initial state q1=4q_{1}=4 and the final states ={0}\mathcal{F}=\{0\}. The agent is currently in room 1 and have been already visited room 2, i.e. q=2q=2 on TT. The shortest path from qq to the accepting state 0 is marked by red arrows in Fig. 5. Along this path, p3p_{3} and p11p_{11} turn 𝗍𝗋𝗎𝖾\mathsf{true}, causing the remaining mission to be to“visit the kitchen 3 and reach the oven 11" (Fig. 4). The atomic preposition p4p_{4} leads to a sink state 55 if it evaluates 𝗍𝗋𝗎𝖾\mathsf{true}, and never appears in the next mission, leading to an optimistic LLM heuristic.

TABLE II: Example mission descriptions for each scene.
Allensville \pbox6cmClean all vases in the dining room. Eventually water the potted plants in the bathrooms one after another.
Benevolence \pbox6cmVisit the bathroom on floor 0 and avoid the sink, then go to the dining room and sit on a chair. Always avoid the living room and the staircase next to it.
Collierville \pbox6cmClean all the corridors, except the one on floor 0.
Refer to caption
Figure 6: Optimal path for the Benevolence mission shown in Table II.
Refer to caption
Refer to caption
Figure 7: Path cost vs planning time for different AMRA* variants.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Number of node expansions v.s. the planning iteration. Each sub-figure presents a hierarchy level.
TABLE III: First feasible path computation time, relative cost between first and optimal path, and optimal path computation time averaged over 55 initial conditions for mission 1 in Benevolence.
1st iter. time (sec.) 1st iter. cost/costoptcost/cost_{opt} opt. time (sec.)
ALL 0.0007 1.3763 8.9062
OCC 0.0244 1.0827 8.3144
OBJ 3.7460 1.0387 24.936
ROOM 3.6878 1.0306 13.1352
FLR 3.8106 1.0369 13.3287
NO-LLM 3.3260 1.0318 24.2516
A* 1.1202 1.0997 11.7594
TABLE IV: First path computation time, relative cost between first and optimal path, and optimal path computation time averaged over each scene’s 5 missions and 5 initial positions/mission.
Allensville(1-floor) Benevolence(3-floor) Collierville(3-floor)
ALL 0.24/1.69/3.50 0.32/1.24/11.82 0.009/1.26/5.48
NO-LLM 0.36/1.34/6.33 2.82/1.18/21.57 1.16/1.38/8.13
A* 0.15/1.08/3.23 1.33/1.16/14.22 1.19/1.06/7.04

V EVALUATION

To evaluate our method, we use Allensville (1-floor), Benevolence (3-floor) and Collierville (3-floor) from the 3D Scene Graph dataset [11]. For each scene, we designed 5 missions (some are shown in Table II). For each mission, we used 5 initial positions across different floors and rooms.

We use GPT-4 [53] to translate missions to LTL formulas, and Spot [51] for LTL formulas to automata as described in Sec. III. Following Sec. IV-D, we use GPT-4 [53] to generate the LLM heuristic function hLLMh_{\textsc{LLM}}. Given a scene graph 𝒢\mathcal{G}, the mission described by the automaton ϕ\mathcal{M}_{\phi}, the LTL heuristic hLTLh_{\textsc{LTL}}, and the LLM heuristic hLLMh_{\textsc{LLM}}, we construct the hierarchical planning domain and run AMRA*.

Fig. 6 shows a path in Benevolence for the mission in Table II. Starting from the empty room on floor 0, the agent goes to the bathroom entrance without approaching the sink, and then proceeds upstairs to floor 1, finally reaching a chair in the dining room without entering the living room.

Since planning domains with different hierarchy levels and different heuristics per level can be constructed, we compare different setups to investigate the benefit of using our LLM heuristic. With all hierarchy levels having an LTL heuristic, we design 7 setups: occupancy level only without LLM heuristic (A*), all levels available but without LLM heuristics (NO-LLM), all levels with LLM heuristics (ALL), and one of the levels with LLM heuristic (OCC, OBJ, ROOM, FLR). Fig. 8 shows that the ALL setup manages to return a feasible path much faster than others thanks to the LLM guidance across all hierarchical levels, while it also approaches the optimal solution within similar time spans.

As an anytime algorithm, AMRA* starts searching with large weight on the heuristics, then reuses the results with smaller heuristic weights. As the planning iterations increase, the path gets improved. When the heuristic weight decays to 1, we obtain an optimal path. To further investigate the benefits of using LLM heuristics, we compare the number of node expansions per planning iteration. As shown in Fig. 8, applying our LLM heuristic to any hierarchy level reduces the node expansions significantly, which indicates that our LLM heuristic produces insightful guidance to accelerate AMRA*. An exciting observation is that the more hierarchy levels we use our LLM heuristic in, the more efficient the algorithm is. This encourages future research to exploit scene semantic information further to accelerate planning.

AMRA* allows a robot to start executing the first feasible path, while the path optimization proceeds in the background. Tables IV and IV shows the time required to compute the first path, the path cost relative to the optimal path, and the time required to find an optimal path. The ALL configuration outperforms other setups in speed of finding the first path and the optimal path when the scene gets more complicated.

VI CONCLUSION

We demonstrated that an LLM can provide symbolic grounding, LTL translation, and semantic guidance from natural language missions in scene graphs. This information allowed us to construct a hierarchical planning domain and achieve efficient planning with LLM heuristic guidance. We managed to ensure optimality via multi-heuristic planning, including a consistent LTL heuristic. Our experiments show that the LLM guidance is useful at all levels of the hierarchy for accelerating feasible path generation.

References

  • [1] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” in IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 4628–4635.
  • [2] S. Bowman, N. Atanasov, K. Daniilidis, and G. Pappas, “Probabilistic Data Association for Semantic SLAM,” in IEEE International Conference on Robotics and Automation (ICRA), 2017.
  • [3] J. Dong, X. Fei, and S. Soatto, “Visual-inertial-semantic scene representation for 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3567–3577.
  • [4] M. Shan, Q. Feng, and N. Atanasov, “OrcVIO: Object residual constrained Visual-Inertial Odometry,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • [5] A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From SLAM to spatial perception with 3D dynamic scene graphs,” The International Journal of Robotics Research (IJRR), vol. 40, no. 12-14, pp. 1510–1546, 2021.
  • [6] E. Zobeidi, A. Koppel, and N. Atanasov, “Dense Incremental Metric-Semantic Mapping for Multi-Agent Systems via Sparse Gaussian Process Regression,” IEEE Transactions on Robotics (T-RO), vol. 38, no. 5, pp. 3133–3153, 2022.
  • [7] A. Asgharivaskasi and N. Atanasov, “Semantic OcTree Mapping and Shannon Mutual Information Computation for Robot Exploration,” IEEE Transactions on Robotics (T-RO), vol. 39, no. 3, pp. 1910–1928, 2023.
  • [8] J. Wilson, Y. Fu, A. Zhang, J. Song, A. Capodieci, P. Jayakumar, K. Barton, and M. Ghaffari, “Convolutional Bayesian Kernel Inference for 3D Semantic Mapping,” in IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 8364–8370.
  • [9] S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, “SceneCode: Monocular Dense Semantic Reconstruction Using Learned Encoded Scene Representations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [10] C. Estrada, J. Neira, and J. Tardos, “Hierarchical SLAM: real-time accurate mapping of large environments,” IEEE Transactions on Robotics, vol. 21, no. 4, pp. 588–596, 2005.
  • [11] I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 5664–5673.
  • [12] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” in Robotics: Science and Systems (RSS), 2020.
  • [13] N. Hughes, Y. Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3D scene graph construction and optimization,” in Robotics: Science and Systems (RSS), 2022.
  • [14] J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic information for robot navigation: A survey,” Applied Sciences, vol. 10, no. 2, 2020.
  • [15] S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller, and N. Roy, “Approaching the symbol grounding problem with probabilistic graphical models,” AI Magazine, vol. 32, no. 4, pp. 64–76, 2011.
  • [16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” Advances in neural information processing systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
  • [17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint:1810.04805, 2018.
  • [18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient Foundation Language Models,” arXiv preprint:2302.13971, 2023.
  • [19] W. Chen, S. Hu, R. Talak, and L. Carlone, “Extracting zero-shot common sense from large language models for robot 3D scene understanding,” arXiv preprint:2206.04585, 2022.
  • [20] ——, “Leveraging large language models for robot 3D scene understanding,” arXiv preprint arXiv:2209.05629, 2022.
  • [21] D. Shah, B. Osiński, B. Ichter, and S. Levine, “LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning (CoRL), 2022, pp. 492–504.
  • [22] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning (ICML), 2021, pp. 8748–8763.
  • [23] C. Belta, A. Bicchi, M. Egerstedt, E. Frazzoli, E. Klavins, and G. J. Pappas, “Symbolic planning and control of robot motion [grand challenges of robotics],” IEEE Robotics and Automation Magazine, vol. 14, no. 1, pp. 61–70, 2007.
  • [24] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-Logic-Based Reactive Mission and Motion Planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
  • [25] Y. Chen, J. Arkin, Y. Zhang, N. Roy, and C. Fan, “AutoTAMP: Autoregressive task and motion planning with LLMs as translators and checkers,” arXiv preprint:2306.06531, 2023.
  • [26] O. Maler and D. Nickovic, “Monitoring temporal properties of continuous signals,” in Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, 2004, pp. 152–166.
  • [27] T. Wongpiromsarn, U. Topcu, N. Ozay, H. Xu, and R. M. Murray, “TuLiP: a software toolbox for receding horizon temporal logic planning,” in International Conference on Hybrid Systems: Computation and Control (HSCC), 2011, pp. 313–314.
  • [28] C. I. Vasile and C. Belta, “Sampling-based temporal logic path planning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013, pp. 4817–4822.
  • [29] J. Fu, N. Atanasov, U. Topcu, and G. J. Pappas, “Optimal temporal logic planning in probabilistic semantic maps,” in IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 3690–3697.
  • [30] P. Purohit and I. Saha, “DT*: Temporal logic path planning in a dynamic environment,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 3627–3634.
  • [31] Y. Kantaros and M. M. Zavlanos, “STyLuS*: A temporal logic optimal control synthesis algorithm for large-scale multi-robot systems,” The International Journal of Robotics Research (IJRR), vol. 39, no. 7, pp. 812–836, 2020.
  • [32] M. Fox and D. Long, “PDDL2. 1: An extension to PDDL for expressing temporal planning domains,” Journal of artificial intelligence research, vol. 20, pp. 61–124, 2003.
  • [33] J. Kim, A. Desrochers, and A. Sanderson, “Task planning and project management using petri nets,” in IEEE International Symposium on Assembly and Task Planning, 1995, pp. 265–271.
  • [34] P. Lv, G. Luo, Z. Ma, S. Li, and X. Yin, “Optimal multi-robot path planning for cyclic tasks using petri nets,” Control Engineering Practice, vol. 138, p. 105600, 2023.
  • [35] S. Karaman, S. Rasmussen, D. Kingston, and E. Frazzoli, “Specification and planning of uav missions: a process algebra approach,” in American Control Conference (ACC), 2009, pp. 1442–1447.
  • [36] F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson Env: Real-World Perception for Embodied Agents,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [37] D. M. Saxena, T. Kusnur, and M. Likhachev, “AMRA*: Anytime multi-resolution multi-heuristic A*,” in IEEE International Conference on Robotics and Automation (ICRA), 2022, pp. 3371–3377.
  • [38] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,” IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.
  • [39] S. Aine, S. Swaminathan, V. Narayanan, V. Hwang, and M. Likhachev, “Multi-heuristic A*,” The International Journal of Robotics Research (IJRR), vol. 35, no. 1-3, pp. 224–243, 2016.
  • [40] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Informed RRT*: Optimal sampling-based path planning focused via direct sampling of an admissible ellipsoidal heuristic,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2014, pp. 2997–3004.
  • [41] J. D. Gammell, T. D. Barfoot, and S. S. Srinivasa, “Batch Informed Trees (BIT*): Informed asymptotically optimal anytime search,” The International Journal of Robotics Research (IJRR), vol. 39, no. 5, pp. 543–567, 2020.
  • [42] W. Du, F. Islam, and M. Likhachev, “Multi-resolution A*,” in International Symposium on Combinatorial Search, vol. 11, no. 1, 2020, pp. 29–37.
  • [43] D. Shah, M. Equi, B. Osiński, F. Xia, B. Ichter, and S. Levine, “Navigation with large language models: Semantic guesswork as a heuristic for planning,” in Conference on Robot Learning (CoRL), 2023.
  • [44] I. Kostavelis, K. Charalampous, A. Gasteratos, and J. K. Tsotsos, “Robot navigation via spatial and temporal coherent semantic maps,” Engineering Applications of Artificial Intelligence, vol. 48, pp. 173–187, 2016.
  • [45] S. Amiri, K. Chandan, and S. Zhang, “Reasoning with scene graphs for robot planning under partial observability,” IEEE Robotics and Automation Letters (RAL), vol. 7, no. 2, pp. 5560–5567, 2022.
  • [46] Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in IEEE International Conference on Robotics and Automation (ICRA), 2022, pp. 9272–9279.
  • [47] A. Pnueli, “The temporal semantics of concurrent programs,” Theoretical computer science, vol. 13, no. 1, pp. 45–60, 1981.
  • [48] O. Kupferman and M. Y. Vardi, “Model checking of safety properties,” Formal methods in system design, vol. 19, pp. 291–314, 2001.
  • [49] M. Kwiatkowska, G. Norman, and D. Parker, “Prism: Probabilistic symbolic model checker,” in Computer Performance Evaluation: Modelling Techniques and Tools, 2002.
  • [50] J. Klein, C. Baier, P. Chrszon, M. Daum, C. Dubslaff, S. Klüppelholz, S. Märcker, and D. Müller, “Advances in probabilistic model checking with PRISM: variable reordering, quantiles and weak deterministic büchi automata,” International Journal on Software Tools for Technology Transfer, vol. 20, no. 2, pp. 179–194, 2018.
  • [51] A. Duret-Lutz, E. Renault, M. Colange, F. Renkin, A. G. Aisse, P. Schlehuber-Caissier, T. Medioni, A. Martin, J. Dubois, C. Gillard, and H. Lauko, “From Spot 2.0 to Spot 2.10: What’s new?” in International Conference on Computer Aided Verification (CAV), 2022, pp. 174–187.
  • [52] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “ChatGPT for robotics: Design principles and model abilities,” Microsoft, Tech. Rep. MSR-TR-2023-8, February 2023. [Online]. Available: https://www.microsoft.com/en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/
  • [53] OpenAI, “GPT-4 Technical Report,” in arXiv: 2303.08774, 2023.