SayNav: Grounding Large Language Models
for Dynamic Planning to Navigation in New Environments

Abhinav Rajvanshi¹, Karan Sikka¹, Xiao Lin¹, Bhoram Lee¹, Han-Pang Chiu¹, Alvaro Velasquez²

Abstract

Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on multi-object navigation (MultiON) task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. We also introduce a benchmark dataset for MultiON task employing ProcTHOR framework that provides large photo-realistic indoor environments with variety of objects. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. The code, benchmark dataset and demonstration videos are accessible at https://www.sri.com/ics/computer-vision/saynav.

Refer to caption — Figure 1: A SayNav example: The robot uses LLM-based planner to efficiently find one target object (laptop) in a new house.

1 Introduction

Finding multiple target objects in a novel environment is a relatively easy task for a human but a daunting task for an autonomous agent. Given such a task, humans are able to leverage common-sense priors like room layouts and plausible object placement to infer likely locations of objects. For example, there are higher chances of finding a pillow on the bed in the bedroom and a spoon on the dining table or in the kitchen. Humans are also capable of dynamically planning and adjusting their search strategies and actions based on new visual observations during exploration in a new environment. For example, a human would search for a spoon first instead of a pillow if entering a kitchen.

Such reasoning and dynamic planning capabilities are essential for an autonomous agent to accomplish complex navigation tasks in novel settings, such as searching and locating specific objects in new houses. However, current learning-based methods, with the most popular being deep reinforcement learning (DRL) (Anderson et al. 2018; Chaplot et al. 2020; Khandelwal et al. 2022), require massive amounts of training for the agent to achieve reasonable performance even for simpler navigation tasks, such as finding a single object (object-goal navigation) or reaching a single target point (point-goal navigation) (Anderson et al. 2018). Moreover, significant computational resources are needed to replicate human ability to generalize to new environments. Such computational demands impede the development of an autonomous agent to efficiently conduct complex tasks at unknown places.

In this paper, we propose SayNav – a new approach to leverage common-sense knowledge from Large Language Models (LLMs) for efficient generalization to complicated navigation tasks in unknown large-scale environments. Recently, agents equipped with LLM-based planners have shown remarkable capabilities to conduct complex manipulation tasks with only a few training samples (Ahn et al. 2022; Song et al. 2022). SayNav follows this trend of utilizing LLMs in developing generalist planning agents specifically for navigation tasks. To fully demonstrate and validate SayNav’s capabilities, we choose a complex navigation task, multi-object navigation (MultiON). For this task, the agent needs to efficiently explore a new 3D environment to locate multiple different objects given the names of these objects. This task requires a large amount of prior knowledge and dynamic planning capabilities (similar to humans) for success.

MultiON task has emerged recently as a generalization of the Object-goal Navigation task. It was introduced by (Wani et al. 2020) as a task of “navigation to an ordered sequence of objects” and most of the other works on MultiON follow the same definition. (Gireesh et al. 2023) dropped the constraint of having a ordered sequence of the given objects and defined the goal as to localize certain number of objects in any order. This is how we also define our MultiON task as it poses much larger planning challenges and task complexities than the previous definition. Following the trend in MultiON task, we will be using three different objects as the targets for each episode for experiments. Note that SayNav is capable of searching for any number of objects in the environment.

The key innovation of SayNav is to incrementally build and expand a 3D scene graph of the new environment using perceived information during exploration. It then grounds feasible and contextually appropriate knowledge from LLMs which is used by the agent for navigation. This new grounding mechanism ensures that LLMs adhere to the physical constraints in the new environment, including the spatial layouts and geometric relationships among perceived entities. 3D scene graphs (Armeni et al. 2019; Kim et al. 2019; Rosinol et al. 2021; Hughes et al. 2022; Wald et al. 2020; Wu et al. 2021) have recently emerged as powerful high-level representations of 3D large-scale environments to support real-time decisions in robotics. A 3D scene graph is a layered graph which represents spatial concepts (nodes) at multiple levels of abstraction (such as objects, places, rooms, and buildings) with their relations (edges). We utilize this 3D scene graph representation to ground LLMs in the current environment, which is used to continuously build and refine the search plan for the agent during navigation.

Specifically, SayNav utilizes LLMs to generate step-by-step instructions on the fly, for locating target objects during navigation. To ensure feasible and effective planning in a dynamic manner, SayNav continuously extracts and converts a subgraph (from the current 3D scene graph) into a textual prompt to be fed to the LLMs. This extracted subgraph includes spatial concepts in the local region centered around the current position of the agent. The LLM then plans next steps based on this subgraph, such as inferring likely locations for the target object and prioritizing them. This plan also includes conditional statements and fallback options when any of the steps is unable to achieve the goal. For example, if the agent is not able to find the laptop on the desk, it will go to a next likely location (bed) in a bedroom.

SayNav also leverages LLMs to augment and refine the scene graph during navigation, such as annotating the room type based on current perceived objects. This improves the hierarchical organization of semantic information in the scene graph, that can support better planning. SayNav computes the feasibility of completing the current goal based on the room type which helps in better optimization of plan. For example, it can skip the restroom when looking for a spoon, but can come back later if needed.

SayNav only requires a few examples via in-context learning (Brown et al. 2020; Ouyang et al. 2022) for configuring LLMs to conduct high-level dynamic planning to complicated MultiON tasks in new environments. The LLM-generated plan is then executed by a pre-trained low-level planner that treats each planned step as a short-distance point-goal navigation sub-task (such as moving to a perceived object). This decomposition reduces the planning complexity of the navigation task, because the sub-tasks planned by LLMs are simple enough for low-level planners to execute successfully.

Figure 1 illustrates an example of SayNav utilizing LLMs to efficiently explore a new environment and locate one (laptop) of the three target objects. The agent first looks around (i.e., observes to build the scene graph) and identifies what type of room it starts from. After checking potential locations of the target objects in the room, the agent does not find any target. Then it decides to go through the door to move to another room. The agent continuously expands the scene graph during exploration and realizes that the neighbor room is a living room. There, it finds one target on the table and continues searching for other two objects.

The main contributions are summarized as follows.

1.

We present, to the best of our knowledge, the first LLM-based high-level planner specifically for navigation tasks in large-scale unknown photo-realistic environments. The proposed LLM planner incrementally generates step-by-step instructions in a dynamic manner during navigation. The instructions generated from LLMs during navigation are consistent and non-redundant.
2.

We propose a novel grounding mechanism to LLMs for navigation in new large-scale environments. SayNav incrementally builds and expands a 3D scene graph during exploration. Next-step plans are generated from LLMs, by utilizing text prompts based on a selected portion (subgraph) of the scene graph. Parts of the scene graph are also continuously refined and updated by LLMs.
3.

We introduce a benchmark dataset for MultiON task across different houses for our evaluation and future use by researchers.

2 Related Work

In this section, we provide a brief review on related works in visual navigation, high-level planning with LLMs for autonomous agents and MultiON.

Visual Navigation in New Environments is a fundamental capability for many applications for autonomous agents. Recent learning-based approaches with DRL methods have shown great potential to outperform classical approaches based on SLAM (simultaneous localization and mapping) and path planning techniques, on different visual navigation tasks (Mishkin et al. 2019). These navigation tasks include point-goal navigation (Wijmans et al. 2019), image-goal navigation (Zhu et al. 2017), and object-goal navigation (Chaplot et al. 2020).

However, these methods generally require at least hundreds of millions of iterations (Wijmans et al. 2019) for training agents to generalize in new environments. This entails high cost in terms of both data collection and computation. In addition, it hinders the development of autonomous agents that can conduct more complex navigation tasks, such as multi-object navigation and cordon and search, that requires the ability to exploit common-sense knowledge and plan dynamically in novel environments.

Leveraging common-sense knowledge from LLMs allows us to avoid the high cost of training as in the previous learning-based methods. By effectively grounding LLMs (such as ChatGPT) via text prompting, our approach enables efficient high-level planning for visual navigation in unknown environments. To better demonstrate and evaluate our proposed method, we use multi-object navigation, which is more complex than previous navigation tasks such as object-goal navigation. The MultiON task demands common-sense knowledge, as humans do, to efficiently search for multiple different objects in large-scale unknown environments.

High-Level Planning with LLMs has become an emergent trend in the robotics field. LLMs by virtue of both training on internet scale data and instruction tuning have demonstrated excellent capabilities to perform zero/few shot learning for unseen tasks (Zhao et al. 2023; Brown et al. 2020). Recent instruction tuned models such as ChatGPT have further shown strong capabilities to follow natural instructions expressed as prompts (Chung et al. 2022; Peng et al. 2023).

Recent works in autonomy have used LLMs and demonstrated significant progress (Ahn et al. 2022; Song et al. 2022; Huang et al. 2022; Liu et al. 2023; Driess et al. 2023; Brown et al. 2020; Ouyang et al. 2022) in incorporating human knowledge, that enables efficient training of autonomous agents for tasks such as mobile manipulation. These works reduce the learning complexity by using a two-level planning architecture. For each assigned task, they utilize LLMs to generate a high-level step-by-step plan. Each planned step, formulated as a much simpler sub-task, can be executed by an oracle (ground truth) or a pre-trained low-level planner that maps one step into a sequence of primitive actions. Agents with these LLM-based planners are able to perform a new task with only a few training examples via in-context learning (Brown et al. 2020; Ouyang et al. 2022).

However, these LLM-based planners have two major limitations for visual navigation tasks in new large-scale environments. First, the grounding mechanisms in these methods (Ahn et al. 2022; Song et al. 2022; Huang et al. 2022; Liu et al. 2023) are designed for small-scale environments. For example, works such as (Song et al. 2022; Singh et al. 2023) have focused on the AI-THOR based environment that consists of only a single room. Moreover, these methods only rely on detection of specific objects. They do not consider room layout and the topological arrangement of perceived entities inside the room, which are important to ground LLMs in the physical environment for visual navigation tasks. Therefore, knowledge extracted from LLMs using these methods might not be contextually appropriate to an agent for navigation in large-scale settings, such as multi-room houses.

Second, some of these LLM-based planners typically generate a multi-step long-horizon plan in the beginning for the assigned task, which is not feasible for navigating in unknown environments. They also lack the capability to change the plan during task execution. In contrast, an effective search plan for navigation in new places is required to be incrementally generated and updated during exploration. Future actions are decided based on current perceived scenes with the memory of previously-visited regions.

SayPlan (Rana et al. 2023) addresses the first issue by using a pre-built ground-truth 3D scene graph of a known large-scale place, to ground LLMs for high-level task planning. However, the planning complexity for SayPlan is simplified due to the availability of the entire ground-truth scene graph prior to task execution. In other words, SayPlan cannot be used for task planning in unknown environments.

Our approach, SayNav, is designed to leverage LLMs specifically for visual navigation in unknown large-scale environments. We propose a new grounding mechanism that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating the high-level plans. SayNav also dynamically generates step-by-step instructions during navigation. It continuously refines future steps based on newly perceived information via LLMs.

The only work we found to leverage LLMs specifically for navigation tasks in unknown environments is L3MVN (Yu et al. 2023). It uses LLMs to find the nearby semantic frontier based on detected objects, for expanding the exploration area to eventually find the target object. For example, moving to the (sofa, table) region which is more likely to have TV. In other words, it utilizes LLMs to hint to the next exploration direction. It does not use LLMs as a full high-level planner, that generates step-by-step instructions. In contrast, our SayNav uses the 3D scene graph to ground LLMs as a high-level planner. Our LLM-based planner generates the instructions in a dynamic manner, and considers its prior planned steps to generate better future plans.

MultiON task has attracted attention of researchers in the last few years. As already mentioned, most of them (Chen et al. 2022; Zeng et al. 2023; Marza et al. 2022, 2023) have considered pre-defined sequence of objects to be localized which simplifies the task by a great extent. Moreover, they employ pure DRL-based approaches and hence suffer from issues discussed before. (Gireesh et al. 2023) is the only work to consider MultiON without any sequence of objects but again makes use of a DRL-based approach.

3 SayNav

We now describe SayNav’s framework as well as the multi-object navigation task.

3.1 Task Definition

We choose Multi-Object Navigation task, to validate SayNav. The goal of this task is to navigate the agent in a large-scale unknown environment in order to find an instance for each of three predefined object categories (such as ”laptop”, ”tomato”, and ”bread”). The agent is initialized at a random location in the environment and receives the goal object categories ( $o_{i}$ , $o_{j}$ , $o_{k}$ ) as input. At each time step $t$ during navigation, the agent receives environment observations $e_{t}$ and takes control actions $a_{t}$ . The observations include RGBD images, semantic segmentation maps, and the agent’s pose (location and orientation). The action space includes five control commands: $turn$ - $left$ , $turn$ - $right$ , $move$ - $forward$ , $stop$ , and $look$ - $around$ . Both $turn$ - $left$ and $turn$ - $right$ actions rotate the agent by 90 degrees. The $move$ - $forward$ action moves the agent by 25 centimeters. The task execution is successful if the agent locates (by detection) all three objects within a time period.

Note that MultiON task poses much larger planning challenges and task complexities than previous navigation tasks, which either look for a single object (Chaplot et al. 2020) or reach a single target point (Wijmans et al. 2019). For example, the agent needs to dynamically set up the search plan based on the prioritized order among three different objects. This plan can also be changed during exploration in a new house with unknown layouts. As shown in Figure 1, the agent first realizes that it is in the bedroom and then decides to prioritize places (such as the table) to locate the laptop inside this room. On the other hand, if the agent had started in the kitchen, it would have been more efficient to search for the fork and spoon first. Therefore, this new task requires extensive semantic reasoning and dynamic planning capabilities, as what humans possess, for an autonomous agent to explore in large-scale unknown environments.

3.2 Overview

SayNav’s framework is illustrated in Figure 2. The corresponding pseudo-code is in Algorithm 1. It includes three modules: (1) Incremental Scene Graph Generation, (2) High-Level LLM-based Dynamic Planner, and (3) Low-Level Planner. The Incremental Scene Graph Generation module accumulates observations received by the agent to build and expand a scene graph, which encodes semantic entities (such as objects and furniture) from the areas the agent has explored. The High-Level LLM-based Dynamic Planner continuously converts relevant information from the scene graph into text prompts to a pre-trained LLM, for dynamically generating short-term high-level plans. Each LLM-planned step is executed by the Low-Level Planner to generate a series of control commands for execution.

Input : Start location of robot

start\_location

house ID

house\_id

Target Objects

target\_objects

unfound\_objects\leftarrow

target\_objects

5spawn_robot( $start\_location$ )

SceneGraph\leftarrow

create_scene_graph( $house\_id$ )

7while $\text{len}(\text{unfound\_objects})>0$ do

objs\_found,observations\leftarrow

look_around()

10 update_unfound_objects( $objs\_found$ )

room\_type\leftarrow

identify_room_type( $observations$ )

SceneGraph

.update(

room\_type

observations

)

plan\_needed\leftarrow

is_feasible( $room\_type$ )

14 if $plan\_needed$ then

subgraph

\leftarrow

SceneGraph

17 extract_subgraph(

room\_type

)

plan\leftarrow

query_llm_for_plan( $subgraph$ , $unfound\_objects$ )

19 for $action$ in $plan$ do

20 if $action.type=$ ’navigate’ then

21 navigate_to( $action.target\_location$ )

22 end if

23 else if $action.type=$ ’look’ then

objs\_found,observations\leftarrow

look_around()

SceneGraph

.update(

observations

)

27 update_unfound_objects( $objs\_found$ )

28 end if

30 end for

32 end if

34 if $\text{len}(\text{unfound\_objects})>0$ then

35 if $SceneGraph$ .all_doors_explored() then

return

’Task Failed’

38 end if

door\leftarrow

find_next_unexplored_door() navigate_to( $door$ )

40 end if

42 end while

return

’Success’

Algorithm 1 SayNav

3.3 Incremental Scene Graph Generation

This module continuously builds and expands a 3D scene graph of the environment being explored. A 3D scene graph is a layered graph which represents spatial concepts (nodes) at multiple levels of abstraction with their relations (edges). This representation has recently emerged as a powerful high-level abstraction for 3D large-scale environments in robotics. Here we define four levels in the 3D scene graph: small objects, large objects, rooms, and house. Each object node is associated with its 3D coordinate and room node is associated with its bounds. Every door is treated as an edge between two rooms, which also has an associated 3D coordinate. All other edges reveal the topological relationships among semantic concepts across different levels. Figure 3 shows one example of our scene graph. Mathematically, our scene graph can be represented as a set of 4 kinds of triplets:

\{(s_{h},`near\textrm{'},l_{i}),(l_{i},`in\textrm{'},r_{j}),(r_{j},d_{jk},r_{k}),(r_{j},`in\textrm{'},H)\}

where, $s_{h}$ : small object, $l_{i}$ : large object, $r_{j},r_{k}$ : rooms ( $i\neq j$ ), $d_{jk}$ : door between $r_{j}$ & $r_{k}$ , $H$ : house (root node).

Large objects act as landmarks which are visible from far away distance. LLM uses these large objects to reason if a small object can be found near them. For example, it may be logical to walk towards a dining table to find a knife. The decision of small vs. large objects is made based on its dimensions and LLM’s understanding about the object class.

The scene graph is built using environmental observations received by the agent during exploration. The depth of each segmented object can be obtained based on RGBD images and semantic segmentation images. The 3D coordinate of each perceived object can then be estimated by combining its depth information at multiple timestamps and the corresponding agent’s poses.

We also utilize LLMs to augment and refine high-level abstractions of the scene graph. For example, we use LLMs to annotate and identify the spatial entity (room type) at the room level of the graph based on its connected objects at lower levels. For instance, a room is probably a bedroom if it includes a bed. The bounds of a room are calculated based on the detection of walls and floor of the room.

SayNav also uses the 3D scene graph to support memory for future planning. For example, it automatically annotates the room nodes that have been investigated. Therefore, it will not generate repeated plans when the agent revisits the same room during exploration.

3.4 High-Level LLM-based Dynamic Planner

Similar to previous works in LLM-based planning, SayNav utilizes a two-level planning architecture to reduce the learning complexity of the assigned task. However, instead of generating a complete high-level plan for the entire task in the beginning, SayNav utilizes LLMs to incrementally generate a short-term plan regularly, based on current observations and the memory of previously-visited regions. This high-level planner can be set-up using only a few training examples via typical in-context learning procedures (Brown et al. 2020; Ouyang et al. 2022) (as shown in Figure 4).

Our high-level LLM-based dynamic planner extracts a subgraph from the full 3D scene graph and converts it into text prompts, which are fed to an LLM. The extracted subgraph includes spatial concepts in the local region centered around the current position of the agent. We implemented the LLM prompts similar to (Singh et al. 2023), which constructs programming language prompts based on the text labels in the extracted subgraph. Once prompts are received, the LLM planner outputs short-term step-by-step instructions, as pseudo code. The generated plan provides an efficient search strategy within the current perceived area based on human knowledge, prioritizing locations to visit inside the room based on the likelihoods of target objects being discovered. For example, LLM may provide a plan to first check the desk and then the bed to find the laptop in the bedroom. Figure 4 shows the prompt structure used to generate the plan. We provide two in-context examples inside the prompt to constrain the LLM-generated plans. For instance, we constrain each step to generate a navigate or look function call with arguments and a high-level comment. Note that we also make LLM output a descriptive comment associated with each step in the generated plan, for minimizing the prevailing issue of hallucinations in LLMs.

The LLM-based planner also extends and updates the plan when the previous plan fails or the task goal (finding three objects) is not achieved after the execution of previous short-term plan. It attempts to update the plan either by (i) generating a plan based on new information received from the environment, (ii) putting the low-level planner into an exploratory mode (ex: try to go to next room), or (iii) refining the scene graph by instructing the low-level planner to randomly move around and collect more observations.

Before generating the high-level plan for a local region (room), LLM also reasons about the feasibility of locating the target object(s) in the detected room-type. It may decide to skip searching a specific room and come back later if the feasibility is low. In this case, it would mark the corresponding room-node in the scene graph to come back later. Figure 5 shows the structure of prompts used to identify the room type and determine the feasibility of locating target object in a room.

3.5 Low-Level Planner

The Low-Level Planner converts each LLM-planned step into a series of control commands for execution. To integrate two planners, SayNav formulates each LLM-planned step as a short-distance point-goal navigation (PointNav) sub-task for the low-level planner. The target point for each sub-task, such as moving from the current position to the table in the current room, is assigned by the 3D coordinate of the object (e.g. table) described in each planned step.

SayNav’s low-level planner takes the RGBD images (resolution $320\times 240$ ) and the agent’s pose (location and orientation) as inputs, and it outputs move_forward, turn_left and turn_right actions to control the robot following standard PointNav settings. Note that large-scale DRL approaches typically take $10^{8}$ to $10^{9}$ simulation steps during training to solve PointNav tasks in simulation environments (Wijmans et al. 2019; Weihs et al. 2020), which poses serious training requirements and computation demands. In SayNav, however, the two-level planning architecture simplifies the job of the low-level planner. The low-level planner mostly outputs control commands for short-range movements. The LLM-based high-level planner is also robust to failures in the low-level planner, by making regular plan updates. In this way, the training load required on the low-level planner can be greatly reduced.

Encouraged by the success of imitation learning (IL) on navigation tasks under resource constraints (Ramrakhya et al. 2022, 2023; Shah et al. 2023), we investigate a sample efficient IL-based method to train the low-level planner for the agent in SayNav. This low-level planner is trained from scratch (without pretraining) on only 2800 episodes or $7\times 10^{5}$ simulation steps. Specifically, the low-level planner is trained using the DAgger algorithm (Ross et al. 2011) to follow a shortest path oracle as the expert. Despite the fact that the shortest path oracle lacks the exploration behavior required to solve the PointNav task in complex environments (e.g. multiple rooms), we find that it helps the agent to learn short-distance navigation skills very quickly, without human-in-the-loop.

We first implement a shortest path oracle using A* algorithm on the grid of reachable positions in the house, and then train a PointNav agent using the DAgger (Ross et al. 2011) algorithm with the A* oracle as the expert. We perform DAgger dataset aggregation over two rounds. In the first round, 2,000 episodes were collected using a random agent. In the second round, 800 additional episodes were collected using an agent trained using episodes from the first round. The aggregated dataset contains a total of 2,800 episodes or $7\times 10^{5}$ simulator steps for IL. For each episode, we choose a scene from the ProcThor-10k train split, randomly sample a start location and sample a random object from the scene as the goal location. The robot then performs the task by taking expert action with $p=0.2$ and agent action from $p=0.8$ . The robot’s observations and expert actions are stored for behavior cloning.

For behavior cloning, the objective function is to minimize cross entropy loss of predicted action against expert action at every step. For agent architecture, following (Wijmans et al. 2019), we use a standard architecture shown in Figure 6. As mentioned, our agent receives an RGBD image and the agent’s pose with respect to the goal location as the input. A GroupNorm (Wu et al. 2018) ResNet18 (He et al. 2016) encodes the input RGBD image, and a 2-layer 512 hidden size gated recurrent unit (GRU) (Cho et al. 2014) combines history and sensor inputs to predict the next action. We use the Adamax optimizer with lr $10^{-4}$ and weight decay $10^{-4}$ . We optimize the objective function with batch size of 16 episodes for 50 epochs.

We evaluate the PointNav performance of the agent on the publicly available AI2Thor ObjectNav dataset ¹¹1https://github.com/allenai/object-nav-eval ProcThor-10k-val split, using the ground truth location of the object as the goal for PointNav evaluation. The evaluation split contains 1550 episodes. When multiple instances of the target object are available, we arbitrarily select the first instance as the PointNav goal. Our low-level planner achieves $84.5\%$ SR (success radius 1.5m, max 300 steps) and $0.782$ SPL. On the subset of episodes where the starting position and the goal position are in the same room, performance increases to $98.5\%$ SR and $0.930$ SPL.

4 Experimental Results

4.1 Multi-Object Navigation Dataset

Most prior Embodied AI simulators such as AI2-THOR (Kolve et al. 2017) or Habitat (Szot et al. 2021) are either based on environments with single rooms or lack the natural placement of objects within the environment or lack the ability to interact with the objects. For our experiments, we opted for the recently introduced ProcTHOR framework (Deitke et al. 2022), which is built on top of the AI2-THOR simulator. ProcTHOR is capable of procedurally generating full floor plans of a house given a room specification (ex: a house with 2 bedrooms, 1 kitchen and 1 bathroom). It also populates each floorplan with 108 object types, with realistic, physically plausible, and natural placements. The scenes in ProcTHOR are interactive which allows to change the state, lighting and texture of objects, posing a bigger challenge for perception and a broader scope for future work. We build a benchmark dataset of 132 episodes using 132 different houses with 3-10 rooms each and select 3 objects for each house to conduct MultiON task. Each episode in the dataset is described using the following properties:
1. data_type : This corresponds to either ‘val’ or ‘test’ based on which set of data was used from ProcThor-10K to construct the episode.
2. house_idx : This corresponds to index of the specific house in the ProcThor-10K dataset.
3. num_rooms : Number of rooms in the house
4. num_targets : Number of targets in the house (Currently we have limited the dataset to 3 targets)
5. targets : List of unique targets in the house along with their ground truth locations
6. start_position : Start position randomly sampled from all reachable positions in the house
7. start_heading : Start heading randomly sampled from 0, 90, 180 and 270 degrees
8. shortest_path_targets_order : The order of targets that results in shortest path using A* planner. This is evaluated by running A* planner on all possible target orders.
9. shortest_path_length : Length of the shortest path computed via A* planner along the shortest_path_targets_order

4.2 Metrics

We report two standard metrics that are used for evaluating navigation tasks: Success Rate (SR) and Success Weighted by Path Length (SPL) (Anderson et al. 2018). SR measures the percentage of cases where the agent is able to find all the three objects successfully, while SPL normalizes the success by ratio of the shortest path to actual path taken. We use the minimum of the shortest path from the starting point to permutations of all the target objects. In addition to these two metrics, we measure the similarity between the object ordering obtained by the agent and that by the ground-truth. The ground-truth object ordering gives an idea of how a perfect agent would have explored the space by first identifying objects that are highly probably in current room/scene-graph and then exploring other rooms. We use the Kendall distance metric (Lapata 2006), which computes the distance between two rankings based on the number of disagreeing pairs. We use the Kendall Tau that converts this distance into a correlation coefficient, and report it over the successful episodes (all three targets are located).

4.3 Implementation Details

We use the default robot with head-mounted RGBD camera in the AI2-Thor simulator. The camera has $320\times 240$ resolution with 90°field-of-view (FoV). The details of the robot observations and actions can be referred to the section 3.1.

We conduct experiments with 2 different LLMs: gpt-3.5-turbo and gpt-4. For training the low-level planner, we used the IL method, described in the section 3.5. It achieves $84.5\%$ SR (success radius 1.5m, max 300 steps) and $0.782$ SPL in unseen ProcThor-10k-val scenes with random start and goal locations. For short-range movements within a single room, performance increases to $98.5\%$ SR and $0.930$ SPL.

	Scene	LL	SR	SPL	Kendall
	Graph	Planner	(%)		Tau
Baseline			$56.06$	$0.49$
	GT	OrNav	$95.35$	$0.43$	$0.70$
SayNav	GT	PNav	$80.62$	$0.32$	$0.72$
(gpt-3.5)	VO	OrNav	$71.32$	$0.48$	$0.56$
	VO	PNav	60.32	0.34	0.62
	GT	OrNav	$93.93$	$0.46$	$0.76$
SayNav	GT	PNav	$84.09$	$0.36$	$0.78$
(gpt-4)	VO	OrNav	$69.80$	$0.47$	$0.76$
	VO	PNav	64.34	0.33	0.78

Table 1: Results of SayNav on multi-object navigation task. Baseline uses a PNav agent to navigate along the shortest route among targets based on ground-truth positions; GT and VO build the scene graph from ground-truth object/room locations and visual observations provided by the simulator respectively; OrNav and PNav use oracle and IL-learned low-level (LL) planner respectively for navigating between the points assigned by the high-level planner.

Note that SayNav consists of three modules – incremental scene graph generator, LLM-based planner, and a low-level planner. The major goal of our experiments is to fully validate and verify the LLM-based planning capabilities in SayNav for MultiON. Therefore, for each of the other two modules, we implemented an alternative option which uses ground truth information to avoid any error within that module. This allows us to conduct ablation study for determining the impact of each module on the overall performance.

First, we allow the scene graph to be generated using ground truth (GT) instead of visual observations (VO). We have described the generation of scene graph using VO in the section 3.3. The GT option directly uses the ground truth information of surrounding objects, including 3D coordinates and geometric relationships among objects, to incrementally build the scene graph during exploration. This option avoids any association and computation ambiguity from processing on visual observations, such as computing 3D coordinates for each object based on RGBD image and its segmentation.

Second, we use an oracle planner (OrNav) as the low-level planner instead of our efficiently-trained IL agent (PNav). We have described the implementation of PNav in the section 3.5. For OrNav, we use an A* planner which has access to the map of the environment. Given a target location, it can plan the shortest path from the agent’s current location to the target.

4.4 Baseline

Most existing LLM-based approaches for robot navigation operate in known and/or small-scale environments. However, there do exist RL / IL based approaches (ex: (Gireesh et al. 2023)) and traditional SLAM-based approaches that can work in our problem setting. Many previous works (ex: (Savva et al. 2019)) have shown that RL/IL based Point-goal Navigation (PointNav) outperforms SLAM-based approaches in unknown environments by a large extent (80% vs 62% in the mentioned example). Due to lack of open-source datasets/code from existing works, we chose to implement the strongest possible baseline method which employs IL-based PointNav policy that could potentially reflect the upper bound of the performance of a learning-based agent using the same amount of training data as ours. The baseline agent uses two privileged information that SayNav doesn’t have access to. First, it has access to the optimal order of target objects which achieves the shortest path. This simplifies the MultiON task to become a series of ObjectNav tasks. Second, as a PointNav policy typically performs better than an ObjectNav policy, we also provide the baseline agent access to the ground truth coordinate of each target object, which then simplifies the task to a series of PointNav tasks. In other words, we implement a PointNav agent to navigate through ground truth points of the objects in the optimal order.

4.5 Quantitative Results

Table 1 shows the results of the baseline along with different implementation choices in SayNav. Note for the baseline method, even after using ground-truth object locations in optimal order, SR is only $56.06\%$ , which indicates the difficulty of MultiON task. It is because of the fact that PointNav policy observes a loss in performance in large-scale environments. Although, the baseline has access to the locations of the goal objects, finding each object still requires it to plan a long-range path across multiple rooms. In comparison, SayNav, without using any ground truth, achieves a higher SR ( $60.32\%$ and $64.34\%$ with gpt-3.5-turbo and gpt-4 respectively). This improvement highlights the superiority of SayNav in navigating in large-scale unknown environments.

SR: With SayNav, we observe that the best performance is achieved when using scene graph generated by ground-truth object/room location from the simulator (GT) along with OrNav. Note that SR of 95.35% with gpt-3.5 and 93.93% with gpt-4 in this setting reflects the strength of our LLM-based high-level planner which is the only imperfect module in the experiment. When we replaced GT with VO, we do observe a loss in performance. We found that the drop in SR can be associated with various challenges encountered in any perception based algorithms. The inaccurate estimation of objects’ 3D positions due to partial observations can lead to failures in detecting targets and navigation. In addition, we remove objects less than 20 pixels on semantic segmentation observations for more practical behavior. Therefore, very small objects can also be missed-out while building the scene graph from visual observations. We also observed a specific challenge in VO associated with estimating the location of glass doors. In the depth map, the depths of visible objects behind the glass door represent the depths of the actual physical door, which fails the estimation of the location of the door. A similar trend can be found from results with GT & PNav and with VO & PNav. When replacing OrNav with PNav, we also observe a fall in performance. This is obvious as PNav doesn’t access any ground truth information, as compared to OrNav.

However, even with all these challenges, SayNav outperforms the oracle-based baseline and succeeds in MultiON tasks. We believe it is due to LLM-based dynamic planning capabilities, with the grounding mechanism based on incremental scene graph generation. It leverages common-sense knowledge, as humans do, to efficiently search multiple different objects in an unknown environment. It also refines or corrects the plan in case of any failure in a planned step.

SPL: Looking at the SPL metric, we see a drop in SayNav as compared to the baseline. Note that SPL reflects the length of the path taken by the agent as compared to the shortest possible path. For example, SPL=1 for an episode would mean that the agent, starting from initial position, goes straight to the targets along the shortest path in the optimal order with zero exploration which is practically impossible in an unknown environment. As a result of access to the ground-truth object locations in optimal order, it becomes obvious for the baseline to have higher SPL. From the results, we also observe that the low-level planner has the major impact on SPL. The system achieves higher SPL with OrNav as compared to PNav.

Kendall-Tau: The Kendall-Tau ( $\tau$ ) metric measures the similarity between the order of objects as located by the agent and the optimal ordering based on the ground-truth. It shows the importance of the knowledge provided by the LLMs, for finding the optimal plan. We observe that the ordering of the objects is not affected much by the low-level planner. This is reasonable since the ordering should majorly depend on the plans generated by the high-level planner. As expected, the score drops when we replace GT with VO since LLM uses scene-graph to generate the plan. The considerable improvement in the score with gpt-4 (vs gtp-3.5-turbo) shows that using a better LLM enables an improvement in use of common-sense priors and yields more optimal ordering. Also, note that Kendall Tau metric doesn’t apply to the baseline since it already has access to the optimal order of targets.

4.6 Qualitative Results

We show an example of a typical episode in Figure 7 where the agent is asked to locate an alarm-clock, a laptop, and a cellphone in an unknown house. The agent happens to start in the kitchen (determined by LLM based on perceived objects). The planner reasons that it is unlikely to find either of the objects there, so it decides to go to another room through a door. Then, it comes to a living-room where it is able to locate the laptop and cellphone. The third object still remains unfound, so it again decides to go to another room via a door. Eventually, it locates the alarm-clock in the final room. The complete demo-video for this example can be found at our project link.

4.7 Memory via LLM

As mentioned in the section 3.3, SayNav uses the 3D scene graph to support memory for future planning. For instance, it automatically annotates the room nodes that have already been explored and won’t plan for the room if the agent happens to visit the same room again. This type of implementation to support memory works perfectly for our chosen task. However, we also wanted to explore the possibility of making the LLM track its own plans. Hence, we also implemented SayNav’s memory via LLM by using Conversational Chains module of the LangChain framework²²2https://python.langchain.com/docs/modules/chains. Here we use two separate instances of identical LLM models, one to generate the high level plans and another to track the generated plans. The LLM tracking the generated plan, uses the Room Tracking Prompt in Figure 8 and is also equipped with a conversational memory. The LLM responsible for generating the plans receives detailed description of the surrounding environment while the other LLM instance only receives the minimal information necessary to do the tracking. A key advantage of this framework is that it avoids hitting the maximum token limit on the LLM by not relying on the entire conversational history (as is usually done). We believe that LLM-based tracking might be able to generalize better to other tasks since it eliminates the need of a module for tracking plan history, which would require different implementations for different task (we will test this in our future work).

	Scene	LL	SR	SPL	Kendall
	Graph	Planner	(%)		Tau
	GT	OrNav	$77.86$	$0.37$	$0.72$
SayNav	GT	PNav	$61.83$	$0.24$	$0.76$
(gpt-3.5)	VO	OrNav	$58.73$	$0.40$	$0.72$
	VO	PNav	46.77	0.29	0.82
	GT	OrNav	$93.93$	$0.43$	$0.69$
SayNav	GT	PNav	$86.36$	$0.35$	$0.77$
(gpt-4)	VO	OrNav	$72.09$	$0.44$	$0.73$
	VO	PNav	61.60	0.35	0.77

Table 2: Results of SayNav on MultiON task using LLM Memory

Table 2 shows the performance of SayNav using LLM Memory via gpt-3.5-turbo and gpt-4. Note that we use the same model for both the instances of LLM in our experiments. We do observe substantial drop in SR and SPL metrics by using LLM Memory in the case of gpt-3.5-turbo as compared to the results reported in Table 1. However, with gpt-4, the LLM Memory is able to achieve similar results as compared to Table 1. This shows that it is feasible to hand-over the task of tracking to the better LLM models.

4.8 Real-World Demonstration

We also ported SayNav to a real robot and tested under various settings, for showcasing the efficient generalization of SayNav to real environments. The demonstration video for a cafeteria environment is available at our project link.

5 Limitations and Discussion

This section discusses the limitations of our work and some ideas to improve SayNav in the future. As mentioned before, our scene graph generation faces various challenges encountered in any perception-based algorithms. In addition to the glass door issue that we described earlier, Figure 10 shows a failure case for another issue due to visual observations. Note, in our experiments, the agent is not equipped with arms to open/close the door. Therefore, it only can go through open doors to move to other rooms. In this episode, the agent from most of the positions in the room cannot observe that the door is open (which connects to the other room that has the target object). The robot repeatedly tries to go towards the center of the current room and refine the scene graph. However, it is still not able to identify the “open” status for the door, and therefore, fails to achieve the goal.

A better mechanism to verify attributes (open/close) associated with the object node (door) in the scene graph can help to alleviate this case. For example, the agent can move closer to the door, verify visual observations from all possible angles, and compare the depth information of the door to that of the wall (closed doors shall have nearly identical depths as the connected wall). In the future, we would like to develop verification and feedback mechanisms to validate the plans generated by LLMs for improving SayNav’s performance. We also plan to explore smaller LLMs (instead of GPT-3.5 and GPT-4), which can run locally on a real robot.

6 Conclusion

We present SayNav, a new approach for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav incrementally builds and converts a 3D scene graph of the explored environment to LLMs, for generating dynamic and contextually appropriate high-level plans for navigation. We evaluate SayNav on the MultiON task, that requires the agent to efficiently search multiple different objects in an unknown environment. Our results demonstrate that SayNav even outperforms a strong oracle based Point-nav baseline (64.34% vs 56.06%), for successfully locating objects in large-scale new environments.

7 Acknowledgement

This material is, in part, based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001123C0089. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DARPA. We would like to thank Tixiao Shan for experiments and demonstrations in the real world. We also thank Rakesh “Teddy” Kumar, Ajay Divakaran, and Supun Samarasekera for their valuable feedback to this material.

References

Ahn et al. (2022) Ahn, M.; et al. 2022. Do as i can, not as i say: Grounding language in robotic affordances. In arXiv:2204.01691.
Anderson et al. (2018) Anderson, P.; et al. 2018. On evaluation of embodied navigation agents. arXiv:1807.06757.
Armeni et al. (2019) Armeni, I.; et al. 2019. 3d scene graph: A structure for unified semantics, 3d space, and camera. In CVPR.
Brown et al. (2020) Brown, T.; et al. 2020. Language models are few-shot learners. NeurIPS, 33: 1877–1901.
Chaplot et al. (2020) Chaplot, D.; et al. 2020. Object goal navigation using goal-oriented semantic exploration. NeurIPS, 33: 4247–4258.
Chen et al. (2022) Chen, P.; et al. 2022. Learning Active Camera for Multi-Object Navigation. In NeurIPS.
Cho et al. (2014) Cho, K.; et al. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In arXiv:1406.1078.
Chung et al. (2022) Chung, H. W.; et al. 2022. Scaling instruction-finetuned language models. arXiv:2210.11416.
Deitke et al. (2022) Deitke, M.; et al. 2022. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. NeurIPS, 35: 5982–5994.
Driess et al. (2023) Driess, D.; et al. 2023. Palm-e: An embodied multimodal language model. In arXiv:2303.03378.
Gireesh et al. (2023) Gireesh, N.; et al. 2023. Sequence-Agnostic Multi-Object Navigation. arXiv:2305.06178.
He et al. (2016) He, K.; et al. 2016. Deep residual learning for image recognition. In CVPR.
Huang et al. (2022) Huang, W.; et al. 2022. Inner monologue: Embodied reasoning through planning with language models. In arXiv:2207.05608.
Hughes et al. (2022) Hughes, N.; et al. 2022. Hydra: A real-time spatial perception engine for 3d scene graph construction and optimization. In Robotics: Science and Systems.
Khandelwal et al. (2022) Khandelwal, A.; et al. 2022. Simple but effective: Clip embeddings for embodied ai. In CVPR, 14829–14838.
Kim et al. (2019) Kim, U.; et al. 2019. 3-D scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE Transactions on Cybernetics, 50(12): 4921–4933.
Kolve et al. (2017) Kolve, E.; et al. 2017. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474.
Lapata (2006) Lapata, M. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 471–484.
Liu et al. (2023) Liu, B.; et al. 2023. LLM+P: Empowering large language models with optimal planning proficiency. In arXiv:2304.11477.
Marza et al. (2022) Marza, P.; et al. 2022. Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation. In IROS.
Marza et al. (2023) Marza, P.; et al. 2023. Multi-Object Navigation with dynamically learned neural implicit representations. In ICCV.
Mishkin et al. (2019) Mishkin, D.; et al. 2019. Benchmarking classic and learned navigation in complex 3d environments. In arXiv:1901.10915.
Ouyang et al. (2022) Ouyang, L.; et al. 2022. Training language models to follow instructions with human feedback. In arXiv:2203.02155.
Peng et al. (2023) Peng, B.; et al. 2023. Instruction tuning with gpt-4. arXiv:2304.03277.
Ramrakhya et al. (2022) Ramrakhya, R.; et al. 2022. Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In CVPR, 5173–5183.
Ramrakhya et al. (2023) Ramrakhya, R.; et al. 2023. Pirlnav: Pretraining with imitation and rl finetuning for objectnav. In CVPR, 17896–17906.
Rana et al. (2023) Rana, K.; et al. 2023. SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning. In arXiv:2307.06135.
Rosinol et al. (2021) Rosinol, A.; et al. 2021. Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research, 40(12–14): 1510–1546.
Ross et al. (2011) Ross, S.; et al. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS. JMLR Workshop and Conference Proceedings.
Savva et al. (2019) Savva, M.; et al. 2019. Habitat: A Platform for Embodied AI Research. In ICCV.
Shah et al. (2023) Shah, D.; et al. 2023. ViNT: A Foundation Model for Visual Navigation. arXiv:2306.14846.
Singh et al. (2023) Singh, I.; et al. 2023. Progprompt: Generating situated robot task plans using large language models. In ICRA.
Song et al. (2022) Song, C. H.; et al. 2022. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In arXiv:2212.04088.
Szot et al. (2021) Szot, A.; et al. 2021. Habitat 2.0: Training home assistants to rearrange their habitat. NeurIPS, 34: 251–266.
Wald et al. (2020) Wald, J.; et al. 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions. In CVPR.
Wani et al. (2020) Wani, S.; et al. 2020. MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation. In NeurIPS.
Weihs et al. (2020) Weihs, L.; et al. 2020. Allenact: A framework for embodied ai research. arXiv preprint arXiv:2008.12760.
Wijmans et al. (2019) Wijmans, E.; et al. 2019. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In arXiv:1911.00357.
Wu et al. (2021) Wu, S.; et al. 2021. SceneGraphFusion: Incremental 3D scene graph prediction from RGB-D sequences. In CVPR.
Wu et al. (2018) Wu, Y.; et al. 2018. Group normalization. In ECCV, 3–19.
Yu et al. (2023) Yu, B.; et al. 2023. Leveraging Large Language Models for Visual Target Navigation. In arXiv:2304.05501.
Zeng et al. (2023) Zeng, H.; et al. 2023. Multi-Object Navigation Using Potential Target Position Policy Function. IEEE Transactions on Image Processing, 32: 2608 – 2619.
Zhao et al. (2023) Zhao, W. X.; et al. 2023. A survey of large language models. arXiv:2303.18223.
Zhu et al. (2017) Zhu, Y.; et al. 2017. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA.

SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments