CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

Pengying Wu, Yao Mu, Kangjie Zhou, Ji Ma, Junting Chen, Chang Liu

Abstract

Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in household scenarios, specifically in the use of multiple agents collaborating to complete complex navigation tasks through communication, remains unexplored. Therefore, this paper proposes a framework for decentralized multi-agent navigation, leveraging LLM-enabled communication and collaboration. By designing the communication-triggered dynamic leadership organization structure, we achieve faster team consensus with fewer communication instances, leading to better navigation effectiveness and collaborative exploration efficiency. With the proposed novel communication scheme, our framework promises to be conflict-free and robust in multi-object navigation tasks, even when there is a surge in team size.

I Introduction

Refer to caption — Figure 1: We contribute CAMON: a framework for Cooperative Multi-Object Navigation in indoor Environments. This figure shows three agents collaborating to find some objects, and the dialog box represents the agents’ conversation contents. In CAMON, the agents make decisions that do not conflict with other robots and maximize team collaboration benefits by asking their current leaders.

In recent years, household visual navigation utilizing the large language model (LLM) has advanced rapidly. Previous methods [41, 12, 27, 4, 8] have leveraged LLMs as scene-understanding tools and planners, yielding promising application results. However, these approaches are constrained to single-agent navigation and do not offer viable solutions for effective communication and collaborative planning among multiple agents [34]. When tasked with searching for and locating various objects within a household environment, such complex tasks pose significant challenges for a single robot, leading to low efficiency and high failure rates. In multi-object navigation scenarios [30, 22, 14], it is essential for multiple robots to collaborate to accomplish these tasks effectively.

Successfully completing multi-agent tasks necessitates a team possessing three key abilities: (1) extracting useful information from observations, i.e., determining the content of communication, (2) a conflict-free communication mechanism, i.e., identifying with whom to communicate, (3) a global planning capability, i.e., planning after communication. To achieve these abilities, we have designed a novel framework specifically designed for multi-agent navigation, and the effect is shown in Figure 1. This method achieves cooperative multi-target tasks through structured scene descriptions and ordered communication mechanisms.

To logically organize and summarize observations, we focus on the layout patterns of indoor scenes, where the placement of objects in household environments is often related to the properties of the room [28]. For example, a room with a bed is typically a bedroom, where pillows, televisions, and similar items are commonly found, while toilets and microwaves are not. Therefore, we advocate this motivation of object-room relationships in navigation representations, dividing the observed scene into individual rooms and generating descriptions of each room for subsequent communication for task division. For instance, upon entering a room and identifying it as a living area based on its layout, a robot should promptly locate and exclusively find all potential targets within that space. To ensure efficient team collaboration, other robots should avoid entering rooms identified as living areas upon detection.

In multi-agent embodied tasks, the organizational structure of the team is crucial [7, 15]. Prior research [10] indicates that leadership-based communication patterns expedite consensus achievement, whereas dynamic leadership allocation further enhances team coordination and effectiveness [15]. To establish an efficient and stable communication system, we designed a comm-triggered¹¹1”comm” is short for ”communication” in this paper. dynamic leadership model as an advanced form of decentralized communication. Our contributions can be summarized as follows ²²2This work is in progress.:

•

We have designed a comprehensive framework for multi-agent navigation tasks utilizing LLMs, encompassing modules for perception, communication, and cooperative planning.
•

Our proposed multi-agent communication mechanism facilitates adaptive task division and planning for complex navigation tasks.
•

We have developed a dynamic leadership mechanism, activated by agents’ communication requests, that facilitates the distribution of information exchange workload in distributed systems.

II Related Work

II-A Visual Object Navigation

Target object navigation necessitates that robots swiftly locate and approach the target object in an unfamiliar environment. Recent research in this field has primarily branched into two mainstream approaches: end-to-end network model-based frameworks [3, 20, 23, 39, 6, 13, 4] and modular map-based frameworks [24, 5, 9, 41, 35, 27, 32, 28, 19]. End-to-end model-based methods exhibit good transferability but are relatively ineffective in navigation efficiency and task success rates [5]. Conversely, the modular approach, guided by hierarchical maps, requires meticulously designed modules, enabling highly effective navigation. With the increasing complexity of object-finding tasks, multi-object navigation (MultiON) tasks [30] and methods [25, 22, 36] have emerged as advanced versions of single-target object navigation. However, existing methods for MultiON predominantly address pre-sequenced MultiON, where the robot receives a predefined sequence for exploring target object classes. To demonstrate the flexibility of the task and the adaptive planning capability of the proposed framework, we adopt the Sequence Agnostic MultiON (SAM) [14] task for evaluation. In this approach, the robot neither receives nor is required to follow a global order for locating and navigating to instances of target object classes. Instead, the robot explores probable locations of the target objects and dynamically adapts its exploration based on observations.

II-B LLM-Based Cooperative Embodied Agents

Recent work [21, 38, 37] has demonstrated the feasibility of inputting observation in linguistic form into large language models for communication and decision-making in Multi-Agent Systems (MAS). Most of the work is structured in a hierarchical manner to ensure the proper functioning of MAS. The mainstream LLM-based multi-agent planning frameworks are divided into two major branches: centralized [40, 2, 34, 10] and decentralized [10, 18, 38, 33, 29].

In the centralized organization, the LLMs comprehend the observations, history, and task progress of multiple agents, and collaboratively allocate tasks to each robot group [40] or individual [2, 34]. Specifically, Yu et al. [34] implement a centralized multi-agent navigation framework, extracting frontier information and semantic information from the map, and utilizing LLMs to allocate exploratory areas for each robot. Such frameworks achieve good coordination and planning performance in small-scale groups. However, as the team size scales up, the communication and information processing burden of centralized leadership increases, posing challenges to reasonable and seasonable planning [10].

In decentralized systems, each robot acts as an independent entity with self-autonomy, exchanging historical observations through human-like verbal communication and making adaptive decisions [33]. In particular, Zhang et al. [38] provide a systematic template for decentralized communication and collaboration. This method categorizes each agent’s execution in the MAS into five modules: observation, belief, communication, reasoning, and planning, with the LLMs facilitating inter-agent communication and reasoning.

II-C Multi-Agent Organizational Structures

Recent studies have delved into the impact of organizational structures among multiple agents on task division, planning conflicts, and communication costs in embodied tasks. Through experiments, Chen et al. [10] demonstrate that a hierarchical organizational structure with leadership significantly outperforms the original decentralized and centralized structures in terms of effectiveness. Chen et al. [7] also analyze the communication patterns among multiple agents and have validated that a leader-organized framework achieves faster task convergence through a simple task of multiple agents moving to a common point. Guo et al. [15] further investigate the organizational forms of leaders and have concluded that dynamic leaders are the most effective in multi-agent collaboration. However, the scheme proposed by [15] assigns leaders based on the importance of the tasks performed by each robot, which does not work well when tasks are equally important. Here, we propose a comm-triggered dynamic leader strategy that can perform well in MultiON tasks.

III Problem Setup

We consider having multiple collaborative robots participate in MultiON tasks, where they jointly explore and approach target objects in indoor environments. At the beginning of each episode, two or more robots are randomly placed in an unfamiliar environment and tasked with searching for a common set of target objects $G=\{g_{1},g_{2},\ldots,g_{m}\}$ . The robots have neither been trained to find these objects nor have prior information about similar scenes. They must coordinate their exploration and locate the targets as quickly as possible using general knowledge and common sense. Through communication, the robots can collaborate to understand the scene, divide tasks, and navigate to the target objects’ locations efficiently.

When a robot correctly recognizes and approaches a target object $g_{t}\in G$ , the sub-task is considered successful. Conversely, if a robot navigates to the wrong object, the sub-task is considered a failure. Regardless of success or failure, the robot will continue searching for the remaining objects. Additionally, the episode ends if all robots reach the maximum step limit to avoid endless movement.

IV CAMON Approach

The key concept of CAMON is to present a comprehensive framework for completing multi-agent navigation tasks through LLM communication and planning. As shown in Figure 2, the central technical approach is as follows: (1) In the perception module that understands and describes observed scenes (IV-A), each robot maintains a local map (IV-A1), divides the map into room levels, extracts the most relevant historical image frames for each room, uses large multimodal models (GPT-4o) to interpret these image frames, and generates language descriptions of the observed scenes (IV-A2). (2) In the comm module that undertakes communication and decision-making (IV-B), we employ a sequential and dynamic leader-member communication structure, where the leader adjusts the decision-making proposals (IV-B2) of the agent and ensures team coordination (IV-B3). By making decisions during communication, the member will be assigned the responsibility of finding specific objects and will select the next referenced target room from the leader. (3) At the path planning level (IV-C), we plan a sequence of waypoints on the map based on the current location and target room, generating discrete actions.

IV-A Perception Module

IV-A1 Map Construction

To record historical semantic information, each robot constructs and updates a local semantic map in real-time using its poses and RGB-D images. The map records the occupancy of obstacle areas, accessible areas, and semantic information. Inspired by the methodology in [42, 17, 32], we extracted the waypoints and topological map from the accessible area map channels to optimize the point-to-point movement.

IV-A2 Room Description

Following the principle of 3D room segmentation [31, 16], the observed rooms are segmented to obtain masks for each room on the map. To obtain a description of each room, we aim to use an LMM to read the holistic image of the entire room and generate the corresponding description. To capture such a comprehensive image, we take advantage of the robot’s manner of moving between nodes on the topology map. The robot will rotate itself to collect $12$ frames of images each time it passes through waypoints. For each segmented room, we select the image from the recorded frames that best captures the view of the room. Next, we use GPT-4o [1] to generate descriptions of the scenes in these images.

IV-B Planning Module

During navigation, each embodied agent carefully observes the current room and updates the semantic map of the room each time it enters a new one. When remaining and previously unsearched objects are detected, the agent sequentially navigates to the locations of the target objects. When the entire room has been explored and no remaining target objects are present in the room, the robot considers which room to explore next and what objects to take responsibility for by communicating with the current leader through a communication module to make a reasonable decision.

IV-B1 Communication and Leadership Appointment

In team collaboration, the presence of leaders greatly affects communication efficiency and task completion. An orderly organization needs to address the question of who the leader is and what role the leader plays. In this framework, we adopt a comm-triggered dynamic leadership mechanism to answer the question of who is the leader, while addressing the issues of imbalanced information flow and poor robustness in fixed leadership. As the episode starts, one of the robots, $agent\_a$ , collected preliminary room descriptions from all robots as global information, becoming a temporary leader. When another robot $agent\_b$ requests assistance from its leader in the future, it sends a request to $agent\_a$ , who provides suggestions to $agent\_b$ and conveys the global information to $agent\_b$ . At this time, $agent\_b$ updates the part of itself in the global information, and $agent\_b$ inherits the leader. In the subsequent process, robots continuously send communication requests, and global information and leadership are conveyed sequentially within the robots. We answer the question of the role played by leaders by endowing dynamic leaders with temporary access to the authority to command any robot and the latest global information. This collaborative approach can share the communication load between robots, and even if a robot crashes (or even the temp leader), the remaining robots can maintain system stability by asking the previous leader.

IV-B2 Agent Asks for Help

Due to the complexity of multi-agent task planning, LLM needs clear historical observations and team conditions to enhance understanding and planning performance. Before any robot team member sends a communication request to the current leader, a robot makes an initial decision using the LLM based on the recorded historical information. This decision focuses on the benefit of the robot itself, which is subsequently conveyed to the leader to help determine whether the decision will cause conflicts with other robots. When the robot receives the response from the leader containing the target room and locked objects to prevent others from finding them, it then moves to the target room. We use templates $Pr$ to concatenate agent id $i$ , task progress $P_{i}$ , recorded states $S_{i}$ , shared goals $G$ , conversation history $H_{i}$ , and optional actions $\mathcal{A}_{i}$ . Then the LLM generates the initial proposal $Ps_{i}=\{\mathcal{L}_{i},a_{i},T_{i}\}$ , containing locked object $\mathcal{L}_{i}$ , action $a_{i}$ (i.e., target room), and thoughts $T_{i}$ . The process can be formulated as:

Ps_{i}=\text{LLM}(Pr(i,P_{i},S_{i},G,H_{i}))

(1)

IV-B3 Leader’s Coordination of Teamwork

When receiving requests from team members, the leader undertakes to evaluate the initial proposals $Ps_{i}$ from agent $i$ and avoid team conflicts based on the current recorded global states $S_{g}$ of all agents and task progress. Similarly, We employ the LLM to handle this process, as demonstrated in Formula 2.

Re^{*}=\text{LLM}(Pr^{*}(Ps_{i},P_{g},S_{g},G))

(2)

where $Pr^{*}$ represents the leader’s prompt template. The initial proposal from the $i$ -th agent is denoted by $Ps_{i}$ . The global task progress and global states, currently managed by the leader, are indicated by $P_{g}$ and $S_{g}$ , respectively. The coordination result generated by LLM is $Re^{*}=\{\mathcal{R}_{i},\ldots,\mathcal{R}_{j}\}$ , where $R_{i}$ is the response to $i$ -th agent who initiated the request, including the action $a_{i}^{*}$ , assigned $D^{*}$ for supporting or opposing the original proposal from the team member, as well as thoughts $T_{i}^{*}$ . The remaining responses $R_{j}\in Re^{*}\setminus R_{i}$ contain decisions on whether to interrupt $j$ -th agent, as well as actions assigned to it (if interrupted).

IV-C Motion Planning

Given the current position of the robot and the target room, the robot selects the Voronoi point closest to its current position within the target room. Subsequently, it uses the Dijkstra [11] method to plan the next waypoint along the topological edges as a mid-term goal. Then, the Fast Marching Method [26] is employed to plan the shortest point sequence and the next discrete action in real-time, from the current position to the mid-term target point. Whenever the robot passes through a topological waypoint, it rotates once to collect surrounding images.

V Conclusion

In this work, we propose a fully decentralized, LLM-based multi-agent collaborative navigation framework. The agents can effectively communicate, efficiently divide tasks and collaborate through a dynamic leadership organization. Our method can achieve task division and team planning consensus with minimal communication overhead, resulting in optimal navigation performance and robustness in the multi-agent system. We believe that our approach holds promising applications for future team collaboration of household mobile agents.

Limitations & Future Work. Several limitations remain in the current approach. First, while our mapping perception module can effectively map all observed objects, it may struggle with dynamic objects, such as humans or pets, in household scenarios. This could adversely affect the robot’s mapping and room segmentation capabilities. Additionally, our current framework is restricted to single-floor navigation and does not account for cross-floor cooperative navigation. However, we believe these limitations are not central to the issues addressed in this article and can be mitigated by incorporating additional strategy modules. Our work demonstrates promising application prospects and offers feasible ideas for future multi-robot systems. Future research will focus on the collaboration of multi-robot system movement and manipulation, aiming to complete the framework for communication, navigation, and manipulation integration. These important topics will be left for future exploration.

References

[1] GPT-4o. https://openai.com/index/hello-gpt-4o/.
Agashe et al. [2023] Saaket Agashe, Yue Fan, and Xin Eric Wang. Evaluating multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903, 2023.
Anderson et al. [2018] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
Cai et al. [2023] Wenzhe Cai, Siyuan Huang, Guangran Cheng, Yuxing Long, Peng Gao, Changyin Sun, and Hao Dong. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. arXiv preprint arXiv:2309.10309, 2023.
Chaplot et al. [2020] Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems, 33:4247–4258, 2020.
Chen et al. [2023a] Hongyi Chen, Ruinian Xu, Shuo Cheng, Patricio A Vela, and Danfei Xu. Zero-shot object searching using large-scale object relationship prior. arXiv preprint arXiv:2303.06228, 2023a.
Chen et al. [2023b] Huaben Chen, Wenkang Ji, Lufeng Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151, 2023b.
Chen et al. [2024] Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang, and Kwan-Yee K Wong. Mapgpt: Map-guided prompting for unified vision-and-language navigation. arXiv preprint arXiv:2401.07314, 2024.
Chen et al. [2023c] Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, and Fisher Yu. How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers. In Robotics: Science and Systems (RSS), 2023c.
Chen et al. [2023d] Yongchao Chen, Jacob Arkin, Yang Zhang, Nicholas Roy, and Chuchu Fan. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? arXiv preprint arXiv:2309.15943, 2023d.
DIJKSTRA [1959] EW DIJKSTRA. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959.
Dorbala et al. [2024] Vishnu Sashank Dorbala, James F. Mullen, and Dinesh Manocha. Can an embodied agent find your “cat-shaped mug”? llm-based zero-shot object navigation. IEEE Robotics and Automation Letters, 9(5):4083–4090, 2024. doi: 10.1109/LRA.2023.3346800.
Gadre et al. [2023] Samir Yitzhak Gadre, Mitchell Wortsman, Gabriel Ilharco, Ludwig Schmidt, and Shuran Song. Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23171–23181, 2023.
Gireesh et al. [2023] Nandiraju Gireesh, Ayush Agrawal, Ahana Datta, Snehasis Banerjee, Mohan Sridharan, Brojeshwar Bhowmick, and Madhava Krishna. Sequence-agnostic multi-object navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9573–9579. IEEE, 2023.
Guo et al. [2024] Xudong Guo, Kaixuan Huang, Jiale Liu, Wenhui Fan, Natalia Vélez, Qingyun Wu, Huazheng Wang, Thomas L Griffiths, and Mengdi Wang. Embodied llm agents learn to cooperate in organized teams. arXiv preprint arXiv:2403.12482, 2024.
Hughes et al. [2022] Nathan Hughes, Yun Chang, and Luca Carlone. Hydra: A real-time spatial perception system for 3d scene graph construction and optimization. arXiv preprint arXiv:2201.13360, 2022.
Kwon et al. [2023] Obin Kwon, Jeongho Park, and Songhwai Oh. Renderable neural radiance map for visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9099–9108, 2023.
Liu et al. [2023] Jijia Liu, Chao Yu, Jiaxuan Gao, Yuqing Xie, Qingmin Liao, Yi Wu, and Yu Wang. Llm-powered hierarchical language agent for real-time human-ai coordination. arXiv preprint arXiv:2312.15224, 2023.
Ma et al. [2024] Ji Ma, Hongming Dai, Yao Mu, Pengying Wu, Hao Wang, Xiaowei Chi, Yang Fei, Shanghang Zhang, and Chang Liu. Doze: A dataset for open-vocabulary zero-shot object navigation in dynamic environments. arXiv preprint arXiv:2402.19007, 2024.
Majumdar et al. [2022] Arjun Majumdar, Gunjan Aggarwal, Bhavika Suresh Devnani, Judy Hoffman, and Dhruv Batra. ZSON: Zero-shot object-goal navigation using multimodal goal embeddings. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=VY1dqOF2RjC.
Mandi et al. [2023] Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738, 2023.
Marza et al. [2023] Pierre Marza, Laetitia Matignon, Olivier Simonin, and Christian Wolf. Multi-object navigation with dynamically learned neural implicit representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11004–11015, 2023.
Park et al. [2023] Jeongeun Park, Taerim Yoon, Jejoon Hong, Youngjae Yu, Matthew Pan, and Sungjoon Choi. Zero-shot active visual search (zavis): Intelligent object search for robotic assistants. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2004–2010, 2023. doi: 10.1109/ICRA48891.2023.10161345.
Ramakrishnan et al. [2022] Santhosh Kumar Ramakrishnan, Devendra Singh Chaplot, Ziad Al-Halah, Jitendra Malik, and Kristen Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022.
Sadek et al. [2023] Assem Sadek, Guillaume Bono, Boris Chidlovskii, Atilla Baskurt, and Christian Wolf. Multi-object navigation in real environments using hybrid policies. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4085–4091, 2023. doi: 10.1109/ICRA48891.2023.10161030.
Sethian [1996] James A Sethian. A fast marching level set method for monotonically advancing fronts. proceedings of the National Academy of Sciences, 93(4):1591–1595, 1996.
Shah et al. [2023] Dhruv Shah, Michael Equi, Blazej Osinski, Fei Xia, Brian Ichter, and Sergey Levine. Navigation with large language models: Semantic guesswork as a heuristic for planning. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=PsV65r0itpo.
Sun et al. [2024] Leyuan Sun, Asako Kanezaki, Guillaume Caron, and Yusuke Yoshiyasu. Leveraging large language model-based room-object relationships knowledge for enhancing multimodal-input object goal navigation. arXiv preprint arXiv:2403.14163, 2024.
Wang et al. [2024] Jun Wang, Guocheng He, and Yiannis Kantaros. Safe task planning for language-instructed multi-robot systems using conformal prediction. arXiv preprint arXiv:2402.15368, 2024.
Wani et al. [2020] Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang, and Manolis Savva. Multion: Benchmarking semantic map memory using multi-object navigation. In NeurIPS, 2020.
Werby et al. [2024] Abdelrhman Werby, Chenguang Huang, Martin Büchner, Abhinav Valada, and Wolfram Burgard. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. In First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.
Wu et al. [2024] Pengying Wu, Yao Mu, Bingxian Wu, Yi Hou, Ji Ma, Shanghang Zhang, and Chang Liu. Voronav: Voronoi-based zero-shot object navigation with large language model. arXiv preprint arXiv:2401.02695, 2024.
Ying et al. [2024] Lance Ying, Kunal Jha, Shivam Aarya, Joshua B Tenenbaum, Antonio Torralba, and Tianmin Shu. Goma: Proactive embodied cooperative communication via goal-oriented mental alignment. arXiv preprint arXiv:2403.11075, 2024.
Yu et al. [2023a] Bangguo Yu, Hamidreza Kasaei, and Ming Cao. Co-navgpt: Multi-robot cooperative visual semantic navigation using large language models. arXiv preprint arXiv:2310.07937, 2023a.
Yu et al. [2023b] Bangguo Yu, Hamidreza Kasaei, and Ming Cao. L3mvn: Leveraging large language models for visual target navigation. arXiv preprint arXiv:2304.05501, 2023b.
Zeng et al. [2023] Haitao Zeng, Xinhang Song, and Shuqiang Jiang. Multi-object navigation using potential target position policy function. IEEE Transactions on Image Processing, 32:2608–2619, 2023. doi: 10.1109/TIP.2023.3263110.
Zhang et al. [2023a] Bin Zhang, Hangyu Mao, Jingqing Ruan, Ying Wen, Yang Li, Shao Zhang, Zhiwei Xu, Dapeng Li, Ziyue Li, Rui Zhao, et al. Controlling large language model-based agents for large-scale decision-making: An actor-critic approach. arXiv preprint arXiv:2311.13884, 2023a.
Zhang et al. [2023b] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485, 2023b.
Zhao et al. [2023] Qianfan Zhao, Lu Zhang, Bin He, Hong Qiao, and Zhiyong Liu. Zero-shot object goal visual navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2025–2031, 2023. doi: 10.1109/ICRA48891.2023.10161289.
Zhao et al. [2024] Zhonghan Zhao, Kewei Chen, Dongxu Guo, Wenhao Chai, Tian Ye, Yanting Zhang, and Gaoang Wang. Hierarchical auto-organizing system for open-ended multi-agent navigation. arXiv preprint arXiv:2403.08282, 2024.
Zhou et al. [2023] Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, and Xin Eric Wang. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. arXiv preprint arXiv:2301.13166, 2023.
Zuo et al. [2020] Xinkai Zuo, Fan Yang, Yifan Liang, Zhou Gang, Fei Su, Haihong Zhu, and Lin Li. An improved autonomous exploration framework for indoor mobile robotics using reduced approximated generalized voronoi graphs. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 1:351–359, 2020.