ReVoLT: Relational Reasoning and Voronoi Local Graph Planning
for Target-driven Navigation

Junjia Liu¹³, Jianfei Guo²³, Zehui Meng³, Jingtao Xue³

1

Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong

2

School of Automation Science and Engineering, Xi’an Jiaotong University

3

Application Innovate Laboratory (2012 Laboratories), Huawei Technologies Co., Ltd.
Beijing, 100038, China
[email protected], [email protected], {mengzehui, xuejingtao}@huawei.com

Abstract

Embodied AI is an inevitable trend that emphasizes the interaction between intelligent entities and the real world, with broad applications in Robotics, especially target-driven navigation. This task requires the robot to find an object of a certain category efficiently in an unknown domestic environment. Recent works focus on exploiting layout relationships by graph neural networks (GNNs). However, most of them obtain robot actions directly from observations in an end-to-end manner via an incomplete relation graph, which is not interpretable and reliable. We decouple this task and propose ReVoLT, a hierarchical framework: (a) an object detection visual front-end, (b) a high-level reasoner (infers semantic sub-goals), (c) an intermediate-level planner (computes geometrical positions), and (d) a low-level controller (executes actions). ReVoLT operates with a multi-layer semantic-spatial topological graph. The reasoner uses multiform structured relations as priors, which are obtained from combinatorial relation extraction networks composed of unsupervised GraphSAGE, GCN, and GraphRNN-based Region Rollout. The reasoner performs with Upper Confidence Bound for Tree (UCT) to infer semantic sub-goals, accounting for trade-offs between exploitation (depth-first searching) and exploration (regretting). The lightweight intermediate-level planner generates instantaneous spatial sub-goal locations via an online constructed Voronoi local graph. The simulation experiments demonstrate that our framework achieves better performance in the target-driven navigation tasks and generalizes well, which has an 80% improvement compared to the existing state-of-the-art method. The code and result video will be released at https://ventusff.github.io/ReVoLT-website/.

Index Terms:

Relational reasoning, combinatorial relation graph neural networks, UCT bandit, online Voronoi local graph

I Introduction

Finding objects in complex houses efficiently is a prerequisite for domestic service robots. Robots need to reason and make dynamic decisions along with interacting with the real-world environment. Embodied AI, proposed by Matej Hoffman and Rolf Pfiefer[1], suggests that to truly understand how the human brain works, a brain should be embedded into a physical body, and let it explore and interact with the real world. Among all the work practicing Embodied AI in recent years, target-driven navigation (TDN) is one of the most feasible and essential tasks, which combines techniques in both machine learning and robotics, and is widely applicable for scenarios such as domestic service robots. It typically requires the robot to find a target object of a certain category in an unknown scene, demanding both high efficiency and success rate. Hence, the key problems of the TDN task are generalizing across unknown domains and exploring efficiently.

Traditional Simultaneous Localization and Mapping (SLAM) pipeline has already handled TDN to some extent[2], but there are still numerous problems lying in its major modules. First, it remains troublesome for SLAM-based methods to acquire and maintain a lifelong updating semantic map, which demands accurate sensors and semantic information. Second, SLAM-based methods are inherently less adaptive to posterior information, which causes them not generalizing well in complicated environments, especially in indoor scenes. Last but not least, SLAM-based methods are not specially designed for searching objects in unknown environments, which requires keeping balance between exploitation (depth-first searching) and exploration (regretting).

Recently, learning-based methods emerge and show powerful capabilities of solving complicated tasks. However, these methods generally have problems of interpretability and generalization, especially in the TDN task which require robots to operate in unseen domain. We argue that it is more natural and empirical to introduce a priori[3][4] to the learning model instead of training from scratch, considering how human teach ignorant babies. Introducing a priori enables algorithms to achieve higher data efficiency, better model interpretability, and generalization. In indoor TDN tasks, one of the most useful prior information is the relationship among objects and rooms of different categories. Some recent works reason about the target direction using object relationships as a priori in single-room environments[5, 6, 7]. However, common domestic scenes are composed of multiple rooms, thus more prior information such as room connection, object-in-room membership, and other implicitly structured relationships could be exploited, which are typically ignored in these works.

Refer to caption — Figure 1: The main hierarchical framework of ReVoLT method, which contains a high-level reasoner (infers semantic sub-goals), an intermediate-level planner (computes spatial location sub-goal), and a low-level controller (computes actions). The combinatorial relation extraction module provides a priori of the exploration value about the observed objects and regions through embedding similarity. Especially, Region Rollout model provides Monte Carlo simulation for UCT in a conditional GraphRNN (c-GraphRNN) way.

In this paper, we propose a hierarchical navigation framework, Relational Reasoning and Voronoi Local graph planning (ReVoLT), which comprises a combinatorial graph neural network for multiform domestic relations extraction, an UCT-based reasoning exploration, and an online Voronoi local graph for the semantic-spatial transition. The TDN task is concisely decomposed, allowing for separate and special designs for different modules, instead of operating in a mixed-up end-to-end manner. The detailed contributions are as follows:

•

To extract multiform structural relations for reasoning, we propose combining unsupervised GraphSAGE [8], self-supervised GCN, and c-GraphRNN methods for learning object embedding, region embedding, and region rollout, respectively.
•

Based on the relation priors, the high-level reasoner (semantic reasoning) is abstracted as a bandit problem and adopts UCT to balance exploitation (depth-first searching) and exploration (regretting).
•

Voronoi local graphs are constructed online for converting semantic sub-goals to spatial locations.
•

It is found in the test results that the proposed framework is superior to state-of-the-art methods and achieves a higher success rate and success weighted by path length (SPL) with good generalization.

II Related Works

Graph is useful for constructing relationships between non-Euclidean data, which has recently become popular for representing the physical structure of redundant robots to learn coordination between their multiple degrees of freedom [9], learning traffic signal co-regulation between multi-junction networks [10], and representing the state change of deformable and soft objects. It is also reasonable and natural to use graphs for relational reasoning to solve TDN problems. They have the advantage of replacing an explicit metric map like SLAM-based methods, inferring the approximate position of the target object based on observed objects. Most of these methods use GNNs to learn object-object proximity relationships but ignore the relationship between regions/rooms, thus it limits their task scenarios to a single room (using AI2Thor data set [11] in simulation for training). For example, Yang et al. [5] propose to use Graph Convolutional Network (GCN) to incorporate the prior knowledge about object relationship into a Deep Reinforcement Learning (DRL) framework as part of joint embedding. Their priors are obtained from large-scale scene understanding datasets and updated according to the current observation.

For navigation tasks in houses with multiple rooms, it is necessary to first reach the room that may contain the target object (e.g. refrigerator-kitchen), then search the target in object cliques. Therefore, the learning of prior knowledge should consider more relationships, including room-to-room connection and object-in-room membership. Wu et al. [12] propose a memory structure based on the Bayesian graph model. It uses the probability relationship graph to get the prior house layout from the training set and estimates its posterior in the test set. However, this work does not combine object-level reasoning to achieve a complete TDN task. Chaplot et al. [13] build a topological representation with associated semantic features and learn a prior semantic score function to evaluate the probability of potential nodes in a graph with various directions. However, they provide target images,which is impractical in domestic scenarios, while our method only uses target labels. They subsequently extend the Active Neural SLAM system [2], to learn semantic priors using a semantically aware long-term policy for label target navigation task [14] and won CVPR 2020 Habitat ObjectNav Challenge¹¹1https://aihabitat.org/challenge/2020/ [15]. It is worth mentioning that they also point out the end-to-end learning-based methods suffer from large sample complexity and poor generalization as they memorize object locations and appearance in training environments[14], which prompt us to consider the hierarchical framework with a topological graph. Table I only lists TDN methods with label target and relational reasoning.

TABLE I: Performance of existing TDN methods with
various experiment setting

Method	Room Scale	Dataset	SR(%)	SPL(%)
Scene-prior [5]	Single	AI2-THOR	35.4	10.9
SAVN [16]	Single	AI2-THOR	35.7	9.3
MJOLNIR [7]	Single	AI2-THOR	65.3	21.1
BRM [12]	Multiple	House3D	-	-
SemExp^† [14]	Multiple	Matterport3D	36.0	14.4

$\dagger$

SemExp won the first place in CVPR Habitat 2020 competition.

III Revolt Reasoning & Planning with Hierarchical Framework

This task needs to be re-examined from the perspective of bionics. Imagine a human facing such a task when he enters an unknown house. He will not feel confused due to the prior knowledge about domestic scenes he has. It is natural for us to first roughly determine the type of room based on categories of multiple observed objects in the current room (e.g. a bedroom). According to the object-in-room membership, the exploration value $\mathbf{V}(t|cur\_room)$ of the target object $t$ in the current room can be obtained. At the same time, some potential but unexplored passages (e.g. a door or hallway) can be determined as ghost nodes like [13]. The structural relationship of the house layout and room connection can help us predict categories and value $\mathbf{V}(t|next\_room)$ of next rooms connected by ghost nodes.

Except for these priors, dynamic decisions also should be made in a specific task, rather than just applying experience mechanically. Reasoning procedure which contains intelligent exploration and exploitation is one of the winning strategies. Thus, our approach focuses on solving the following two problems:

•

How to obtain a more effective prior conditional exploration value in a structured form?
•

How to make efficient decisions between multiple feasible paths based on exploration values?

The remainder of this section is organized as follows. In subsection III-A, III-B, III-C, we present a combinatorial relation extraction module (Fig. 2) using GNNs, which learns three different relationships in a unified paradigm. A UCT-based online reasoner is described in subsection III-D. In III-E, we consider the coarse spatial information and build an intermediate-level planner through online Voronoi construction. Finally, the whole ReVoLT hierarchical framework is summarized in subsection III-F (Fig. 1).

III-A Object Embedding learning

As illustrated in Fig. 2 (a), the object-to-object relationship consists of not only pair-wise semantic similarity, but also distances and the number of hops between object pairs. We first extract an object-level graph $\mathcal{G}_{o}(\mathcal{V}_{o},\mathcal{E}_{o})$ through object positions $pos$ and category $\mathcal{C}_{o}$ from Matterport3D dataset. Objects in the same room are fully connected. As for object pairs in different rooms, only those closest to a common door have an connecting edge. This is useful for the robot to infer objects that are strongly related to the target just using object-level embedding.

GraphSAGE [8] is a popular model in the node embedding field. We adopt it to obtain the embedding of each object category to fuse semantics and proximity relationships with other categories. Our node embedding procedure uses GloVe [17] as the initial node semantic feature $\left\{\mathbf{x}_{v},\forall v\in\mathcal{V}_{o}\right\}$ , and employ an unsupervised form of GraphSAGE with a loss that penalizes the embedding similarity between two objects far apart and reward the adjacent two. Different from the original GraphSAGE, edge features $\left\{\omega_{e:u\rightarrow v},\forall e\in\mathcal{E}_{o}\right\}$ are also taken into account in aggregation and loss calculations. For each search depth $k$ , weight matrices $\mathbf{W}^{k},\forall k\in\{1,\dots,K\}$ , we employ an edge-weighted mean aggregator which simply takes the element-wise mean of the vectors in $\{h^{k-1}_{u},\forall u\in\mathcal{N}(v)\}$ to aggregate information from node neighbors:

		$\displaystyle\mathbf{h}_{v}^{0}\leftarrow x_{v},\forall v\in\mathcal{V}$		(1)
	$\displaystyle\mathbf{h}_{v}^{k}\leftarrow$	$\displaystyle\sigma\left(\mathbf{W}^{k}\cdot\text{mean}(\{\mathbf{h}_{v}^{k-1}\}\cup\{\omega_{u\to v}\cdot\mathbf{h}_{u}^{k-1}\})\right)$		(1)

Then an edge-weighted loss function is applied to the output $\{\mathbf{z}_{v},\forall v\in\mathcal{V}_{o}\}$ , and tune the weight matrices $\mathbf{W}^{k}$ :

	$\displaystyle\mathcal{L}_{\mathcal{G}_{o}}\left(\mathbf{z}_{v}\right)=$	$\displaystyle-\log\left(\sigma\left(\omega_{u\to v}\mathbf{z}_{v}^{\top}\mathbf{z}_{u}\right)\right)$		(2)
		$\displaystyle-Q\cdot\mathbb{E}_{u_{n}\sim P_{n}(v)}\log\left(\sigma\left(-\omega_{u\to v}\mathbf{z}_{v}^{\top}\mathbf{z}_{u_{n}}\right)\right)$		(2)

where $P_{n}$ is a negative sampling distribution, $Q$ defines the number of negative samples, $\sigma$ is the sigmoid function.

Since object embeddings with the same category $\{\mathbf{z}_{c},\forall c\in\mathcal{C}_{o}\}$ should have consistent representation, another mean aggregation is performed on the embeddings of same category between the final GraphSAGE aggregation and loss function. This overwrites the original value with the final embedding for each category $\{\mathbf{z}_{c}\leftarrow\text{mean}(\mathbf{h}_{v}^{K}),\textrm{if }\mathcal{C}_{o}(v)=c\}$ .

III-B Region Embedding learning

Apart from the pairwise relationship between objects, the many-to-one relationship between an object and a room or region is also indispensable for inferring the existence possibility of the target object in a certain room or among multiple observed objects. Besides, to evaluate the similarity, relationships of different levels should have a unified paradigm to obtain representation of consistent metrics. Therefore, for region-level sub-graphs, we still opt for the same embedding representation procedure. This part is shown in Fig. 2 (b).

Region embedding is carried out in a self-supervised form. We take the sub-graph $\mathcal{G}_{r}(\mathcal{V}_{r},\mathcal{E}_{r})$ as input, with embedding of objects in the same region $\{\mathbf{z}_{c},\forall c\in\mathcal{C}_{o}\}$ as nodes and weighted spatial distances as edges. The batch composed of these sub-graphs is passed into the GCN[18], and the corresponding region embedding $\{\mathbf{r}_{v},\forall v\in\mathcal{V}_{r}\}$ is obtained. Similarly from the previous procedure, for region embedding with the same label, a mean aggregation is performed to obtain a uniform vector representation $\{\mathbf{r}_{l},\forall l\in\mathcal{L}_{r}\}$ . Since there is no need to do multi-hop aggregations at region-level, a simple GCN layer is applied rather than GraphSAGE.

To enable membership calculation between region embedding $\mathbf{r}_{l}$ and object embedding $\mathbf{z}_{c}$ and distinguish regions with different labels, we use a combined loss which comprises two parts: the classification loss of embedding label and the membership loss of object-in-region:

$\displaystyle\mathcal{L}_{\mathcal{G}_{r}}\left(\mathbf{r}_{v}\right)=$	$\displaystyle-\log\left(\sigma\left(\mathbf{r}_{v}^{\top}\mathbf{z}_{u}\right)\right)$	(3)
	$\displaystyle-Q\cdot\mathbb{E}_{u_{n}\sim P_{n}(v)}\log\left(\sigma\left(-\mathbf{r}_{v}^{\top}\mathbf{z}_{u_{n}}\right)\right)$
	$\displaystyle-\dfrac{1}{n}\sum_{i=1}^{n}l_{v}\log(\hat{l}(\mathbf{r}_{v}))$

where $P_{n}(v)$ represents objects not in region $v$ , and $\hat{l}(\cdot)$ is a multi-layer perceptron (MLP) network.

III-C Region Rollout learning

As the third and most important part of relation extraction, the structural relationship reasoning ability plays a crucial role in understanding the correct direction of navigation and shortening the exploration period. To achieve this, the joint probability $p(\mathcal{G}_{h})$ of houses need to be learned to conceive a probable house layout memory $\mathcal{G}_{h}\sim p(\mathcal{G}_{h}|\mathcal{G}_{sub})$ conditioned on observed regions $\mathcal{G}_{sub}$ . However, its sample space might not be easily characterized. Thus, the house graphs are modeled as sequences by following the idea of GraphRNN[19], and redefine some concepts to make it more suitable for conditional reasoning with embedding. This part is shown in Fig. 2 (c).

\displaystyle S^{\pi}=f_{s}(\mathcal{G}_{h},\pi)=\left(A_{1}^{\pi},\ldots,A_{n}^{\pi}\right)

(4)

where $\pi$ represents the node order, and each element $A_{i}^{\pi}\in\{0,1\}_{(i-1)\times(i-1)},i\in\{1,\ldots,n\}$ is an adjacent matrix referring the edges between node $\pi(v_{i})$ and its previous nodes $\pi(v_{j}),j\in\{1,\dots,i-1\}$ already in the graph.

Since each $A_{i}^{\pi}$ has variable dimensions, we first fill them up to the maximum dimension $n$ and then repeat the 2D matrix 16 times to form a 3D matrix with $n\times n\times 16$ dimensions as an edge mask where 16 is the embedding length. Therefore, a featured graph can be expressed as the element-wise product of the region embedding matrix $X^{\pi}$ under corresponding order and sequence matrix $\{S^{\pi}\}^{3D}$ :

\displaystyle p(\mathcal{G})=\prod_{i=1}^{n+1}p\left(\mathbf{x}_{i}^{\pi}\mid(\{S_{1}^{\pi}\}^{3D},\ldots,\{S_{i-1}^{\pi}\}^{3D})\odot X_{i-1}^{\pi}\right)

(5)

where $X_{i-1}^{\pi}$ is the embedding matrix with $(i-1)\times(i-1)\times 16$ dimensions referring to region embeddings before region $\pi(v_{i})$ , and $\mathbf{x}_{i}^{\pi}$ refers to the embedding of $\pi(v_{i})$ .

Passing $\{S^{\pi}\}^{3D}\odot X^{\pi}$ as a sequence into GRU or LSTM, we can get the structure distribution of houses. This allows us to predict the next region embedding and label under the condition of the observed subgraph. The loss function of the Region Rollout network is a CrossEntropy between generated embedding label and the real label:

\displaystyle\mathcal{L}_{\mathcal{G}_{h}}(\mathbf{x}^{\pi}_{i})=-\dfrac{1}{n}\sum_{i=1}^{n}l_{i}\text{ log-softmax}[(\mathbf{x}^{\pi}_{i})^{T}\mathbf{r}_{j}],\forall j\in\mathcal{L}_{r}

(6)

In conclusion, with the combination of III-A unsupervised edge-weighted GraphSAGE object embedding learning, III-B self-supervised GCN region embedding learning, and III-C c-GraphRNN conditional region rollout, we can now extract multiform structural relationships. Meanwhile, embedding is used as a unified paradigm for representation, and the similarity between objects or regions (either observed or predicted) embeddings and the target object embedding is used as a prior to guide the exploration in an unknown domain.

III-D Reasoning and Exploring as a Bandit Problem

A prior alone cannot lead to success. Inspired by [13], a posterior topological representation is also constructed in each specific task to combine experience with practice. Specifically, we build a multi-layer posterior topological graph covering all object-level, clique-level and vertex-level. clique divides rooms into small clustered regions and reduces the burden of the visual front-end. Each vertex governs the three nearest cliques. Object Embedding network provides the object node features, and Region Embedding network generates the features of both clique and vertex from their attached objects. Region Rollout network gives an evaluation about ghost nodes. However, there are always situations contrary to experience in reality. In other words, robots must have the ability to balance exploration and exploitation online.

We adopt Upper Confidence Bound for Tree (UCT) method[20] to set an online bonus. The simulation procedure of UCT is supported by the Region Rollout network, thus the robot is not only able to obtain the bonus from reached count, but also estimate the future exploration value inductive bias $\omega_{i}$ of selected path. It can effectively prevent the robot from being trapped in a useless area. The combined effect of inductive bias $\omega$ and bonus will discourage the repetitive search near negative (non-success) sub-goals and drive the robot to return to parent nodes for back-tracking, which we term Revolt Reasoning. The word Revolt summarizes the characteristics of our method vividly, which allows robots to regret at nodes with low exploration value, discarding them and returning to previous paths. To avoid robots wandering between two goals, it is necessary to introduce a navigation loss term $\mathcal{L}_{dis}$ to penalize node distances. Hence, we can finally obtain the exploration value $\mathbf{V}$ of the node $i$ as:

\displaystyle\mathbf{V}(t|i)=\dfrac{\Sigma^{m}_{i\rightarrow j}\omega_{j}}{m}+c_{1}\sqrt{\dfrac{\ln N_{i}}{n_{i}}}-c_{2}\mathcal{L}_{dis}

(7)

where factors $c_{1}$ and $c_{2}$ are set as 1 and 0.5. $j$ refers to one of node $i$ ’s descendants in the tree, and $m$ is its total number. $N_{i}$ is the total arrivals of node $i$ and its descendants, while $n_{i}$ just represents arrivals of node $i$ .

III-E Online constructed Voronoi local graph

The reasoner only gives a semantic node id in a graph as a sub-goal. If the low-level controller directly uses it as a navigation goal, it will inevitably lead to over-coupling and increase the difficulty of navigation success. We can refer to the hierarchical human central nervous system composed of the brain, cerebellum, brain-stem and spinal cord[21], if the high-level reasoner is compared to the brain, then the skeletal muscle is the low-level motor controller. The brain does not directly transmit motion instructions to the skeletal muscles, but passes it through the brain-stem, spinal cord and other lower-level central nervous system for information conversion[22]. Besides, the brain does not actually support high-speed, low-latency information interaction while controlling a motion[23]. Therefore, it is necessary to use a RGB-D camera and an odometer to construct a local Voronoi graph, offering approximate relative coordinates of the sub-goal within a reachable range as an input to the low-level controller. The Voronoi graph can record the relationship between the robot and obstacles, and provide an available path. Since the TDN task is map-less, we construct a local Voronoi graph within a fixed step online.

Conditioning on the depth information, the parameters (internal and external) of the camera, and the odometer information, obstacles in depth images can be easily converted into coordinates in a world coordinate system. This system is derived from the birth pose of the robot. Projecting this partially reconstructed 3D map onto a 2D plane along the vertical axis forms a scatter diagram depicting obstacles. We can construct a Voronoi diagram online by segmenting navigable paths and explorable cliques with multiple related objects. Different from traditional methods [24], we use DBSCAN [25, 26] (a density-based clustering algorithm) to cluster the scattered points of adjacent obstacles into convex hulls first, and then filter out noise points. Followed by constructing Delaunay triangle with the center of scattered points in the convex hull, thereby generating a Voronoi diagram.

III-F Hierarchical reasoning and planning for navigation

In this section, we will summarize how the proposed reasoner and planner cooperate to complete navigation tasks. The curves in Fig. 5 show the correspondence of concepts between the topological graph in reasoner and the Voronoi diagram in planner. The aggregation of obstacles is regarded as a clique, each of which attaches and records all objects in its convex hull, and evaluates its inductive bias value according to the object-in-region membership via the Region Embedding network. The position of a vertex is generated by Voronoi. Multiple cliques and their subordinate objects surrounding the vertex jointly determine the general room label of it, and use the label for the inductive bias evaluation. Relative directions and distances between two adjacent vertex nodes are stored in gray ghost nodes. Since the robot exploits relative coordinates and directions, it effectively avoids the influence of odometer and depth camera error, thus insensitive to cumulative error. Besides, thanks to the Voronoi local diagram, only short-period scatter data need to be saved, and there is no need to consider the closed-loop matching problem like SLAM.

With the construction of Voronoi diagram and its transformation to a hierarchical topology, we can conduct reasoning in vertex/clique-level and object-level, searching for the best vertex position and the most likely clique based on the exploration value. After selecting a clique, the robot will navigate towards it, and explore it more explicitly as object-level reasoning. Besides, the Voronoi diagram provides the evidence for choosing the next best view of one clique. By changing multiple perspectives, the robot can find the target object in a clique more efficiently.

IV Experiments

IV-A Experiment Setup

We use the Habitat simulator [27] with Matterport3D [28] environment as our experiment platform. Habitat simulator is a 3D simulator with configurable agents, multiple sensors, and generic 3D dataset handling. Matterport3D dataset contains 90 houses with 40 categories of objects and 31 labels of regions. It also provides detailed object and region segmentation information. Here we just focus on 21 categories of target object required by the task: chair, table, picture, cabinet, cushion, sofa, bed, chest of drawers, plant, sink, toilet, stool, towel, tv monitor, shower, bathtub, counter, fireplace, gym equipment, seating, clothes and also ignore some meaningless room labels, like outdoor, no label, other room and empty room. We use YOLOv4 [29] as our object detection module, which is fine-tuned using objects in Matterport3D dataset. Because the aiming of low-level controller is the same as PointNav task’s [30], we adapt a pre-trained state-of-the-art PointNav method occupancy anticipation [31] as our controller.

During a specific TDN task, the robot is spawned at a random location in a certain house and is demanded to find a object of a given category as quickly as possible. The task is evaluated with three commonly used indicators: Success Rate (SR), the Success weighted by Path Length (SPL) and Distance to Success (DTS). SR represents the number of times the target was found in multiple episodes and is defined as $\frac{1}{N}\sum^{N}_{i=1}Su_{i}$ , where $N$ is the number of total episodes and $Su_{i}$ is a binary value representing the success or failure of the $i$ -th episode. SPL depicts both success and the optimal path length, it is defined as $\frac{1}{N}\sum_{i=1}^{N}S_{i}\frac{L_{i}}{\max\left(P_{i},L_{i}\right)}$ . Here we use the shortest length provided by the simulator as $L_{i}$ and the path length of the robot as $P_{i}$ in episode $i$ . DTS is the distance of the agent from the success threshold boundary when the episode ends. The boundary is set to $1m$ and the maximum episode length is $500$ steps, which are the same as [14].

Furthermore, our navigation task has two modes: independent (ReVoLT-i) and continuous (ReVoLT-c). Independent mode is the traditional one, the environment is reset after each episode and the robot clears its memory. While the continuous mode allows the robot to keep the topological graph if it resets in the same house. It is used for evaluating the robot’s capability of keeping and updating the environment memory.

IV-B Baselines

Random: At each step, the agent randomly samples an action from the action space with a uniform distribution.

RGBD + DD-PPO: This baseline is provided by ObjectNav Challenge 2020 [27]. Directly pass RGB-D information to an end-to-end DD-PPO and output an action from the policy.

Active Neural SLAM: This baseline uses an exploration policy trained to maximize coverage from [2], followed by the heuristic-based local policy as described above.

SemExp: Since [14] has not open-sourced their code, we directly use results in their paper as a state-of-the-art method.

IV-C Results

IV-C1 results of combinatorial relation embeddings

The Object Embedding network obtains classification accuracy of 91%. The Region Embedding network obtains membership accuracy of 78% and classification accuracy of 75%. The Region Rollout network reaches prediction accuracy of 45% in the test set, which is acceptable since room relationships are not significant inherently.

IV-C2 results of the whole TDN task

TABLE II: Performance Comparison

Method	SR(%)	SPL	DTS (m)
Random	0	0	10.3298
RGBD + DD-PPO	6.2	0.021	9.3162
Active Neural SLAM	32.1	0.119	7.056
SemExp¹	36.0	0.144	6.733
ReVoLT-i small^∗	66.7	0.265	0.9762
ReVoLT-i^∗	62.5	0.102	1.0511
ReVoLT-c^∗	85.7	0.070	0.0253

1

The 1st prize of AI Habitat 2020
*

These three refer to small mode with only $6$ categories target like SemExp, independence mode (-i) and continuous mode (-c) of ReVoLT.

The results of baseline methods and ReVoLT is shown in Table II. It can be seen that both of our models significantly outperform the current state-of-the-art. ReVoLT-i small has $\approx 80\%$ increase in SR and nearly twice than SemExp in SPL. This confirms our hypothesis that separating prior learning and control policy in a hierarchical framework is indeed a wise approach than directly learning a semantically-aware policy. Besides, the standard ReVoLT-i with $19$ categories of targets still achieve a higher SR and SPL. By applying the continuous mode, the robot retains a memory belonging to the same house, which allows it to find observed targets with a higher SR.

V Ablation Study

The success of ReVoLT is attributed to the relationship priors provided by the combinatorial GNNs, the online bonus by UCT, and the distance penalty. Therefore, we set three extra experiments with the same Voronoi-based planner and low-level controller to reveal their impacts, respectively. Moreover, the results of the continuous mode are also presented below. The performance of all varieties is listed in Table III.

TABLE III: Performance of Ablation Experiments

Method	SR(%)	SPL	DTS (m)
ReVoLT-i	62.5	0.102	1.0511
ReVoLT-c	85.7	0.070	0.0253
ReVoLT w/o priors	25.0	0.003	1.4129
ReVoLT w/o bonus	60.0	0.034	0.8139
ReVoLT w/o distance	54.5	0.030	1.2689

ReVoLT w/o relationship priors. Sub-goal in the navigation without priors can be generated according to the distance of the observed cliques. Compared to Fig. 7 (a) with Fig. 6, we find that the lack of semantic relationship profoundly affects the robot’s path decision, making it not interested in the region with a target even though it is just nearby. Besides, the lack of region classification and region rollout makes the robot unable to use the observed semantic information to reason about relationships, resulting in longer paths.
ReVoLT w/o UCT bonus. The bonus is replaced with a fixed threshold. If the robot reaches the same clique or vertex node more than twice, then this node will no longer be selected as a sub-goal. The corresponding top-down maps are illustrated in Fig. 7 (b). Without a UCT bonus, the robot falls into an impossible local region until the threshold is reached.
ReVoLT w/o distance penalty. In Fig. 7 (c), using only priors and bonuses can also complete tasks, but their paths are longer due to the fluctuating thoughts while making decisions.
ReVoLT with continuous mode. The left figure of Fig. 7 (d) is the same as the one in Fig. 6. However, when searching for the second target in this house, once the robot associates current observations with the memory, it can find the target with a higher success rate. However, this also makes the robot more focused on exploitation and reduces exploration, which may cause it to ignore closer targets and lead to a lower SPL.

To sum up, relationship priors are essential for robots to understand the environment semantics, and it is also the major factor affecting the SR. The UCT bonus and distance penalty contribute to the improvement of SPL. ReVoLT-c maintains a long-term scene memory and can get outstanding performance.

VI Conclusion

We propose ReVoLT, a hierarchical reasoning target-driven navigation framework that combines combinatorial graph relation extraction and online UCT decision operating with a multi-layer topological graph. ReVoLT shows better performance on exploiting the prior relationships, and its bandit reasoning is more reasonable and efficient. To bridge the gap between existing point-goal controllers and our reasoner, we adopt the Voronoi local graph for the semantic-spatial transition. However, some significant challenges remain in this field. Our future direction lies in using representation learning techniques to introduce richer object information like shape, color, and size, using scene graph detection to introduce richer semantic relation information like furniture layout, and achieving more abundant tasks like object instance navigation.

References

[1] M. Hoffmann and R. Pfeifer, “The implications of embodiment for behavior and cognition: animal and robotic case studies,” arXiv preprint arXiv:1202.0440, 2012.
[2] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, “Learning to explore using active neural slam,” in International Conference on Learning Representations, 2019.
[3] K. Chatzilygeroudis, V. Vassiliades, F. Stulp, S. Calinon, and J.-B. Mouret, “A survey on policy search algorithms for learning robot controllers in a handful of trials,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 328–347, 2019.
[4] J. Liu, J. Shou, Z. Fu, H. Zhou, R. Xie, J. Zhang, J. Fei, and Y. Zhao, “Efficient reinforcement learning control for continuum robots based on inexplicit prior knowledge,” arXiv preprint arXiv:2002.11573, 2020.
[5] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” arXiv preprint arXiv:1810.06543, 2018.
[6] H. Du, X. Yu, and L. Zheng, “Learning object relation graph and tentative policy for visual navigation,” in European Conference on Computer Vision, pp. 19–34, Springer, 2020.
[7] Y. Qiu, A. Pal, and H. I. Christensen, “Learning hierarchical relationships for object-goal navigation,” 2020.
[8] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
[9] J. Liu, Y. Chen, Z. Dong, S. Wang, S. Calinon, M. Li, and F. Chen, “Robot cooking with stir-fry: Bimanual non-prehensile manipulation of semi-fluid objects,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5159–5166, 2022.
[10] J. Liu, H. Zhang, Z. Fu, and Y. Wang, “Learning scalable multi-agent coordination by spatial differentiation for traffic signal control,” Engineering Applications of Artificial Intelligence, vol. 100, p. 104165, 2021.
[11] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
[12] Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian, “Bayesian relational memory for semantic visual navigation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2769–2779, 2019.
[13] D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12875–12884, 2020.
[14] D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020.
[15] D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects,” in arXiv:2006.13171, 2020.
[16] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6743–6752, 2019.
[17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
[18] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
[19] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec, “Graphrnn: Generating realistic graphs with deep auto-regressive models,” in International Conference on Machine Learning, pp. 5708–5717, 2018.
[20] P.-A. Coquelin and R. Munos, “Bandit algorithms for tree search,” in Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, pp. 67–74, 2007.
[21] D. Purves, R. Cabeza, S. A. Huettel, K. S. LaBar, M. L. Platt, M. G. Woldorff, and E. M. Brannon, Cognitive neuroscience. Sunderland: Sinauer Associates, Inc, 2008.
[22] E. Bizzi, M. C. Tresch, P. Saltiel, and A. d’Avella, “New perspectives on spinal motor systems,” Nature Reviews Neuroscience, vol. 1, no. 2, pp. 101–108, 2000.
[23] D. A. Rosenbaum, Human motor control. Academic press, 2009.
[24] R. Mahkovic and T. Slivnik, “Generalized local voronoi diagram of visible region,” in Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No. 98CH36146), vol. 1, pp. 349–355, IEEE, 1998.
[25] K. Khan, S. U. Rehman, K. Aziz, S. Fong, and S. Sarasvady, “Dbscan: Past, present and future,” in The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp. 232–238, IEEE, 2014.
[26] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Dbscan revisited, revisited: why and how you should (still) use dbscan,” ACM Transactions on Database Systems (TODS), vol. 42, no. 3, pp. 1–21, 2017.
[27] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[28] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,” International Conference on 3D Vision (3DV), 2017.
[29] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[30] A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra, “Sim2real predictivity: Does evaluation in simulation predict real-world performance?,” 2019.
[31] S. K. Ramakrishnan, Z. Al-Halah, and K. Grauman, “Occupancy anticipation for efficient exploration and navigation,” in European Conference on Computer Vision, pp. 400–418, Springer, 2020.

ReVoLT: Relational Reasoning and Voronoi Local Graph Planning for Target-driven Navigation