This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ReVoLT: Relational Reasoning and Voronoi Local Graph Planning
for Target-driven Navigation

Junjia Liu13, Jianfei Guo23, Zehui Meng3, Jingtao Xue3 11Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong
22
School of Automation Science and Engineering, Xi’an Jiaotong University
33
Application Innovate Laboratory (2012 Laboratories), Huawei Technologies Co., Ltd.
Beijing, 100038, China
[email protected], [email protected], {mengzehui, xuejingtao}@huawei.com
Abstract

Embodied AI is an inevitable trend that emphasizes the interaction between intelligent entities and the real world, with broad applications in Robotics, especially target-driven navigation. This task requires the robot to find an object of a certain category efficiently in an unknown domestic environment. Recent works focus on exploiting layout relationships by graph neural networks (GNNs). However, most of them obtain robot actions directly from observations in an end-to-end manner via an incomplete relation graph, which is not interpretable and reliable. We decouple this task and propose ReVoLT, a hierarchical framework: (a) an object detection visual front-end, (b) a high-level reasoner (infers semantic sub-goals), (c) an intermediate-level planner (computes geometrical positions), and (d) a low-level controller (executes actions). ReVoLT operates with a multi-layer semantic-spatial topological graph. The reasoner uses multiform structured relations as priors, which are obtained from combinatorial relation extraction networks composed of unsupervised GraphSAGE, GCN, and GraphRNN-based Region Rollout. The reasoner performs with Upper Confidence Bound for Tree (UCT) to infer semantic sub-goals, accounting for trade-offs between exploitation (depth-first searching) and exploration (regretting). The lightweight intermediate-level planner generates instantaneous spatial sub-goal locations via an online constructed Voronoi local graph. The simulation experiments demonstrate that our framework achieves better performance in the target-driven navigation tasks and generalizes well, which has an 80% improvement compared to the existing state-of-the-art method. The code and result video will be released at https://ventusff.github.io/ReVoLT-website/.

Index Terms:
Relational reasoning, combinatorial relation graph neural networks, UCT bandit, online Voronoi local graph

I Introduction

Finding objects in complex houses efficiently is a prerequisite for domestic service robots. Robots need to reason and make dynamic decisions along with interacting with the real-world environment. Embodied AI, proposed by Matej Hoffman and Rolf Pfiefer[1], suggests that to truly understand how the human brain works, a brain should be embedded into a physical body, and let it explore and interact with the real world. Among all the work practicing Embodied AI in recent years, target-driven navigation (TDN) is one of the most feasible and essential tasks, which combines techniques in both machine learning and robotics, and is widely applicable for scenarios such as domestic service robots. It typically requires the robot to find a target object of a certain category in an unknown scene, demanding both high efficiency and success rate. Hence, the key problems of the TDN task are generalizing across unknown domains and exploring efficiently.

Traditional Simultaneous Localization and Mapping (SLAM) pipeline has already handled TDN to some extent[2], but there are still numerous problems lying in its major modules. First, it remains troublesome for SLAM-based methods to acquire and maintain a lifelong updating semantic map, which demands accurate sensors and semantic information. Second, SLAM-based methods are inherently less adaptive to posterior information, which causes them not generalizing well in complicated environments, especially in indoor scenes. Last but not least, SLAM-based methods are not specially designed for searching objects in unknown environments, which requires keeping balance between exploitation (depth-first searching) and exploration (regretting).

Recently, learning-based methods emerge and show powerful capabilities of solving complicated tasks. However, these methods generally have problems of interpretability and generalization, especially in the TDN task which require robots to operate in unseen domain. We argue that it is more natural and empirical to introduce a priori[3][4] to the learning model instead of training from scratch, considering how human teach ignorant babies. Introducing a priori enables algorithms to achieve higher data efficiency, better model interpretability, and generalization. In indoor TDN tasks, one of the most useful prior information is the relationship among objects and rooms of different categories. Some recent works reason about the target direction using object relationships as a priori in single-room environments[5, 6, 7]. However, common domestic scenes are composed of multiple rooms, thus more prior information such as room connection, object-in-room membership, and other implicitly structured relationships could be exploited, which are typically ignored in these works.

Refer to caption

Figure 1: The main hierarchical framework of ReVoLT method, which contains a high-level reasoner (infers semantic sub-goals), an intermediate-level planner (computes spatial location sub-goal), and a low-level controller (computes actions). The combinatorial relation extraction module provides a priori of the exploration value about the observed objects and regions through embedding similarity. Especially, Region Rollout model provides Monte Carlo simulation for UCT in a conditional GraphRNN (c-GraphRNN) way.

In this paper, we propose a hierarchical navigation framework, Relational Reasoning and Voronoi Local graph planning (ReVoLT), which comprises a combinatorial graph neural network for multiform domestic relations extraction, an UCT-based reasoning exploration, and an online Voronoi local graph for the semantic-spatial transition. The TDN task is concisely decomposed, allowing for separate and special designs for different modules, instead of operating in a mixed-up end-to-end manner. The detailed contributions are as follows:

  • To extract multiform structural relations for reasoning, we propose combining unsupervised GraphSAGE [8], self-supervised GCN, and c-GraphRNN methods for learning object embedding, region embedding, and region rollout, respectively.

  • Based on the relation priors, the high-level reasoner (semantic reasoning) is abstracted as a bandit problem and adopts UCT to balance exploitation (depth-first searching) and exploration (regretting).

  • Voronoi local graphs are constructed online for converting semantic sub-goals to spatial locations.

  • It is found in the test results that the proposed framework is superior to state-of-the-art methods and achieves a higher success rate and success weighted by path length (SPL) with good generalization.

II Related Works

Graph is useful for constructing relationships between non-Euclidean data, which has recently become popular for representing the physical structure of redundant robots to learn coordination between their multiple degrees of freedom [9], learning traffic signal co-regulation between multi-junction networks [10], and representing the state change of deformable and soft objects. It is also reasonable and natural to use graphs for relational reasoning to solve TDN problems. They have the advantage of replacing an explicit metric map like SLAM-based methods, inferring the approximate position of the target object based on observed objects. Most of these methods use GNNs to learn object-object proximity relationships but ignore the relationship between regions/rooms, thus it limits their task scenarios to a single room (using AI2Thor data set [11] in simulation for training). For example, Yang et al. [5] propose to use Graph Convolutional Network (GCN) to incorporate the prior knowledge about object relationship into a Deep Reinforcement Learning (DRL) framework as part of joint embedding. Their priors are obtained from large-scale scene understanding datasets and updated according to the current observation.

For navigation tasks in houses with multiple rooms, it is necessary to first reach the room that may contain the target object (e.g. refrigerator-kitchen), then search the target in object cliques. Therefore, the learning of prior knowledge should consider more relationships, including room-to-room connection and object-in-room membership. Wu et al. [12] propose a memory structure based on the Bayesian graph model. It uses the probability relationship graph to get the prior house layout from the training set and estimates its posterior in the test set. However, this work does not combine object-level reasoning to achieve a complete TDN task. Chaplot et al. [13] build a topological representation with associated semantic features and learn a prior semantic score function to evaluate the probability of potential nodes in a graph with various directions. However, they provide target images,which is impractical in domestic scenarios, while our method only uses target labels. They subsequently extend the Active Neural SLAM system [2], to learn semantic priors using a semantically aware long-term policy for label target navigation task [14] and won CVPR 2020 Habitat ObjectNav Challenge111https://aihabitat.org/challenge/2020/ [15]. It is worth mentioning that they also point out the end-to-end learning-based methods suffer from large sample complexity and poor generalization as they memorize object locations and appearance in training environments[14], which prompt us to consider the hierarchical framework with a topological graph. Table I only lists TDN methods with label target and relational reasoning.

TABLE I: Performance of existing TDN methods with
various experiment setting
Method Room Scale Dataset SR(%) SPL(%)
Scene-prior [5] Single AI2-THOR 35.4 10.9
SAVN [16] Single AI2-THOR 35.7 9.3
MJOLNIR [7] Single AI2-THOR 65.3 21.1
BRM [12] Multiple House3D - -
SemExp [14] Multiple Matterport3D 36.0 14.4
  • \dagger

    SemExp won the first place in CVPR Habitat 2020 competition.

Refer to caption


Figure 2: Combinatorial relation extraction module. (a) Obtain object embedding via unsupervised weighted-GraphSAGE; (b) Region embedding is received by passing a sub-graph with object embedding to GCN layers; (c) According to the house structure of region connectivity, a GraphRNN-based model is used to learn the structure distribution and generate possible feature of future regions node by node.

III Revolt Reasoning & Planning with Hierarchical Framework

This task needs to be re-examined from the perspective of bionics. Imagine a human facing such a task when he enters an unknown house. He will not feel confused due to the prior knowledge about domestic scenes he has. It is natural for us to first roughly determine the type of room based on categories of multiple observed objects in the current room (e.g. a bedroom). According to the object-in-room membership, the exploration value 𝐕(t|cur_room)\mathbf{V}(t|cur\_room) of the target object tt in the current room can be obtained. At the same time, some potential but unexplored passages (e.g. a door or hallway) can be determined as ghost nodes like [13]. The structural relationship of the house layout and room connection can help us predict categories and value 𝐕(t|next_room)\mathbf{V}(t|next\_room) of next rooms connected by ghost nodes.

Except for these priors, dynamic decisions also should be made in a specific task, rather than just applying experience mechanically. Reasoning procedure which contains intelligent exploration and exploitation is one of the winning strategies. Thus, our approach focuses on solving the following two problems:

  • How to obtain a more effective prior conditional exploration value in a structured form?

  • How to make efficient decisions between multiple feasible paths based on exploration values?

The remainder of this section is organized as follows. In subsection III-A, III-B, III-C, we present a combinatorial relation extraction module (Fig. 2) using GNNs, which learns three different relationships in a unified paradigm. A UCT-based online reasoner is described in subsection III-D. In III-E, we consider the coarse spatial information and build an intermediate-level planner through online Voronoi construction. Finally, the whole ReVoLT hierarchical framework is summarized in subsection III-F (Fig. 1).

III-A Object Embedding learning

As illustrated in Fig. 2 (a), the object-to-object relationship consists of not only pair-wise semantic similarity, but also distances and the number of hops between object pairs. We first extract an object-level graph 𝒢o(𝒱o,o)\mathcal{G}_{o}(\mathcal{V}_{o},\mathcal{E}_{o}) through object positions pospos and category 𝒞o\mathcal{C}_{o} from Matterport3D dataset. Objects in the same room are fully connected. As for object pairs in different rooms, only those closest to a common door have an connecting edge. This is useful for the robot to infer objects that are strongly related to the target just using object-level embedding.

GraphSAGE [8] is a popular model in the node embedding field. We adopt it to obtain the embedding of each object category to fuse semantics and proximity relationships with other categories. Our node embedding procedure uses GloVe [17] as the initial node semantic feature {𝐱v,v𝒱o}\left\{\mathbf{x}_{v},\forall v\in\mathcal{V}_{o}\right\}, and employ an unsupervised form of GraphSAGE with a loss that penalizes the embedding similarity between two objects far apart and reward the adjacent two. Different from the original GraphSAGE, edge features {ωe:uv,eo}\left\{\omega_{e:u\rightarrow v},\forall e\in\mathcal{E}_{o}\right\} are also taken into account in aggregation and loss calculations. For each search depth kk, weight matrices 𝐖k,k{1,,K}\mathbf{W}^{k},\forall k\in\{1,\dots,K\}, we employ an edge-weighted mean aggregator which simply takes the element-wise mean of the vectors in {huk1,u𝒩(v)}\{h^{k-1}_{u},\forall u\in\mathcal{N}(v)\} to aggregate information from node neighbors:

𝐡v0xv,v𝒱\displaystyle\mathbf{h}_{v}^{0}\leftarrow x_{v},\forall v\in\mathcal{V} (1)
𝐡vk\displaystyle\mathbf{h}_{v}^{k}\leftarrow σ(𝐖kmean({𝐡vk1}{ωuv𝐡uk1}))\displaystyle\sigma\left(\mathbf{W}^{k}\cdot\text{mean}(\{\mathbf{h}_{v}^{k-1}\}\cup\{\omega_{u\to v}\cdot\mathbf{h}_{u}^{k-1}\})\right)

Then an edge-weighted loss function is applied to the output {𝐳v,v𝒱o}\{\mathbf{z}_{v},\forall v\in\mathcal{V}_{o}\}, and tune the weight matrices 𝐖k\mathbf{W}^{k}:

𝒢o(𝐳v)=\displaystyle\mathcal{L}_{\mathcal{G}_{o}}\left(\mathbf{z}_{v}\right)= log(σ(ωuv𝐳v𝐳u))\displaystyle-\log\left(\sigma\left(\omega_{u\to v}\mathbf{z}_{v}^{\top}\mathbf{z}_{u}\right)\right) (2)
Q𝔼unPn(v)log(σ(ωuv𝐳v𝐳un))\displaystyle-Q\cdot\mathbb{E}_{u_{n}\sim P_{n}(v)}\log\left(\sigma\left(-\omega_{u\to v}\mathbf{z}_{v}^{\top}\mathbf{z}_{u_{n}}\right)\right)

where PnP_{n} is a negative sampling distribution, QQ defines the number of negative samples, σ\sigma is the sigmoid function.

Since object embeddings with the same category {𝐳c,c𝒞o}\{\mathbf{z}_{c},\forall c\in\mathcal{C}_{o}\} should have consistent representation, another mean aggregation is performed on the embeddings of same category between the final GraphSAGE aggregation and loss function. This overwrites the original value with the final embedding for each category {𝐳cmean(𝐡vK),if 𝒞o(v)=c}\{\mathbf{z}_{c}\leftarrow\text{mean}(\mathbf{h}_{v}^{K}),\textrm{if }\mathcal{C}_{o}(v)=c\}.

III-B Region Embedding learning

Apart from the pairwise relationship between objects, the many-to-one relationship between an object and a room or region is also indispensable for inferring the existence possibility of the target object in a certain room or among multiple observed objects. Besides, to evaluate the similarity, relationships of different levels should have a unified paradigm to obtain representation of consistent metrics. Therefore, for region-level sub-graphs, we still opt for the same embedding representation procedure. This part is shown in Fig. 2 (b).

Region embedding is carried out in a self-supervised form. We take the sub-graph 𝒢r(𝒱r,r)\mathcal{G}_{r}(\mathcal{V}_{r},\mathcal{E}_{r}) as input, with embedding of objects in the same region {𝐳c,c𝒞o}\{\mathbf{z}_{c},\forall c\in\mathcal{C}_{o}\} as nodes and weighted spatial distances as edges. The batch composed of these sub-graphs is passed into the GCN[18], and the corresponding region embedding {𝐫v,v𝒱r}\{\mathbf{r}_{v},\forall v\in\mathcal{V}_{r}\} is obtained. Similarly from the previous procedure, for region embedding with the same label, a mean aggregation is performed to obtain a uniform vector representation {𝐫l,lr}\{\mathbf{r}_{l},\forall l\in\mathcal{L}_{r}\}. Since there is no need to do multi-hop aggregations at region-level, a simple GCN layer is applied rather than GraphSAGE.

To enable membership calculation between region embedding 𝐫l\mathbf{r}_{l} and object embedding 𝐳c\mathbf{z}_{c} and distinguish regions with different labels, we use a combined loss which comprises two parts: the classification loss of embedding label and the membership loss of object-in-region:

𝒢r(𝐫v)=\displaystyle\mathcal{L}_{\mathcal{G}_{r}}\left(\mathbf{r}_{v}\right)= log(σ(𝐫v𝐳u))\displaystyle-\log\left(\sigma\left(\mathbf{r}_{v}^{\top}\mathbf{z}_{u}\right)\right) (3)
Q𝔼unPn(v)log(σ(𝐫v𝐳un))\displaystyle-Q\cdot\mathbb{E}_{u_{n}\sim P_{n}(v)}\log\left(\sigma\left(-\mathbf{r}_{v}^{\top}\mathbf{z}_{u_{n}}\right)\right)
1ni=1nlvlog(l^(𝐫v))\displaystyle-\dfrac{1}{n}\sum_{i=1}^{n}l_{v}\log(\hat{l}(\mathbf{r}_{v}))

where Pn(v)P_{n}(v) represents objects not in region vv, and l^()\hat{l}(\cdot) is a multi-layer perceptron (MLP) network.

III-C Region Rollout learning

As the third and most important part of relation extraction, the structural relationship reasoning ability plays a crucial role in understanding the correct direction of navigation and shortening the exploration period. To achieve this, the joint probability p(𝒢h)p(\mathcal{G}_{h}) of houses need to be learned to conceive a probable house layout memory 𝒢hp(𝒢h|𝒢sub)\mathcal{G}_{h}\sim p(\mathcal{G}_{h}|\mathcal{G}_{sub}) conditioned on observed regions 𝒢sub\mathcal{G}_{sub}. However, its sample space might not be easily characterized. Thus, the house graphs are modeled as sequences by following the idea of GraphRNN[19], and redefine some concepts to make it more suitable for conditional reasoning with embedding. This part is shown in Fig. 2 (c).

Sπ=fs(𝒢h,π)=(A1π,,Anπ)\displaystyle S^{\pi}=f_{s}(\mathcal{G}_{h},\pi)=\left(A_{1}^{\pi},\ldots,A_{n}^{\pi}\right) (4)

where π\pi represents the node order, and each element Aiπ{0,1}(i1)×(i1),i{1,,n}A_{i}^{\pi}\in\{0,1\}_{(i-1)\times(i-1)},i\in\{1,\ldots,n\} is an adjacent matrix referring the edges between node π(vi)\pi(v_{i}) and its previous nodes π(vj),j{1,,i1}\pi(v_{j}),j\in\{1,\dots,i-1\} already in the graph.

Since each AiπA_{i}^{\pi} has variable dimensions, we first fill them up to the maximum dimension nn and then repeat the 2D matrix 16 times to form a 3D matrix with n×n×16n\times n\times 16 dimensions as an edge mask where 16 is the embedding length. Therefore, a featured graph can be expressed as the element-wise product of the region embedding matrix XπX^{\pi} under corresponding order and sequence matrix {Sπ}3D\{S^{\pi}\}^{3D}:

p(𝒢)=i=1n+1p(𝐱iπ({S1π}3D,,{Si1π}3D)Xi1π)\displaystyle p(\mathcal{G})=\prod_{i=1}^{n+1}p\left(\mathbf{x}_{i}^{\pi}\mid(\{S_{1}^{\pi}\}^{3D},\ldots,\{S_{i-1}^{\pi}\}^{3D})\odot X_{i-1}^{\pi}\right) (5)

where Xi1πX_{i-1}^{\pi} is the embedding matrix with (i1)×(i1)×16(i-1)\times(i-1)\times 16 dimensions referring to region embeddings before region π(vi)\pi(v_{i}), and 𝐱iπ\mathbf{x}_{i}^{\pi} refers to the embedding of π(vi)\pi(v_{i}).

Passing {Sπ}3DXπ\{S^{\pi}\}^{3D}\odot X^{\pi} as a sequence into GRU or LSTM, we can get the structure distribution of houses. This allows us to predict the next region embedding and label under the condition of the observed subgraph. The loss function of the Region Rollout network is a CrossEntropy between generated embedding label and the real label:

𝒢h(𝐱iπ)=1ni=1nli log-softmax[(𝐱iπ)T𝐫j],jr\displaystyle\mathcal{L}_{\mathcal{G}_{h}}(\mathbf{x}^{\pi}_{i})=-\dfrac{1}{n}\sum_{i=1}^{n}l_{i}\text{ log-softmax}[(\mathbf{x}^{\pi}_{i})^{T}\mathbf{r}_{j}],\forall j\in\mathcal{L}_{r} (6)

In conclusion, with the combination of III-A unsupervised edge-weighted GraphSAGE object embedding learning, III-B self-supervised GCN region embedding learning, and III-C c-GraphRNN conditional region rollout, we can now extract multiform structural relationships. Meanwhile, embedding is used as a unified paradigm for representation, and the similarity between objects or regions (either observed or predicted) embeddings and the target object embedding is used as a prior to guide the exploration in an unknown domain.

III-D Reasoning and Exploring as a Bandit Problem

Refer to caption

Figure 3: In a specific task, a multi-layer topological graph is constructed based on visual front-end, and a tree with the birthplace as the root node is abstracted from the graph. The clique refers to a collection of adjacent objects or a bunch of non-semantic obstacles, and the vertex refers to an observed navigable location. Each gray ghost node has connected two vertices, and only stores the relative position of the connected vertices to assist localization, without being used as a navigation sub-goal. The black ghost nodes refer to unknown areas and promote exploration.

A prior alone cannot lead to success. Inspired by [13], a posterior topological representation is also constructed in each specific task to combine experience with practice. Specifically, we build a multi-layer posterior topological graph covering all object-level, clique-level and vertex-level. clique divides rooms into small clustered regions and reduces the burden of the visual front-end. Each vertex governs the three nearest cliques. Object Embedding network provides the object node features, and Region Embedding network generates the features of both clique and vertex from their attached objects. Region Rollout network gives an evaluation about ghost nodes. However, there are always situations contrary to experience in reality. In other words, robots must have the ability to balance exploration and exploitation online.

We adopt Upper Confidence Bound for Tree (UCT) method[20] to set an online bonus. The simulation procedure of UCT is supported by the Region Rollout network, thus the robot is not only able to obtain the bonus from reached count, but also estimate the future exploration value inductive bias ωi\omega_{i} of selected path. It can effectively prevent the robot from being trapped in a useless area. The combined effect of inductive bias ω\omega and bonus will discourage the repetitive search near negative (non-success) sub-goals and drive the robot to return to parent nodes for back-tracking, which we term Revolt Reasoning. The word Revolt summarizes the characteristics of our method vividly, which allows robots to regret at nodes with low exploration value, discarding them and returning to previous paths. To avoid robots wandering between two goals, it is necessary to introduce a navigation loss term dis\mathcal{L}_{dis} to penalize node distances. Hence, we can finally obtain the exploration value 𝐕\mathbf{V} of the node ii as:

𝐕(t|i)=Σijmωjm+c1lnNinic2dis\displaystyle\mathbf{V}(t|i)=\dfrac{\Sigma^{m}_{i\rightarrow j}\omega_{j}}{m}+c_{1}\sqrt{\dfrac{\ln N_{i}}{n_{i}}}-c_{2}\mathcal{L}_{dis} (7)

where factors c1c_{1} and c2c_{2} are set as 1 and 0.5. jj refers to one of node ii’s descendants in the tree, and mm is its total number. NiN_{i} is the total arrivals of node ii and its descendants, while nin_{i} just represents arrivals of node ii.

III-E Online constructed Voronoi local graph

Refer to caption

Figure 4: Combining the depth information with robot’s pose in a short period, then we can get a simple 3D reconstruction result. A Voronoi local graph can be constructed through DBSCAN clustering after projecting the 3D map as a 2D obstacle scatter plot.

The reasoner only gives a semantic node id in a graph as a sub-goal. If the low-level controller directly uses it as a navigation goal, it will inevitably lead to over-coupling and increase the difficulty of navigation success. We can refer to the hierarchical human central nervous system composed of the brain, cerebellum, brain-stem and spinal cord[21], if the high-level reasoner is compared to the brain, then the skeletal muscle is the low-level motor controller. The brain does not directly transmit motion instructions to the skeletal muscles, but passes it through the brain-stem, spinal cord and other lower-level central nervous system for information conversion[22]. Besides, the brain does not actually support high-speed, low-latency information interaction while controlling a motion[23]. Therefore, it is necessary to use a RGB-D camera and an odometer to construct a local Voronoi graph, offering approximate relative coordinates of the sub-goal within a reachable range as an input to the low-level controller. The Voronoi graph can record the relationship between the robot and obstacles, and provide an available path. Since the TDN task is map-less, we construct a local Voronoi graph within a fixed step online.

Conditioning on the depth information, the parameters (internal and external) of the camera, and the odometer information, obstacles in depth images can be easily converted into coordinates in a world coordinate system. This system is derived from the birth pose of the robot. Projecting this partially reconstructed 3D map onto a 2D plane along the vertical axis forms a scatter diagram depicting obstacles. We can construct a Voronoi diagram online by segmenting navigable paths and explorable cliques with multiple related objects. Different from traditional methods [24], we use DBSCAN [25, 26] (a density-based clustering algorithm) to cluster the scattered points of adjacent obstacles into convex hulls first, and then filter out noise points. Followed by constructing Delaunay triangle with the center of scattered points in the convex hull, thereby generating a Voronoi diagram.

III-F Hierarchical reasoning and planning for navigation

Refer to caption

Figure 5: The semantic sub-goal is converted into relative coordinates by the Voronoi-based intermediate-level planner.

In this section, we will summarize how the proposed reasoner and planner cooperate to complete navigation tasks. The curves in Fig. 5 show the correspondence of concepts between the topological graph in reasoner and the Voronoi diagram in planner. The aggregation of obstacles is regarded as a clique, each of which attaches and records all objects in its convex hull, and evaluates its inductive bias value according to the object-in-region membership via the Region Embedding network. The position of a vertex is generated by Voronoi. Multiple cliques and their subordinate objects surrounding the vertex jointly determine the general room label of it, and use the label for the inductive bias evaluation. Relative directions and distances between two adjacent vertex nodes are stored in gray ghost nodes. Since the robot exploits relative coordinates and directions, it effectively avoids the influence of odometer and depth camera error, thus insensitive to cumulative error. Besides, thanks to the Voronoi local diagram, only short-period scatter data need to be saved, and there is no need to consider the closed-loop matching problem like SLAM.

With the construction of Voronoi diagram and its transformation to a hierarchical topology, we can conduct reasoning in vertex/clique-level and object-level, searching for the best vertex position and the most likely clique based on the exploration value. After selecting a clique, the robot will navigate towards it, and explore it more explicitly as object-level reasoning. Besides, the Voronoi diagram provides the evidence for choosing the next best view of one clique. By changing multiple perspectives, the robot can find the target object in a clique more efficiently.

IV Experiments

IV-A Experiment Setup

We use the Habitat simulator [27] with Matterport3D [28] environment as our experiment platform. Habitat simulator is a 3D simulator with configurable agents, multiple sensors, and generic 3D dataset handling. Matterport3D dataset contains 90 houses with 40 categories of objects and 31 labels of regions. It also provides detailed object and region segmentation information. Here we just focus on 21 categories of target object required by the task: chair, table, picture, cabinet, cushion, sofa, bed, chest of drawers, plant, sink, toilet, stool, towel, tv monitor, shower, bathtub, counter, fireplace, gym equipment, seating, clothes and also ignore some meaningless room labels, like outdoor, no label, other room and empty room. We use YOLOv4 [29] as our object detection module, which is fine-tuned using objects in Matterport3D dataset. Because the aiming of low-level controller is the same as PointNav task’s [30], we adapt a pre-trained state-of-the-art PointNav method occupancy anticipation [31] as our controller.

During a specific TDN task, the robot is spawned at a random location in a certain house and is demanded to find a object of a given category as quickly as possible. The task is evaluated with three commonly used indicators: Success Rate (SR), the Success weighted by Path Length (SPL) and Distance to Success (DTS). SR represents the number of times the target was found in multiple episodes and is defined as 1Ni=1NSui\frac{1}{N}\sum^{N}_{i=1}Su_{i}, where NN is the number of total episodes and SuiSu_{i} is a binary value representing the success or failure of the ii-th episode. SPL depicts both success and the optimal path length, it is defined as 1Ni=1NSiLimax(Pi,Li)\frac{1}{N}\sum_{i=1}^{N}S_{i}\frac{L_{i}}{\max\left(P_{i},L_{i}\right)}. Here we use the shortest length provided by the simulator as LiL_{i} and the path length of the robot as PiP_{i} in episode ii. DTS is the distance of the agent from the success threshold boundary when the episode ends. The boundary is set to 1m1m and the maximum episode length is 500500 steps, which are the same as [14].

Furthermore, our navigation task has two modes: independent (ReVoLT-i) and continuous (ReVoLT-c). Independent mode is the traditional one, the environment is reset after each episode and the robot clears its memory. While the continuous mode allows the robot to keep the topological graph if it resets in the same house. It is used for evaluating the robot’s capability of keeping and updating the environment memory.

IV-B Baselines

Random: At each step, the agent randomly samples an action from the action space with a uniform distribution.

RGBD + DD-PPO: This baseline is provided by ObjectNav Challenge 2020 [27]. Directly pass RGB-D information to an end-to-end DD-PPO and output an action from the policy.

Active Neural SLAM: This baseline uses an exploration policy trained to maximize coverage from [2], followed by the heuristic-based local policy as described above.

SemExp: Since [14] has not open-sourced their code, we directly use results in their paper as a state-of-the-art method.

IV-C Results

IV-C1 results of combinatorial relation embeddings

The Object Embedding network obtains classification accuracy of 91%. The Region Embedding network obtains membership accuracy of 78% and classification accuracy of 75%. The Region Rollout network reaches prediction accuracy of 45% in the test set, which is acceptable since room relationships are not significant inherently.

IV-C2 results of the whole TDN task

Refer to caption

Figure 6: Top-down maps of four successful tasks while using ReVoLT-i. The blue squares are the beginning positions, the blue curves are the robot trajectories, and arrows represent the robot’s current positions. Targets are highlighted with green boxes, and pink areas refer to the success threshold boundary. The color of the trajectory is a gradient from dark to light, and the brighter the end indicates the longer the path.
TABLE II: Performance Comparison
Method SR(%) SPL DTS (m)
Random 0 0 10.3298
RGBD + DD-PPO 6.2 0.021 9.3162
Active Neural SLAM 32.1 0.119 7.056
SemExp1 36.0 0.144 6.733
ReVoLT-i small 66.7 0.265 0.9762
ReVoLT-i 62.5 0.102 1.0511
ReVoLT-c 85.7 0.070 0.0253
  • 1

    The 1st prize of AI Habitat 2020

  • *

    These three refer to small mode with only 66 categories target like SemExp, independence mode (-i) and continuous mode (-c) of ReVoLT.

The results of baseline methods and ReVoLT is shown in Table II. It can be seen that both of our models significantly outperform the current state-of-the-art. ReVoLT-i small has 80%\approx 80\% increase in SR and nearly twice than SemExp in SPL. This confirms our hypothesis that separating prior learning and control policy in a hierarchical framework is indeed a wise approach than directly learning a semantically-aware policy. Besides, the standard ReVoLT-i with 1919 categories of targets still achieve a higher SR and SPL. By applying the continuous mode, the robot retains a memory belonging to the same house, which allows it to find observed targets with a higher SR.

V Ablation Study

Refer to caption

Figure 7: In response to the three parts of exploration value function, we conduct ablation experiments respectively and illustrate them in top-down maps.

The success of ReVoLT is attributed to the relationship priors provided by the combinatorial GNNs, the online bonus by UCT, and the distance penalty. Therefore, we set three extra experiments with the same Voronoi-based planner and low-level controller to reveal their impacts, respectively. Moreover, the results of the continuous mode are also presented below. The performance of all varieties is listed in Table III.

TABLE III: Performance of Ablation Experiments
Method SR(%) SPL DTS (m)
ReVoLT-i 62.5 0.102 1.0511
ReVoLT-c 85.7 0.070 0.0253
ReVoLT w/o priors 25.0 0.003 1.4129
ReVoLT w/o bonus 60.0 0.034 0.8139
ReVoLT w/o distance 54.5 0.030 1.2689

ReVoLT w/o relationship priors. Sub-goal in the navigation without priors can be generated according to the distance of the observed cliques. Compared to Fig. 7 (a) with Fig. 6, we find that the lack of semantic relationship profoundly affects the robot’s path decision, making it not interested in the region with a target even though it is just nearby. Besides, the lack of region classification and region rollout makes the robot unable to use the observed semantic information to reason about relationships, resulting in longer paths.
ReVoLT w/o UCT bonus. The bonus is replaced with a fixed threshold. If the robot reaches the same clique or vertex node more than twice, then this node will no longer be selected as a sub-goal. The corresponding top-down maps are illustrated in Fig. 7 (b). Without a UCT bonus, the robot falls into an impossible local region until the threshold is reached.
ReVoLT w/o distance penalty. In Fig. 7 (c), using only priors and bonuses can also complete tasks, but their paths are longer due to the fluctuating thoughts while making decisions.
ReVoLT with continuous mode. The left figure of Fig. 7 (d) is the same as the one in Fig. 6. However, when searching for the second target in this house, once the robot associates current observations with the memory, it can find the target with a higher success rate. However, this also makes the robot more focused on exploitation and reduces exploration, which may cause it to ignore closer targets and lead to a lower SPL.

To sum up, relationship priors are essential for robots to understand the environment semantics, and it is also the major factor affecting the SR. The UCT bonus and distance penalty contribute to the improvement of SPL. ReVoLT-c maintains a long-term scene memory and can get outstanding performance.

VI Conclusion

We propose ReVoLT, a hierarchical reasoning target-driven navigation framework that combines combinatorial graph relation extraction and online UCT decision operating with a multi-layer topological graph. ReVoLT shows better performance on exploiting the prior relationships, and its bandit reasoning is more reasonable and efficient. To bridge the gap between existing point-goal controllers and our reasoner, we adopt the Voronoi local graph for the semantic-spatial transition. However, some significant challenges remain in this field. Our future direction lies in using representation learning techniques to introduce richer object information like shape, color, and size, using scene graph detection to introduce richer semantic relation information like furniture layout, and achieving more abundant tasks like object instance navigation.

References

  • [1] M. Hoffmann and R. Pfeifer, “The implications of embodiment for behavior and cognition: animal and robotic case studies,” arXiv preprint arXiv:1202.0440, 2012.
  • [2] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, “Learning to explore using active neural slam,” in International Conference on Learning Representations, 2019.
  • [3] K. Chatzilygeroudis, V. Vassiliades, F. Stulp, S. Calinon, and J.-B. Mouret, “A survey on policy search algorithms for learning robot controllers in a handful of trials,” IEEE Transactions on Robotics, vol. 36, no. 2, pp. 328–347, 2019.
  • [4] J. Liu, J. Shou, Z. Fu, H. Zhou, R. Xie, J. Zhang, J. Fei, and Y. Zhao, “Efficient reinforcement learning control for continuum robots based on inexplicit prior knowledge,” arXiv preprint arXiv:2002.11573, 2020.
  • [5] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” arXiv preprint arXiv:1810.06543, 2018.
  • [6] H. Du, X. Yu, and L. Zheng, “Learning object relation graph and tentative policy for visual navigation,” in European Conference on Computer Vision, pp. 19–34, Springer, 2020.
  • [7] Y. Qiu, A. Pal, and H. I. Christensen, “Learning hierarchical relationships for object-goal navigation,” 2020.
  • [8] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [9] J. Liu, Y. Chen, Z. Dong, S. Wang, S. Calinon, M. Li, and F. Chen, “Robot cooking with stir-fry: Bimanual non-prehensile manipulation of semi-fluid objects,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5159–5166, 2022.
  • [10] J. Liu, H. Zhang, Z. Fu, and Y. Wang, “Learning scalable multi-agent coordination by spatial differentiation for traffic signal control,” Engineering Applications of Artificial Intelligence, vol. 100, p. 104165, 2021.
  • [11] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
  • [12] Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian, “Bayesian relational memory for semantic visual navigation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2769–2779, 2019.
  • [13] D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12875–12884, 2020.
  • [14] D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020.
  • [15] D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects,” in arXiv:2006.13171, 2020.
  • [16] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6743–6752, 2019.
  • [17] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
  • [18] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
  • [19] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec, “Graphrnn: Generating realistic graphs with deep auto-regressive models,” in International Conference on Machine Learning, pp. 5708–5717, 2018.
  • [20] P.-A. Coquelin and R. Munos, “Bandit algorithms for tree search,” in Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, pp. 67–74, 2007.
  • [21] D. Purves, R. Cabeza, S. A. Huettel, K. S. LaBar, M. L. Platt, M. G. Woldorff, and E. M. Brannon, Cognitive neuroscience. Sunderland: Sinauer Associates, Inc, 2008.
  • [22] E. Bizzi, M. C. Tresch, P. Saltiel, and A. d’Avella, “New perspectives on spinal motor systems,” Nature Reviews Neuroscience, vol. 1, no. 2, pp. 101–108, 2000.
  • [23] D. A. Rosenbaum, Human motor control. Academic press, 2009.
  • [24] R. Mahkovic and T. Slivnik, “Generalized local voronoi diagram of visible region,” in Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No. 98CH36146), vol. 1, pp. 349–355, IEEE, 1998.
  • [25] K. Khan, S. U. Rehman, K. Aziz, S. Fong, and S. Sarasvady, “Dbscan: Past, present and future,” in The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pp. 232–238, IEEE, 2014.
  • [26] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Dbscan revisited, revisited: why and how you should (still) use dbscan,” ACM Transactions on Database Systems (TODS), vol. 42, no. 3, pp. 1–21, 2017.
  • [27] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [28] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,” International Conference on 3D Vision (3DV), 2017.
  • [29] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
  • [30] A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra, “Sim2real predictivity: Does evaluation in simulation predict real-world performance?,” 2019.
  • [31] S. K. Ramakrishnan, Z. Al-Halah, and K. Grauman, “Occupancy anticipation for efficient exploration and navigation,” in European Conference on Computer Vision, pp. 400–418, Springer, 2020.