This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scene-Intuitive Agent for Remote Embodied Visual Grounding

Xiangru Lin1  Guanbin Li2     Yizhou Yu1,3
1The University of Hong Kong    2Sun Yat-sen University    3Deepwise AI Lab
[email protected]
Corresponding author is Guanbin Li. This work was partially supported by National Key Research and Development Program of China (No.2020YFC2003902). This work was supported in part by the Guangdong Basic and Applied Basic Research Foundation under Grant No.2020B1515020048, in part by the National Natural Science Foundation of China under Grant No.61976250 and No.U1811463. This work was also sponsored by CCF-Tencent Open Research Fund.
   [email protected]    [email protected]
Abstract

Humans learn from life events to form intuitions towards the understanding of visual environments and languages. Envision that you are instructed by a high-level instruction, “Go to the bathroom in the master bedroom and replace the blue towel on the left wall”, what would you possibly do to carry out the task? Intuitively, we comprehend the semantics of the instruction to form an overview of where a bathroom is and what a blue towel is in mind; then, we navigate to the target location by consistently matching the bathroom appearance in mind with the current scene. In this paper, we present an agent that mimics such human behaviors. Specifically, we focus on the Remote Embodied Visual Referring Expression in Real Indoor Environments task, called REVERIE, where an agent is asked to correctly localize a remote target object specified by a concise high-level natural language instruction, and propose a two-stage training pipeline. In the first stage, we pre-train the agent with two cross-modal alignment sub-tasks, namely the Scene Grounding task and the Object Grounding task. The agent learns where to stop in the Scene Grounding task and what to attend to in the Object Grounding task respectively. Then, to generate action sequences, we propose a memory-augmented attentive action decoder to smoothly fuse the pre-trained vision and language representations with the agent’s past memory experiences. Without bells and whistles, experimental results show that our method outperforms previous state-of-the-art(SOTA) significantly, demonstrating the effectiveness of our method.

1 Introduction

Vision and Language tasks, such as Vision-and-Language Navigation (VLN) [2], Visual Question Answering (VQA) [3, 4] and Referring Expression Comprehension (REF) [21, 49, 50] etc., have been extensively studied in the wave of deep neural networks. In particular, VLN [2, 5] is a challenging task that combines both natural language understanding and visual navigation. Recent works have shown promising performance and progress. They mainly focus on designing agents capable of grounding fine-grained natural language instructions, where detailed information is provided, to find where to stop, for example “Leave the bedroom and take a left. Take a left down the hallway and walk straight into the bathroom at the end of the hall. Stop in front of the sink”  [12, 28, 44, 43, 40, 22]. However, a practical issue is that fine-grained natural language instructions are not always available in real life and human-machine interactions are mostly based on high-level instructions such as “Go to the bathroom at the end of the hallway”. In other words, designing an agent that could perform high-level natural language interpretation and infer the probable target location using knowledge of the environments is of more practical use.

In this paper, we focus on the REVERIE task [36] which is an example of the above mentioned high-level instruction task. Here, we briefly introduce the settings. Given a high-level instruction that refers to a remote target object at a target location within a building, a robot agent spawns at a starting location in the same building and tries to navigate closer to the object. The output of the task is a bounding box encompassing the target object. The success of the task is evaluated based on explicit object grounding at the correct target location. A straightforward solution is to integrate SOTA navigation model with SOTA object grounding model. This strategy has proven to be inefficient in  [36] and instead, they proposed an interactive module to enable the navigation model to work together with the object grounding model. Although the performance is improved, we observe that such method has a key weakness: it is unreasonable to discern high-level instruction by directly borrowing the fine-grained instruction navigation model that consists of simple trainable language attention mechanism based on the fact that the perception of high-level instruction primarily depends on commonsense knowledge prior as well as past experiences in memory. Therefore, the overall design is not in line with human intuitions in high-level instruction navigation.

Refer to caption
Figure 1: The overview of two pre-training tasks, the Scene Grounding task and the Object Grounding task. The Scene Grounding task empowers the agent the ability to reason where the target location is and the Object Grounding task learns what to attend to.

Designing an agent to solve the problem like the REVERIE task is still under explored and there are still no systematic ways to design such an agent. Then, how does human wisdom solve this task? Human beings have instincts to understand surrounding visual environments and languages. Intuitively, given a high-level instruction, we would first extract high-level what and where information and then form an overview of the appearance of the target location in mind based on common sense knowledge. During navigation, we would consistently match current scene and objects in the scene to the instruction semantics and decide where to navigate next. According to such intuitions, we approach this problem from a new perspective and present an agent that imitates such human behaviors. Concretely, we define our problem as designing an agent that is able to solve where and what problem in the REVERIE task. We propose a two-stage training pipeline. In the first stage, we design two pre-training tasks, mimicking the aforementioned two human intuitions. The second stage is training the agent with a memory-augmented attentive action decoder, further increasing the agent’s navigation capability under high-level instructions.

Pre-training Stage. As is shown in Fig. 1, we introduce a new subtask called the Scene Grounding task that is trained to recognize which viewpoint in a set of viewpoints is best aligned with the high-level instruction and another subtask called the Object Grounding task that helps the agent identify the best object that matches to the instruction among a set of candidate objects located at a target viewpoint. Experimental results show that the Scene Grounding model recognizes the target viewpoint with a high accuracy and the Object Grounding model outperforms the previous best model used in  [51, 36] by more than 10%10\%.

Action Decoding Stage. In this stage, with the pre-trained models serving as scene and language encoders, we propose a memory-augmented attentive action decoder that leverages a scene memory structure as the agent’s internal past state memory. This design is based on the fact that the computation of action at a specific time step could depend on any provided information in the past. Experimental results indicate that the proposed structure is effective and achieves new state-of-the-art performance.

To sum up, this paper has the following contributions:

  • We propose a new framework that borrows human intuitions for designing agent capable of understanding high-level instructions, which closely integrate navigation and visual grounding in both training and inference. Specifically, the visual grounding models are pre-trained and serve as vision and language encoders for training navigation action decoder in the training phase. In inference, the action is predicted by considering logits from both the visual grounding models and the navigation decoder.

  • We introduce two novel pre-training tasks, called Scene Grounding task and Object Grounding task, and a new Memory-augmented attentive action decoder in our framework. The pre-training tasks attempt to help the agent learn where to stop and what to attend to, and the action decoder effectively exploits past observations to fuse visual and textual modalities.

  • Without bells and whistles, our method outperforms all previous methods, achieving new state-of-the-art performance on both seen and unseen environments on the REVERIE task.

2 Related Work

Vision-and-Language Navigation and REVERIE. In VLN, an agent is required to navigate to a goal location in a 33D simulator based on fine-grained instructions.  [2] proposed the Matterport3D Simulator and designed the Room-to-Room task. Then, a lot of methods have been proposed to solve this task [12, 44, 43, 40, 22]. On the other hand, the recently proposed REVERIE task  [36] is different from traditional VLN in that it requires an agent to navigate and localize target object simultaneously under the guidance of high-level instruction. The model they proposed trains the navigation model with the interactive module that works together with the object grounding model [51], in the hope that the model could learn to understand high-level instruction in a data-driven manner. However, our motivation is essentially different in that we inject commonsense knowledge prior and past memory experiences into the action policy taking into consideration the human perception in dealing with such high-level instruction navigation problems. Specifically, we introduce two pre-training tasks and a memory based action policy to make the agent become scene-intuitive. Moreover, our pre-training tasks differ from the ones proposed in  [12, 52, 29] in that their motivation is based on the fact that the ground truth navigation path is actually hidden in the fine-grained instruction, which is not the case in high-level instruction navigation.

Memory-based policy for navigation tasks. Various memory models have been extensively studied for navigation agents, including unstructured memory [17, 31, 46, 18, 30, 9], addressable memory [32, 33], topological memory [37], and metric grid-based maps [14, 1], etc. Unstructured memory representations, such as LSTM memory, have been used extensively in both 2D and 3D environments. However, the issue of RNN based memory is that it does not contain context-dependent state feature storage or retrieval and does not have long time memory [1, 20, 11]. To address these limitations, more advaneced memory structures, such as addressable, topological, and metric based memory are proposed. In this paper, we adopt a simple adressable memory structure. The aim of using such a simple design is 1) to intentionally make it lightweight, thus reducing computational overhead, since the computational cost is important in REVERIE and our pipeline already contains heavy models; 2) to improve the performance of the overall pipeline rather than designing a more advanced memory superior to others. Besides, in VLN, the metric map memory construction requires finegrained language instruction as guidance, which is not available in our task, and building the topological memory requires pre-exploration of the environment, a technique that is certainly helpful to our agent but is beyond the discussion of this paper.

Vision-and-Language BERT based referring expression comprehension. Recent years have witnessed a resurgence of active research in transferrable image-text representation learning. BERT-based models  [10, 39, 38, 27, 6, 26] have achieved superior performance over multiple vision-and-language tasks by transferring the pre-trained model on large aligned image-text pairs to other downstream tasks. In BERT-based VLN, the most related agents to ours are  [15] and  [29].  [15] treats VLN as a vision-and-language alignment task and utilizes a pre-trained vision-and-language BERT model to predict action sequence while  [29] formulates VLN as an instruction and path alignment task and adopts a pre-trained vision-and-language BERT model to find the best candidate path that matches to the instruction given. However, our work differs from others in that we propose a generalized pipeline that mimics human intuitions to solve the high-level instruction navigation task where vision-and-language BERT model is a building block which can be customized to other vision-language alignment block. Experimental results show that the main performance gain comes from our proposed pipeline.

Refer to caption
Figure 2: The overall pipeline of our method. The green part of the figure denotes the memory module where current viewpoint feature 𝑽t\boldsymbol{V}_{t} and previous action feature 𝒂t1\boldsymbol{a}_{t-1} are embedded and stored in the Memory. TransformerTransformer blocks are used to generate 𝒔ta\boldsymbol{s}_{t}^{a}. The red rectangles represent two pre-trained models, namely Scene Grounding model and Object Grounding model. ViLEncoderViLEncoder consists of ViLBERTViLBERT and BiLSTMBiLSTM and ViLPointerViLPointer is ViLBERTViLBERT trained on viewpoint-based object grounding task. At each time step tt, the agent perceives the instruction with viewpoint features and object features simultaneously. Action prediction is made by the Action Select part where an attentive structure is applied. The final action is generated by considering scene grounding score gsgg_{sg}, object grounding score gogg_{og} and action logit 𝒍t\boldsymbol{l}_{t}. The dashed dot lines are used only for illustration purposes.

3 Method

In the REVERIE task, an agent placed at a starting location navigates to the target location to localize an object specified by a high-level instruction. To carry out this difficult task, we propose a novel pipeline that contains a scene grounding model, an object grounding model, and a memory-based action decoder. We make two claims of our design choice: first, to better grasp the semantics of high-level instructions, we choose ViLBERT model as our basic building block to serve as vision-and-language encoder; second, since scene grounding task and object grounding task are two essentially different tasks, we do not share the basic building blocks for these two tasks. In general, we decompose our method into two stages, as shown in Fig. 2, namely the pre-training stage and the action decoding stage. In the following sections, we first introduce the pre-training tasks; then we illustrate the memory-based attentive action decoder and finally, the loss function used to train the agent.

3.1 ViLBERT introduction

In this section, we briefly introduce the input and output arguments of a ViLBERT model [26] as shown in Fig. 3. A ViLBERT model is a BERT-based model that consists of two input streams, vision encoding stream and language encoding stream, followed by a cross-modal alignment Transformer block. The inputs to ViLBERT model are sequence of words and visual features respectively and the outputs are corresponding encoded word sequence features as well as visual sequence features. We use ViLBERT as our base model (basic building block) for the Scene Grounding task and the Object Grounding task. In Scene Grounding task, a panorama viewpoint image is discretized into 3636 view images and the inputs are sequence of words in the instruction and 3636 mean-pooled features extracted from 3636 view images by a ResNet-152152 CNN pre-trained on ImageNet [24]. In Object Grounding task, the inputs are sequence of words in the instruction and all annotated bounding boxes features extracted by Mask R-CNN  [16] in a target viewpoint.

3.2 Overview of the proposed method

Settings. To formalize the task, we denote a given high-level instruction as L={lk}k=1NlL=\left\{l_{k}\right\}_{k=1}^{N_{l}} where NlN_{l} is the number of words in the instruction LL and a set of viewpoints as ν={Vk}k=1Nv\nu=\left\{V_{k}\right\}_{k=1}^{N_{v}} where NvN_{v} is the number of viewpoints in the environment. At each time step tt, the agent observes a panoramic view VtV_{t}, a few navigable views OtO_{t} and a set of annotated bounding boxes BtB_{t}. The panoramic view is discretized into 3636 single views by perspective projections, each of which is a 640×480640\times 480 size image with field of view set to 6060 degrees, and is denoted by Vt={vt,i}i=136V_{t}=\left\{v_{t,i}\right\}_{i=1}^{36}. Ot={vt,i}i=1NoVtO_{t}=\left\{v_{t,i}\right\}_{i=1}^{N_{o}}\subseteq V_{t} where NoN_{o} is the maximum navigable directions at a viewpoint VtV_{t}. Each vt,iv_{t,i} is represented as 𝒗t,i=ResNet(vt,i)\boldsymbol{v}_{t,i}=ResNet(v_{t,i}). Thus, 𝑽t={𝒗t,i}i=136\boldsymbol{V}_{t}=\left\{\boldsymbol{v}_{t,i}\right\}_{i=1}^{36}. Besides, the set of annotated bounding boxes at viewpoint VtV_{t} is denoted by Bt={bt,i}i=1NbB_{t}=\left\{b_{t,i}\right\}_{i=1}^{N_{b}} where NbN_{b} is the number of bounding boxes. Mask R-CNN [16] is used to extract bounding boxes features 𝑩t={𝒃t,i}i=1Nb\boldsymbol{B}_{t}=\left\{\boldsymbol{b}_{t,i}\right\}_{i=1}^{N_{b}}, where 𝒃t,i=MRCNN(bt,i)\boldsymbol{b}_{t,i}=MRCNN(b_{t,i}).

Stage 1(a): Scene Grounding Task. We formulate the task as finding a viewpoint that best matches to a high-level instruction LL in a set of candidate viewpoints νs\nu_{s}. νs={Vk|Vkν}ν\nu_{s}=\left\{V_{k}|V_{k}\in\nu\right\}\subseteq\nu. Concretely, we define a mapping function gsg(,)g_{sg}(,) that maps (L,Vk)(L,V_{k}) to a matching score. The formula is defined as follows,

Vk\displaystyle V_{k}^{\star} =argmaxVkνsgsg(L,ResNet(Vk))\displaystyle=\mathop{\arg\max}_{V_{k}\in\nu_{s}}g_{sg}(L,ResNet(V_{k})) (1)

Stage 1(b): Object Grounding Task. The goal of this task is to identify the best matching object among a set of candidate objects located at a target viewpoint. We denote VTV_{T} as a target viewpoint and its corresponding annotated bounding boxes set is BTB_{T}. We define another compatibility matching function gog(,)g_{og}(,) that produce matching scores for all objects with a high-level instruction LL. Thus, the problem is defined as follows,

bT,i\displaystyle b_{T,i}^{\star} =argmaxbT,iBTgog(L,MRCNN(bT,i))\displaystyle=\mathop{\arg\max}_{b_{T,i}\in B_{T}}g_{og}(L,MRCNN(b_{T,i})) (2)

Stage 2: Memory-augmented action decoder. To mitigate the memory problem presented in previous section, a scene memory structure 𝑴t\boldsymbol{M}_{t} is implemented to store the embedded observation and previous action at each time step tt. The memory is updated by,

𝒗~t\displaystyle\tilde{\boldsymbol{v}}_{t} =softmax(𝑽t(𝑾1𝒉t1))T𝑽t\displaystyle=softmax(\boldsymbol{V}_{t}(\boldsymbol{W}_{1}\boldsymbol{h}_{t-1}))^{T}\boldsymbol{V}_{t} (3)
𝒔t\displaystyle\boldsymbol{s}_{t} =FC([𝒂t1,𝒗~t]),\displaystyle=FC([\boldsymbol{a}_{t-1},\tilde{\boldsymbol{v}}_{t}]),
𝑴t\displaystyle\boldsymbol{M}_{t} =Update(𝑴t1,𝒔t)\displaystyle=Update(\boldsymbol{M}_{t-1},\boldsymbol{s}_{t})

where 𝒔t\boldsymbol{s}_{t} is current state representation; 𝒗~t\tilde{\boldsymbol{v}}_{t} is attentive visual feature;𝒉t1\boldsymbol{h}_{t-1} and 𝒂t1\boldsymbol{a}_{t-1} are last time step hidden state and action embedding respectively;𝑾12048×Dh\boldsymbol{W}_{1}\in\mathbb{R}^{2048\times D_{h}} is a trainable parameter. FCFC stands for fully connected layer. The UpdateUpdate operation appends 𝒔t\boldsymbol{s}_{t} to 𝑴t\boldsymbol{M}_{t}. 𝑽t36×2048,𝒗~t1×2048,𝒂t11×3200,𝒔t1×Dh,𝒉tDh×1,𝑴tt×Dh\boldsymbol{V}_{t}\in\mathbb{R}^{36\times 2048},\tilde{\boldsymbol{v}}_{t}\in\mathbb{R}^{1\times 2048},\boldsymbol{a}_{t-1}\in\mathbb{R}^{1\times 3200},\boldsymbol{s}_{t}\in\mathbb{R}^{1\times D_{h}},\boldsymbol{h}_{t}\in\mathbb{R}^{D_{h}\times 1},\boldsymbol{M}_{t}\in\mathbb{R}^{t\times D_{h}}.

Refer to caption
Figure 3: The pipeline of the Scene Grounding Task. We formulate this task as a 55-way multiple choice problem. Each (L,ResNet(Vk))(L,ResNet(V_{k})) pair is sent to the ViLBERTViLBERT model separately to generate alignment score scksc_{k}. The panorama viewpoint image here denotes the discretized 3636 view images in a viewpoint. We mark the beginning of the image sequence with a special token IMGIMG and the language with CLSCLS.

3.3 Scene Grounding Task

The goal of this task it to help the agent infer where the target location is. Given a high-level instruction, “Bring me the jeans that are hanging up in the closet to the right”, humans first locate the where information, the key word closet, by capturing the semantics of the instruction according to the language context and commonsense knowledge and then form an overview of the appearance of the closet in mind; then, humans navigate to the target location by consistently matching the closet appearance in mind with current scene. In fact, humans have gradually formed intuitions towards the understanding of scenes, instructions and tasks in life. For language instructions in relatively simple life scenes that do not involve complex reasoning, they usually directly merge the above two processes for direct perception and understanding. We call this process as context-driven scene perception. In this section, we propose Scene Grounding task to imitate such human behavior.

Based on the observation, we believe that a model that could evaluate the alignment between an instruction and a viewpoint is able to localize the target viewpoint. Therefore, to implement this idea, we create a dataset from the REVERIE training set and fine-tune a ViLBERT model on the dataset. Specifically, we adopt a 55-way multiple choice setting. We eliminate subscript for simplicity concern. Given an instruction LL, we sample 55 viewpoints {V1+,V2,V3,V4,V5}\left\{V_{1}^{+},V_{2}^{-},V_{3}^{-},V_{4}^{-},V_{5}^{-}\right\}, out of which only one is aligned to the instruction (or in other words, positive). In detail, we choose the ending viewpoint in the ground-truth training path as V1+V_{1}^{+}, the second last viewpoint along the ground-truth path as V2V_{2}^{-} which is a hard negative sample and random sample V3,V4V_{3}^{-},V_{4}^{-} from the rest of the viewpoints along the path, and V5V_{5}^{-} from other path . Then, we run the ViLBERT model on each of the (L,Vk)(L,V_{k}) pair. As is shown in Fig. 3, the output tokens CLSCLS and IMGIMG encode instruction representation 𝒉CLS\boldsymbol{h}_{CLS} as well as viewpoint representation 𝒉IMG\boldsymbol{h}_{IMG} respectively. We define the matching scores as 𝑺𝒄\boldsymbol{Sc} and train the model with cross entropy loss sr\mathcal{L}_{sr}.

𝑺𝒄\displaystyle\boldsymbol{Sc} ={sc1,sc2,sc3,sc4,sc5}\displaystyle=\left\{sc_{1},sc_{2},sc_{3},sc_{4},sc_{5}\right\} (4)
sck\displaystyle sc_{k} =gsg(L,ResNet(Vk))=𝑾2(𝒉CLSk𝒉IMGk)\displaystyle=g_{sg}(L,ResNet(V_{k}))=\boldsymbol{W}_{2}(\boldsymbol{h}_{CLS}^{k}\odot\boldsymbol{h}_{IMG}^{k})
sr\displaystyle\mathcal{L}_{sr} =CELoss(softmax(𝑺𝒄),𝕀(V1+))\displaystyle=CELoss(softmax(\boldsymbol{Sc}),\mathbb{I}(V_{1}^{+}))

where 𝑾21×1024\boldsymbol{W}_{2}\in\mathbb{R}^{1\times 1024} is a trainable parameter and 𝕀(.)\mathbb{I}(.) is indicator function. 𝒉CLSk1024×1,𝒉IMGk1024×1\boldsymbol{h}_{CLS}^{k}\in\mathbb{R}^{1024\times 1},\boldsymbol{h}_{IMG}^{k}\in\mathbb{R}^{1024\times 1} are the encoded language and visual representations of the language and vision encoding streams from our pre-trained ViLBERT model for kkth (L,Vk)(L,V_{k}) pair respectively.

3.4 Object Grounding Task

The aim of this task is to help the agent learn what to attend to. For each ground-truth target viewpoint VTV_{T}, we formulate this task as finding the best bounding box bT,ib_{T,i}^{\star} in bounding boxes set BTB_{T} given (L,BT)(L,B_{T}) pair. A straightforward method to implement this idea is to construct a single image based grounding task, where each training sample consists of instruction LL and a subset of bounding boxes in BTB_{T} that belong to view vT,iv_{T,i}. However, according to our experiment, this strategy produces moderate performance since objects in 33D space could span multiple views in corresponding projected 22D image space. The cross-image objects relationships in each viewpoint are not well captured by the model. Therefore, we propose a two-stage training strategy, namely a single image based grounding and a viewpoint based object grounding. In single image grounding, we fine-tune the ViLBERT model from  [27, 26] on the aforementioned single image grounding dataset where each training sample is (L,BvT,i)(L,B_{v_{T,i}}) (all annotated bounding boxes in vT,iv_{T,i} are collected) and BvT,iBTB_{v_{T,i}}\subset B_{T}; then, we further fine-tune trained model on a new viewpoint based object grounding dataset. Concretely, each training sample in the viewpoint based dataset is a (L,BT)(L,B_{T}) pair (all annotated bounding boxes in vTv_{T} are collected) and the corresponding label is a vector containing 0s and 11s where 11 indicates the IoU of a bounding box with the target bounding box is higher than 0.50.5. In inference, we represent an object score as the averaged scores from all bounding boxes that share the same object id at a viewpoint that the agent stops.

3.5 Action Decoder

With the pre-trained grounding models, the action decoder generally adopts Encoder-Decoder structure to produce action prediction. Specifically, the Scene Grounding model is accompanied by a BiLSTMBiLSTM network to construct a vision and language grounding encoder ViLEncoderViLEncoder and the Object Grounding model is formulated as an object level grounding encoder ViLPointerViLPointer. The inputs to action decoder are LL, 𝑩t\boldsymbol{B}_{t} and 𝑽t\boldsymbol{V}_{t} and it outputs predicted action distribution 𝒍t\boldsymbol{l}_{t}.

First. At each time step tt, to perceive current scene and instruction, we obtain 𝒙~t\tilde{\boldsymbol{x}}_{t} by grounding LL with 𝑽t\boldsymbol{V}_{t} through ViLEncoderViLEncoder and then selecting the fused language sequence as output. The formula is defined as follows,

𝑿t\displaystyle\boldsymbol{X}_{t} =ViLEncoder(L,𝑽t)\displaystyle=ViLEncoder(L,\boldsymbol{V}_{t}) (5)
=BiLSTM(ViLBERT(L,𝑽t))\displaystyle=BiLSTM(ViLBERT(L,\boldsymbol{V}_{t}))
𝒙~t\displaystyle\tilde{\boldsymbol{x}}_{t} =softmax(𝑿t(𝑾3𝒉t1))T𝑿t\displaystyle=softmax(\boldsymbol{X}_{t}(\boldsymbol{W}_{3}\boldsymbol{h}_{t-1}))^{T}\boldsymbol{X}_{t}

where 𝑾31024×Dh\boldsymbol{W}_{3}\in\mathbb{R}^{1024\times D_{h}} is a trainable parameter and 𝑿t\boldsymbol{X}_{t} is encoded language feature taking current scene 𝑽t\boldsymbol{V}_{t} into consideration. 𝑿tNl×1024,𝒙~t1×1024,𝒉t1Dh×1\boldsymbol{X}_{t}\in\mathbb{R}^{N_{l}\times 1024},\tilde{\boldsymbol{x}}_{t}\in\mathbb{R}^{1\times 1024},\boldsymbol{h}_{t-1}\in\mathbb{R}^{D_{h}\times 1}.

Second. To decide which navigable direction to go next, we perform object level referring expression comprehension. The object level referring comprehension helps the agent infer whether a navigable view vt,iv_{t,i} contains possible target object. In particular, the set of bounding boxes in view vt,iv_{t,i} is denoted by B^t,i={bt,k|bt,kBt,Inside(bt,k,vt,i)=1}\hat{B}_{t,i}=\left\{b_{t,k}|b_{t,k}\in B_{t},Inside(b_{t,k},v_{t,i})=1\right\} where Inside(,)Inside(,) function decides whether bt,kb_{t,k} is inside view vt,iv_{t,i}. ViLPointerViLPointer is ViLBERTViLBERT pre-trained on the Object Grounding task and we select the fused bounding boxes features as the output. Then,

𝑭t,i\displaystyle\boldsymbol{F}_{t,i} =ViLPointer(L,MRCNN(B^t,i))\displaystyle=ViLPointer(L,MRCNN(\hat{B}_{t,i})) (6)
𝒗~t,i\displaystyle\tilde{\boldsymbol{v}}_{t,i} =gtopk(𝑭t,i)\displaystyle=g_{top-k}(\boldsymbol{F}_{t,i})

where 𝑭t,i\boldsymbol{F}_{t,i} is the set of aligned bounding boxes features at view vt,iv_{t,i} and gtopk(,)g_{top-k}(,) selects top-kk aligned bounding boxes and averages the corresponding aligned bounding boxes features from 𝑭t,i\boldsymbol{F}_{t,i} to produce view comprehension 𝒗~t,i1×1024\tilde{\boldsymbol{v}}_{t,i}\in\mathbb{R}^{1\times 1024}.

Third. We define the representation of each navigable view as 𝒗t,i\boldsymbol{v}_{t,i}^{\prime}:

𝒗t,i\displaystyle\boldsymbol{v}_{t,i}^{\prime} =[𝒗t,i,(cosθt,i,sinθt,i,cosϕt,i,sinϕt,i),𝒗~t,i]\displaystyle=[\boldsymbol{v}_{t,i},(\cos\theta_{t,i},\sin\theta_{t,i},\cos\phi_{t,i},\sin\phi_{t,i}),\tilde{\boldsymbol{v}}_{t,i}] (7)

where the agent’s current orientation (θt,i,ϕt,i)(\theta_{t,i},\phi_{t,i}) represents the angles of heading and elevation and is tiled 3232 times according to  [12]. (cosθt,i,sinθt,i,cosϕt,i,sinϕt,i)1×128(\cos\theta_{t,i},\sin\theta_{t,i},\cos\phi_{t,i},\sin\phi_{t,i})\in\mathbb{R}^{1\times 128} and 𝒗t,i1×3200\boldsymbol{v}^{\prime}_{t,i}\in\mathbb{R}^{1\times 3200}. The set of navigable view representation is denoted as 𝑶t={𝒗t,i}i=1No\boldsymbol{O}^{\prime}_{t}=\left\{\boldsymbol{v}^{\prime}_{t,i}\right\}_{i=1}^{N_{o}}. The grounded navigable visual representation 𝒐~t\tilde{\boldsymbol{o}}^{\prime}_{t} is represented as follows:

𝒐~t=softmax(g(𝑶t)(𝑾4𝒉t1))Tg(𝑶t)\displaystyle\tilde{\boldsymbol{o}}^{\prime}_{t}=softmax(g(\boldsymbol{O}^{\prime}_{t})(\boldsymbol{W}_{4}\boldsymbol{h}_{t-1}))^{T}g(\boldsymbol{O}^{\prime}_{t}) (8)

where 𝑾41024×Dh\boldsymbol{W}_{4}\in\mathbb{R}^{1024\times D_{h}} is a trainable parameter and g(,)g(,) is a number of Fully Connected layers accompanied by ReLU nonlinearities. 𝒐~t1×1024,𝑶tNo×3200\tilde{\boldsymbol{o}}^{\prime}_{t}\in\mathbb{R}^{1\times 1024},\boldsymbol{O}^{\prime}_{t}\in\mathbb{R}^{N_{o}\times 3200}.

Fourth. The new context hidden state 𝒉t\boldsymbol{h}_{t} is updated by a LSTM layer taking as input the grounded text 𝒙~t\tilde{\boldsymbol{x}}_{t} and navigable view features 𝒐~t\tilde{\boldsymbol{o}}^{\prime}_{t} as well as the current state representation feature 𝒔ta\boldsymbol{s}_{t}^{a}.

(𝒉t,𝒄t)=LSTM([𝒙~t,𝒐~t,𝒔ta],(𝒉t1,𝒄t1))\displaystyle(\boldsymbol{h}_{t},\boldsymbol{c}_{t})=LSTM([\tilde{\boldsymbol{x}}_{t},\tilde{\boldsymbol{o}}^{\prime}_{t},\boldsymbol{s}_{t}^{a}],(\boldsymbol{h}_{t-1},\boldsymbol{c}_{t-1})) (9)

where 𝒔ta\boldsymbol{s}_{t}^{a} is memory augmented current state representation and is defined as,

𝑴ta\displaystyle\boldsymbol{M}_{t}^{a} =[Transformer(𝑴t,𝑴t)]×Nmem\displaystyle=[Transformer(\boldsymbol{M}_{t},\boldsymbol{M}_{t})]_{\times N_{mem}} (10)
𝒔ta\displaystyle\boldsymbol{s}_{t}^{a} =[Transformer(𝒔t,𝑴ta)]×Nstate\displaystyle=[Transformer(\boldsymbol{s}_{t},\boldsymbol{M}_{t}^{a})]_{\times N_{state}}

where NmemN_{mem} and NstateN_{state} are number of memory transformer blocks used and number of state transformer blocks used respectively. 𝒔ta1×Dh,𝑴tat×Dh\boldsymbol{s}_{t}^{a}\in\mathbb{R}^{1\times D_{h}},\boldsymbol{M}_{t}^{a}\in\mathbb{R}^{t\times D_{h}}. TransformerTransformer is the standard version Transformer block from  [42].

Finally. The action logit 𝒍t\boldsymbol{l}_{t} is computed in an attentive manner.

lt,i=g(𝑶t,i)(𝑾5[𝒉t,𝒙~t])\displaystyle l_{t,i}=g(\boldsymbol{O}^{\prime}_{t,i})(\boldsymbol{W}_{5}[\boldsymbol{h}_{t},\tilde{\boldsymbol{x}}_{t}]) (11)

where 𝑾51024×(1024+Dh)\boldsymbol{W}_{5}\in\mathbb{R}^{1024\times(1024+D_{h})} is a trainable parameter and 𝒍tNo×1\boldsymbol{l}_{t}\in\mathbb{R}^{N_{o}\times 1}. In training stage, at=Categorical(𝒍t)a_{t}=Categorical(\boldsymbol{l}_{t}) is selected based on categorical policy and in inference stage, it is selected by at=argmax(𝒍t)a_{t}=\arg max(\boldsymbol{l}_{t}). Action embedding is selected based on 𝒂t=𝑶t[at]\boldsymbol{a}_{t}=\boldsymbol{O}^{\prime}_{t}[a_{t}].

3.6 Inference

We propose to use a combined logit τ=0t𝒍τ+gogτ+gsgτ\sum_{\tau=0}^{t}\boldsymbol{l}_{\tau}+g_{og}^{\tau}+g_{sg}^{\tau} that sums action logits, object grounding logits and scene grounding logits to perform navigation, where gogτg_{og}^{\tau} and gsgτg_{sg}^{\tau} denote object grounding score and scene grounding score at time step τ\tau respectively. Experimental results indicate that our strategy shortens the search trajectories while maintaining a good success rate. The final output bounding box is obtained by running ViLPointerViLPointer at the stop viewpoint that the agent predicts.

3.7 Loss Functions

To train the agent, we use a mixture of Imitation Learning (IL) and Reinforcement Learning (RL) to supervise the training. Specifically, In IL, at each time step, we allow the agent to learn to imitate the teacher action by using a cross entropy loss ce\mathcal{L}_{ce} and a mean squared error loss pm\mathcal{L}_{pm} for progress monitor  [28]. In RL, we follow the idea of  [40] and allow the agent to learn from rewards. If the agent stops within 33 meters near the target viewpoint, a positive reward +3+3 is assigned at the final step; otherwise a negative reward 3-3 is given.

final\displaystyle\mathcal{L}_{final} =αce+βpm+γRL\displaystyle=\alpha\mathcal{L}_{ce}+\beta\mathcal{L}_{pm}+\gamma\mathcal{L}_{RL} (12)
ce\displaystyle\mathcal{L}_{ce} =t=1Tytlog(lt,)\displaystyle=-\sum_{t=1}^{T}y_{t}^{\star}\log(l_{t,\star})
pm\displaystyle\mathcal{L}_{pm} =t=1T(ytpmptpm)2\displaystyle=-\sum_{t=1}^{T}(y_{t}^{pm}-p_{t}^{pm})^{2}

where yty_{t}^{\star} is the teacher action at step tt; ytpm[0,1]y_{t}^{pm}\in[0,1] is the shortest normalized distance from current viewpoint to the target viewpoint; ptpmp_{t}^{pm} is the predicted progress; α\alpha, β\beta and γ\gamma are all set to 11.

4 Experiments

In the REVERIE dataset, the training set contains 5959 scenes and 1046610466 instructions over 23532353 objects; the val seen split consists of 5353 scenes and 13711371 instructions over 428428 objects and the val unseen split include 1010 scenes and 35733573 instructions over 525525 objects. The test set contains 1616 scenes and 62926292 instructions over 834834 objects. In this section, we conduct extensive evaluation and analysis of the effectiveness of our proposed components.

Table 1: Ablation Study experiments performed to verify the effectiveness of the proposed method. In different ablation study block, the best performing result is marked in bold.
Experiments ID Methods Val Seen Val Unseen
Encoder Pointer Policy Nav. Acc. RGS\uparrow RG SPL\uparrow Nav. Acc. RGS\uparrow RG SPL\uparrow
LencL_{enc} BertencBert_{enc} ViLRawencViLRaw_{enc} ViLencViL_{enc} MNptrMN_{ptr} ViLptrViL_{ptr} CpolC_{pol} MApolMA_{pol} Succ.\uparrow OSucc.\uparrow SPL\uparrow Length\downarrow Succ.\uparrow OSucc.\uparrow SPL\uparrow Length\downarrow
Component Effectiveness 1 \surd \surd \surd 50.53 55.17 45.50 16.35 31.97 29.66 14.40 28.20 7.19 45.28 7.84 4.67
2 \surd \surd \surd 54.18 58.68 48.99 12.46 33.87 21.23 18.66 29.51 10.44 32.95 11.13 6.32
3 \surd \surd \surd 33.73 39.14 30.72 14.56 23.82 21.94 15.22 31.64 8.44 42.62 8.89 4.84
4 \surd \surd \surd 39.00 43.85 35.00 13.71 28.95 25.98 13.80 31.33 8.21 37.31 9.17 5.54
5 \surd \surd \surd 37.32 43.08 31.71 18.29 24.88 21.70 19.06 44.39 7.10 79.88 11.08 4.17
6 \surd \surd \surd 56.36 60.93 52.24 13.21 36.33 33.92 21.61 31.98 12.21 36.05 13.21 7.31
7 \surd \surd \surd 54.25 56.08 50.49 13.56 39.56 37.16 26.98 37.86 13.70 42.50 17.32 8.71
8 \surd \surd \surd 59.52 64.23 55.30 14.00 43.57 40.42 28.17 40.41 14.77 43.12 19.60 10.27
Memory Blocks (NmemN_{mem}, NstateN_{state}) 9 (1, 1) 55.24 58.61 52.29 12.42 40.90 38.76 28.97 39.56 13.28 44.10 20.51 9.19
10 (3, 3) 61.91 65.85 57.08 13.61 45.96 42.65 31.53 44.67 16.28 41.53 22.41 11.56
11 (5, 5) 60.01 63.38 54.99 17.44 44.69 41.10 25.84 38.20 13.09 44.00 18.23 9.19
12 (7, 7) 57.27 62.26 52.78 13.96 42.66 39.38 23.66 35.61 11.67 45.73 16.79 8.43
13 (9, 9) 57.06 60.15 53.35 14.16 42.38 39.67 28.15 39.45 14.92 41.53 19.54 10.13
Logit Fusion 14 𝒍t\boldsymbol{l}_{t} 60.92 65.78 56.14 15.28 45.61 42.19 32.35 49.08 14.74 60.89 22.35 10.54
15 𝒍t\boldsymbol{l}_{t}+gsgg_{sg} 61.49 65.78 56.72 13.67 45.47 42.31 31.20 47.80 15.90 45.82 21.68 11.08
16 𝒍t\boldsymbol{l}_{t}+gogg_{og} 61.14 65.77 55.21 16.82 44.48 40.04 32.12 46.54 15.73 52.14 21.98 11.02
17 𝒍t\boldsymbol{l}_{t}+gsgg_{sg}+gogg_{og} 61.91 65.85 57.08 13.61 45.96 42.65 31.53 44.67 16.28 41.53 22.41 11.56
Table 2: Comparison with state-of-the-art methods on the REVERIE task. The best performing result is marked in bold.
Methods Val Seen Val Unseen Test (Unseen)
Nav. Succ. RGS\uparrow RG SPL \uparrow Nav. Succ. RGS\uparrow RG SPL \uparrow Nav. Succ. RGS\uparrow RG SPL \uparrow
Succ.\uparrow OSucc.\uparrow SPL\uparrow Length\downarrow Succ.\uparrow OSucc.\uparrow SPL\uparrow Length\downarrow Succ.\uparrow OSucc.\uparrow SPL\uparrow Length\downarrow
RCM [43] + MattNet 23.33 29.44 21.82 10.70 16.23 15.36 9.29 14.23 6.97 11.98 4.89 3.89 7.84 11.68 6.67 10.60 3.67 3.14
SelfMonitor [28] + MattNet 41.25 43.29 39.61 7.54 30.07 28.98 8.15 11.28 6.44 9.07 4.54 3.61 5.80 8.39 4.53 9.23 3.10 2.39
FAST-short [22] + MattNet 45.12 49.68 40.18 13.22 31.41 28.11 10.08 20.48 6.17 29.70 6.24 3.97 14.18 23.36 8.74 30.69 7.07 4.52
REVERIE [36] 50.53 55.17 45.50 16.35 31.97 29.66 14.40 28.20 7.19 45.28 7.84 4.67 19.88 30.63 11.61 39.05 11.28 6.08
Human - - - - - - - - - - - - 81.51 86.83 53.66 21.18 77.84 51.44
Ours 61.91 65.85 57.08 13.61 45.96 42.65 31.53 44.67 16.28 41.53 22.41 11.56 30.8 44.56 14.85 48.61 19.02 9.20

4.1 Evaluation Metrics

Following  [36], we evaluate the performance of the model based on REVERIE Success Rate (RGS) and REVERIE Success Rate weighted by Path Length (RG SPL). We also report the performance of Navigation Success Rate, Navigation Oracle Success Rate, Navigation Success Rate weighted by Path Length (SPL), and Navigation Length. Please refer to the supplementary document for more details.

4.2 Ablation Study

In this section, we aim to answer the following questions: (a)(a) Does the performance gain mainly come from BERT-based structure? (b)(b) How effective is each of the proposed component? (c)(c) Does the memory blocks number matter? (d)(d) Why do we need logit fusion? For simplicity concern, we define the following experiment settings: (1)(1) our proposed ViLEncoderViLEncoder is ViLencViL_{enc}; (2)(2) the ViLRawencViLRaw_{enc} is ViLEncoderViLEncoder not pre-trained on the Scene Grounding task but pre-trained on the Conceptual Captions dataset [35] as well as the 1212 tasks specified in  [27]; (3)(3) the BERTencBERT_{enc} is a BERT language encoder pre-trained on the BookCorpus [53] and English Wikipedia datasets; (4)(4) our proposed ViLPointerViLPointer is ViLptrViL_{ptr}; (5)(5) previous SOTA MattNet pointer is MNptrMN_{ptr}; (6)(6) our action policy is MApolMA_{pol}; (7)(7) previous action policy is CpolC_{pol}; (8)(8) previous simple language encoder is LencL_{enc} composed of a trainable embedding layer with a Bi-directional LSTM layer.

Performance Gain. To answer question (a)(a), we perform experiments 11, 22, 33 and 66 as is shown in Table 1. All agents are trained under ce\mathcal{L}_{ce} and pm\mathcal{L}_{pm} with α\alpha and β\beta both set to 0.50.5. It is clear that the agent’s overall performance is incrementally improved by changing the language encoder from the simple LencL_{enc} to our proposed ViLencViL_{enc}, which proves our analysis that previous language encoder does not well capture the semantics of high-level instructions. The experimental results of 33 and 66 clearly suggests that the BERT-based structure is not the root cause of our performance gain and our proposed Scene Grounding task significantly increase the RG SPL metric to 33.9%33.9\% on Val Seen and 7.31%7.31\% on Val Unseen, even higher than the strong baseline in experiment 22.

Component Effectiveness. To answer question (b)(b), based on the statistics from Table 1, we train six models in experiments from 33 to 88 and ablate the proposed component one by one to demonstrate the effectiveness. For fair comparison, we follow the settings of  [36]. All agents are trained under ce\mathcal{L}_{ce} and pm\mathcal{L}_{pm} with α\alpha and β\beta both set to 0.50.5. We start from the baseline experiment 33 and replace each component by our proposed ones. Specifically, in experiments 33 and 66, the proposed ViLEncoderViLEncoder improves the RG SPL (and SPL) by a large margin, 11.98%11.98\% (and 21.52%21.52\%) higher in Val Seen and 2.47%2.47\% (and 3.77%3.77\%) higher in Val Unseen than the baseline respectively, which proves that the Scene Grounding task is effective; in experiments 33 and 44, our pointer ViLPointerViLPointer outperforms the MattNet counterpart by shortening the length of the search trajectory while maintaining a high RG SPL (and SPL), which demonstrates the effectiveness of the Object Grounding task; in experiments 33 and 55, the results show that the overall search trajectory of our action policy is longer than that of the baseline while our action policy achieves higher RGS and Navigation Success Rate, which demonstrates that the memory structure in our policy guides the agent to the correct target location at the cost of long trajectory; in experiments 77 and 88, we demonstrate that by integrating all our proposed methods, our agent improves previous SOTA in terms of RG SPL by 10.76%10.76\% on Val Seen and 5.6%5.6\% on Val Unseen.

Memory Blocks. To answer question (c)(c), we train five models with different NmemN_{mem} and NstateN_{state} values. In these experiments, we train the agents with ce\mathcal{L}_{ce}, pm\mathcal{L}_{pm} and RL\mathcal{L}_{RL} and α\alpha, β\beta and γ\gamma set to 1.01.0. In general, according to the experiments from 99 to 1313 in Table 1, all pairs of (Nmem,Nstate)(N_{mem},N_{state}) exhibit superior performance compared to previous SOTA method in experiment 11 and the strong BERT baseline model in experiment 22. Moreover, the best performance model is achieved by setting (Nmem,Nstate)(N_{mem},N_{state}) to (3,3)(3,3) in these five models, which suggests that using small values of (Nmem,Nstate)(N_{mem},N_{state}) limits the agent’s memorization ability and using large values of (Nmem,Nstate)(N_{mem},N_{state}) enables the agent to achieve good performance on Val Unseen while maintains good performance on Val Seen.

Table 3: Pointer Task: REVERIE Success Rate at the ground truth target viewpoint; Encoder Task: given ground truth path, the success rate of identifying the target viewpoint among a set of candidate viewpoints along the path.
Tasks Methods Val Seen Val Unseen
Pointer MattNet [51] 68.45 56.63
CM-Erase [25] 65.21 54.02
ViLPointer-image-based 65.72 55.53
ViLPointer-vp-based 73.26 67.45
Encoder ViLEncoder 85.67 66.43

Logit Fusion. To answer question (d)(d), we report two accuracies to verify the effectiveness of gogg_{og} and gsgg_{sg}. In the Encoder Task of Table 3, given ground-truth path, our proposed ViLBERTViLBERT model achieves competitive performance on both Val Seen and Val Unseen, demonstrating the strong ability of gsgg_{sg} to identify a target viewpoint. In the Pointer Task of Table 3, the performance of ViLPointerViLPointer-vp-based is significantly higher than previous image-based pointers because it is able to capture cross-image objects relationships, suggesting that gogg_{og} has the ability to find the target location if the target object exists. According to experiments from 1414 to 1717, where the agents are trained with ce\mathcal{L}_{ce}, pm\mathcal{L}_{pm} and RL\mathcal{L}_{RL} and α\alpha, β\beta and γ\gamma set to 1.01.0, summing 𝒍τ\boldsymbol{l}_{\tau}, gogτg_{og}^{\tau}, and gsgτg_{sg}^{\tau} shortens the search trajectory and maintains a high RGS(Navigation Success Rate) and RG SPL(SPL). The motivation behind the summing strategy is to use model ensemble to reduce bias when searching for target locations considering the fact that the agent has no prior knowledge of the surrounding environments and the guidance of the high-level instructions is weak.

Refer to caption
Figure 4: Percentage of successful Navigation and RGS cases under different length of ground-truth paths on Val Seen and Val Unseen datasets for previous state-of-the-art method, BERT baseline in experiment 22, and our method.

4.3 Compared to previous state-of-the-art results

We first show what kind of cases our method improves compared to previous SOTA and our BERT-based strong baseline in experiment 22. Specifically, we divide the shortest distance lengths of all ground-truth paths into three groups, namely short path(55 meters to 99 meters with 462462 sample paths on Val Seen and 14001400 sample paths on Val Unseen), middle path(99 meters to 1414 meters with 703703 sample paths on Val Seen and 18691869 sample paths on Val Unseen), and long path(1414 meters to 1818 meters with 247247 sample paths on Val Seen and 250250 sample paths on Val Unseen). Then, we count the cases that the agent successfully navigates to the target locations and the cases that the agent successfully navigates and localizes the target objects for the three groups. In Fig. 4, we report the corresponding successful cases percentage. It is obvious that our proposed method improves all kinds of sample paths by a clear margin.

Then, we compare our final model with previous SOTA models in Table 2. As is clearly shown in Table 2, our model outperforms all previous models by a large margin. Specifically, in terms of SPL, our agent increases previous SOTA by 11.58%11.58\% on Val Seen, 9.09%9.09\% on Val Unseen and 3.24%3.24\% on Test respectively; for RG SPL, our agent increase previous SOTA by 12.99%12.99\% on Val Seen, 6.89%6.89\% on Val Unseen and 3.12%3.12\% on Test. The overall improvements indicate that our proposed scene-intuitive agent not only navigates better but also localizes target objects more accurately.

5 Conclusion

In this paper, we present a scene-intuitive agent capable of understanding high-level instructions for the REVERIE task. Different from previous works, we propose two pre-training tasks, Scene Grounding task and Object Grounding task respectively, to help the agent learn where to navigate and what object to localize simultaneously. Moreover, the agent is trained with a Memory-augmented action decoder that fuses grounded textual representation and visual representation with memory augmented current state representation to generate action sequence. We extensively verify the effectiveness of our proposed components and experimental results demonstrate that our result outperforms previous methods significantly. Nevertheless, how to bridge the performance gap between seen and unseen environments and how to shorten the navigation length efficiently remains an open problem for further investigation.

References

  • [1] Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: Instruction following as bayesian state tracking. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  • [4] Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin Li, and Liang Lin. Visual question reasoning on general dependency tree. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7249–7257, 2018.
  • [5] Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [6] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In The European Conference on Computer Vision (ECCV), 2020.
  • [7] Elizabeth R. Chrastil. Neural evidence supports a novel framework for spatial navigation. Psychonomic Bulletin & Review, 20:208–227, 2013.
  • [8] A. Coutrot, R. Silva, E. Manley, Will de Cothi, and H. Spiers. Global determinants of navigation ability. Current Biology, 28:2861–2866.e4, 2018.
  • [9] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, D. Parikh, and Dhruv Batra. Embodied question answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
  • [10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [11] Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. Scene memory transformer for embodied agents in long-horizon tasks. 2019.
  • [12] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. In Neural Information Processing Systems (NeurIPS), 2018.
  • [13] S. Gopal, R. Klatzky, and T. Smith. Navigator: A psychologically based model of environmental learning through navigation. Journal of Environmental Psychology, 9:309–331, 1989.
  • [14] Saurabh Gupta, Varun Tolani, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. International Journal of Computer Vision, 128:1311–1330, 2019.
  • [15] Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [16] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [18] Max Jaderberg, V. Mnih, W. Czarnecki, T. Schaul, Joel Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
  • [19] Simon Jetzschke, M. Ernst, J. Fröhlich, and N. Boeddeker. Finding home: Landmark ambiguity in human navigation. Frontiers in Behavioral Neuroscience, 11, 2017.
  • [20] Jia Lv Yanjie Duan Zhen Qin Guodong Li Jingyu Zhao, Feiqing Huang and Guangjian Tian. Do rnn and lstm have long memory? In International Conference on Machine Learning, 2020.
  • [21] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. Referit game: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  • [22] Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [25] Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, and Hongsheng Li. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1950–1959, 2019.
  • [26] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23, 2019.
  • [27] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [28] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • [29] Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. 2020.
  • [30] P. Mirowski, Razvan Pascanu, F. Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and Raia Hadsell. Learning to navigate in complex environments. In ICLR, 2017.
  • [31] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning to navigate in complex environments. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [32] Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, and H. Lee. Control of memory, active perception, and action in minecraft. In ICML, 2016.
  • [33] Emilio Parisotto and R. Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. In ICLR, 2018.
  • [34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. 2019.
  • [35] Sharma Piyush, Ding Nan, Goodman Sebastian, and Soricut Radu. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
  • [36] Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [37] Nikolay Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018.
  • [38] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2020.
  • [39] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
  • [40] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments:back translation with environmental dropout. In Proceedings of The North American Chapter of the Association for Computational Linguistics (NAACL), 2019.
  • [41] I. V. D. van der Ham, M. H. G. Claessen, A. Evers, and Milan N. A. van der Kuil. Large-scale assessment of human navigation ability across the lifespan. Scientific Reports, 10, 2020.
  • [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008, 2017.
  • [43] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6629–6638, 2019.
  • [44] Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In The European Conference on Computer Vision (ECCV), September 2018.
  • [45] J. Wiener, Simon J. Büchner, and C. Hölscher. Taxonomy of human wayfinding tasks: A knowledge-based approach. Spatial Cognition & Computation, 9:152 – 165, 2009.
  • [46] Daan Wierstra, A. Förster, Jan Peters, and J. Schmidhuber. Solving deep memory pomdps with recurrent policy gradients. In ICANN, 2007.
  • [47] T. Wolbers and M. Hegarty. What determines our navigational abilities? Trends in Cognitive Sciences, 14:138–146, 2010.
  • [48] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018.
  • [49] Sibei Yang, Guanbin Li, and Yizhou Yu. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9952–9961, 2020.
  • [50] Sibei Yang, Guanbin Li, and Yizhou Yu. Relationship-embedded representation learning for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [51] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [52] Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [53] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

6 More Related Work

Behavioral Research on Human Navigation. The behavioural research of navigation of human beings has a long history and is still under active research [41, 19, 47, 45, 7, 8, 13]. Yet, it is not well understood how we human carry out the learning process of navigation in our brain to allow us to navigate in a familiar or unfamiliar environment. However, according to  [41, 47, 45, 7], we humans use a range of different cognitive processes when we navigate. For example, we identify representative landmark cues, memorize our goal location, and identify the shortest route to that goal location. A significant number of research have supported such dissociable cognitive aspects. The human intuitions for remote embodied navigation we referred to in this paper is a set of commonsense rules and heuristics that come from observations of humans¡¯ life experiences, which shares a similar motivation mentioned in  [14]. Our work has also proved that drawing on such observations in high-level VLN is a promising direction.

7 Implementation Details

In this section, we introduce the implementation details of the pre-training stage and the action decoding stage. In pre-training stage, we first present the sampled datasets information of the Scene Grounding task and the Object Grounding task. Second, we introduce the ViLBERT model used in the pre-training stage. Third, we illustrate the action decoder architecture and the training parameters in detail.

7.1 Pre-training Stage Details

Scene Grounding Task. The Scene Grounding training dataset consists of 1031210312 samples, each containing an instruction and four viewpoints out of which one is positive. The sampling strategy is illustrated in the main paper. We evaluate the effectiveness of this task by asking the model trained to identify the true target viewpoint given the ground-truth path. We report the accuracy on the Val Seen (14231423 paths) and Val UnSeen (35213521 paths) REVERIE.

Object Grounding Task. The image based grounding dataset contains 6743267432 training samples and the viewpoint based object grounding dataset contains 43564356 training samples. The sampling strategy is presented in the main paper. Similar to  [36], we evaluate the performance of this model on the ground-truth target viewpoint and report the object grounding accuracy.

Model Details. The ViLBERT model used in Scene Grounding task and Object Grounding task consists of a language stream, a vision stream and a cross modal alignment layers block. The language stream utilizes a BERTBASEBERT_{BASE} architecture [10], which has 1212-layer of transformer blocks and each block having a hidden state size of 768768 and 1212 attention heads. The vision stream and the cross modal alignment block use 66-layer transformer blocks and each having a hidden state size of 10241024 and 88 attention heads respectively. Following  [26, 27], the language stream is initialized with BERT weights pre-trained on the BookCorpus [53] and English Wikipedia datasets. Then, the ViLBERT model is pre-trained on the Conceptual Captions dataset [35] as well as the 1212 tasks specified in  [27]. Finally, it is fine-tuned on our Scene Grounding task and Object Grounding task respectively. In the Scene Grounding task, the Scene Grounding model is trained with the Adam optimizer with a learning rate of 4e54e-5 and a batch size of 3232 for 1010 epochs. In the Object Grounding task, the Object Grounding model is first trained on the image based Object Grounding dataset with the Adam optimizer with a learning rate of 4e54e-5 and a batch size of 128128 for 2020 epochs. Then, it is further fine-tuned on the viewpoint based Object Grounding dataset with the Adam optimizer with a learning rate of 1e51e-5 and a batch size of 128128 for 1010 epochs. We use a linear decay learning rate schedule with warm up to train the aforementioned models. All models are trained on NVIDIA Geforce 20802080Ti GPUs with 1111GB memory using Pytorch  [34].

7.2 Action Decoding Stage Details

The ViLEncoderViLEncoder is composed of a ViLBERT model pre-trained on the Scene Grounding task and a Bi-directional LSTM layer. The DhD_{h} in the BiLSTM is set to 512512. The ViLPointerViLPointer is ViLBERT model pre-trained on the Object Grounding task. The NmemN_{mem} and NstateN_{state} used in the memory blocks are set to 33 according to our ablation study in the ablation study. We follow the same RL setting as  [40] that sets the discounted factor to 0.90.9 and adopts reward shaping  [48]. We train the agent with the Adam Optimizer  [23] with a learning rate of 1e41e-4, weight decay of 5e45e-4, batch size of 6464 and the maximum decoding action length of 4040. We clip the global gradient norm at 4040. We train the agent for 1300013000 iterations and report the final performance. All experiments have been conducted on NVIDIA Geforce 20802080Ti GPUs with 1111GB memory using Pytorch  [34].

8 Evaluation Metrics Details

In this section, we illustrate the details of the evaluation metrics. Following  [36], we evaluate the performance of the model based on REVERIE Success Rate (RGS) and REVERIE Success Rate weighted by Path Length (RG SPL). Besides, we report the performance of our method on the following metrics in the REVERIE dataset. It is worth noting that the target object is only observable within 33 meters of the target viewpoint.

  • Navigation Success Rate is the percentage of the target object observable at the agent’s final location.

  • Navigation Oracle Success Rate measures the percentage of the target object that can be observed at one of the agent’s passed viewpoints.

  • Navigation Success Rate weighted by Path Length (SPL) is the navigation success rate weighted by the trajectory length.

  • Navigation Length is the trajectory length in meters.

  • REVERIE Success Rate (RGS) is calculated as the percentage of the output bounding box that has an IoU 0.5\geq 0.5 with the ground truth box.

  • REVERIE Success Rate weighted by Path Length (RG SPL) is REVERIE success rate weighted by the trajectory length.

9 Qualitative Examples

In this section, we show a number of qualitative examples of how our proposed agent performs in both Val Seen environment (from Fig. 5 to Fig. 8) and Val Unseen environment (from Fig. 9 to Fig. 12). Besides, we also visualize five representative failed cases illustrating the typical mistakes our agent make to better understand how our agent works.

Refer to caption
Figure 5: The successful navigation and localization qualitative example result on Val Seen dataset.
Refer to caption
Figure 6: The successful navigation and localization qualitative example result on Val Seen dataset.
Refer to caption
Figure 7: The successful navigation and localization qualitative example result on Val Seen dataset.
Refer to caption
Figure 8: The successful navigation and localization qualitative example result on Val Seen dataset.
Refer to caption
Figure 9: The successful navigation and localization qualitative example result on Val Unseen dataset.
Refer to caption
Figure 10: The successful navigation and localization qualitative example result on Val Unseen dataset.
Refer to caption
Figure 11: The successful navigation and localization qualitative example result on Val Unseen dataset.
Refer to caption
Figure 12: The successful navigation and localization qualitative example result on Val Unseen dataset.
Refer to caption
Figure 13: The failed navigation qualitative example result. In this example, the agent first successfully navigates to second level but failed to enter the correct bedroom and stopped at a wrong viewpoint.
Refer to caption
Figure 14: The failed localization qualitative example result. In this example, the agent first successfully navigates to the target viewpoint but failed to localize the target object because the ViLPointerViLPointer module thinks the white chair is closer to the office door than the black chair, which is reasonable as it is hard to decide which one is closer.
Refer to caption
Figure 15: The failed localization qualitative example result. In this example, the agent first successfully navigates to the target viewpoint but failed to localize the target object because of the ambiguous meaning of “the plate” in the high-level instruction.
Refer to caption
Figure 16: The failed localization qualitative example result. In this example, the agent first successfully navigates to the target viewpoint but failed to localize the target object because of the ambiguous meaning of “the bottle on the shelf” in the high-level instruction.
Refer to caption
Figure 17: The failed localization qualitative example result. In this example, the agent first successfully navigates to the target viewpoint but failed to localize the target object because it failed to capture the relative position of similar objects (“the black and white stripped pillow”) in the scene.