Mind the Gap: Improving Success Rate of Vision-and-Language Navigation by Revisiting Oracle Success Routes
Abstract.
Vision-and-Language Navigation (VLN) aims to navigate to the target location by following a given instruction. Unlike existing methods focused on predicting a more accurate action at each step in navigation, in this paper, we make the first attempt to tackle a long-ignored problem in VLN: narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR). We observe a consistently large gap (up to 9%) on four state-of-the-art VLN methods across two benchmark datasets: R2R and REVERIE. The high OSR indicates the robot agent passes the target location, while the low SR suggests the agent actually fails to stop at the target location at last. Instead of predicting actions directly, we propose to mine the target location from a trajectory given by off-the-shelf VLN models. Specially, we design a multi-module transformer-based model for learning compact discriminative trajectory viewpoint representation, which is used to predict the confidence of being a target location as described in the instruction. The proposed method is evaluated on three widely-adopted datasets: R2R, REVERIE and NDH, and shows promising results, demonstrating the potential for more future research.
1. Introduction

Vision-and-Language Navigation (VLN) has received increasing attention in computer vision, natural language processing, and robotics communities due to its great potential for real-world applications such as domestic assistants. The VLN task requires an agent to navigate to a target location in a 3D simulated environment based on visual observation and a given natural language instruction. A variety of VLN tasks have been proposed, including indoor room-to-room navigation according to detailed instructions (e.g., R2R (Anderson et al., 2018) and RxR (Ku et al., 2020)) or dialogue-based instructions (e.g., NDH (Thomason et al., 2019)); remote object navigation according to concise instructions (REVERIE (Qi et al., 2020b)) or detailed instructions (SOON (Zhu et al., 2021)). Currently, there are both discrete (Anderson et al., 2018) and continuous (Krantz et al., 2020; Wang et al., 2022a) simulators. Our work is based on the discrete simulator, which provides predefined navigation graphs, thus encouraging researchers to focus on textual-visual alignment, excluding the impact from tasks like topology mapping and obstacle avoidance.
Most existing methods formulate VLN as a sequential text-image matching problem. To be specific, positioned at a node of a navigation graph, the agent traverses the environment by selecting one of its neighbouring nodes (represented by images) that has the maximal similarity to instruction. To improve the performance of visual-textual matching, many approaches have been developed. Data augmentation is explored to overcome the scarcity of training data via generating pseudo trajectory-instruction pairs (Fried et al., 2018; Liang et al., 2022a), collecting extra data (Guhur et al., 2021; Chen et al., 2022b), and editing environments from different houses (Li et al., 2022; Liu et al., 2021). In (An et al., 2021; Cheng et al., 2022; Hong et al., 2020; Qi et al., 2020a), instructions are decomposed into landmark, action, and scene components via the attention mechanism to enable fine-grained cross-modality matching. Vision-language pretraining has been employed in previous works such as (Hao et al., 2020; Hong et al., 2021; Guhur et al., 2021; Qi et al., 2021; Chen et al., 2021; Qiao et al., 2022, 2023; Chen et al., 2022c). Most recently, memory and maps for navigation history and planning have been utilized to further enhance textual-visual matching in the works of (Chen et al., 2021, 2022c; Lin et al., 2022a; Qiao et al., 2023; Gao et al., 2023; Zhao et al., 2022; An et al., 2022; Wang et al., 2023b).
Although these methods have boosted the VLN performance, there is one problem that remains open and has long been ignored: the performance gap between Success Rate (SR, the agent stops at a graph node within 3 meters to the target location) and Oracle Success Rate (OSR, the agent passes by or stops at a graph node within 3 meters to the destination). Instead of only happening to one specific method, this gap exists among most existing state-of-the-art methods and across datasets. Figure 1 presents the statistics of SR and OSR of four cutting-edge methods on two popular VLN tasks R2R and REVERIE. It shows that the performance gap ranges from 2.9% to 4.4% on the REVERIE task and consistently remains around 7-9% for all four methods on the R2R task. This significant gap suggests existing VLN models may successfully find the correct trajectory (because it passes the target location.), however fail to stop at the target spot finally. Such a large gap demonstrates a great potential to advance VLN performance, and why not mine the passed target nodes in trajectories?
In this paper, we study how to find the target node in a given trajectory. Interestingly, we find this task could be formulated as a video grounding (Gao et al., 2017; Zhou et al., 2018; Krishna et al., 2017; Zhang et al., 2020; Tang et al., 2021) task, i.e., localise the target frame (location) from a given video (trajectory) based on a given textual description (instruction). However, there are still two significant differences: (a) each node in a trajectory is represented by a panorama (usually consists of 312 discrete observation views, i.e., 3 elevation levels and 12 views at each elevation) and panoramas between two consecutive nodes present sever instant visual transition (i.e., it is full of cutaway). (b) There may be only a few images (usually less than three) that match the textual description for the destination, which is much less than that in the video grounding.
To overcome these challenges, we design a transformer-based model consisting of three modules: Cross-Modality Elevation Transformer, Spatial-Temporal Transformer, and Target Selection Transformer. The Cross-Modality Elevation Transformer learns to fuse visual perceptions across three elevations, which helps reduce redundant information. The Spatial-Temporal Transformer is used to exchange information across different time steps and heading views at each step. The Target Selection Transformer takes advantage of learnable queries to summarize visual information at each time step and generate the representation for confidence prediction of being the target location. In summary, our contributions are three-fold:
-
•
To the best of our knowledge, this is the first work to study the SR and OSR gap issue on the VLN tasks.
-
•
We design a multi-module transformer-based model to minimize the SR and OSR gap.
-
•
Extensive experiments on three popular VLN benchmarks: R2R, REVERIE and NDH, show the effectiveness of the proposed method, indicating the possibility for more future research.

2. Related Work
Vision-and-Language Navigation. The R2R (Anderson et al., 2018), REVERIE (Qi et al., 2020b), and NDH (Thomason et al., 2019) are three well-known benchmarks in the field of VLN, each presenting distinctive challenges for navigation agents. R2R offers low-level instructions in natural language and photo-realistic environments for navigation purposes. REVERIE presents the task of remote object localization using concise high-level instructions. NDH focuses on dialog-based navigation. A number of existing works have focused on visual-textual matching to tackle the VLN tasks. Ma et al. (Ma et al., 2019a) propose a visual-textual co-grounding module and a progress monitor to estimate progress towards the goal. Qi et al. (Qi et al., 2020a) use an object- and action-aware module framework to match the decomposed instruction and candidate visual features. Hong et al. (Hong et al., 2020) model inter-and-intra relationships among the scene, objects and directions via a language and visual entity relation graph. Recent works in VLN have leveraged transformer-based architectures to encode vision and language information. PRESS (Li et al., 2019) adopts BERT (Devlin et al., 2019), while (Hong et al., 2021) reuses the build-in [CLS] token to maintain history information recurrently. To improve generalization ability, data augmentation has been applied in (Fried et al., 2018; Tan et al., 2019; Liu et al., 2021; Li et al., 2022; Wang et al., 2023a; Li and Bansal, 2023b). Furthermore, adversarial learning was utilized to mine hard training samples in (Fu et al., 2020), while (Guhur et al., 2021; Chen et al., 2022b) leverage additional training data to improve performance. ADAPT (Lin et al., 2022b) proposes modality-aligned action prompts to effectively guide agents in completing complex navigation tasks. CCC (Wang et al., 2022b) explores the intrinsic correlation between instruction following and instruction generation by jointly learning both tasks. CITL (Liang et al., 2022b) employs contrastive learning to acquire the alignment between trajectory and instruction. SIG (Li and Bansal, 2023a) generates potential future views to benefit the agent during navigation. History memories are specially maintained in (Chen et al., 2021, 2022c; Qiao et al., 2023; Wang et al., 2021; Zhou et al., 2023) to capture long-range dependencies across past observations and actions. All the above methods have advanced the progress of VLN. However, they all suffer from a performance gap between success rate and oracle success rate, indicating they pass by the target location but fail to stop on it. In this paper, we design a multi-module transformer-based model to mitigate this issue and hope to give hints for future work.
Video Grounding. Visual grounding involves localizing an object using a referring expression, researched extensively in both image domain (Qiao et al., 2021; Yu et al., 2018; Deng et al., 2022, 2021) and video domain. For video grounding, two tasks are distinguished: Spatial-Temporal Video Grounding (STVG) aims to localize both the frames and bounding boxes of an object specified by a language description (Yamaguchi et al., 2017; Su et al., 2021; Yang et al., 2022), and Temporal Video Grounding (TVG) that only localizes frames (Chen et al., 2018; Rodriguez et al., 2020). Our formulation of VLN, localizing the image representing the navigation destination in an image sequence of the navigation trajectory, can be reformulated as part of the TVG task. Early TVG approaches employ a proposal-based architecture (Gao et al., 2017; Chen et al., 2018; Yuan et al., 2019a; Xu et al., 2019; Zhang et al., 2019), and several proposal-free methods (Yuan et al., 2019b; Rodriguez et al., 2020; Mun et al., 2020) are proposed to reduce the computation cost of proposal feature exacting. Techniques such as attention mechanism (Rodriguez et al., 2020; Yuan et al., 2019b; Mun et al., 2020), reinforcement-learning (Hahn et al., 2019; Wang et al., 2019a) and language parsing (Mun et al., 2020) have been explored to improve video-description matching. However, a simple direct application of these methods on our VLN task does not perform well due to the visual representation difference: consecutive frames with smooth transitions v.s. sparse image sequences full of cutaways. And therefore, we design a new grounding method specifically for VLN tasks.
Methods | R2R Val Seen | R2R Val Unseen | R2R Test Unseen | ||||||||||||
TL | NE | SPL | SR | OSR | TL | NE | SPL | SR | OSR | TL | NE | SPL | SR | OSR | |
Human | - | - | - | - | - | - | - | - | - | - | 11.85 | 1.61 | 76 | 86 | 90 |
Seq2Seq (Anderson et al., 2018) | 11.33 | 6.01 | - | 39 | 53 | 8.39 | 7.81 | - | 21 | 28 | 8.13 | 7.85 | - | 20 | 27 |
SF (Fried et al., 2018) | - | 3.36 | - | 66 | 74 | - | 6.62 | - | 36 | 45 | 14.82 | 6.62 | 28 | 35 | 44 |
RCM (Wang et al., 2019b) | 10.65 | 3.53 | - | 67 | 75 | 11.46 | 6.09 | - | 43 | 50 | 11.97 | 6.12 | 38 | 43 | 50 |
Regretful (Ma et al., 2019b) | - | 3.23 | 63 | 69 | - | - | 5.32 | 41 | 50 | - | 13.69 | 5.69 | 40 | 48 | 56 |
FAST-short (Ke et al., 2019) | - | - | - | - | - | 21.17 | 4.97 | 43 | 56 | - | 22.08 | 5.14 | 41 | 54 | 64 |
EnvDrop (Tan et al., 2019) | 11.00 | 3.99 | 59 | 62 | - | 10.70 | 5.22 | 48 | 52 | - | 11.66 | 5.23 | 47 | 51 | 59 |
OAAM (Qi et al., 2020a) | 10.20 | - | 62 | 65 | 73 | 9.95 | - | 50 | 54 | 61 | 10.40 | 5.30 | 50 | 53 | 61 |
EntityGraph (Hong et al., 2020) | 10.13 | 3.47 | 65 | 67 | - | 9.99 | 4.73 | 53 | 57 | - | 10.29 | 4.75 | 52 | 55 | 61 |
NvEM (An et al., 2021) | 11.09 | 3.44 | 65 | 69 | - | 11.83 | 4.27 | 55 | 60 | - | 12.98 | 4.37 | 54 | 58 | 66 |
ActiveVLN(Wang et al., 2020) | 19.70 | 3.20 | 52 | 70 | - | 20.60 | 4.36 | 40 | 58 | - | 21.60 | 4.33 | 41 | 60 | - |
PRESS (Li et al., 2019) | 10.57 | 4.39 | 55 | 58 | - | 10.36 | 5.28 | 45 | 49 | - | 10.77 | 5.49 | 45 | 49 | - |
PREVALENT (Hao et al., 2020) | 10.32 | 3.67 | 65 | 69 | - | 10.19 | 4.71 | 53 | 58 | - | 10.51 | 5.30 | 51 | 54 | 61 |
VLN
|
11.13 | 2.90 | 68 | 72 | 79 | 12.01 | 3.93 | 57 | 63 | 69 | 12.35 | 4.09 | 57 | 63 | 70 |
AirBERT (Guhur et al., 2021) | 11.09 | 2.68 | 70 | 75 | - | 11.78 | 4.01 | 56 | 62 | - | 12.41 | 4.13 | 57 | 62 | - |
SEvol (Chen et al., 2022a) | 11.97 | 3.56 | 63 | 67 | - | 12.26 | 3.99 | 57 | 62 | - | 13.40 | 4.13 | 57 | 62 | - |
ADAPT (Lin et al., 2022b) | 10.97 | 2.54 | 72 | 76 | - | 12.21 | 3.77 | 58 | 64 | - | 12.99 | 3.79 | 59 | 65 | - |
HAMT (Chen et al., 2021) | 11.15 | 2.51 | 72 | 76 | 82 | 11.46 | 2.29 | 61 | 66 | 73 | 12.27 | 3.93 | 60 | 65 | 72 |
HOP (Qiao et al., 2022) | 11.26 | 2.72 | 70 | 75 | 80 | 12.27 | 3.80 | 57 | 64 | 71 | 12.68 | 3.83 | 59 | 64 | 71 |
Ours | 11.72 | 2.72 | 70 | 77(2) | 80 | 13.05 | 3.68 | 57 | 67(3) | 71 | 13.24 | 3.78 | 58 | 66(2) | 71 |
Ours♠ | 9.95 | 2.72 | 73 | 77(2) | 79 | 9.69 | 3.68 | 62 | 67(3) | 70 | 10.23 | 3.78 | 61 | 66(2) | 69 |
SIG_HAMT (Li and Bansal, 2023a) | 11.68 | 2.80 | 70 | 73 | 79 | 11.96 | 3.37 | 62 | 68 | 75 | 12.83 | 3.81 | 60 | 65 | 72 |
Ours | 12.24 | 2.80 | 69 | 76(3) | 79 | 12.62 | 3.36 | 62 | 70(2) | 75 | 13.42 | 3.73 | 59 | 66(1) | 72 |
Ours♠ | 10.08 | 2.80 | 73 | 76(3) | 78 | 9.65 | 3.36 | 66 | 70(2) | 73 | 10.09 | 3.73 | 62 | 66(1) | 70 |
DUET (Chen et al., 2022c) | 12.32 | 2.28 | 73 | 79 | 86 | 13.94 | 3.31 | 60 | 72 | 81 | 14.73 | 3.65 | 59 | 69 | 76 |
Ours | 12.83 | 2.33 | 72 | 82(3) | 86 | 14.65 | 3.27 | 60 | 75(3) | 81 | 15.33 | 3.62 | 58 | 712) | 76 |
Ours♠ | 11.24 | 2.33 | 77 | 82(2) | 85 | 11.95 | 3.27 | 65 | 75(3) | 79 | 12.98 | 3.62 | 61 | 712) | 75 |
SIG_DUET (Li and Bansal, 2023a) | 13.96 | 2.73 | 67 | 75 | 83 | 14.31 | 3.13 | 62 | 72 | 81 | 15.36 | 3.37 | 60 | 72 | 80 |
Ours | 14.68 | 2.73 | 66 | 79(4) | 83 | 15.09 | 3.10 | 61 | 76(4) | 81 | 16.02 | 3.31 | 59 | 74(2) | 80 |
Ours♠ | 11.81 | 2.73 | 72 | 79(4) | 81 | 11.90 | 3.10 | 67 | 76(4) | 79 | 12.88 | 3.31 | 64 | 74(2) | 78 |
Methods | REVERIE Val Seen | REVERIE Val Unseen | REVERIE Test Unseen | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Navigation | RGS | RGSPL | Navigation | RGS | RGSPL | Navigation | RGS | RGSPL | |||||||
SPL | SR | OSR | SPL | SR | OSR | SPL | SR | OSR | |||||||
Human | - | - | - | - | - | - | - | - | - | - | 53.66 | 81.51 | 86.83 | 77.84 | 51.44 |
RCM (Wang et al., 2019b) | 21.82 | 23.33 | 29.44 | 16.23 | 15.36 | 6.97 | 9.29 | 14.23 | 4.89 | 3.89 | 6.67 | 7.84 | 11.68 | 3.67 | 3.14 |
SMNA (Ma et al., 2019a) | 39.61 | 41.25 | 43.29 | 30.07 | 28.98 | 6.44 | 8.15 | 11.28 | 4.54 | 3.61 | 4.53 | 5.80 | 8.39 | 3.10 | 2.39 |
FAST-Short (Ke et al., 2019) | 40.18 | 45.12 | 49.68 | 31.41 | 28.11 | 6.17 | 10.08 | 20.48 | 6.24 | 3.97 | 8.74 | 14.18 | 23.36 | 7.07 | 4.52 |
MATTN (Qi et al., 2020b) | 45.50 | 50.53 | 55.17 | 31.97 | 29.66 | 7.19 | 14.40 | 28.20 | 7.84 | 4.67 | 11.61 | 19.88 | 30.63 | 11.28 | 6.08 |
ORIST (Qi et al., 2021) | 42.21 | 45.19 | 49.12 | 29.87 | 27.77 | 15.14 | 16.84 | 25.02 | 8.52 | 7.58 | 18.97 | 22.19 | 29.20 | 10.68 | 9.28 |
VLN
|
47.96 | 51.79 | 53.90 | 38.23 | 35.61 | 24.90 | 30.67 | 35.02 | 18.77 | 15.27 | 23.99 | 29.61 | 32.91 | 16.50 | 13.51 |
AirBERT (Guhur et al., 2021) | 42.34 | 47.01 | 48.98 | 32.75 | 30.01 | 21.88 | 27.89 | 34.51 | 18.23 | 14.18 | 23.61 | 30.28 | 34.20 | 16.83 | 13.28 |
HOP (Qiao et al., 2022) | 47.19 | 53.76 | 54.88 | 38.65 | 33.85 | 26.11 | 31.78 | 36.24 | 18.86 | 15.73 | 24.34 | 30.17 | 33.06 | 17.69 | 14.34 |
Ours | 46.43 | 54.11(0.35) | 54.88 | 38.65 | 33.56 | 25.79 | 32.83(1.05) | 36.24 | 18.86 | 15.71 | 24.28 | 31.29(1.12) | 33.06 | 17.69 | 14.12 |
Ours♠ | 50.61 | 54.11(0.35) | 27.64 | 38.65 | 26.23 | 54.63 | 32.83(1.05) | 36.09 | 18.86 | 32.98 | 35.12 | 31.29(1.12) | 36.09 | 17.69 | 14.51 |
HAMT (Chen et al., 2021) | 40.19 | 43.29 | 47.65 | 27.20 | 25.18 | 30.16 | 32.95 | 36.84 | 18.92 | 17.28 | 26.67 | 30.40 | 33.41 | 14.88 | 13.08 |
Ours | 39.73 | 44.60(1.31) | 47.65 | 27.20 | 24.90 | 29.56 | 34.17(1.22) | 36.84 | 18.92 | 16.93 | 26.61 | 31.19(0.79) | 33.41 | 14.88 | 12.82 |
Ours♠ | 42.37 | 44.60(1.31) | 47.40 | 27.20 | 25.37 | 31.58 | 34.17(1.22) | 36.68 | 18.92 | 17.56 | 27.92 | 31.19(0.79) | 33.28 | 14.88 | 13.47 |
DUET (Chen et al., 2022c) | 63.94 | 71.75 | 73.86 | 57.41 | 51.14 | 33.73 | 46.98 | 51.07 | 32.15 | 23.03 | 36.06 | 52.51 | 56.91 | 31.88 | 22.06 |
Ours | 63.21 | 72.48(0.73) | 73.86 | 57.41 | 50.92 | 33.57 | 48.31(1.33) | 51.07 | 32.15 | 22.84 | 35.30 | 54.08(1.57) | 56.91 | 31.88 | 21.84 |
Ours♠ | 66.80 | 72.48(0.73) | 73.46 | 57.41 | 53.18 | 34.96 | 48.31(1.33) | 50.93 | 32.15 | 23.33 | 38.19 | 54.08(1.57) | 56.56 | 31.88 | 22.18 |
HM3D (Chen et al., 2022b) | 55.70 | 65.00 | 66.76 | 48.42 | 41.67 | 40.84 | 55.89 | 62.14 | 36.58 | 26.75 | 38.88 | 55.17 | 62.30 | 32.23 | 22.68 |
Ours | 55.60 | 65.50(0.50) | 66.76 | 48.42 | 41.48 | 40.61 | 58.22(2.33) | 62.14 | 36.58 | 26.52 | 38.72 | 56.78(1.61) | 62.30 | 32.23 | 22.53 |
Ours♠ | 58.22 | 65.50(0.50) | 66.54 | 48.42 | 43.69 | 43.04 | 58.22(2.33) | 61.73 | 36.58 | 26.98 | 41.76 | 56.78(1.61) | 61.82 | 32.23 | 23.03 |
3. Method
Problem Formulation. In this work, we propose to minimize the SR and OSR gap of a given VLN model by formulating it as a trajectory grounding task. Given a natural language instruction consisting of words and a navigation trajectory consisting of viewpoints, trajectory grounding aims to find out the viewpoint that matches the specified destination in the instruction so as to reduce the gap between SR and OSR. Each viewpoint has a panorama observation represented by 36 discrete images (3 elevations 12 headings) as in previous VLN works (Anderson et al., 2018; Qi et al., 2020b). The trajectories are harvested via pre-trained VLN methods. Note that the destination may not appear in the given trajectories. In this case, our model fails as well. But considering our aim is only to reduce the gap between SR and OSR in this work, this is acceptable.
Method Overview. Figure 2 illustrates the main architecture of the proposed model, which contains three transformer-based modules: Cross-Modality Elevation Transformer, Spatial-Temporal Transformer, and Target Selection Transformer. The model takes as input an instruction-trajectory pair and utilizes a text encoder and a vision encoder to extract their own representations. Then, the representations are fed into the Cross-Modality Elevation Transformer that fuses information across elevations. Next, the Spatial-Temporal Transformer module exchanges information across discrete views and across temporal navigation steps. Finally, the Target Selection Transformer utilizes learnable queries to summarize the information of each viewpoint, and its outputs are used to predict the confidence of each viewpoint of being the destination. Below we give more details about each module.
3.1. Text and Vision Encoder
Text Encoder. Given the natural language instruction , we utilize the linguistic embedding model BERT (Devlin et al., 2019) to exact features. Here, we use the BERT embedding of the first token (i.e., [CLS] token) as the feature of instruction, denoted as , which has been proven being able to represent comprehensive information of the whole instruction (Qiao et al., 2022; Hao et al., 2020; Chen et al., 2022c), and this can also significantly reduce computation and resource costs.
Vision Encoder. A trajectory is represented by a sequence of visual panorama observations at each viewpoint . Each panorama consists of 36 discrete images (3 elevations and 12 headings in each elevation), each for one observing view (i.e., viewpoint). As in (Chen et al., 2021, 2022c), we use ViT-B/16 (Dosovitskiy et al., 2021) pre-trained on ImageNet (Russakovsky et al., 2015) to encode each image, obtaining its feature and thus for the viewpoint . Then, we add navigation step encoding and spatial sinusoidal positional encoding as (Vaswani et al., 2017; Carion et al., 2020; Yang et al., 2022) onto each viewpoint representation:
(1) |
The trajectory-instruction pair representations are fed into the Cross-Modality Elevation Transformer.
3.2. Cross-Modality Elevation Transformer
This module is designed to fuse information across elevations with the awareness of the instruction description. To this end, we first fuse vision-language information from the observing view level because the described destination is one of these views. Specifically, the feature of each observing view is fused with the instruction embedding separately:
(2) |
where and are learnable parameter matrix, denotes the activation function, and denotes concatenation.
Then, we further fuse the information across elevations. In each observing heading direction, there are three elevation views, as shown in Figure 2. Images of these elevations present significant overlap and thus redundant information. To address this problem, we leverage a transformer module to fuse information based on attention mechanism. Another benefit of this module is that the number of vision tokens also sharply decreases to of the original. Specifically, observations at each heading angle can be denoted as , where . Then, we can reformulate in terms of : . Next, we feed it to a stack of transformer layers with self-attention on each to exchange information across elevations:
(3) |
where is a learnable parameter matrix, and is the embedding dimension of . Finally, we further aggregate information across elevation via average pooling: on elevation dimension. In this way, the instruction-informed representation at each trajectory viewpoint is converted to .
3.3. Spatial-Temporal Transformer
To make each visual token aware of the existence of other tokens in both spatial and temporal dimensions, we first perform spatial attention within each viewpoint representation across 12 headings:
(4) |
where are learnable parameters and . In this way, we obtain the updated representation for the whole trajectory .
Then, we expand the spatial attention from one viewpoint to all the viewpoints, attending to all observation representations:
(5) |
where are learnable parameters and . This helps the model to mine the long-range dependency in the whole trajectory.
3.4. Target Selection Transformer
After cross-modality elevation and spatial-temporal transformers, the information of each viewpoint has been fully exchanged. However, the representation dimension for a trajectory is still very high: , which leads to a large amount of computational cost for the end-to-end training of the whole model. First we use viewpoint query self-attention layer to make the input queries attend to each other. To extract discriminative information, we leverage the query to adaptively summarize the information within each viewpoint representation via cross attention:
(6) |
where are learnable parameters, and the dimension of the resulting is compressed to . The final representation of the whole trajectory turns to . Next, we utilize a two-layer MLP followed by a sigmoid function to predict the probability of each viewpoint being the described target location:
(7) |
Considering there might be more than one positive viewpoints in a trajectory, we hope our model can discover all these viewpoints, so we view this as a step-wise two-class classification problem for each viewpoint. While during inference, the final prediction is determined by selecting the viewpoint with the maximum probability, which allows high prediction confidence.
3.5. Loss Function
We employ two loss functions for training: Focal Loss (Lin et al., 2017) and Dice Loss (Milletari et al., 2016). Focal Loss is a balanced version of cross-entropy loss that puts more emphasis on hard-to-classify examples. The focal loss function is defined as follows:
(8) |
where is the predicted probability of the ground truth target viewpoint, is a weighting factor that depends on the class frequency, and is a hyper-parameter that controls how easy examples are downweighted.
Dice Loss is a similarity-based loss function that measures the overlap between the prediction and ground truth labels:
(9) |
where is the binary ground truth label (1 for positive viewpoints and 0 for negative viewpoints).
The final loss is a weighted sum of Focal Loss and Dice Loss:
(10) |
where and are trade-off weights between Focal Loss and Dice Loss.
Methods | NDH Val Seen | NDH Val Unseen | ||||
---|---|---|---|---|---|---|
GP | SR | OSR | GP | SR | OSR | |
PREVALENT (Hao et al., 2020) | - | - | - | 3.15 | - | - |
HOP (Qiao et al., 2022) | - | - | - | 5.13 | - | - |
HAMT (Chen et al., 2021) | 6.90 | 20.68 | 31.15 | 5.09 | 16.87 | 27.89 |
+ Ours | 7.28(0.38) | 25.19(4.51) | 31.15 | 5.52(0.43) | 19.52(2.65) | 27.89 |
SIG_HAMT (Li and Bansal, 2023a) | 8.13 | 23.56 | 33.77 | 5.60 | 15.33 | 28.45 |
+ Ours | 8.51(0.38) | 27.33(3.77) | 33.77 | 5.89(0.29) | 17.24(1.91) | 28.45 |
4. Experiments
We evaluate our proposed method on three downstream tasks: R2R (Anderson et al., 2018), REVERIE (Qi et al., 2020b) and NDH (Thomason et al., 2019) based on the Matterport3D (Chang et al., 2017) simulator. These tasks evaluate the agent from different perspectives and have different characteristics.
R2R benchmark. The Room-to-Room (R2R) dataset (Anderson et al., 2018) is a VLN task that involves navigating through photo-realistic indoor environments by following low-level natural language instructions, such as “Turn left and walk across the hallway. Turn left again and walk across this hallway …”. This task consists of 90 houses with 10,567 viewpoints and 7,189 shortest-path trajectories with 21,567 manually annotated instructions.
REVERIE benchmark. REVERIE dataset (Qi et al., 2020b) is a VLN task that focuses on localizing a remote target object based on high-level human instructions, such as "Bring me the bench from the foyer". The target object is not visible at the starting location. The agent has to navigate to an appropriate location without detailed guidance.
NDH benchmark. NDH dataset (Thomason et al., 2019) is a VLN task that evaluates an agent’s ability to arrive at goal regions based on multi-turn question-answering dialogs. The task poses challenges due to the ambiguity and under-specification of starting instructions, as well as the long lengths of both instructions and paths.
Model | Component | R2R Val Unseen | |||||
---|---|---|---|---|---|---|---|
# | E Trans. | S-T Trans. | T Trans. | SR | OSR | Gap | |
HOP (Qiao et al., 2022) | 0 | 63.52 | 71.48 | 7.96 | |||
Ours | 1 | ✓ | ✓ | ✓ | 66.92 | 71.48 | 4.56 |
2 | ✗ | 65.79 | 71.48 | 5.69 | |||
3 | ✗ | 64.82 | 71.48 | 6.66 | |||
4 | ✗ | 65.33 | 71.48 | 6.15 |
4.1. Evaluation Metrics
We adopt widely-used metrics on each of these three VLN tasks. For the R2R task, Trajectory Length (TL) measures the average distance navigated by the agent. Navigation Error (NE) is used to measure the mean deviation between the agent’s stop location and the target location in meters. The Success Rate (SR) is computed as the ratio of successfully completed tasks, where the agent stops within 3 meters to the target location, while Oracle Success Rate (OSR) considers a task as successful when at least one of its trajectory viewpoints is within 3 meters to the target location. Success Rate weighted by Path Length (SPL) evaluates the accuracy and efficiency of navigation simultaneously, which considers both the success rate and the length of the navigation path. For the REVERIE (Qi et al., 2020b) task, the same metrics as R2R are utilized to evaluate the navigation sub-task. REVERIE introduces additional metrics to evaluate the object grounding performance, Remote Grounding Success Rate (RGS) measures the proportion of tasks that successfully locate the target object, and RGS weighted by Path Length (RGSPL) which considers both grounding accuracy and the navigation path length. For the NDH (Thomason et al., 2019) task, the primary evaluation metric in this study is Goal Progress (GP), which quantifies the distance (in meters) that the agent has progressed towards the target location. We also introduce the SR of the R2R task as a complementary measurement.
4.2. Implementation Details
We conducted all experiments on a single NVIDIA RTX 3090 GPU. The batch size is set to 48. The image features are exacted by ViT-B/16 (Dosovitskiy et al., 2021) pretrained on ImageNet (Russakovsky et al., 2015), and we adopt RoBERTa (Liu et al., 2019) as the text encoder. For each VLN benchmark, we trained our model using both generated data from the original data and HM3D-AutoVLN data (Chen et al., 2022b). The Cross-Modality Elevation Transformer, Spatial-Temporal Transformer and Target Selection Transformer in our model use 2, 2 and 2 transformer layers, respectively. Our transformer has 8 heads, and feed-forward layers are with a hidden dimension of 768. We freeze the vision encoder and set the initial learning rates to for the language encoder and for the rest of the network. The text encoder has a linear schedule learning rate with warm-up while the rest of the network has a constant learning rate. We apply the AdamW (Loshchilov and Hutter, 2019) optimizer with weight-decay and use dropout probability of 0.1 in transformer layers and 0.5 in the prediction head. We also apply a dropout with a probability of 0.4 followed by a linear layer on image features. Lastly, we use an exponential moving average with a decay rate of 0.9998. For loss functions, we set hyper-parameters , , , and . We train our networks for 40000 iterations, and the final model is selected based on the best performance on the validation unseen split.
Training Data Preparation. The ground-truth trajectories are not suitable for training our model because the target viewpoint is always the last one in each trajectory, which might lead to our model to learn this bias instead of the ability to match visual representation to textual ones. To overcome this problem, we construct new training data based on original trajectories. Specifically, for each original trajectory, we first find out all the viewpoints that are located within 3 meters of the target location. These viewpoints are viewed as positive because if an agent stops at any of these viewpoints, the navigation is regarded as successful. Then, we sample viewpoints (e.g., {}) from as the positive viewpoints of a newly constructed trajectory. The reason we sample multiple positive viewpoints is that a VLN method may pass by the target location several times, such as before and after the target location. Next, we connect the original starting location with one of and/or using the shortest path. Then, from the other viewpoint (if there is any), we expand to another sub-path within 6 meters. By connecting these two sub-paths with and/or , we obtain a new path. In this way, we construct 396,866 trajectories for R2R, 331,533 trajectories for REVERIE, and 376,871 trajectories for NDH.
4.3. Comparison to State-of-the-Art Methods
Results on R2R. We assess the efficacy of the proposed method on top of four state-of-the-art methods: HOP (Qiao et al., 2022), SIG_HAMT (Li and Bansal, 2023a), DUET (Chen et al., 2022c) and SIG_DUET (Li and Bansal, 2023a). The results are presented in Table 1. It shows that our method brings significant performance improvements for these four top-performing methods. Specifically, on validation unseen, our method improves the success rate with an absolute increase of 4% for SIG_DUET, 3% for HOP and DUET, and 2% for SIG_HAMT. On the test split, our method yields 2% absolute improvement for 3 out of the 4 top-performing methods, boosting the success rate to 74%. In terms of SPL, we report two types of results: one is computed based on the path that adds the shortest return path from the original stop location to the viewpoint predicted by our method, and the other one is computed using the path cropped from the starting location to the viewpoint predicted by our method (indicated by superscript ). In the return setting, the SPL is slightly decreased (-1%) due to the increase in trajectory length caused by the return path. By contrast, in the crop setting, our method can bring 2%5% absolute SPL improvement. All these results show the effectiveness of the proposed method.
Results on REVERIE. In Table 2, we present results on the REVERIE dataset with comparisons to several state-of-the-art methods, including RCM (Wang et al., 2019b), SMNA (Ma et al., 2019a), FAST-Short (Ke et al., 2019), MATTN (Qi et al., 2020b), ORIST (Qi et al., 2021), VLN BERT (Hong et al., 2021), and AirBERT (Guhur et al., 2021). REVERE evaluates both navigation and grounding performance. As our method is designed to improve navigation ability, the navigation success rate is the main metric for reference. As shown in Table 2, our method boosts the navigation success of all four top-performing baselines, with improvement ranging from 0.35% to 2.33% across three splits. As our method only modifies pre-obtained navigation trajectories, it cannot change the object grounding results and thus there is no performance difference in terms of the RGS metric.
Model | Trainning Loss | R2R Val Unseen | ||
---|---|---|---|---|
SR | OSR | Gap | ||
1 | BCE | 63.62 | 71.48 | 7.86 |
2 | Focal | 65.05 | 71.48 | 6.43 |
3 | BCE Dice | 65.43 | 71.48 | 6.05 |
4 | Focal Dice | 66.92 | 71.48 | 4.56 |
Model | Training Loss HP | R2R Val Unseen | |||
---|---|---|---|---|---|
SR | OSR | Gap | |||
1 | 1.0 | 1.0 | 65.42 | 71.48 | 6.06 |
2 | 1.0 | 0.5 | 65.53 | 71.48 | 5.95 |
3 | 1.0 | 0.2 | 66.31 | 71.48 | 5.17 |
4 | 1.0 | 0.1 | 66.92 | 71.48 | 4.56 |
Results on NDH. The results of our baselines and our method are presented in Table 3. By applying our method to HAMT (Chen et al., 2021) and SIG_HAMT (Li and Bansal, 2023a), we achieved 2.65% and 1.91% improvements in success rate on the NDH validation unseen split, respectively. More improvements are observed on the validation seen split: 4.51% and 3.77%. Regarding the GP metric, our method also improve our baselines to the new state-of-the-art performance from 5.6 to 5.89.

4.4. Ablation Study
In this section, we investigate the effectiveness of the three transformer-based modules of our method, additional training data, loss functions, and hyperparameters of loss functions.
The Effectiveness of Different Components. We conduct ablation experiments over the main components, cross-modality elevation transformer (E Trans.), spatial-temporal transformer (S-T Trans.), and target selection transformer (T Trans.), of our method on the R2R validation unseen split. The results are presented in Table 4. For comparison convenience, we also present the baseline result in #0. In #1, we present the results when applying all three modules onto the baseline. In the remaining rows, we report the results by removing each module one by one. It shows that, when applying all three modules, the best results are achieved and reduce the gap between SR and OSR from 7.96 to 4.56 (#0 v.s. #1). When removing each module in turns, we notice that the elevator transformer leads to the most performance drop (#1 v.s. #2), then the target selection transformer (#1 v.s. #4), and last the spatial-temproal transformer (#1 v.s. #3).
The Effectiveness of Additional Training Data. In addition to the original training data of each VLN task, we also use the automatically generated data in (Chen et al., 2022b) of 900 unlabelled raw HM3D environments. To verify its effectiveness, we conduct three ablation studies and the results are shown in Table 5. Specifically, using additional data resulted in a 2.17% increase in Success Rate compared to the original training data (#1 v.s. #3). We also evaluated zero-shot performance of the model trained only on additional data (#2), which achieves a Success Rate of 45.52% on the R2R validation unseen split. Although the results suggest that the additional data from HM3D-VLN provides relevant information for the visual navigation task (#2 v.s. #3), it is important to note that the model trained solely on the additional data achieves significantly lower performance than the model trained on only the original data.
Training Losses. To effectively train our model, we test three loss functions: binary cross-entropy (BCE) loss, Focal loss (Lin et al., 2017), and Dice loss (Milletari et al., 2016). Here we evaluate their own efficacy. The results are shown in Table 6. As shown in the table, using focal loss with dice loss results in the best success rate, then BCE+Dice, Focal, and the BCE loss. Focal loss achieving better results than BCE loss indicates some training samples are important than others. The introduction of Dice loss boosts the performance over BCE and Focal loss. This indicates that considering multiple predictions together is better than handling them separately.
Impact of Hyperparameters. We take another ablation study to investigate the impact of training loss hyperparameters (HPs) for our model. The ablation study involves varying the values of two HPs, and , for the training loss function. Specifically, we experiment with four different combinations of hyper-parameters as shown in Table 7, ranging from both HPs having equal weight to one HP being given more weight than the other. The results show that varying the hyper-parameters has a minor impact on the performance of our model.
4.5. Qualitative Results
In Figure 3, we visualize the predicted trajectories before and after applying our method on top of three methods: HOP (Qiao et al., 2022), HAMT (Chen et al., 2021) and DUET (Chen et al., 2022c). As shown in Figure 3(a), HOP initially passed the target location, which is a common problem in trajectory planning for VLN agents. Our method is able to find the correct viewpoints in the original trajectory and thus can correct wrong navigations as indicated by the dashed lines in Figure 3. The similar phenomena can also be observed on HAMT (Chen et al., 2021) (Figure 3(b)) and DUET (Chen et al., 2022c) (Figure 3(c)). This shows the effectiveness of our method in correcting trajectories that deviate from desired path and enabling agents to reach their target locations with improved precision.
5. Conclusion
In this paper, we address the long-ignored problem in Vision-and-Language Navigation of narrowing the gap between Success Rate (SR) and Oracle Success Rate (OSR) by formulating VLN as a trajectory grounding task. We propose to discover the target location from a trajectory given by off-the-shelf VLN models, rather than predicting actions directly as existing methods do. To achieve this, we design a novel multi-module transformer-based model to learn a compact discriminative trajectory viewpoint representation, which is then used to predict the confidence of being a target location. Our proposed approach is evaluated on four widely adopted datasets, R2R, REVERIE and NDH, with extensive evaluations demonstrating its effectiveness. Our method shows a new direction to solve the VLN tasks, which has been previously overlooked, and presents promising results for enhancing the performance of existing navigation agents.
References
- (1)
- An et al. (2021) Dong An, Yuankai Qi, Yan Huang, Qi Wu, Liang Wang, and Tieniu Tan. 2021. Neighbor-view Enhanced Model for Vision and Language Navigation. In ACM Int. Conf. Multimedia.
- An et al. (2022) Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. 2022. BEVBert: Topo-Metric Map Pre-training for Language-guided Navigation. arXiv preprint arXiv:2212.04385 (2022).
- Anderson et al. (2018) Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian D. Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In IEEE Conf. Comput. Vis. Pattern Recog. 3674–3683.
- Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Eur. Conf. Comput. Vis. 213–229.
- Chang et al. (2017) Angel X. Chang, Angela Dai, Thomas A. Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments. In International Conference on 3D Vision. 667–676.
- Chen et al. (2018) Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In EMNLP. 162–171.
- Chen et al. (2022a) Jinyu Chen, Chen Gao, Erli Meng, Qiong Zhang, and Si Liu. 2022a. Reinforced Structured State-Evolution for Vision-Language Navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 15450–15459.
- Chen et al. (2021) Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History aware multimodal transformer for vision-and-language navigation. Adv. Neural Inform. Process. Syst. 34 (2021), 5834–5847.
- Chen et al. (2022b) Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. 2022b. Learning from unlabeled 3d environments for vision-and-language navigation. In Eur. Conf. Comput. Vis. 638–655.
- Chen et al. (2022c) Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. 2022c. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 16537–16547.
- Cheng et al. (2022) Wenhao Cheng, Xingping Dong, Salman H. Khan, and Jianbing Shen. 2022. Learning Disentanglement with Decoupled Labels for Vision-Language Navigation. In Eur. Conf. Comput. Vis. 309–329.
- Deng et al. (2022) Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2022. Visual Grounding Via Accumulated Attention. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3 (2022), 1670–1684. https://doi.org/10.1109/TPAMI.2020.3023438
- Deng et al. (2021) Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. 2021. Transvg: End-to-end visual grounding with transformers. In IEEE Conf. Comput. Vis. Pattern Recog. 1769–1779.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
- Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Int. Conf. Learn. Represent.
- Fried et al. (2018) Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-Follower Models for Vision-and-Language Navigation. In Adv. Neural Inform. Process. Syst. 3318–3329.
- Fu et al. (2020) Tsu-Jui Fu, Xin Eric Wang, Matthew F Peterson, Scott T Grafton, Miguel P Eckstein, and William Yang Wang. 2020. Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling. In Eur. Conf. Comput. Vis. 71–86.
- Gao et al. (2023) Chen Gao, Xingyu Peng, Mi Yan, He Wang, Lirong Yang, Haibing Ren, Hongsheng Li, and Si Liu. 2023. Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 14911–14920.
- Gao et al. (2017) Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In IEEE Conf. Comput. Vis. Pattern Recog. 5267–5275.
- Guhur et al. (2021) Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, and Cordelia Schmid. 2021. Airbert: In-Domain Pretraining for Vision-and-Language Navigation. In Int. Conf. Comput. Vis. 1634–1643.
- Hahn et al. (2019) Meera Hahn, Asim Kadav, James M Rehg, and Hans Peter Graf. 2019. Tripping through time: Efficient localization of activities in videos. arXiv preprint arXiv:1904.09936 (2019).
- Hao et al. (2020) Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training. In IEEE Conf. Comput. Vis. Pattern Recog. 13134–13143.
- Hong et al. (2020) Yicong Hong, Cristian Rodriguez Opazo, Yuankai Qi, Qi Wu, and Stephen Gould. 2020. Language and Visual Entity Relationship Graph for Agent Navigation. In Adv. Neural Inform. Process. Syst.
-
Hong et al. (2021)
Yicong Hong, Qi Wu,
Yuankai Qi, Cristian Rodriguez Opazo,
and Stephen Gould. 2021.
VLN
- Ilharco et al. (2019) Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. 2019. General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping. In Adv. Neural Inform. Process. Syst.
- Ke et al. (2019) Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and Siddhartha S. Srinivasa. 2019. Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 6741–6749.
- Krantz et al. (2020) Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. 2020. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments. In Eur. Conf. Comput. Vis., Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.), Vol. 12373. 104–120.
- Krishna et al. (2017) Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Int. Conf. Comput. Vis. 706–715.
- Ku et al. (2020) Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. 2020. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. In EMNLP. 4392–4412.
- Li and Bansal (2023a) Jialu Li and Mohit Bansal. 2023a. Improving Vision-and-Language Navigation by Generating Future-View Image Semantics. In IEEE Conf. Comput. Vis. Pattern Recog. 10803–10812.
- Li and Bansal (2023b) Jialu Li and Mohit Bansal. 2023b. PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation. arXiv preprint arXiv:2305.19195 (2023).
- Li et al. (2022) Jialu Li, Hao Tan, and Mohit Bansal. 2022. Envedit: Environment editing for vision-and-language navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 15407–15417.
- Li et al. (2019) Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Çelikyilmaz, Jianfeng Gao, Noah A. Smith, and Yejin Choi. 2019. Robust Navigation with Language Pretraining and Stochastic Sampling. In EMNLP. 1494–1499.
- Liang et al. (2022a) Xiwen Liang, Fengda Zhu, Lingling Li, Hang Xu, and Xiaodan Liang. 2022a. Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration. In ACL. 4837–4851.
- Liang et al. (2022b) Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, and Xiaodan Liang. 2022b. Contrastive instruction-trajectory learning for vision-language navigation. In AAAI. 1592–1600.
- Lin et al. (2022b) Bingqian Lin, Yi Zhu, Zicong Chen, Xiwen Liang, Jianzhuang Liu, and Xiaodan Liang. 2022b. ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts. In IEEE Conf. Comput. Vis. Pattern Recog. 15396–15406.
- Lin et al. (2022a) Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, and Zehuan Yuan. 2022a. Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation. In Eur. Conf. Comput. Vis. 380–397.
- Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In IEEE Conf. Comput. Vis. Pattern Recog. 2980–2988.
- Liu et al. (2021) Chong Liu, Fengda Zhu, Xiaojun Chang, Xiaodan Liang, Zongyuan Ge, and Yi-Dong Shen. 2021. Vision-language navigation with random environmental mixup. In IEEE Conf. Comput. Vis. Pattern Recog. 1644–1654.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
- Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Int. Conf. Learn. Represent.
- Ma et al. (2019a) Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. 2019a. Self-Monitoring Navigation Agent via Auxiliary Progress Estimation. In Int. Conf. Learn. Represent.
- Ma et al. (2019b) Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. 2019b. The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation. In IEEE Conf. Comput. Vis. Pattern Recog. 6732–6740.
- Milletari et al. (2016) Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In International Conference on 3D Vision. 565–571.
- Mun et al. (2020) Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In IEEE Conf. Comput. Vis. Pattern Recog. 10810–10819.
- Qi et al. (2021) Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton van den Hengel, and Qi Wu. 2021. The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation. In Int. Conf. Comput. Vis. 1655–1664.
- Qi et al. (2020a) Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. 2020a. Object-and-Action Aware Model for Visual Language Navigation. In Eur. Conf. Comput. Vis. 303–317.
- Qi et al. (2020b) Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020b. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In IEEE Conf. Comput. Vis. Pattern Recog. 9979–9988.
- Qiao et al. (2021) Yanyuan Qiao, Chaorui Deng, and Qi Wu. 2021. Referring Expression Comprehension: A Survey of Methods and Datasets. IEEE Trans. Multimedia 23 (2021), 4426–4440. https://doi.org/10.1109/TMM.2020.3042066
- Qiao et al. (2022) Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu. 2022. HOP: history-and-order aware pre-training for vision-and-language navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 15418–15427.
- Qiao et al. (2023) Yanyuan Qiao, Yuankai Qi, Yicong Hong, Zheng Yu, Peng Wang, and Qi Wu. 2023. HOP+: History-enhanced and Order-aware Pre-training for Vision-and-Language Navigation. IEEE Trans. Pattern Anal. Mach. Intell. (2023).
- Rodriguez et al. (2020) Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV. 2464–2473.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.
- Su et al. (2021) Rui Su, Qian Yu, and Dong Xu. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In IEEE Conf. Comput. Vis. Pattern Recog. 1533–1542.
- Tan et al. (2019) Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. In NAACL-HLT. 2610–2621.
- Tang et al. (2021) Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. 2021. Human-centric spatio-temporal video grounding with visual transformers. IEEE Trans. Circuit Syst. Video Technol. (2021).
- Thomason et al. (2019) Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2019. Vision-and-Dialog Navigation. In CoRL. 394–406.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Adv. Neural Inform. Process. Syst. 5998–6008.
- Wang et al. (2022a) Hanqing Wang, Wei Liang, Luc V Gool, and Wenguan Wang. 2022a. Towards versatile embodied navigation. Adv. Neural Inform. Process. Syst. 35 (2022), 36858–36874.
- Wang et al. (2022b) Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, and Wenguan Wang. 2022b. Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 15471–15481.
- Wang et al. (2021) Hanqing Wang, Wenguan Wang, Wei Liang, Caiming Xiong, and Jianbing Shen. 2021. Structured scene memory for vision-language navigation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 8455–8464.
- Wang et al. (2020) Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, and Jianbing Shen. 2020. Active Visual Information Gathering for Vision-Language Navigation. In Eur. Conf. Comput. Vis.
- Wang et al. (2019a) Weining Wang, Yan Huang, and Liang Wang. 2019a. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In IEEE Conf. Comput. Vis. Pattern Recog. 334–343.
- Wang et al. (2019b) Xin Wang, Qiuyuan Huang, Asli Çelikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019b. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In IEEE Conf. Comput. Vis. Pattern Recog. 6629–6638.
- Wang et al. (2023b) Xiaohan Wang, Wenguan Wang, Jiayi Shao, and Yi Yang. 2023b. LANA: A Language-Capable Navigator for Instruction Following and Generation. In IEEE Conf. Comput. Vis. Pattern Recog. 19048–19058.
- Wang et al. (2023a) Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, and Yu Qiao. 2023a. Scaling Data Generation in Vision-and-Language Navigation. arXiv preprint arXiv:2307.15644 (2023).
- Xu et al. (2019) Huijuan Xu, Kun He, Bryan A Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI. 9062–9069.
- Yamaguchi et al. (2017) Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Int. Conf. Comput. Vis.
- Yang et al. (2022) Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022. TubeDETR: Spatio-Temporal Video Grounding With Transformers. In IEEE Conf. Comput. Vis. Pattern Recog. 16442–16453.
- Yu et al. (2018) Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In IEEE Conf. Comput. Vis. Pattern Recog. 1307–1315.
- Yuan et al. (2019a) Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. 2019a. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Adv. Neural Inform. Process. Syst. 32 (2019).
- Yuan et al. (2019b) Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019b. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI. 9159–9166.
- Zhang et al. (2019) Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In IEEE Conf. Comput. Vis. Pattern Recog. 1247–1257.
- Zhang et al. (2020) Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. 2020. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences. In IEEE Conf. Comput. Vis. Pattern Recog.
- Zhao et al. (2022) Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu. 2022. Target-driven structured transformer planner for vision-language navigation. In ACM Int. Conf. Multimedia. 4194–4203.
- Zhou et al. (2023) Gengze Zhou, Yicong Hong, and Qi Wu. 2023. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models. arXiv preprint arXiv:2305.16986 (2023).
- Zhou et al. (2018) Luowei Zhou, Chenliang Xu, and Jason Corso. 2018. Towards automatic learning of procedures from web instructional videos. In AAAI.
- Zhu et al. (2021) Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. 2021. SOON: Scenario Oriented Object Navigation With Graph-Based Exploration. In IEEE Conf. Comput. Vis. Pattern Recog. 12689–12699.
Appendix
Appendix A Additional Results
RxR benchmark. The Room-Across-Room (RxR) (Ku et al., 2020) dataset is a large multilingual VLN dataset that comprises a vast collection of elaborate instructions and trajectories, including instructions in three different languages: English, Hindi, and Telugu. The dataset emphasizes the role of language in VLN by addressing biases in paths and describing more visible entities than R2R (Anderson et al., 2018). We construct 427,347 trajectories for RxR using the same method mentioned in the main paper.
Evaluation Metrics. For the RxR (Ku et al., 2020), Normalized Dynamic Time Warping (nDTW) (Ilharco et al., 2019) and Success Rate weighted by Dynamic Time Warping (sDTW) are introduced in addition to the SR and SPL metrics mentioned in R2R (Anderson et al., 2018), to assess path fidelity.
Results on RxR. The results are presented in Table 8. By applying our method to HAMT (Chen et al., 2021), we achieved 3.2% and improvements in success rate on the RxR validation unseen split, respectively. More improvements are observed on the validation seen split: 3.8%. This demonstrates that our method can improve the navigation success capability of the VLN agent.
Methods | RxR Val Seen | RxR Val Unseen | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
SR | OSR | SPL | nDTW | SDTW | SR | OSR | SPL | nDTW | SDTW | |
Baseline (Ku et al., 2020) | 25.2 | - | - | 42.2 | 20.7 | 22.8 | - | - | 38.9 | 18.2 |
HAMT (Chen et al., 2021) | 59.4 | 66.5 | 58.9 | 65.3 | 50.9 | 56.6 | 64.4 | 56.0 | 63.1 | 48.3 |
Ours | 63.2(3.8) | 66.5 | 58.1 | 66.9 | 52.7 | 59.8(3.2) | 64.4 | 55.2 | 63.9 | 49.2 |
Ours♠ | 63.2(3.8) | 66.1 | 60.2 | 67.2 | 53.0 | 59.8(3.2) | 63.5 | 56.7 | 64.4 | 49.6 |

Appendix B Additional Ablations
R2R Val Unseen | HOP (Qiao et al., 2022) | SIG_HAMT (Li and Bansal, 2023a) | DUET (Chen et al., 2022c) | SIG_DUET (Li and Bansal, 2023a) | |
---|---|---|---|---|---|
0 | Keeping SR | 98.12% (62.32) | 98.56% (67.09) | 98.75% (70.63) | 98.35% (71.22) |
1 | Correcting Gap | 67.08% (4.60) | 58.52% (3.36) | 51.67% (3.96) | 62.15% (4.68) |
2 | Final SR | 66.92 (62.32+4.6) | 70.46 (67.09+3.36) | 74.58(70.63+3.96) | 75.9 (71.22+4.68) |
Accuracy in identifying the correct destination. The results of the accuracy in identifying the correct destination of our model with HOP (Qiao et al., 2022) on R2R Val Unseen split are presented in Table 9. "Keeping SR" denotes the percentage of episodes that are still successful after our correction. "Correcting Gap" denotes the percentage of episodes that are previously failed but become successful after our correction.
Methods | Input Text | R2R Val Unseen | ||
---|---|---|---|---|
SR | OSR | Gap | ||
HOP (Qiao et al., 2022) | 63.52 | 71.43 | 7.96 | |
+ Ours (1) | Target Instruction | 66.39 | 71.48 | 5.09 |
+ Ours (2) | Whole Instruction | 66.92 | 71.48 | 4.56 |
Can Text [CLS] Encode Destination Information? To investigate whether the Text [CLS] token (Section 3.1 Text Encoder. in the main paper) can encode the destination of an instruction after fine-tuning, we conducted an ablation study as presented in Table 10. Specifically, we compare the model performance between feeding the full instruction and the sub-instruction that only contains information related to the target location. The results show that the performances are very close with a small gap of 0.6%. The results suggest that the [CLS] token can learn the destination information of an instruction.
Appendix C Constructed Trajectories
We provide examples of constructed new trajectories based on ground truth trajectory. Figure 4 (a) illustrates an original ground truth trajectory, where the target viewpoint is represented by a pentagram. Figure 4 (b) and (c) present examples of new trajectories that are constructed based on the original trajectory.