Cross-modal Map Learning for Vision and Language Navigation

Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan,
Eleni Miltsakaki, Dan Roth, Kostas Daniilidis
University of Pennsylvania
{ggeorgak,karls,kwanchoo,sohamdan,elenimi,danroth,kostas}@seas.upenn.edu
Project webpage: https://ggeorgak11.github.io/CM2-project/

Abstract

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.

1 Introduction

For mobile robots to be able to operate together with humans, they must be able to execute tasks that are defined not in the form of machine-readable scripts but rather in the form of human instructions. A very basic but challenging task is going from A to B. While robots have been quite successful in executing this task using metric representations, it has been more challenging for robots to execute semantic tasks like “go to the kitchen sink” or follow instructions that describe a path and associate actions with natural language, defined as the Vision-and-Language Navigation (VLN) task [4, 32, 33]. In VLN, the robot is given instructions and has to reach a goal making use of images of the environment that it can acquire along the way.

The dominant approach for VLN tasks has been using end-to-end pipelines from images and instructions to actions [23, 17, 31, 32]. While they can be attractive due to their simplicity, they are expected to implicitly learn end-to-end all navigation components such as mapping, planning, and control, and thus often require considerable amounts of training data. This approach to designing navigation systems is in direct contrast to research on human spatial navigation, which has shown that humans and other species build map-like representations of the environment to accomplish way-finding [41, 52]. However, multiple findings have shown that the ability to build cognitive maps and acquire spatial knowledge deteriorates when humans exclusively use ready to drive or walk paths to a goal [6]. On the other hand, studies have shown that humans build better spatial representations when presented with landmark-based navigation instructions rather than full paths [55]. Such spatial representations enable the recall of landmarks on an egocentric map weeks after the experiment. While this does not prove that humans build a map during wayfinding when following semantic instructions, it is a strong indication that they can anchor landmarks and other semantics to a map that they easily recall. Research in learning of mapping and planning in computer vision and robotics [22] has also shown that an end-to-end system encompasses semantic maps that naturally emerge in the learning process.

Refer to caption — Figure 1: We approach the task of vision-and-language navigation as a two-stage procedure which learns to semantically and spatially ground the instruction on egocentric maps.

We propose Cross-modal Map Learning (CM²), a novel navigation system for the VLN task in continuous environments, that learns a language-informed representation for both map and trajectory prediction by applying twice cross-modal attention, hence CM². Our method decomposes the problem in the two paths of semantic and spatial grounding as illustrated in Figure 1. First, we use a cross-modal attention network to semantically ground the instruction through an egocentric map prediction task that learns to hallucinate information outside the field-of-view of the agent. This is followed by another cross-modal attention network that is responsible for spatially grounding the instruction by learning to predict the path on the egocentric map. Our analysis shows that through these two subtasks, the attended representations learn to focus on instruction-relevant objects and locations on the map.

The main difference between our method and existing image-language attentional mechanisms that generate actions is that in our approach, the robot is building a cognitive map that encodes the environmental priors and follows instructions based on this map. The motivation to use this representation is based on our finding that when the robot is given a local ground-truth (“correct”) map of the environment, then the robot outperforms all approaches on the VLN task by a large margin. This map is still local, more like a crop of the blueprint than a global map of the environment, but can still hallucinate what is behind the walls so that it can align better the map with the language instruction. This differentiates us from approaches such as [12] that first build a topological map of the environment by exploring the whole scene and then executing the task having access to a global map. We further argue that by learning the layout priors through cross-modal attention, we can leverage the spatial and semantic descriptions from natural language and decrease the uncertainty over the hallucinated areas. As opposed to recent work [31] that outputs a single waypoint, we learn to predict the whole trajectory, while our waypoints are determined by the alignment between language and egocentric maps rather than the distance to the goal.

In summary, our contributions are as follows:

•

A novel system for the VLN task that learns maps as an explicit intermediate representation.
•

Semantic grounding of language to those maps by applying cross-modal attention when learning to predict semantic maps.
•

Spatial grounding of instructions when learning to predict paths by applying cross-modal attention on semantic maps and language.
•

An analysis over the learned representation that demonstrates the effectiveness of using egocentric maps for the VLN task.
•

Competitive results in the VLN-CE [32] dataset against current state-of-the-art methods.

2 Related Work

Vision-and-Language Navigation. The problem of instruction following for navigation has drawn significant attention in a wide range of domains. These include Google Street View Panoramas [11], simulated environments for quadcopters [5], multilingual settings [33], interactive vision-dialogue setups [60], real world scenes [3], and realistic simulations of indoor scenes [4]. More relevant to our work is the literature on the Vision-and-Language Navigation (VLN) task initially defined in [4] on navigation graphs (R2R) in Matterport3D [8] dataset, and then converted for continuous environments in [32] (VLN-CE). Arguably, the biggest challenges in VLN are grounding the natural language to the visual input while keeping track which part of the instruction was completed. To address these issues, many methods rely on unstructured memory such as LSTM for visual-textual alignment [37, 28, 17, 14], or have dedicated progress monitor modules [38, 37]. Other approaches formulate instruction following as a Bayesian tracking problem [2], or learn to decompose and execute the instructions in short steps [59]. Another line of works [26, 44, 42, 23, 12, 31, 46, 21, 39] make use of attention mechanisms and adapt powerful language models such as BERT [15] and transformer networks [53] to the VLN task. For instance, Chen at al. [12] learn the association between instructions and nodes on a prebuild topological map of the environment, while Krantz et al. [31] learn to predict waypoints from panoramic images and investigate the prediction in different action spaces. In contrast to all these works, our method learns to associate the language and egocentric observations at the semantic level with 2D spatial representations followed by path prediction.

Cross-modal attention. The transformer architecture [53] has been extremely successful in language [15], speech [16] vision [30] and multimodal applications [27]. A key feature of the transformer architecture is the attention mechanism. Cross-modal transformers have been widely used for vision-language tasks such as visual-question answering and beyond, such as joint video and language understanding [50]. Additionally, there have been investigations into whether the multimodal transformers learn interpretable relations between the two modalities by analyzing the cross-modal attention heads, as studied in Visual BERT [34] and in a cross modal self attention network for referring image segmentation [56]. Prior works have trained cross-modal transformers in two ways: 1) single-stream design where the multimodal inputs (for example, word embeddings and image regions) are fed into a single transformer architecture. Examples of this are UNITER [13], VLBERT[49], VisualBERT [34]. 2) multi-stream design where the individual modalities are encoded separately via self-attention and then a cross-modal representation is learned by the transformer. Examples of this are LXMERT [51], ViLBERT [36], [58]. In this work, we adapt the multi-stream design for vision language navigation using egocentric maps. We also investigate the cross-modal attention heads and decoder representation of the transformer for interpretable patterns.

Map Prediction in Navigation. Modular approaches using different types of spatial representations have been successful in multiple navigation tasks, whether they focused on occupancy [22, 45, 10, 29, 18] or semantic map prediction [9, 19, 40, 35, 7, 20]. For example, Gupta et al. [22] learn a differentiable mapper for predicting top-down egocentric maps that are trained end-to-end with a differentiable planner, while Cartillier et al. [7] learn to build top-down allocentric maps from egocentric RGB-D observations. Several recent works go beyond traditional mapping and learn to predict information outside the field-of-view of the agent [45, 19, 40, 35]. The work of [45] learns to hallucinate occupancy layouts in indoor environments, while [19] extends the prediction to semantic classes and uses information gain objectives to increase the performance of the predictor. Our approach expands upon this last set of methods by presenting a language-informed model that attempts to hallucinate missing information using cues from both language and currently observed regions.

3 Approach

3.1 Problem setup

We address instruction-following navigation in indoor environments, where natural language instructions implicitly describe a specific path and goal location in the environment that an agent needs to follow. In particular, we consider the setup described in the Vision-and-Language Navigation in Continuous Environments (VLN-CE) [32] that was adapted from the Room-to-Room (R2R) [4] dataset from pre-specified navigation graphs to continuous 3D environments. VLN-CE uses the Habitat [48] simulator in the Matterport3D [8] scenes and offers more realistic settings and is much more challenging [32] than the original R2R. During a VLN-CE navigation episode, the agent has access to egocentric RGB-D observations at a resolution of $256\times 256$ with a horizontal field-of-view of $90^{\circ}$ . In contrast to other recent methods [31, 12], we assume the agent observes a frame with limited field-of-view at each time-step (not panoramas). The action space is defined over a discrete set of actions consisting of MOVE_FORWARD by $0.25m$ , TURN_LEFT and TURN_RIGHT by $15^{\circ}$ , and STOP, without actuation noise. Recently, the work of [31] demonstrated higher performance when continuous-space actions are considered, however we kept the action set discrete to remain consistent with prior work on VLN-CE.

3.2 Overview of our approach

We propose a method for Vision-and-Language Navigation involving path prediction over predicted semantic egocentric 2D maps. Our argument for this approach is threefold. First, an egocentric map offers a natural representation for grounding spatial and semantic concepts from natural language instructions. Second, a VLN method should take advantage of the knowledge over semantic and spatial layouts as they offer a strong prior over possible trajectories. Third, the language instruction provides a semantic description of a trajectory through the environment, which could be leveraged to improve map predictions.

Given the instruction, our method learns to predict the entire path defined as a set of waypoints on an egocentric local map at every step of the episode (Sec. 3.3). The agent then localizes itself on the current predicted path and chooses the following waypoint on the path as a short-term goal. This goal is then passed to an off-the-shelf local policy (DD-PPO [54]) which predicts the next navigation action. We assume that we have access to ground-truth pose as provided by the simulator to facilitate DD-PPO. We note that estimating the pose from noisy sensor readings is out of the scope of this work, and point to visual odometry methods [57] that can adapt DD-PPO agents to such a setting.

To obtain the egocentric map we define a language-informed two-stage semantic map predictor that learns to hallucinate the semantics in the unobserved areas (Sec. 3.4). An overview of our method is shown in Figure 2. In the following two paragraphs we briefly describe the common input encoding procedures between different components of our method.

Instruction Encoding.

We use a pretrained Bidirectional Encoder Representations from Transformers (BERT) [15] model, which is a multi-layer transformer [53], to extract a feature vector for each word in the instruction. The overall feature representation for the instruction $X^{\prime}\in\mathbb{R}^{M\times d^{\prime}}$ is passed through a fully-connected layer to obtain the final representation $X\in\mathbb{R}^{M\times d}$ , where $M$ is the number of words in the instruction, $d^{\prime}=768$ is the default feature dimension of BERT, and $d=128$ is the feature dimension we use throughout our method. During training we only finetune the last layer of BERT.

Egocentric Map Encoding.

Our network encodes an input egocentric semantic map $s\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times c}$ with a truncated ResNet18 [24], where $h^{\prime}$ , $w^{\prime}$ , $c$ are height, width, and the number of semantic classes, respectively as $Y=Enc\left(s\right)$ . The ResNet18 initially produces a feature representation $Y^{\prime}\in\mathbb{R}^{h\times w\times d}$ ( $h=\frac{h^{\prime}}{16}$ , $w=\frac{w^{\prime}}{16}$ ), which is then reshaped to $Y\in\mathbb{R}^{N\times d}$ ( $N=h\times w$ ). One of these modules encodes the ground projected RGB-D observations for the map predictor (Sec. 3.4) and a separate module is used to encode the predicted semantic map for the path predictor (Sec. 3.3).

3.3 Cross-modal attention for path prediction

The cross-modal attention for path prediction module takes as input the instruction representation $X$ and the egocentric map encoding $Y$ and formulates the path prediction problem as a waypoint localization task. In order to learn a grounded representation of the natural language instruction on the egocentric map, we define a cross-modal attention module following the architecture of the self-attention transformer model [53]. While it is common to concatenate the representations of the two modalities and then use self-attention (such as VisualBERT [34]) we follow the example of LXMERT [51] and treat the egocentric map separately as the query and the instruction as the key and value. The idea is that during an episode the language instruction remains the same while the egocentric map changes at each time-step and is used to query the model for the path.

Specifically, given the egocentric map feature representation $Y_{t}^{s}=Enc\left(s_{t}\right)$ at time $t$ during a VLN episode, and the instruction features $X$ , we use the scaled dot-product attention:

Q=Y_{t}^{s}W_{q},K=XW_{k},V=XW_{v}

(1)

H_{t}^{s}=Softmax\left(\frac{QK^{T}}{\sqrt{d}}\right)V

(2)

where $W_{q}$ , $W_{k}$ , and $W_{v}\in\mathbb{R}^{d\times d}$ are learned parameter matrices, and $H_{t}^{s}\in\mathbb{R}^{N\times d}$ is the attended representation over the egocentric semantic map regions. In practice, this architecture [53] first applies self-attention to each modality followed by the cross-modal attention.

We define the path as a set of 2D waypoints $\{p_{t}^{i}\}_{i=1}^{k}$ situated on the egocentric map. The first and last waypoints always represent the starting position and the final goal position respectively. During training we sample the remaining waypoints from the ground-truth path with respect to the instruction. These are used to construct ground-truth heatmaps $\mathcal{P}_{t}\in\mathbb{R}^{k\times u\times v}$ based on a 2D Gaussian centered at each waypoint with $\sigma=1$ , where $u$ , $v$ are the height and width of the heatmap. We predict the entire path at every time-step given the entire instruction. This can cause ambiguity with regards to the waypoint placements towards the agent’s current pose, since the agent has no knowledge of the amount of the path covered at a given time-step. In other words, if the agent is half-way through the path then the model should learn to predict both backward and forward waypoints along the path, as opposed to predicting only forward waypoints at the beginning of the episode. We mitigate this issue in two ways. First, the path prediction is conditioned on the starting position heatmap $\mathcal{P}_{t}^{0}$ relative to the current agent’s pose. Second, we add an auxiliary loss that trains the model to predict a probability $\hat{\xi}_{t}^{i}$ for each waypoint whether it has already been traversed. We empirically found this auxiliary loss to help the learning process.

The waypoint predictor model is defined as an encoder-decoder UNet [47] $f$ , that takes as inputs the instruction attended representation of the egocentric map regions $H_{t}^{s}$ and the starting position $\mathcal{P}_{t}^{0}$ :

\hat{\mathcal{P}}_{t},\hat{\xi}_{t}=f\left(H_{t}^{s},\mathcal{P}_{t}^{0}\right).

(3)

We train the waypoint prediction with the following loss:

L_{wp}=\sum_{i=1}^{k}b^{i}_{t}||\hat{\mathcal{P}}_{t}^{i}-\mathcal{P}_{t}^{i}||^{2}_{2}-\lambda_{\xi}\xi_{t}^{i}\log\hat{\xi}_{t}^{i}

(4)

where $b^{i}_{t}$ is a binary indicator whether the particular waypoint $i$ is visible on the egocentric map at time $t$ , and $\lambda_{\xi}$ weighs the auxiliary loss.

3.4 Cross-modal attention for map prediction

We design a language-informed semantic map predictor for obtaining the egocentric semantic map $s_{t}$ from RGB-D observations. Given the often limited field-of-view of embodied agents, we are interested in hallucinating the semantic information in regions where the agent cannot directly observe. While different versions of this procedure were attempted in the past [45, 19, 18], our key contribution is to learn the layout priors by leveraging the spatial and semantic descriptions from the instructions.

The map prediction is defined as a semantic segmentation task over the top-down egocentric map. Our model first takes as input the depth observation which is ground-projected to an egocentric grid $o_{t}\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times 3}$ containing the classes $occupied$ , $free$ , and $void$ . For the ground-projection we first unproject the depth to a 3D point cloud using the camera intrinsic parameters and then map each 3D point to an $h^{\prime}\times w^{\prime}$ grid following the procedure described here [25]. Note that $o_{t}$ is an incomplete representation of the occupancy map around the agent, where all areas outside the field-of-view are considered unknown.

We define a cross-modal attention module similar to the one in Sec. 3.3, where the feature representation $Y_{t}^{o}=Enc\left(o_{t}\right)$ is determined as the query, while the instruction features $X$ are used as key and value. Following Eq. 1 and 2 (where $Y_{t}^{s}$ is replaced by $Y_{t}^{o}$ ) we get the attended representation $H_{t}^{o}$ over the incomplete egocentric map $o_{t}$ . The prediction model includes two encoder-decoder UNet [47] models $g^{o}$ , $g^{s}$ stacked together:

\displaystyle\hat{o}_{t}=g^{o}\left(o_{t},H_{t}^{o}\right)

\displaystyle\hat{s}_{t}=g^{s}\left(\hat{o}_{t},H_{t}^{o},\hat{\chi}_{t}\right)

(5)

where $\hat{\chi}_{t}\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times c}$ is a ground-projected semantic segmentation of the RGB frame. Note that $H_{t}^{o}$ is concatenated at the bottlenecks of both $g^{o}$ , $g^{s}$ models. The model is trained with a pixel-wise cross-entropy loss on the occupancy and the semantic classes:

L_{m}=-\sum_{q\in(s,o)}\sum_{k}\sum_{c}q_{k,c}\log\hat{q}_{k,c}

(6)

where $k$ iterates over the number of pixels in the map and $q_{k,c}$ is the ground-truth label for pixel $k$ . The ground-truth semantic maps are created from the available 3D semantic information in Matterport3D. The network that produces $\hat{\chi}$ is another UNet which is pre-trained separately from the rest of the model.

Overall learning objective. During training we add up all the losses from the path and map prediction modules:

L=\lambda_{wp}L_{wp}+\lambda_{m}L_{m}

(7)

where the $\lambda s$ denote the corresponding loss weights, and perform a single backward pass through the entire model.

3.5 Controller

The method described so far outputs the path as a set of 2D waypoints $\{p_{t}^{i}\}_{i=1}^{k}$ on an egocentric map from an RGB-D observation. In order to follow this path towards the goal, at each time-step we designate a waypoint as a short-term goal, following:

\zeta=1+\operatorname*{arg\,min}_{i}\Delta(\hat{p}_{t}^{i},\varrho_{t})

(8)

where $\Delta$ is the euclidean distance, $\hat{p}_{t}^{i}$ corresponds to the mode of the predicted waypoint heatmap $\hat{\mathcal{P}}_{t}^{i}$ , and $\varrho_{t}$ is the agent’s pose at time $t$ . This effectively determines the closest predicted waypoint to the agent and selects the next one in the sequence as the short-term goal $p_{t}^{\zeta}$ . In order to reach the short-term goal, we use the off-the-shelf deep reinforcement learning model DD-PPO [54] that is trained for the PointNav [1] task. DD-PPO receives the current depth observation and $p_{t}^{\zeta}$ and outputs the next navigation action for the agent. Finally, at any time during the episode the agent may decide on the STOP action when it’s within a certain radius $\tau$ (m) of the final goal (last predicted waypoint) and the confidence of the goal in the predicted heatmap is above a threshold $\gamma$ .

4 Experiments

We conduct our experiments in the VLN-CE [32] dataset, which offers $16,844$ path-instruction pairs over 90 visually realistic scenes in the Matterport3D [8] dataset. We follow the typical evaluation scenario and report results in scenes which were observed (val-seen) and not observed (val-unseen) during training. An episode is considered successful if the STOP decision is taken within $3m$ of the goal position, and the agent has a fixed-time budget of 500 steps to complete an episode. As mentioned before, the agent has access to egocentric RGB-D observations with a horizontal field-of-view of $90^{\circ}$ . We perform three sets of experiments. First, we compare against other methods on the VLN-CE dataset including the held-out test set of the VLN-CE challenge (Sec. 4.1), followed by an ablation study (Sec. 4.2). Finally, we provide visual examples of the learned representation (Sec. 4.3). We use two main variations of our method. CM² refers to our full pipeline that predicts both the egocentric map and path from RGB-D inputs, while CM²-GT refers to using the ground-truth egocentric map as input, effectively only performing path prediction. All egocentric maps used are local $192\times 192$ with each pixel corresponding to $5cm\times 5cm$ . The map covers a square 9.6 meters on a side, leaving most of the scene unobserved. We provide code, trained models and instructions to reproduce our results: https://github.com/ggeorgak11/CM2. Implementation details along with additional experimental results are included in the appendix.

	Val-Seen					Val-Unseen
	TL $\downarrow$	NE $\downarrow$	OS $\uparrow$	SR $\uparrow$	SPL $\uparrow$	TL $\downarrow$	NE $\downarrow$	OS $\uparrow$	SR $\uparrow$	SPL $\uparrow$
Seq2Seq+PM+DA+Aug [32]	9.37	7.02	46.0	33.0	31.0	9.32	7.77	37.0	25.0	22.0
AG-CMTP* [12]	-	6.60	56.2	35.9	30.5	-	7.9	39.2	23.1	19.1
R2R-CMTP* [12]	-	7.10	45.4	36.1	31.2	-	7.9	38.0	26.4	22.7
CMA+PM+DA+Aug [32]	9.26	7.12	46.0	37.0	35.0	8.64	7.37	40.0	32.0	30.0
WPN-DD* [31]	9.11	6.57	44.0	35.0	32.0	8.23	7.48	35.0	28.0	26.0
LAW [46]	9.34	6.35	49.0	40.0	37.0	8.89	6.83	44.0	35.0	31.0
CM² (Ours)	12.05	6.10	50.7	42.9	34.8	11.54	7.02	41.5	34.3	27.6
WPN-CC* [31]	10.29	6.05	51.0	40.0	35.0	10.62	6.62	43.0	36.0	30.0
HPN-C* [31]	8.71	5.17	53.0	47.0	45.0	7.71	6.02	42.0	38.0	36.0
CM²-GT (Ours)	12.60	4.81	58.3	52.8	41.8	10.68	6.23	41.3	37.0	30.6

Table 1: Evaluation on VLN-CE dataset. All methods marked with * use panoramic images. CM²-GT is the same as CM², but uses ground truth local maps, rather than predicting them. HPN-C and WPN-CC use a more expressive action space than the rest of the methods. AG-CMTP and R2R-CMPT allow the agent to explore each scene before the experiment begins. Our method is the most successful on val-seen while it is competitive on val-unseen.

4.1 VLN-CE Evaluation

Here we evaluate the performance of our method on the continuous vision-and-language navigation task against current state-of-the-art methods. The metrics reported are the following: Trajectory length TL (m), navigation error from goal NE (m), oracle success rate OS (%), success rate SR (%), and success weighted by path length SPL (%). More details on these metrics can be found in [1, 4].

We compare our method against the following works:

Krantz et al. [32]: Two baselines are used from here. First, Seq2Seq+PM+DA+Aug is a simple sequence-to-sequence baseline that uses a recurrent policy to predict the action directly from the visual observations. Second, CMA+PM+DA+Aug utilizes cross-modal attention between instruction and RGB-D observations. Both methods use off-the-shelf techniques for Progress Monitor (PM), DAgger (DA), and synthetic data augmentation.

Chen et al. [12]: This work uses cross-modal attention between the instruction and a topological map to compute a global navigation plan. To construct the topological map, the authors assume the agent can explore the environment before the execution of the navigation episode. Each node in the topological map corresponds to a panoramic image. We compare against AG-CMTP and R2R-CMTP which use the method’s generated map and the maps from the Room2Room [1] dataset respectively.

Raychaudhuri et al. [46]: This method (LAW) updates the training setup of CMA+PM+DA+Aug [32] by adjusting the supervision to use the nearest waypoint on the path rather than the goal location.

Krantz et al. [31]: We compare against the Waypoint Prediction Network (WPN) and the Heading Prediction Network (HPN) which are end-to-end models that predict relative waypoints directly from natural language instructions and panoramic RGB-D inputs. The models differ with respect to the waypoint prediction space. WPN-CC considers continuous values for distance and direction, WPN-DD considers discrete values, and HPN-C uses a constant value for distance and continuous for direction. Our method is analogous to WPN-DD, since our waypoint prediction is on the discrete 2D space of maps. Investigating more expressive waypoint prediction spaces is out of the scope of our work.

Quantitative results are shown in Table 1 and a navigation example can be seen in Figure 3. On val-seen our method CM² outperforms all other baselines except WPN-CC and HPN-C (which use more expressive waypoint prediction spaces) on navigation error and success rate, while it is competitive on SPL. In particular, we show better results than WPN-DD which uses panoramic images (4 $\times$ larger field-of-view), and was trained on 200M steps of experience (285 $\times$ more data) [31]. This is a characteristic of end-to-end methods that need to learn all navigation components such as mapping, planning, and control in a single network and thus require large amounts of data. In contrast, aligning language to egocentric maps proves to be much more sample efficient, as our model was trained with only 0.7M training samples. Regarding our comparison to [12], AG-CMPT performs better only on oracle success rate, while our CM² method has a noticeably higher success rate. However, this baseline has a prior scene exploration phase, which is not counted in the task step limit, that acquires knowledge of scene topology to use during the navigation episode. In comparison, our CM²-GT, which also has knowledge over the map, performs better on all metrics. We are also competitive against CMA+PM+DA+Aug and LAW that use cross-modal attention mechanisms between the instruction and the RGB-D frames. The latter also employs a more sophisticated reward function that forces the agent to stay on the path and trains on an augmented dataset with over ten times as many trajectories. We outperform both in success rate on val-seen and we have almost the same performance with LAW on val-unseen. Finally, when the input to our method is the ground-truth egocentric semantic map (CM²-GT) we observe a significant increase in success rate in val-seen. Although the map is local and the goal location is usually not visible, this performance gain further justifies our choice of using cross-modal attention on egocentric maps.

Team Name	TL	NE	OS	SR	SPL
CWP-VLNBERT*	13.3	5.9	51	42	36
CWP-CMA*	11.9	6.3	49	38	33
WaypointTeam*	8.0	6.6	37	32	30
CM²	13.9	7.7	39	31	24
TJA*	10.4	8.1	42	29	27
VIRL_Team	8.9	7.9	36	28	25

Table 2: Results on the VLN-CE challenge leaderboard. Methods marked with * use either panoramic images and/or a non-standard action space.

VLN-CE Leaderboard

We submitted our CM² on the held-out test-unseen set containing 3.4K episodes in unseen environments used for the VLN-CE challenge. Table 2 shows the leaderboard as accessed on Mar 8th 2022. Our method is leading in terms of OS, SR, and NE among those that use standard observation (no panoramas) and action spaces (discrete), and is 4th overall on OS SR, and NE.

	IoU (%)	F1 (%)	PCW (%)
CM²-w/o-MapAttn	21.2	33.2	71.1
CM²	28.3	42.2	76.5

Table 3: Effect of map attention on map and waypoint prediction.

Val-Seen	TL	NE	OS	SR	SPL
CM²-GT, $\tau=1.5$	10.18	5.01	53.6	49.5	45.1
CM²-GT, $\tau=1.0$	11.48	4.94	56.4	51.9	43.8
CM²-GT, $\tau=0.5$	12.60	4.81	58.3	52.8	41.8
CM²-GT-384, $\tau=0.5$	12.89	4.52	66.4	58.4	46.7

Table 4: Effect of map size and stop distance threshold on VLN.

4.2 Ablation Study

In this experiment we provide an analysis over our model and aim to answer the following questions:

How important is the cross-modal map attention? The cross-modal map attention, shown in Figure 2, is the attention module that learns the semantic grounding and influences the semantic map prediction. We are interested in quantifying its contribution towards the map and path prediction and define the baseline CM²-w/o-MapAttn that does not include the cross-modal map attention module and therefore is not aware of the language instruction. We compare against our method CM² on the popular semantic segmentation metrics of Intersection over Union (IoU) and F1 score, and on the Percentage of Correct Waypoints (PCW) that evaluates the quality of the path prediction. PCW counts a predicted waypoint as correct if it is within $1.92m$ (on the $192\times 192$ maps) of the ground-truth waypoint. Results are reported in Table 3. CM² has higher performance on IoU, F1, and PCW by $7.1\%$ , $9.0\%$ , and $5.4\%$ respectively. These results show that the cross-modal map attention extracts useful information from language that improves the prediction of the semantic map and the path. Examples over map predictions are shown in Figure 4.

What is the effect of the stop decision threshold? We vary the stop decision distance threshold $\tau$ (m) used by the controller and observe the performance on the VLN-CE metrics in Table 4. This experiment is carried out on val-seen using CM²-GT. When $\tau=1.5$ , success rate drops by $3.3\%$ because the agent chooses to stop more aggressively thus it is more likely to choose STOP outside the goal radius. On the other hand, SPL gained $3.3\%$ since stopping earlier reduces the path length. This result signifies a trade-off between success rate (SR) and SPL based on the value of $\tau$ that can adjust the agent’s behavior.

What is the effect of egocentric map size? All experimental evaluation of our work (CM², CM²-GT) uses $192\times 192$ egocentric maps. Given that each cell in the map corresponds to $5cm\times 5cm$ , this translates to a distance from the center of the map (where the agent is situated) to each side of 4.8m. With the mean euclidean distance between the start position and goal being around 8m across val-seen and val-unseen episodes, this means that for the majority of the episodes the goal is not located within the egocentric map at the beginning. In order to see how much this affects performance, we train our path predictor again CM²-GT-384 with maps of size $384\times 384$ (9.6m between the agent and the sides of the map) and compare to our original method in Table 4. Doubling the map size increases SR by $5.6\%$ , OS by $8.1\%$ , and SPL by $4.9\%$ demonstrating that the larger maps have significant impact on the navigation performance.

4.3 Validation of Semantic and Spatial Grounding

Finally, we provide evidence that the learned representations can be semantically and spatially grounded on the egocentric maps. Specifically, we visualize (Figure 5) two feature representations from the cross-modal path attention module: 1) The attention decoder output $H_{t}^{s}\in\mathbb{R}^{N\times d}$ which we max-pool over the feature dimension $d$ and reshape $N$ back to its encoded map dimensions of $h\times w$ to get a spatial heatmap. This representation is shown to focus around goal locations and along paths. 2) The cross-modal attention ( $Softmax\left(\frac{QK^{T}}{\sqrt{d}}\right)$ ) between the map regions and the words in the instruction with dimensionality $N\times M$ from which we can visualize the attention heatmap for a specific word token over the map. This demonstrates that the cross-modal attention learns to associate instruction tokens to semantic objects on the map.

5 Conclusion

We presented a new method for the Language-and-Navigation task that solves the problem by first predicting the egocentric semantic map and then estimating the trajectory, defined by the instruction, on the 2D map. This is facilitated by two cross-modal attention modules that learn to semantically and spatially ground the natural language on the egocentric map. We showcased the effectiveness of our method with competitive results on the VLN-CE dataset and demonstrated that grounding the language on the maps allows for good VLN performance with a fraction of the data that the end-to-end methods require. Furthermore, we qualitatively show that our method learns meaningful intermediate representations.

Acknowledgements.

Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-20-1-0080, as well as by the ARL DCIST CRA W911NF-17-2-0181, NSF TRIPODS 1934960, and NSF CPS 2038873 grants.

References

[1] Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
[2] Peter Anderson, Ayush Shrivastava, Devi Parikh, Dhruv Batra, and Stefan Lee. Chasing ghosts: instruction following as bayesian state tracking. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 371–381, 2019.
[3] Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, and Stefan Lee. Sim-to-real transfer for vision-and-language navigation. In Conference on Robot Learning, pages 671–681. PMLR, 2021.
[4] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018.
[5] Valts Blukis, Dipendra Misra, Ross A Knepper, and Yoav Artzi. Mapping navigation instructions to continuous control actions with position-visitation prediction. In Conference on Robot Learning, pages 505–518. PMLR, 2018.
[6] Annina Brügger, Kai-Florian Richter, and Sara Irina Fabrikant. How does navigation system behavior influence human behavior? Cognitive research: principles and implications, 4(1):1–22, 2019.
[7] Vincent Cartillier, Zhile Ren, Neha Jain, Stefan Lee, Irfan Essa, and Dhruv Batra. Semantic mapnet: Building allocentric semantic maps and representations from egocentric views. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 964–972, 2021.
[8] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
[9] Devendra Singh Chaplot, Dhiraj Gandhi, Abhinav Gupta, and Ruslan Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems 33, 2020.
[10] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. International Conference on Learning Representations, 2020.
[11] Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019.
[12] Kevin Chen, Junshen K. Chen, Jo Chuang, Marynel Vazquez, and Silvio Savarese. Topological planning with transformers for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11276–11286, June 2021.
[13] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
[14] Zhiwei Deng, Karthik Narasimhan, and Olga Russakovsky. Evolving graphical planner: Contextual global planning for vision-and-language navigation. Advances in Neural Information Processing Systems, 33:20660–20672, 2020.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[16] Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5884–5888. IEEE, 2018.
[17] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3318–3329, 2018.
[18] Georgios Georgakis, Bernadette Bucher, Anton Arapin, Karl Schmeckpeper, Nikolai Matni, and Kostas Daniilidis. Uncertainty-driven planner for exploration and navigation. International Conference in Robotics and Automation (ICRA), 2022.
[19] Georgios Georgakis, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, and Kostas Daniilidis. Learning to map for active semantic goal navigation. International Conference on Learning Representations (ICLR), 2022.
[20] Georgios Georgakis, Yimeng Li, and Jana Kosecka. Simultaneous mapping and target driven navigation. arXiv preprint arXiv:1911.07980, 2019.
[21] Pierre-Louis Guhur, Makarand Tapaswi, Shizhe Chen, Ivan Laptev, and Cordelia Schmid. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021.
[22] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017.
[23] Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[25] Joao F Henriques and Andrea Vedaldi. Mapnet: An allocentric spatial memory for mapping environments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8476–8484, 2018.
[26] Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021.
[27] Ronghang Hu and Amanpreet Singh. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1439–1449, 2021.
[28] Haoshuo Huang, Vihan Jain, Harsh Mehta, Jason Baldridge, and Eugene Ie. Multi-modal discriminative model for vision-and-language navigation. In Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP), pages 40–49, 2019.
[29] Kapil Katyal, Katie Popek, Chris Paxton, Phil Burlina, and Gregory D. Hager. Uncertainty-aware occupancy map prediction using generative networks for robot navigation. In 2019 International Conference on Robotics and Automation (ICRA), pages 5453–5459, 2019.
[30] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM Computing Surveys (CSUR), 2021.
[31] Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15162–15171, 2021.
[32] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In European Conference on Computer Vision (ECCV), 2020.
[33] Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Conference on Empirical Methods for Natural Language Processing (EMNLP), 2020.
[34] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. What does bert with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5265–5275, 2020.
[35] Yiqing Liang, Boyuan Chen, and Shuran Song. SSCNav: Confidence-aware semantic scene completion for visual semantic navigation. International Conference on Robotics and Automation (ICRA), 2021.
[36] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13–23, 2019.
[37] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. International Conference on Learning Representations (ICLR), 2019.
[38] Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6732–6740, 2019.
[39] Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020.
[40] Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Darrell, Dhruv Batra, Devi Parikh, and Amanpreet Singh. Seeing the un-scene: Learning amodal semantic maps for room navigation. European Conference on Computer Vision. Springer, Cham, 2020.
[41] J. O’Keefe and L. Nadel. The hippocampus as a cognitive map. Oxford University Press, 1998.
[42] Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021.
[43] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[44] Yuankai Qi, Zizheng Pan, Yicong Hong, Ming-Hsuan Yang, Anton van den Hengel, and Qi Wu. The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1655–1664, 2021.
[45] Santhosh K Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation. European Conference on Computer Vision, pages 400–418, 2020.
[46] Sonia Raychaudhuri, Saim Wani, Shivansh Patel, Unnat Jain, and Angel Chang. Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4018–4028, 2021.
[47] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[48] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision, pages 9339–9347, 2019.
[49] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations (ICLR), 2019.
[50] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7464–7473, 2019.
[51] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, 2019.
[52] E.C. Tolman. Cognitive maps in rats and men. Psychological Review, 55:189–208, 1948.
[53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[54] Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. International Conference on Learning Representations (ICLR), 2019.
[55] Anna Wunderlich and Klaus Gramann. Landmark-based navigation instructions improve incidental spatial knowledge acquisition in real-world environments. biorxiv:10.1101/789529, 2020.
[56] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10502–10511, 2019.
[57] Xiaoming Zhao, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. The surprising effectiveness of visual odometry techniques for embodied pointgoal navigation. International Conference on Computer Vision (ICCV), 2021.
[58] Chen Zheng, Quan Guo, and Parisa Kordjamshidi. Cross-modality relevance for reasoning on language and vision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7642–7651, 2020.
[59] Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha. Babywalk: Going farther in vision-and-language navigation by taking baby steps. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2539–2556, 2020.
[60] Yi Zhu, Yue Weng, Fengda Zhu, Xiaodan Liang, Qixiang Ye, Yutong Lu, and Jianbin Jiao. Self-motivated communication agent for real-world vision-dialog navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1594–1603, 2021.

Appendix A Appendix

Here we provide the following additional material:

1.

Discussion on societal impact and limitations.
2.

Implementation details.
3.

Analysis of path prediction learning with regards to auxiliary loss and start position input.
4.

Analytical results over semantic map prediction to assess the contribution of cross-modal map prediction.
5.

Additional results on the effect of stop decision threshold.
6.

Additional qualitative navigation results and visualizations of the learned attention representations.

A.1 Societal Impact and Limitations

Potential negative societal impact.

Our current method is trained on scenes from Matterport3D which contains scans of homes from North America and Europe. Since we do not model out-of-distribution scenarios, deploying our method in safety critical situations such as rescue operations or hospitals could have negative outcomes. Furthermore, house layouts strongly correlate with regions of the world and with socio-economic factors, making it likely that agents using our algorithm will underperform when deployed in other parts of the world or in poor or minority houses which are frequently underrepresented in datasets.

Limitations.

While our approach achieves results comparable with the state of the art, we acknowledge that there is much room for improvement.We would like to point out three limitations of our method. First, since we predict the path from the semantic map, we are not utilizing information from the instructions that describe object attributes such as color, (i.e., “brown table”, “red table”). This can be important in situations where we need to distinguish between two instances of the same category. Second, we depend on the pretrained BERT representation, after fine-tuning its final layer, to provide all relevant information about the instruction. We do not use any explicit language representation, which could allow for better decomposition of instructions. Third, our method is limited by size of the local egocentric map. We cannot spatially ground information to locations outside of the local map, and while increasing the size of the local map can significantly improve performance, it is also computationally expensive.

	TL	NE	OS	SR	SPL
CM²-GT, w/o $\mathcal{P}_{t}^{0}$ , $\lambda_{\xi}=0$	9.37	6.80	32.9	29.3	22.2
CM²-GT, w/o $\mathcal{P}_{t}^{0}$	10.62	6.18	38.4	34.3	26.5
CM²-GT, $\lambda_{\xi}=0$	12.61	5.04	54.3	49.1	39.0
CM²-GT	12.60	4.81	58.3	52.8	41.8

Table 5: Analysis of our path prediction strategy demonstrating the contributions of

\mathcal{P}_{t}^{0}

and the auxiliary loss using navigation metrics on val-seen set.

A.2 Implementation details

Our method is implemented in PyTorch [43]. The UNet [47] models used in our method have four encoder and four decoder convolutional blocks with skip connections. The entire model is trained with the Adam optimizer and a learning rate of 0.0002. During training all $\lambda$ s are equal to 1. The training data for both the map and waypoint prediction were sampled from the ground-truth paths provided in VLN-CE train split. We used around 700K examples to train CM² and around 500K to train CM²-GT. The semantic segmentation that produces $\hat{\chi}$ is another UNet which we pre-trained separately from the rest of the model on RGB observations from the Matterport3D scenes. The egocentric map and waypoint heatmap dimensions are $h^{\prime}=w^{\prime}=192$ and $u=v=24$ respectively. Each pixel in the egocentric map corresponds to physical dimensions of $5cm\times 5cm$ . We use $k=10$ waypoints and $c=27$ semantic classes from the original 40 categories of Matterport3D. For the controller we define stop distance threshold $\tau=0.5$ and goal confidence threshold $\gamma=0.6$ . Our method does not use any recurrence or an implicit state representation so the map and path predictions are temporally independent. However, during a navigation episode we maintain a global occupancy map using the ground-projected depth $o_{t}$ that is registered using Bayesian updates. The input to the model is an egocentric crop from this global map, so the agent is aware of previously observed occupancy.

A.3 Analysis of path prediction learning

We investigate the contribution of certain choices we made to mitigate the ambiguity over waypoint placements during path prediction learning as discussed in section 3.3 of the main paper. In particular, we train the following variants of our CM²-GT model: 1) without using the starting position heatmap $\mathcal{P}_{t}^{0}$ as input, 2) without the auxiliary loss for predicting whether a waypoint has been traversed ( $\lambda_{\xi}=0$ ), and 3) without $\mathcal{P}_{t}^{0}$ and $\lambda_{\xi}=0$ . The variants are evaluated against our proposed approach on val-seen using the navigation metrics from section 4.1 of the main paper (Table 5). We observe that without the auxiliary loss success rate drops by $3.7\%$ , while not using the starting position further decreases success rate by $18.5\%$ . The worst performance by far is recorded when both are not utilized. The results justify our choices and suggest the importance of anchoring the prediction of the entire path to a starting location in the egocentric map, complemented by an auxiliary objective that forces the model to predict its current position on the path.

	Val-Seen					Val-Unseen
	TL	NE	OS	SR	SPL	TL	NE	OS	SR	SPL
CM², $\tau=1.5$	9.54	6.06	42.4	38.8	34.6	9.07	7.01	35.2	31.3	27.7
CM², $\tau=1.0$	10.72	5.88	49.2	42.6	35.9	10.04	7.09	39.0	33.3	27.9
CM², $\tau=0.5$	12.05	6.10	50.7	42.9	34.8	11.53	7.02	41.5	34.3	27.6

Table 6: Additional results on the effect of stop distance threshold on VLN.

A.4 Analytical results for cross-modal map attention

In section 4.2 of the main paper we investigated the importance of the cross-modal map attention component by comparing our approach to the baseline CM²-w/o-MapAttn that is unaware of the language instruction during map prediction. Here, we show additional per-class and per-waypoint results over F1 score (Figure 6) and PCW (Figure 7) respectively. First, in Figure 6 we observe that the model trained with the cross-modal map attention (CM²) performs better on all semantic categories against the baseline. Furthermore, the performance gain is more pronounced over object categories (e.g., toilet $12.4\%$ , sink $12.6\%$ ) as opposed to semantic classes referring to the structure of the scene (e.g., floor $5.6\%$ , wall $5.1\%$ ). This reinforces our initial hypothesis that the attention component is able to pick semantic cues from the instruction and improve the map prediction. Additionally, in Figure 7 we demonstrate path prediction results over individual waypoints (1-9). Waypoint 0 is omitted since it is used as input to our method, while waypoint 9 corresponds to the goal location. As expected, waypoints earlier in the path have larger PCW. However, an interesting observation is that the gain in performance increases for waypoints closer to the goal rather than in the beginning of the path, thus demonstrating that improved map prediction is crucial for predicting waypoints far from the starting position.

For additional qualitative comparisons of semantic map predictions between the baseline and our approach see Figure 11.

A.5 Additional results on effect of stop distance threshold.

We repeat the experiment presented in section 4.2 of the main paper regarding the effect of the stop distance threshold on the VLN task using our CM² (no GT map) agent on both val-seen and val-unseen splits. In Table 6 we observe a similar trend as that shown in Table 4 of the main paper. Success rate is higher when $\tau$ is low, because the agent takes the stop action more cautiously, while trajectory length is best when $\tau$ is high.

A.6 Additional visualizations

Finally, we share additional visualizations of navigation episodes (Figure 10) and more examples of spatial and semantic grounding of the learned representations. Figure 8 shows the attention decoder output $H_{t}^{s}$ and Figure 9 presents more examples of the cross-modal attention. See section 4.3 of the main paper for more details.