This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

Jason Armitage
University of Zurich
Switzerland
   Leonardo Impett
University of Cambridge
UK
   Rico Sennrich
University of Zurich
Switzerland
Abstract

In a busy city street, a pedestrian surrounded by distractions can pick out a single sign if it is relevant to their route. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also confronted with detecting supervisory signal on environment features and location in inputs. To boost the prominence of relevant features in transformer-based systems without costly preprocessing and pretraining, we take inspiration from priority maps - a mechanism described in neuropsychological studies. We implement a novel priority map module and pretrain on auxiliary tasks using low-sample datasets with high-level representations of routes and environment-related references to urban features. A hierarchical process of trajectory planning - with subsequent parameterised visual boost filtering on visual inputs and prediction of corresponding textual spans - addresses the core challenge of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance for transformer-based systems on the Touchdown benchmark for VLN. Code and data are referenced in Appendix C.

1   Introduction

Navigation in the world depends on attending to relevant cues at the right time. A road user in an urban environment is presented with billboards, moving traffic, and other people - but at an intersection will pinpoint a single light to check if it contains the colour red (Gottlieb et al., 2020; Shinoda et al., 2001). An artificial agent navigating a virtual environment of an outdoor location is also presented with a stream of linguistic and visual cues. Action selections that move the agent closer to a final destination depend on the prioritisation of references that are relevant to the point in the trajectory. In the first example, human attention is guided to specific objects by visibility and the present objective of crossing the road. At a neurophysiological level, this process is mediated by a priority map - a neural mechanism that guides attention by matching low-level signals on salient objects with high-level signals on task goals. Prioritisation in humans is enhanced by combining multimodal signals and integration between linguistic and visual information (Ptak, 2012; Cavicchio et al., 2014). The ability to prioritise improves as experience of situations and knowledge of environments increases (Zelinsky and Bisley, 2015; Tatler et al., 2011).

We introduce a priority map module for Vision-and-Language Navigation (PM-VLN) that is pretrained to guide a transformer-based architecture to prioritise relevant information for action selections in navigation. In contrast to pretraining on large-scale datasets with generic image-text pairs (Su et al., 2020), the PM-VLN module learns from small sets of samples representing trajectory plans and urban features. Our proposal is founded on observation of concentrations in location deictic terms and references to objects with high visual salience in inputs for VLN. Prominent features in the environment pervade human-generated language navigation instructions. Road network types (“intersection”), architectural features (“awning”), and transportation (“cars”) all appear with high frequency in linguistic descriptions of the visual appearance of urban locations. Learning to combine information in the two modalities relies on synchronising temporal sequences of varying lengths. We utilise references to entities as a signal for a process of cross-modal prioritisation that addresses this requirement.

Our module learns over both modalities to prioritise timely information and assist both generic vision-and-language and custom VLN transformer-based architectures to complete routes (Li et al., 2019; Zhu et al., 2021). Transformers have contributed to recent proposals to conduct VLN, Visual Question Answering, and other multimodal tasks - but are associated with three challenges: 1) Standard architectures lack mechanisms that address the challenge of temporal synchronisation over linguistic and visual inputs. Pretrained transformers perform well in tasks on image-text pairs but are challenged when learning over sequences without explicit alignments between modalities (Lin and Wang, 2020). 2) Performance is dependent on pretraining with large sets of image-text pairs and a consequent requirement for access to enterprise-scale computational resources (Majumdar et al., 2020; Suglia et al., 2021). 3) Visual learning relies on external models and pipelines - notably for object detection (Li et al., 2020; Le et al., 2022). The efficacy of object detection for VLN is low in cases where training data only refer to a small subset of object types observed in navigation environments.

We address these challenges with a hierarchical process of trajectory planning with feature-level localisation and low-sample pretraining on in-domain data. We use discriminative training on two auxiliary tasks that adapt parameters of the PM-VLN for the specific challenges presented by navigating routes in outdoor environments. High-level planning for routes is enabled by pretraining for trajectory estimation on simple path traces ahead of a second task comprising multi-objective cross-modal matching and location estimation on urban landmarks. Data in the final evaluation task represent locations and trajectories in large US cities and present an option to leverage real-world resources in pretraining. Our approach builds on this opportunity by sourcing text, images, coordinates, and path traces from the open web and the Google Directions API where additional samples may be secured at low cost in comparison to human generation of instructions.

This research presents four contributions to enhance transformer-based systems on outdoor VLN tasks:

  • Priority map module Our novel PM-VLN module conducts a hierarchical process of high-level alignment of textual spans with visual perspectives and feature-level operations to enhance and localise inputs during navigation (see Figure 3).

  • Trajectory planning We propose a new method for aligning temporal sequences in VLN comprising trajectory estimation on path traces and subsequent predictions for the distribution of linguistic descriptions over routes.

  • Two in-domain datasets and training strategy We introduce a set of path traces for routes in two urban locations (TR-NY-PIT-central) and a dataset consisting of textual summaries, images, and World Geodetic System (WGS) coordinates for landmarks in 10 US cities (MC-10). These resources enable discriminative training of specific components of the PM-VLN on trajectory estimation and multi-objective loss for a new task that pairs location estimation with cross-modal sentence prediction.

  • Feature-location framework We design and build a framework (see Figure 2) to combine the outputs from the PM-VLN module and cross-modal embeddings from a transformer-based encoder. The framework incorporates components for performing self-attention, combining embeddings, and predicting actions with maxout activation.

2   Background

In this section we define the Touchdown task and highlight a preceding challenge of aligning and localising over linguistic and visual inputs addressed in our research. A summary of the notation used below and in subsequent sections is presented in Appendix A.

Touchdown Navigation in the Touchdown benchmark ϕVLN\phi_{VLN} is measured as the completion of NN predefined trajectories by an agent in an environment representing an area of central Manhattan. The environment is represented as an undirected graph composed of nodes OO located at WGS latitude / longitude points. At each step tt of the sequence {1,,T}\{1,…,T\} that constitute a trajectory, the agent selects an edge ξt\xi_{t} to a corresponding node. The agent’s selection is based on linguistic and visual inputs. A textual instruction τ\uptau composed of a varying number of tokens describes the overall trajectory. We use ς\varsigma to denote a span of tokens from τ\uptau that corresponds to the agent’s location in the trajectory. Depending on the approach, ς\varsigma can be the complete instruction or a selected sequence. The visual representation of a node in the environment is a panorama drawn from a sequence RouteRoute of undetermined length. The agent receives a specific perspective ψ\psi of a panorama determined by the heading angle \angle between (o1,o2)(o_{1},o_{2}). Success in completing a route is defined as predicting a path that ends at the node designated as the goal - or one directly adjacent to it.

In a supervised learning paradigm (see a) in Figure 1), an embedding eηe_{\eta} is learned from inputs ςt\varsigma_{t} and ψt\psi_{t}. The agent’s next action is a classification over eηe_{\eta} where the action αt\alpha_{t} is one of a class drawn from the set A{Forward,Left,Right,Stop}\mathrm{A}\{Forward,Left,Right,Stop\}. Predictions αt=Forward\alpha_{t}=Forward and αt={Left,Right}\alpha_{t}=\{Left,Right\} result respectively in a congruent or a new \angle at edge ξt+1\xi_{t+1}. A route in progress is terminated by a prediction αt=Stop\alpha_{t}=Stop.

Align and Localise We highlight in Figure 1 a preceding challenge in learning cross-modal embeddings. As in real-world navigation, an agent is required to align and match cues in instructions with its surrounding environment. A strategy in human navigation is to use entities or landmarks to perform this alignment Cavicchio et al. (2014). In the Touchdown benchmark, a relationship between sequences τ\uptau and RouteRoute is assumed from the task generation process outlined in Chen et al. (2019) - but the precise alignment is not known. We define the challenge as one of aligning temporal sequences τ={ς1,ς2,,ςn}\uptau=\{\varsigma_{1},\varsigma_{2},…,\varsigma_{n}\} and Route={ψ1,ψ2,,ψn}Route=\{\psi_{1},\psi_{2},…,\psi_{n}\} with the aim of generating a set of cross-modal embeddings EηE_{\eta} where referenced entities correspond. At a high level, this challenge can be addressed by an algorithm qq that maximises the probability PP of detecting SS signal in a set of inputs. This algorithm is defined as

q(θ)((Xt))=q(θ)(Xt)max[𝒳p(Xt|θ)s(Xt)]q(\theta)((X_{t}))=q(\theta)(X_{t})\rightarrow max\left[\int_{\mathcal{X}}p(X_{t}|\theta)s(X_{t})\right] (1)

where θ\theta is a parameter θΘ1\theta\in\Theta_{1} and 𝒳\mathcal{X} is the data space.

Refer to caption

Figure 1: Outline of VLN as a supervised classification task a). Linguistic and visual inputs both refer to entities indicated in red. We address a challenge to align and localise over unsynchronised inputs b) by focusing on entities represented in both modalities.
Refer to caption
Figure 2: Prior work on transformer-based systems for VLN follows the above pipeline from inputs to the main model concluding with a) a classifier to predict actions. We propose a feature-location framework (FLPM) to enhance the performance of a main model as in b). Here path traces are an additional input to assist the PM-VLN to align linguistic and visual sequences. Submodule gCFnsg_{CFns} combines embeddings from the main model UηU_{\eta} and the PM-VLN Eη~\widetilde{E_{\eta}} ahead of action prediction with maxout activation.

In the Touchdown benchmark, linguistic and visual inputs are of the form 0|τ|n0\leq|\uptau|\leq n and 0|Route|n0\leq|Route|\leq n where len(τ)len(Route)len(\uptau)\neq len(Route). The task then is to maximise the probability of detecting signal in the form of corresponding entities over the sequences τ\uptau and RouteRoute, which in turn is the product of probabilities over pairings ςt\varsigma_{t} and ψt\psi_{t} presented at each step:

g(Xt)maxsubjecttoP[τ,Route]=pxςxψ\underset{subject\,to}{g(X_{t})\rightarrow max}\;\;P[\uptau,Route]=\prod p_{x_{\varsigma}x_{\psi}} (2)

3   Method

We address the challenge of aligning and localising over sequences with a computational implementation of cross-modal prioritisation. Diagnostics on VLN systems have placed in question the ability of agents to perform cross-modal alignment (Zhu et al., 2022). Transformers underperform in problems with temporal inputs where supervision on image-text alignments is lacking (Chen et al., 2020). This is demonstrated in the case of Touchdown where transformer-based systems complete less than a quarter of routes. Our own observations of lower performance when increasing the depth of transformer architectures motivates moving beyond stacking blocks to an approach that compliments self-attention.

Our PM-VLN module modulates transformer-based encoder embeddings in the main task ϕVLN\phi_{VLN} using a hierarchical process of operations and leveraging prior learning on auxiliary tasks (ϕ1,ϕ2)(\phi_{1},\phi_{2}) (see Figure 3). In order to prioritise relevant information, a training strategy for PM-VLN components is designed where training data contain samples that correspond to the urban grid type and environment features in the main task. The datasets required for pretraining contain less samples than other transformer-based VLN frameworks (Zhu et al., 2021; Majumdar et al., 2020) and target only specific layers of the PM-VLN module. The pretrained module is integrated in a novel feature-location framework FLPM shown in Figure 2. Subsequent components in the FLPM combine cross-modal embeddings from the PM-VLN and a main transformer model ahead of predicting an action.

Refer to caption

Figure 3: A Priority Map module performs a hierarchical process of high-level trajectory planning and feature-level localisation. Submodules inside the white box are learned together and a helper function generates a trajectory plan to predict spans from step t1t_{1}.

3.1   Feature-location Framework with a Priority Map Module

Prior work on VLN agents has demonstrated reliance for navigation decisions on environment features and location-related references (Zhu et al., 2021). In the definition of ϕVLN\phi_{VLN} above, we consider this information as the supervisory signal contained in both sets of inputs (xς,xψ)t(x_{\varsigma},x_{\psi})_{t}. As illustrated in Figure 2, our PM-VLN module is introduced into a framework FLPM. This framework takes outputs from a transformer-based main model EncTransEnc_{Trans} together with path traces ahead of cross-modal prioritisation and classification with maxout activation ClasmaxxiClas_{max\ x_{i}}. Inputs for EncTransEnc_{Trans} comprise cross-modal embeddings e¯η\bar{e}_{\eta} proposed by Zhu et al. (2021) and a concatenation of perspectives up to the current step ψcat\psi_{cat}

Clasmaxxi𝑖[yj|z]=d¯(PM-VLN({g(xi),(trt,ıt,ψt)}i=1n)+EncTrans({g(xi),(e¯η,ψcat)}i=1n))\begin{split}Clas_{\underset{i}{max\,x_{i}}}[y_{j}|z^{\prime}]&=d\bar{}\hskip 1.00006pt(\textit{PM-VLN}(\{g(x_{i}),(tr_{t},\imath_{t}^{\prime},\psi_{t})\}_{i=1}^{n})+\\ &Enc_{Trans}(\{g(x_{i}),(\bar{e}_{\eta},\psi_{cat})\}_{i=1}^{n}))\end{split} (3)

where trttr_{t} is a path trace, zz^{\prime} is the concatenation of the outputs of the two encoders, and d¯d\bar{}\hskip 1.00006pt is a dropout operation.

3.1.1   Priority Map Module

Priority maps are described in the neuropsychological literature as a mechanism that modulates sensory processing on cues from the environment. Salience deriving from the physical aspects of objects in low-level processing is mediated by high-level signals for the relevance of cues to task goals (Fecteau and Munoz, 2006; Itti and Koch, 2000; Zelinsky and Bisley, 2015). Cortical regions that form the location of these mechanisms are associated with the combined processing of feature- and location-based information (Bisley and Mirpour, 2019; Hayden and Gallant, 2009). Prioritisation of items in map tasks with language instructions indicate an integration between linguistic and visual information and subsequent increases in salience attributed to landmarks (Cavicchio et al., 2014).

Our priority map module (PM-VLN) uses a series of simple operations to approximate the prioritsation process observed in human navigation. These operations avoid dependence on initial tasks such as object detection. Alignment of linguistic and visual inputs is enabled by trajectory estimation on simple path traces forming high-level representations of routes and subsequent generation of trajectory plans. Localisation consists of parameterised visual boost filtering on the current environment perspective ψt\psi_{t} and cross-modal alignment of this view with selected spans from subsequent alignment (see Algorithm 1). This hierarchical process compliments self-attention by accounting for the lack of a mechanism in transformers to learn over unaligned temporal sequences. A theoretical basis for cross-modal prioritisation is presented below.

Algorithm 1 Priority Map Module
  Input: Datasets 𝒟ϕ1\mathcal{D}_{\phi_{1}},𝒟ϕ2\mathcal{D}_{\phi_{2}}, and 𝒟ϕVLN\mathcal{D}_{\phi_{VLN}} with inputs (xl,xv)(x_{l},x_{v}) for tasks Φ\Phi. Initial parameters in all layers at ΘjlNormal(μj,σj).\Theta_{j}^{l}\sim Normal(\mu_{j},\sigma_{j}).
  Output: (el,ev)(e_{l},e_{v}^{\prime})
  while not converged do
     for xtrix_{{tr}_{i}} in ϕ1\phi_{1} do
        ΘgPMTPgϕ1(Xi,Θ).\Theta_{g_{PMTP}}^{\prime}\leftarrow g_{\phi_{1}}(X_{i},\Theta).
     end for
  end while
  while not converged do
     for (xli,xvi)(x_{l_{i}},x_{v_{i}}) in ϕ2\phi_{2} do
        ΘgPMFgϕ2(Xi,Θ).\Theta_{g_{PMF}}^{\prime}\leftarrow g_{\phi_{2}}(X_{i},\Theta).
     end for
  end while
  while not converged do
     Sample xtrtx_{tr_{t}} from DTrain.D^{Train}.
     xtptgPMTP(xtrt).x_{tp_{t}}\leftarrow{g_{PMTP}(x_{tr_{t}})}.
     Sample (ıt,ψt)(\imath_{t}^{\prime},\psi_{t}) from DTrain.D^{Train}.
     evgUSM(ψt).e_{v}\leftarrow{g_{USM}(\psi_{t})}.
     evgVBF(ev).e_{v}^{\prime}\leftarrow{g_{VBF}(e_{v})}.
     elgPrL(gCat(ıt,ev)).e_{l}\leftarrow{g_{PrL}(g_{Cat}(\imath_{t}^{\prime},e_{v}^{\prime}))}.
  end while
  return (el,ev)(e_{l},e_{v}^{\prime})

High-level trajectory estimation Alignment over linguistic and visual sequences is formulated as a task of predicting a set of spans from the instruction that correspond to the current step. This process starts with a submodule gPMTPg_{PMTP} that estimates a count cntcnt of steps from a high-level view on the route (see Figure 4). Path traces - denoted as trT{tr}_{T} - are visual representations of trajectories generated from the coordinates of nodes. At t0t_{0} in trT{tr}_{T} initial spans in the instruction are assumed to align with the first visual perspective. From step t1t_{1}, a submodule containing a pretrained ConvNeXt Tiny model (Liu et al., 2022) updates an estimate of the step count in cnttrTcnt_{{tr}_{T}}. A trajectory plan tpt{tp}_{t} is a Gaussian distribution of spans in τ\uptau within the interval [xleft,xright][x_{left},x_{right}]. At each step, samples from this distribution serve as a prediction for relevant spans. The final output ıt\imath_{t}\textquoteright is the predicted span ıt\imath_{t} combined with ıt1\imath_{t-1}.

Refer to caption

Figure 4: Submodule gPMTPg_{PMTP} estimates a step count (cnttrcnt_{tr}) on a path trace. A trajectory plan (tptp) is a Gaussian distribution (NormalNormal) over the instruction and predicts a span for every step ıt\imath_{t}. This is concatenated with the span predicted for the previous step.

Feature-level localisation Predicted spans are passed with ψt\psi_{t} to a submodule gPMFg_{PMF} that is pretrained on cross-modal matching in ϕ2\phi_{2} (see Figure 5). Feature-level operations commence with visual boost filtering. Let ConvVBFConv_{VBF} be a convolutional layer with a kernel κ\kappa and weights WW that receives as input ψt\psi_{t}. In the first operation gUSMg_{USM}, a Laplacian of Gaussian kernel κLoG\kappa_{LoG} is applied to ψt\psi_{t}. The second operation gVBFg_{VBF} consists of subtracting the output eve_{v} from the original tensor ψt\psi_{t}:

gVBF(ev)=(λ1)(ev)g(USM)(ψt)g_{VBF}(e_{v})=(\lambda-1)(e_{v})-g_{(USM)}(\psi_{t}) (4)

where λ\lambda is a learned parameter for the degree of sharpening.

A combination of gUSMg_{USM} and gVBFg_{VBF} is equivalent to adaptive sharpening of details in an image with a Laplacian residual (Bayer, 1986). Here operations are applied directly to eve_{v} and adjusted at each update of WConvVBFW_{Conv_{VBF}} with a parameterised control βλ\beta\lambda. In the simple and efficient implementation from Carranza-Rojas et al. (2019), σ\sigma in the distribution LoG(μj,σj)LoG(\mu_{j},\sigma_{j}) is fixed and the level of boosting is reduced to a single learned term

Δz(x1,x2)=βλ(j(AAκijAWκij)z)\Delta z(x_{1},x_{2})=\beta\lambda({\sum_{j}}({AA}_{\kappa_{i_{j}}}^{\prime}-A_{W_{\kappa_{i_{j}}}})_{z})\newline (5)

where AWA_{W} is a matrix of parameters and AAAA^{\prime} is the identity.

Refer to caption

Figure 5: Submodule gPMFg_{PMF} commences feature-level operations by boosting visual features in the perspective. The next operation (CatCat) is a concatenation of the output from gVBFg_{VBF} and the linguistic output ıt\imath_{t}\textquoteright from the alignment process above. A precise prediction for the relevant span ele_{l} is returned by gPrLg_{PrL}.

Selection of a localised span ele_{l} proceeds with a learned cross-modal embedding eηe_{\eta}^{\prime} composed of eve_{v}^{\prime} and the linguistic output ıt\imath_{t}^{\prime} from the preceding alignment operation. A binary prediction over this linguistic pair is performed on the output hidden state from a single-layer LSTM, which receives eηe_{\eta}^{\prime} as its input sequence. Function gPrLg_{PrL} returns a precise localisation of relevant spans w.r.t. prominent features in the perspective:

gPrL(el)=gCat(ıt,ev){0,ifw,x+b<01,otherwiseg_{PrL}(e_{l})=g_{Cat}(\imath_{t}^{\prime},e_{v}\textquoteright)\triangleq\begin{cases}0,\,if\langle w,x\rangle+b<0\\ 1,\,otherwise\end{cases} (6)

Pretraining Strategy A data-efficient pretraining strategy for the PM-VLN module consists of pretraining submodules of the PM-VLN on auxiliary tasks (ϕ1,ϕ2)(\phi_{1},\phi_{2}). We denote the two datasets for these tasks as (𝒟ϕ1,𝒟ϕ2)(\mathcal{D}_{\phi_{1}},\mathcal{D}_{\phi_{2}}) and a training partition as 𝒟Train\mathcal{D}^{Train} (see Appendix B for details). In ϕ1\phi_{1}, the gPMTPg_{PMTP} submodule is pretrained on TR-NY-PIT-central - a new set of path traces. Path traces in Dϕ1TrainD^{Train}_{\phi_{1}} are generated from 17,000 routes in central Pittsburgh with a class label for the step count in the route. The distribution of step counts in Dϕ1TrainD^{Train}_{\phi_{1}} is 50 samples for routes with \leq7 steps and 300 samples for routes with >>7 steps (see Appendix B). During training, samples from Dϕ1TrainD^{Train}_{\phi_{1}} are presented in standard orientation for 20 epochs and rotated 180° ahead of a second round of training. This rotation policy is preferred following empirical evaluation using standalone versions of the gPMTPg_{PMTP} submodule receiving two alternate preparations of Dϕ1TrainD^{Train}_{\phi_{1}} with random and 180° rotations. Training is formulated as multiclass classification with cross-entropy loss on a set of M=66 classes

gϕ1(xtr,Θ)=B0+argmax𝑖j=1MBi(xtr,Wj)g_{\phi_{1}}(x_{tr},\Theta)=B_{0}+\underset{i}{argmax}\sum^{M}_{j=1}B_{i}(x_{tr},W_{j}) (7)

where a class is the step count, BB is the bias, and ii is the sample in the dataset.

Pretraining on ϕ2\phi_{2} for the feature-level localisation submodule gPMFg_{PMF} is conducted with the component integrated in the framework FLPM and the new MC-10 dataset. Samples in Dϕ2TrainD^{Train}_{\phi_{2}} consist of 8,100 landmarks in 10 US cities. To demonstrate the utility of open source tools in designing systems for outdoor VLN, the generation process leverages free and accessible resources that enable targeted querying. Entity IDs for landmarks sourced from the Wikidata Knowledge Graph are the basis for downloading textual summaries and images from the MediaWiki and WikiMedia APIs. Additional details on MC-10 are available in Appendix B. The aim in generating the MC-10 dataset is to optimise ΘgPMF\Theta_{g_{PMF}} such that features relating to YϕVLNY_{\phi_{VLN}} are detected in inputs XϕVLNX_{\phi_{VLN}}. We opt for open A multi-objective loss for ϕ2\phi_{2} consists of cross-modal matching over the paired samples (xl,xv)(x_{l},x_{v}) - and a second objective comprising a prediction on the geolocation of the entity. In the first objective, gPMFg_{PMF} conducts a binary classification between the true xlx_{l} matching xvx_{v} and a second textual input selected at random from entities in the mini-batch. A limit of 540 tokens is set for all textual inputs and the classification in gPMFg_{PMF} is performed on the first sentence for each entity. Parameters ΘgPMF\Theta_{g_{PMF}} are saved and used subsequently for feature-level localisation in ϕVLN\phi_{VLN}.

3.1.2   Cross-modal Attention and Action Prediction on Combined Outputs

Resuming operations subsequent to the PM-VLN, outputs evt{e_{v}\textquoteright}_{t} from ConvVBFConv_{VBF} are passed together with elt{e_{l}}_{t} to a VisualBERT embedding layer. Embeddings for both modalities are then processed by 4 transformer encoder layers with a hidden size of 256 and self-attention \bigoplus is applied to learn alignments between the pairs

e~η=(elev)=Soft(k=1kŁ(k,~k))\widetilde{e}_{\eta}=\bigoplus(e_{l}\Longleftrightarrow e_{v}^{\prime})=Soft\left(\sum^{\mathcal{E}}_{k=1}{\mathcal{M}}_{k}\text{\L}({\mathcal{E}}_{k},\widetilde{\mathcal{E}}_{k})\right) (8)

where SoftSoft is the softmax function, kk is the number of elements in the inputs, k=1\mathcal{M}_{k=1} is a masked element over the cross-modal inputs, Ł is the loss, k\mathcal{E}_{k} is an element in the input modality, and ~k\widetilde{\mathcal{E}}_{k} is the predicted element. Cross-modal embeddings resulting from this attention operation are processed by concatenating over layer outputs g(eη~)=(e~1,e~2,e~3,e~4)g(\widetilde{{e}_{\eta}}\textquoteright)=(\widetilde{e}_{\mathcal{L}}^{1},\widetilde{e}_{\mathcal{L}}^{2},\widetilde{e}_{\mathcal{L}}^{3},\widetilde{e}_{\mathcal{L}}^{4}).

Architectural and embedding selections for our frameworks aim to enable comparison with benchmark systems on ϕVLN\phi_{VLN}. The EncTransEnc_{Trans} in the best performing framework uses a standard VisualBERT encoder with a hidden size of 256 and 4 layers and attention heads. As noted above, inputs for EncTransEnc_{Trans} align with those used in prior work (Zhu et al., 2021).

A submodule gCFnsg_{CFns} combines UηU_{\eta} from 4\mathcal{L}^{4} of the EncTransEnc_{Trans} and outputs from the cross-modal attention operation g(E~η)g({\widetilde{E}_{\eta}}\textquoteright) ahead of applying dropout. Predictions for navigation actions are the outputs of a classifier block consisting of linear layers with maxout activation. Maxout activation in a block composed of linear operations takes the maxzijmaxz_{ij} where zijz_{ij} are the product of xijWnx_{ij}W_{n*} for kk layers. In contrast to ReLU, the activation function is learned and prevents unit saturation associated with performing dropout (Goodfellow et al., 2013). We compare a standard classifier to one with maxximax\ x_{i} in Table 2. Improvements with maxximax\ x_{i} are consistent with a requirement to offset variance when training with the high number of layers in the full FLPM framework.

3.2   Theoretical Basis

This section provides a theoretical basis for a hierarchical process of cross-modal prioritisation that optimises attention over linguistic and visual inputs. In this section we use qq to denote this process for convenience. During the main task ϕVLN\phi_{VLN}, qq aligns elements in temporal sequences τ\uptau and RouteRoute and localises spans and visual features w.r.t. a subset of all entities EntEnt in the routes:

qPM=xlxvsubjecttomaxPDEnt[τ,Route]Rq_{PM}={\lVert x_{l}-x_{v}\rVert}{\underset{subject\,to}{\rightarrow}max}\;P_{D_{Ent}}[\uptau,Route]\leq R (9)

Inputs in ϕVLN\phi_{VLN} consist of a linguistic sequence τ\uptau and a visual sequence RouteRoute or each trajectory jj in a set of trajectories. As a result of the process followed by Chen et al. (2019) to create the Touchdown task, these inputs conform to the following definition.

Definition 1 (Sequences refer to corresponding entities). At each step in jj, |xl||x_{l}| and |xv||x_{v}| are finite subsequences drawn from τj\uptau_{j} and RoutejRoute_{j} that refer to corresponding entities appearing in the trajectory entjEntent_{j}\subset Ent.

In order to simplify the notation, these subsequences are denoted in this section as xlx_{l} and xvx_{v}. Touchdown differs from other outdoor navigation tasks (Hermann et al., 2020) in excluding supervision on the alignment over cross-modal sequences. Furthermore len(τj)len(Routej)len(\uptau_{j})\neq len(Route_{j}) and there are varying counts of subsequences and entities in trajectories. In an approach to ϕVLN\phi_{VLN} formulated as supervised classification, an agent’s action at each step αt\alpha_{t}\equiv classification ct{0,1}c_{t}\in\{0,1\} where cc is based on corresponding enttent_{t} in the pair (xl,xv)t(x_{l},x_{v})_{t}. The likelihood that ctc_{t} is the correct action depends in turn on detecting SS signal in the form of enttent_{t} from noise in the inputs. The objective of qq then is to maximise PSP_{S} for each point in the data space.

The process defining qq is composed of multiple operations to perform two functions of high-level alignment gAligng_{Align} and localisation gLocg_{Loc}. At the current stage stgstg, function gAligng_{Align} selects one set of spans φstg(φ1,φ2,,φn)\varphi_{stg}\in(\varphi_{1},\varphi_{2},…,\varphi_{n})
where stgstg {Start,if t=0End,if t=1stgother,nNn=1n1>t1otherwise.\begin{cases}Start,\,\textit{if }t=0\\ End,\,\textit{if }t=-1\\ \forall\ stg_{other},\,n\in N\in\sum_{n=1}^{n_{1}}>t_{-1}\ otherwise.\end{cases}\\
This is followed by the function gLocg_{Loc}, which predicts one of ςscnt0ςscnt01\varsigma_{scnt_{0}}\lor\varsigma_{scnt_{0-1}} as the span ς\varsigma relevant to the current trajectory step scntscnt

where scntscnt {scnt0,if (τ,ψt)=0scnt01,otherwise.\begin{cases}scnt_{0},\,\textit{if }(\uptau,\psi_{t})=0\\ scnt_{0-1},\ otherwise.\end{cases}

We start by describing the learning process when the agent in ϕVLN\phi_{VLN} is a transformer-based architecture Enc+ClasEnc+Clas excluding an equivalent to qq (e.g. VisualBERT in Table 1 of the main report). Enc+ClasEnc+Clas is composed of two core subprocesses: cross-modal attention to generate representations q((LV~))q(\bigoplus(L\Longleftrightarrow\widetilde{V})) and a subsequent classification Clas(eη~)Clas(\widetilde{{e}_{\eta}}\textquoteright).

Definition 2 (Objective in Enc+ClasEnc+Clas). The objective Obj1(θ)Obj_{1}(\theta) for algorithm q((LV~)q(\bigoplus(L\Longleftrightarrow\widetilde{V}), where LL and VV are each sequences of samples {x1,x2,,xn}\{x_{1},x_{2},…,x_{n}\}, is the correspondence between samples xlx_{l} and xvx_{v} presented at step tt in i=1nti=t1+t2,+tn\sum^{n}_{i=1}t_{i}=t_{1}+t_{2},…+t_{n}.

It is observed that in the learning process for Enc+ClasEnc+Clas, any subprocesses to align and localise finite sequences xlx_{l} and xvx_{v} w.r.t. entjent_{j} are performed as implicit elements in the process of optimising Obj1(θ)Obj_{1}(\theta). In contrast the basis for the hierarchical learning process enabled by our framework FLPM - which incorporates qPMq_{PM} with explicit functions for these steps - is given in Theorem 1.

Theorem 1. Assuming xlx_{l} and xvx_{v} conform to Definition 1 and that xLxV\forall\ x\in L\ \exists\ x\in V, an onto function gMap=mx+b,m0g_{Map}=mx+b,m\neq 0 exists such that:

gMap(xl,xv)max[entj(xl,xv)Ent]g_{Map}(x_{l},x_{v})\rightarrow max\left[ent^{(x_{l},x_{v})}_{j}\in Ent\right] (10)

In this case, additional functions gAligng_{Align} and gLocg_{Loc} - when implemented in order - maximise gMapg_{Map}:

maxPDentj=maxgMap(xl,xv)subjectto(gAlign,gLoc,gMap)entj(xl,xv)LjVj\begin{split}max\ P_{D_{ent_{j}}}&=max\ g_{Map}(x_{l},x_{v}){\underset{subject\,to}{\rightarrow}}\\ &(\overrightarrow{g_{Align},g_{Loc},g_{Map}})\ \forall\ ent^{(x_{l},x_{v})}_{j}\in L_{j}\cap V_{j}\end{split} (11)

Remark 1 Let P(maxgMap)P(max\ g_{Map}) in Theorem 1 be the probability of optimising gMapg_{Map} such that the number of pairs N(xl,xv)N^{(x_{l},x_{v})} corresponding to entjLjVjent_{j}\in L_{j}\cap V_{j} is maximised. It is noted that N(xl,xv)N^{(x_{l},x_{v})} is determined by all possible outcomes in the set of cases {(xl,xv)entj\{(x_{l},x_{v})\Leftrightarrow ent_{j}, (xl,xv)entj(x_{l},x_{v})\nLeftrightarrow ent_{j}, xlxvx_{l}\nLeftrightarrow x_{v}}. As the sequences of instances ii in xlx_{l}, xvx_{v} and entjent_{j} are forward-only, it is also noted that Nt+1(xl,xv)<Nt(xl,xv)N^{(x_{l},x_{v})}_{t+1}<N^{(x_{l},x_{v})}_{t} if entixlient_{i}\not\in{x_{l}}_{i}, entixvient_{i}\not\in{x_{v}}_{i}, or entixlentixvent^{x_{l}}_{i}\neq ent^{x_{v}}_{i}. By definition, Nt+1(xl,xv)>Nt(xl,xv)N^{(x_{l},x_{v})}_{t+1}>N^{(x_{l},x_{v})}_{t} if P(enti=xli=xvi)P(ent_{i}={x_{l}}_{i}={x_{v}}_{i}) - where the latter probability is s.t. processes performed within finite computational time CT(n)CT(n) - which implies that P(maxgMap)|P(enti=xli=xvi)P(max\ g_{Map})|P(ent_{i}={x_{l}}_{i}={x_{v}}_{i}).

Remark 2. Following on from Remark 1,
CT(nP(enti=xli=xvi))CT(n^{P(ent_{i}={x_{l}}_{i}={x_{v}}_{i})})  when  qq  contains  gtg_{t},  and  function
gt(max(N(xl,xv)entjLjVj),wheregtG<CT(nP(enti=xli=xvi))g_{t}(max(N^{(x_{l},x_{v})}\Rightarrow ent_{j}\in L_{j}\cap V_{j}),\ where\ g_{t}\in G<\\ CT(n^{P(ent_{i}={x_{l}}_{i}={x_{v}}_{i})})  when  qq  does  not  contain  gt<CT(nP(enti=xli=xvi))g_{t}<\\ CT(n^{P(ent_{i}={x_{l}}_{i}={x_{v}}_{i})})  when  qq  contains  gtg_{t},  and  function
gt(max(N(xl,xv)entjLjVj)g_{t}(max(N^{(x_{l},x_{v})}\nRightarrow ent_{j}\in L_{j}\cap V_{j}).

Discussion In experiments, we expect from Remark 1 that results on ϕVLN\phi_{VLN} for architectures such as Enc+ClasEnc+Clas - which exclude operations equivalent to those undertaken by the onto function gMapg_{Map} - will be lower than the results for a framework FLPM over a finite number of epochs. We observe this in Table 1 when comparing the performance of respective standalone and + FLPM for VisualBERT and VLN Transformer systems. Poor results for variants (a) and (h) in Tables 2 and 3 in comparison to FLPM + VisualBERT(4l) also support the expectation set by Remark 2 that performance will be highly impacted in an architecture where operations in gMapg_{Map} increase the number of misalignments.

Proof of Theorem 1 We use below aa* for a generic transformer-based system that predicts α\alpha on (L,V)(L,V), x\nabla x for gradients, and Θa\Theta^{a*} to denote ΘEnc+ClasνΘEnc+q\Theta^{Enc+Clas}\ \nu\ \Theta^{Enc+q}. Let sequence xl=[ent1,ent2,,entn1]x_{l}=[ent_{1},ent_{2},…,ent_{n_{1}}] and sequence xv=[ent1,ent2,,entn2]x_{v}=[ent_{1},ent_{2},…,ent_{n_{2}}], where n1n_{1} and n2n_{2} are unknown. We note that at any point during learning, PS(xl,xv)P_{S}(x_{l},x_{v}) is spread unevenly over entjent_{j} in relation to Θa𝒳\Theta^{a*}\approx\mathcal{X}.

Propositions We start with the case that entj:ent(xl)\exists\ ent_{j}:ent^{(x_{l})} andent(xv)\ and\ ent^{(x_{v})}. Here  CT(nEntLV)forΘa+gt<CT(nEntLV)forΘawheregtCT(n^{Ent\in L\cap V})\ for\ \Theta^{a*+g_{t}}<CT(n^{Ent\in L\cap V})\ for\ \Theta^{a*}\ where\ g_{t} accounts for Δ(Len1,Len2)\Delta(Len_{1},Len_{2}). We next consider the case where  entj:ent(xl)νent(xv)\nexists\ ent_{j}:\ ent^{(x_{l})}\ \nu\ ent^{(x_{v})}. WheregLocthenPS(xl,xv)<gLocPS(xl,xv)Where\ \nexists\ g_{Loc}\ then\ P_{S}^{(x_{l},x_{v})}<\exists\ g_{Loc}\ P_{S}^{(x_{l},x_{v})}.
We conclude with the case where Ent:xlνxv\exists\ Ent:x_{l}\ \nu\ x_{v}.  InPSAent(xl)ent(xv)whenent(xl)ent(xv).In\ P_{S}^{A*}\ ent^{(x_{l})}\ \bigoplus\ ent^{(x_{v})}\ when\ ent^{(x_{l})}\ \neq\ ent^{(x_{v})}.

As  (EntL,EntV)Ent(Ent_{L},Ent_{V})\ \Rightarrow\ Ent, Θamax(N(xl,xv))𝒳.\Theta^{a*}\ \approx\ max(N^{(x_{l},x_{v})})\ \in\ \mathcal{X}.
 PS(xl,xv)whereenti=xli=xvi>entiΘamax(N(xl,xv)).P_{S}^{(x_{l},x_{v})}\ where\ ent_{i}\ =\ {x_{l}}_{i}\ =\ {x_{v}}_{i}\ >\ ent_{i}\ \in\ \Theta^{a*}\ \approx\\ max(N^{(x_{l},x_{v})}). Furthermore PentEnt(enti)>ententi.P\ \exists\ ent\in\ Ent\approx(ent_{i})\ >\nexists\\ \ ent\ \not\approx\ ent_{i}. Therefore  slopexincreasesandCT(nEntLV)forΘa+q<CT(nEntLV).slope\ \nabla x\ increases\ and\ CT(n^{Ent\in L\cap V})\\ for\ \Theta^{a*+q}<CT(n^{Ent\in L\cap V}).

Development Test
TC\uparrow SPD\downarrow SED\uparrow TC\uparrow SPD\downarrow SED\uparrow
Inputs (L, V) GA a 12.1 20.2 11.7 10.7 19.9 10.4
(non-transformer based) RCONCAT a 11.9 20.1 11.5 11.0 20.4 10.5
ARC+L2STOP* c 19.5 17.1 19.0 16.7 18.8 16.3
Inputs (L, V) VisualBERT(8l) 10.4 21.3 10.0 9.9 21.7 9.5
(transformer based) VisualBERT(4l) 14.3 17.7 13.7 11.8 18.3 11.5
VLN Transformer(4l) b 12.2 18.9 12.0 12.8 20.4 11.8
VLN Transformer(8l) b 13.2 19.8 12.7 13.1 21.1 12.3
VLN Transformer(8l) + M50 + style * b 15.0 20.3 14.7 16.2 20.8 15.7
Inputs (L, V) + JD / HT** ORAR (ResNet pre-final)* d 26.0 15.0 - 25.3 16.2 -
(non-transformer based) ORAR (ResNet 4th-to-last)* d 29.9 11.1 - 29.1 11.7 -
Inputs (L, V) + Path Traces VLN Transformer(8l) 11.2 23.4 10.7 11.5 23.9 10.8
(transformer based) VisualBERT(4l) 16.2 18.7 15.7 15.0 20.1 14.5
FLPM(4l) ++ VLN Transformer(8l) 29.9 23.4 26.8 28.2 23.8 25.6
FLPM(4l) ++ VisualBERT(4l) 33.0 23.6 29.5 33.4 23.8 29.7
  • Frameworks from a Chen et al. (2019), b Zhu et al. (2021), c Xiang et al. (2020), and d Schumann and Riezler (2022).

  • *

    Results reported by the authors.

  • **

    Systems receive two types of features - Junction Type and Heading Delta - as inputs.

Table 1: Performance on the Touchdown benchmark ranked by TC on the test partition. Systems are grouped by input types during VLN and the use of transformer blocks in architectures. Contributions of the FLPM framework and path traces to improved performance are demonstrated with results for systems with two baseline transformer-based architectures - VisualBERT and VLN Transformer. These baselines also are assessed in two sizes to test the benefits of adding transformer blocks.

4   Experiments

Our starting point in evaluating the PM-VN module and FLPM is performance in relation to benchmark systems (see Table 1). Ablations are conducted by removing individual operations (see Table 2) and the role of training data is assessed (see Table 3). To minimise computational cost, we implement frameworks with low numbers of layers and attention heads in transformer models.

4.1   Experiment Settings

Metrics   We align with Chen et al. (2019) in reporting task completion (TC), shortest-path distance (SPD), and success weighted edit distance (SED) for ϕVLN\phi_{VLN}. All metrics are derived using the Touchdown navigation graph. TC is a binary measure of success 0,1{0,1} in ending a route with a prediction ct1o=yt1oc_{t-1}^{o}=y_{t-1}^{o} or ct1o=yt1o1c_{t-1}^{o}=y_{t-1}^{o-1} and SPD is calculated as the mean distance between ct1oc_{t-1}^{o} and yt1oy_{t-1}^{o}. SED is the Levenshtein distance between the predicted path in relation to the defined route path and is only applied when TC = 1.

Hyperparameter Settings   Frameworks are trained for 80 epochs with batch size=30. Scores are reported for the epoch with the highest SPD on 𝒟ϕVLNDev\mathcal{D}_{\phi_{VLN}}^{Dev}. Pretraining for the PM-VLN module is conducted for 10 epochs with batch sizes ϕ1=60\phi_{1}=60 and ϕ2=30\phi_{2}=30. Frameworks are optimised using AdamW with a learning rate of 2.5 x 10-3 (Loshchilov and Hutter, 2017).

4.2   Touchdown

Experiment Design: Chen et al. (2019) define two separate tasks in the Touchdown benchmark: VLN and spatial description resolution. This research aligns with other studies (Zhu et al., 2021, 2022) in conducting evaluation on the navigation component as a standalone task. Dataset and Data Preprocessing: Frameworks are evaluated on full partitions of Touchdown with DTrain=6,525D^{Train}=6,525, DDev=1,391D^{Dev}=1,391, and DTest=1,409D^{Test}=1,409 routes. Trajectory lengths vary with DTrain=34.2D^{Train}=34.2, DDev=34.1D^{Dev}=34.1, and DTest=34.4D^{Test}=34.4 mean steps per route. Junction Type and Heading Delta are additional inputs generated from the environment graph and low-level visual features (Schumann and Riezler, 2022). M-50 + style is a subset of the StreetLearn dataset with DTrain=30,968D^{Train}=30,968 routes of 50 nodes or less and multimodal style transfer applied to instructions (Zhu et al., 2021). Embeddings: All architectures evaluated in this research receive the same base cross-modal embeddings xηx_{\eta} proposed by Zhu et al. (2021), which are learned by a combination of the outputs of a pretrained BERT-base encoder with 12 encoder layers and attention heads. At each step, a fully connected layer is used for textual embeddings ςt\varsigma_{t} and a 3 layer CNN returns the perspective ψt\psi_{t}. FLPM frameworks also receive an embedding of the path trace trttr_{t} at step tt. As this constitutes additional signal on the route, we evaluate a VisualBERT model (4l) that also receives trttr_{t}, which in this case is combined with ψt\psi_{t} ahead of inclusion in xηt{x_{\eta}}_{t}. Results: In Table 1 the first block of frameworks consists of architectures composed primarily of convolutional and recurrent layers. VLN Transformer is a framework proposed by Zhu et al. (2021) for the Touchdown benchmark and consists of a transformer-based cross-modal encoder with 8 encoder layers and 8 attention heads. VLN Transformer + M50 + style is a version of this framework pretrained on the dataset described above. To our knowledge, this was the transformer-based framework with the highest TC on Touchdown preceding our work. ORAR (ResNet 4th-to-last) (Schumann and Riezler, 2022) is from work published shortly before the completion of this research and uses two types of features to attain highest TC in prior work. Standalone VisualBERT models are evaluated in two versions with 4 and 8 layers and attention heads. A stronger performance by the smaller version indicates that adding self-attention layers is unlikely to improve VLN predictions. This is further supported by the closely matched results for the VLN Transformer(4l) and VLN Transformer(8l). FLPM frameworks incorporate the PM-VLN module pretrained on auxiliary tasks (ϕ1,ϕ2)(\phi_{1},\phi_{2}) - and one of VisualBERT (4l) or VLN Transformer(8l) as the main model. Performance on TC for both of these transformer models doubles when integrated into the framework. A comparison of results for standalone VisualBERT and VLN Transformer systems with path traces supports the use of specific architectural components that can exploit this additional input type. Lower SPD for systems run with the FLPM framework reflect a higher number of routes where a stop action was predicted prior to route completion. Although not a focus for the current research, this shortcoming in VLN benchmarks has been addressed in other work (Xiang et al., 2020; Blukis et al., 2018).

4.3   Assessment of Specific Operations

Ablations are conducted on the framework with the highest TC i.e. FLPM + VisualBERT(4l). The tests do not provide a direct measure of operations as subsequent computations in forward and backward passes by retained components are not accounted for. Results indicate that initial alignment is critical to cross-modal prioritisation and support the use of in-domain data during pretraining.

Development
TC\uparrow SPD\downarrow SED\uparrow
FLPM ++ VisualBERT(4l) 33.0 23.6 29.5
PM-VLN
  - gPMTPg_{PMTP} (a) 7.1 26.8 6.8
  - gPMFg_{PMF} minus gVBF (b) 27.9 25.7 24.9
  - gPMFg_{PMF} minus ıt1\imath_{t-1} (c) 29.8 21.8 27.2
FLPM
  - gAttng_{Attn} with gCatg_{Cat} (d) 18.8 30.5 16.4
  - gClasmaxxig_{Clas_{max\ x_{i}}} with gClasg_{Clas} (e) 31.7 21.9 28.2
Table 2: Ablations on core operations in the PM-VLN (variants (a-(c)) and the FLPM framework (variants (d) and (e)).

Ablation 1: PM-VLN   Prioritisation in the PM-VLN module constitutes a sequential chain of operations. Table 2 reports results for variants of the framework where the PM-VLN excludes individual operations. Starting with gPMTPg_{PMTP}, trajectory estimation is replaced with a fixed count of 34 steps for each route trttr_{t} (see variant (a)). This deprives the PM-VLN of a method to take account of the current route when synchronising τ\uptau and sequences of visual inputs. All subsequent operations are impacted and the variant reports low scores for all metrics. Two experiments are then conducted on gPMFg_{PMF}. In variant (b), visual boost filtering is disabled and feature-level localisation relies on a base ψt\psi_{t}. A variant excluding linguistic components from gPMFg_{PMF} is then implemented by specifying ıt\imath_{t} as the default input from τt\uptau_{t} (see variant (c)). In practice, span selection in this case is based on trajectory estimation only.

Ablation 2: FLPM   Ablations conclude with variants of FLPM where core functions are excluded from other submodules in the framework. Results for variant (d) demonstrate the impact of replacing the operation defined in Equation 3 with a simple concatenation on outputs from PM-VLN ele_{l} and eve_{v}^{\prime}. A final experiment compares methods for generating action predictions: in variant (e), gClasmaxxig_{Clas_{max\ x_{i}}} is replaced by the standard implementation for classification in VisualBERT. Classification with dropout and a single linear layer underperforms our proposal by 1.3 points on TC.

4.4   Assessment of Training Strategy

A final set of experiments is conducted to measure the impact of training data for auxiliary tasks (ϕ1,ϕ2)(\phi_{1},\phi_{2}).

Training Strategy 1: Exploiting Street Pattern in Trajectory Estimation   We conduct tests on alternate samples to examine the impact of route types in Dϕ1TrainD^{Train}_{\phi_{1}}. The module for FLPM frameworks in Table 1 is trained on path traces drawn from an area in central Pittsburgh (see SupMat:Sec.3) with a rectangular street pattern that aligns with the urban grid type Lynch (1981) found in the location of routes in Touchdown. Table 3 presents results for modules trained on routes selected at random outside of this area. In variants (f) and (g), versions V2 and V3 of Dϕ1TrainD^{Train}_{\phi_{1}} each consist of 17,000 samples drawn at random from the remainder of a total set of 70,000 routes. Routes that conform to curvilinear grid types are observable in outer areas of Pittsburgh. Lower TC for these variants prompts consideration of street patterns when generating path traces. A variant (h) where the gPMTPg_{PMTP} submodule receives no pretraining underlines - along with variant (a) in Table 2 - the importance of the initial alignment step to our proposed method of cross-modal prioritisation.

Training Strategy 2: In-domain Data and Feature-Level Localisation   We conclude by examining the use of in-domain data when pretraining the gPMFg_{PMF} submodule ahead of feature-level localisation operations in the PM-VLN. In Table 3, versions of FLPM are evaluated subsequent to pretraining with varying sized subsets of the Conceptual Captions dataset Sharma et al. (2018). This resource of general image-text pairs is selected as it has been proposed for pretraining VLN systems (see below). Samples are selected at random and grouped into two training partitions equivalent in number to 100%100\% (variant(i)) and 150%150\% of Dϕ2TrainD^{Train}_{\phi_{2}} (variant (j)). In place of the multi-objective loss applied to the MC-10 dataset, θgPMF\theta_{g_{PMF}} are optimised on a single goal of cross-modal matching. Variant (k) assesses FLPM when no pretraining for the gPMFg_{PMF} submodule is undertaken. Lower results for variants (i), (j), and (k) support pretraining on small sets of in-domain data as an alternative to optimising VLN systems on large-scale datasets of general samples.

Development
TC\uparrow SPD\downarrow SED\uparrow
FLPM ++ VisualBERT(4l) 33.0 23.6 29.5
Pretraining for gPMTPg_{PMTP}
  - gPMTP+Dϕ1TrainV2g_{PMTP}+D^{Train}_{\phi_{1}}V2 (f) 11.9 20.1 11.5
  - gPMTP+Dϕ1TrainV3g_{PMTP}+D^{Train}_{\phi_{1}}V3 (g) 13.6 20.5 13.1
  - gPMTPg_{PMTP} no pretraining (h) 4.7 27.6 1.9
Pretraining for gPMFg_{PMF}
  - gPMF+Dϕ2TrainV2g_{PMF}+D^{Train}_{\phi_{2}}V2 (i) 19.8 23.2 17.2
  - gPMF+Dϕ2TrainV3g_{PMF}+D^{Train}_{\phi_{2}}V3 (j) 23.9 20.8 20.3
  - gPMFg_{PMF} no pretraining (k) 6.3 25.1 4.6
Table 3: Assessment of the pretraining strategy for individual PM-VLN submodules gPMTPg_{PMTP} (variants (f) to (h)) and gPMFg_{PMF} (variants (i) to (k) using alternative datasets for auxiliary tasks. Variants are also run with no pretraining of gPMTPg_{PMTP} and gPMFg_{PMF}.

5   Related Work

This research aims to extend cross-disciplinary links between machine learning and computational cognitive neuroscience in the study of prioritisation in attention. This section starts with a summary of literature in these two disciplines that use computational methods to explore this subject. Our training strategy is positioned in the context of prior work on pretraining frameworks for VLN. The section concludes with work related to the alignment and feature-level operations performed by our PM-VLN model.

Computational Implementations of Prioritisation in Attention Denil et al. (2012) proposed a model that generates saliency maps where feature selection is dependent on high-level signals in the task. The full system was evaluated on computer vision tasks where the aim is to track targets in video. A priority map computation was implemented in object detection models by Wei et al. (2016) to compare functions in these systems to those observed in human visual attention. Anantrasirichai et al. (2017) used a Support Vector Machine classifier to model visual attention in human participants traversing four terrains. Priority maps were then generated to study the interaction of prioritised features and a high-level goal of maintaining smooth locomotion. A priority map component was incorporated into a CNN-based model of primate attention mechanisms by Zelinsky and Adeli (2019) to prioritise locations containing classes of interest when performing visual search. Studies on spatial attention in human participants have explored priority map mechanisms that process inputs consisting of auditory stimuli and combined linguistic and visual information (Golob et al., 2017; Cavicchio et al., 2014). To our knowledge, our work is the first to extend neuropsychological work on prioritisation over multiple modalities to a computational implementation of a cross-modal priority map for machine learning tasks.

Pretraining for VLN Tasks Two forms of data samples - in-domain and generic - are used in pretraining prior to conducting VLN tasks. In-domain data samples have been sourced from image-caption pairs from online rental listings (Guhur et al., 2021) and other VLN tasks (Zhu et al., 2021). In-domain samples have also been generated by augmenting or reusing in-task data (Fried et al., 2018; Huang et al., 2019; Hao et al., 2020; He et al., 2021; Schumann and Riezler, 2022). Generic samples from large-scale datasets designed for other Vision-Language tasks have been sourced to improve generalisation in transformer-based VLN agents. Majumdar et al. (2020) conduct large-scale pretraining with 3.3M image-text pairs from Conceptual Captions Sharma et al. (2018) and Qi et al. (2021) initialise a framework with weights trained on four out-of-domain datasets. Our training strategy relies on datasets with a few thousand samples derived from resources where additional samples are available at low cost.

Methods for Aligning and Localising Features in Linguistic and Visual Sequences Alignment in multimodal tasks is often posited as an implicit subprocess in an attention-based component of a transformer (Tsai et al., 2019; Zhu et al., 2021). Huang et al. (2019) identified explicit cross-modal alignment as an auxiliary task that improves agent performance in VLN. Alignment in this case is measured as a similarity score on inputs from the main task. In contrast, our PM-VLN module conducts a hierarchical process of trajectory planning and learned localisation to pair inputs. A similarity measure was the basis for an alignment step in the Vision-Language Pretraining framework proposed by Li et al. (2021). A fundamental point of difference with our work is that this framework - along with related methods (Jia et al., 2021) - is trained on a distinct class of tasks where the visual input is a single image as opposed to a temporal sequence. Several VLN frameworks containing components that perform feature localisation on visual inputs have been pretrained on object detection (Majumdar et al., 2020; Suglia et al., 2021; Hu et al., 2019). In contrast, we include visual boost filtering in gPMFg_{PMF} to prioritise visual features. Our method of localising spans using a concatenation of the enhanced visual input and cross-modal embeddings is unique to this research.

6   Conclusion

We take inspiration from a mechanism described in neurophysiological research with the introduction of a priority map module that combines temporal sequence alignment enabled by high-level trajectory estimation and feature-level localisation. Two new resources comprised of in-domain samples and a tailored training strategy are proposed to enable data-efficient pretraining of the PM-VLN module ahead of the main VLN task. A novel framework enables action prediction with maxout activation on a combination of the outputs from the PM-VLN module and a transformer-based encoder. Evaluations demonstrate that our module, framework, and pretraining strategy double the performance of standalone transformers in outdoor VLN.

7   Acknowledgments

This publication is supported by the Digital Visual Studies program at the University of Zurich and funded by the Max Planck Society. RS has received funding by the Swiss National Science Foundation (project MUTAMUR; no. 176727). The authors would like to thank Howard Chen, Piotr Mirowski, and Wanrong Zhu for assistance and feedback on questions related to the Touchdown task, data, and framework evaluations.

References

  • Anantrasirichai et al. (2017) N. Anantrasirichai, K. A. Daniels, J. F. Burn, I. D. Gilchrist, and D. R. Bull. Fixation prediction and visual priority maps for biped locomotion. IEEE Transactions on Cybernetics, 48(8):2294–2306, 2017.
  • Armitage et al. (2020) J. Armitage, E. Kacupaj, G. Tahmasebzadeh, M. Maleshkova, R. Ewerth, and J. Lehmann. Mlm: A benchmark dataset for multitask learning with multiple languages and modalities. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 2967–2974, 2020.
  • Bayer (1986) B. Bayer. A method for the digital enhancement of unsharp, grainy photographic images. Advances in Computer Vision and Image Processing, 2:Chapter–2, 1986.
  • Bisley and Mirpour (2019) J. W. Bisley and K. Mirpour. The neural instantiation of a priority map. Current Opinion in Psychology, 29:108–112, 2019.
  • Blukis et al. (2018) V. Blukis, D. Misra, R. A. Knepper, and Y. Artzi. Mapping navigation instructions to continuous control actions with position-visitation prediction. In Conference on Robot Learning, pages 505–518. PMLR, 2018.
  • Carranza-Rojas et al. (2019) J. Carranza-Rojas, S. Calderon-Ramirez, A. Mora-Fallas, M. Granados-Menani, and J. Torrents-Barrena. Unsharp masking layer: injecting prior knowledge in convolutional networks for image classification. In International Conference on Artificial Neural Networks, pages 3–16. Springer, 2019.
  • Cavicchio et al. (2014) F. Cavicchio, D. Melcher, and M. Poesio. The effect of linguistic and visual salience in visual world studies. Frontiers in Psychology, 5:176, 2014.
  • Chen et al. (2019) H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12538–12547, 2019.
  • Chen et al. (2020) S. Chen, Y. Zhao, Q. Jin, and Q. Wu. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10638–10647, 2020.
  • Denil et al. (2012) M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas. Learning where to attend with deep architectures for image tracking. Neural Computation, 24(8):2151–2184, 2012.
  • Fecteau and Munoz (2006) J. H. Fecteau and D. P. Munoz. Salience, relevance, and firing: a priority map for target selection. Trends in Cognitive Sciences, 10(8):382–390, 2006.
  • Fried et al. (2018) D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
  • Golob et al. (2017) E. J. Golob, K. B. Venable, J. Scheuerman, and M. T. Anderson. Computational modeling of auditory spatial attention. In CogSci, 2017.
  • Goodfellow et al. (2013) I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In International Conference on Machine Learning, pages 1319–1327. PMLR, 2013.
  • Gottlieb et al. (2020) J. Gottlieb, M. Cohanpour, Y. Li, N. Singletary, and E. Zabeh. Curiosity, information demand and attentional priority. Current Opinion in Behavioral Sciences, 35:83–91, 2020.
  • Guhur et al. (2021) P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021.
  • Hao et al. (2020) W. Hao, C. Li, X. Li, L. Carin, and J. Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020.
  • Hayden and Gallant (2009) B. Y. Hayden and J. L. Gallant. Combined effects of spatial and feature-based attention on responses of v4 neurons. Vision Research, 49(10):1182–1187, 2009.
  • He et al. (2021) K. He, Y. Huang, Q. Wu, J. Yang, D. An, S. Sima, and L. Wang. Landmark-rxr: Solving vision-and-language navigation with fine-grained alignment supervision. Advances in Neural Information Processing Systems, 34, 2021.
  • Hermann et al. (2020) K. M. Hermann, M. Malinowski, P. Mirowski, A. Banki-Horvath, K. Anderson, and R. Hadsell. Learning to follow directions in street view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11773–11781, 2020.
  • Hu et al. (2019) R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, and K. Saenko. Are you looking? grounding to multiple modalities in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6551–6557, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1655. URL https://aclanthology.org/P19-1655.
  • Huang et al. (2019) H. Huang, V. Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie. Transferable representation learning in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7404–7413, 2019.
  • Itti and Koch (2000) L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(10-12):1489–1506, 2000.
  • Jia et al. (2021) C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  • Le et al. (2022) T. Le, K. Pho, T. Bui, H. T. Nguyen, and M. Le Nguyen. Object-less vision-language model on visual question classification for blind people. In ICAART (3), pages 180–187, 2022.
  • Li et al. (2021) J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021.
  • Li et al. (2019) L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  • Li et al. (2020) X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
  • Lin and Wang (2020) Y.-B. Lin and Y.-C. F. Wang. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision, 2020.
  • Liu et al. (2022) Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Loshchilov and Hutter (2017) I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lynch (1981) K. Lynch. A theory of good city form. ma. Cambridge: Massachusetts Institute of Technology, 1981.
  • Majumdar et al. (2020) A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra. Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020.
  • Mirowski et al. (2018) P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, A. Zisserman, R. Hadsell, et al. Learning to navigate in cities without a map. Advances in Neural Information Processing Systems, 31:2419–2430, 2018.
  • Ptak (2012) R. Ptak. The frontoparietal attention network of the human brain: action, saliency, and a priority map of the environment. The Neuroscientist, 18(5):502–515, 2012.
  • Qi et al. (2021) Y. Qi, Z. Pan, Y. Hong, M.-H. Yang, A. van den Hengel, and Q. Wu. The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1655–1664, 2021.
  • Schumann and Riezler (2022) R. Schumann and S. Riezler. Analyzing generalization of vision and language navigation to unseen outdoor areas. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7519–7532, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.518. URL https://aclanthology.org/2022.acl-long.518.
  • Sharma et al. (2018) P. Sharma, N. Ding, S. Goodman, and R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  • Shinoda et al. (2001) H. Shinoda, M. M. Hayhoe, and A. Shrivastava. What controls attention in natural environments? Vision Research, 41(25-26):3535–3545, 2001.
  • Su et al. (2020) W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai. VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  • Suglia et al. (2021) A. Suglia, Q. Gao, J. Thomason, G. Thattai, and G. Sukhatme. Embodied bert: A transformer model for embodied, language-guided visual task completion. arXiv preprint arXiv:2108.04927, 2021.
  • Tatler et al. (2011) B. W. Tatler, M. M. Hayhoe, M. F. Land, and D. H. Ballard. Eye guidance in natural vision: Reinterpreting salience. Journal of Vision, 11(5):5–5, 2011.
  • Tsai et al. (2019) Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access, 2019.
  • Wei et al. (2016) Z. Wei, H. Adeli, M. H. Nguyen, G. Zelinsky, and D. Samaras. Learned region sparsity and diversity also predicts visual attention. Advances in Neural Information Processing Systems, 29, 2016.
  • Xiang et al. (2020) J. Xiang, X. Wang, and W. Y. Wang. Learning to stop: A simple yet effective approach to urban vision-language navigation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 699–707, Online, Nov. 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.62. URL https://aclanthology.org/2020.findings-emnlp.62.
  • Zelinsky and Adeli (2019) G. J. Zelinsky and H. Adeli. Learning to attend in a brain-inspired deep neural network. Journal of Vision, 19(10):282d–282d, 2019.
  • Zelinsky and Bisley (2015) G. J. Zelinsky and J. W. Bisley. The what, where, and why of priority maps and their interactions with visual working memory. Annals of the New York Academy of Sciences, 1339(1):154, 2015.
  • Zhu et al. (2021) W. Zhu, X. Wang, T.-J. Fu, A. Yan, P. Narayana, K. Sone, S. Basu, and W. Y. Wang. Multimodal text style transfer for outdoor vision-and-language navigation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1207–1221, 2021.
  • Zhu et al. (2022) W. Zhu, Y. Qi, P. Narayana, K. Sone, S. Basu, X. Wang, Q. Wu, M. Eckstein, and W. Y. Wang. Diagnosing vision-and-language navigation: What really matters. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5981–5993, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.438. URL https://aclanthology.org/2022.naacl-main.438.

Appendices

Appendix A Notation

Notations used in multiple sections of this paper are defined here for fast reference. Auxiliary tasks (ϕ1,ϕ2)(\phi_{1},\phi_{2}) and the main VLN task ϕVLN\phi_{VLN} constitute the set of tasks Φ\Phi. Inputs and embeddings are specified as ll (linguistic), vv (visual), and η\eta (multimodal). A complete textual instruction is denoted as τ\uptau, ς\varsigma is a span, and ψ\psi is a perspective. Linguistic and visual inputs for the PM-VLN are denoted as (ıt,ψt)(\imath_{t}^{\prime},\psi_{t}) and embeddings processed in prioritisation operations are (el,ev)t(e_{l},e_{v})_{t}. In contrast, UU denotes a set of embeddings from the main model, which are derived from inputs e¯η,ψcat)\bar{e}_{\eta},\psi_{cat}). The notations Δ\Delta and \bigoplus are respectively visual boost filtering and self-attention operations. Table 4 provides a reference source for standard notation appearing throughout this paper. Other notations are defined in the sections where they are used.

Notation Usage in this paper
AA Matrix
AAAA Identity matrix
B,bB,b Bias
𝒟\mathcal{D} Dataset
Train,Dev,Test{Train,Dev,Test} Dataset partitions
\exists Exists
\forall For every (eg member in a set)
gg Function
HH Hypothesis
\mathcal{L} Layer of a model
lenlen Length
μ\mu Mean
nn Number of samples
ν\nu Or
PP Probability
qq Algorithm
SS Signal detected
σ\sigma Standard deviation
Θ\Theta Set of parameters
W,wW,w Set of weights
|x||x| Sequence
\triangleq Equal by definition
Table 4: Reference List for Standard Notation.

Appendix B Datasets

B.1   Generation and Partition Sizes

The MC-10 dataset consists of visual, textual and geospatial data for landmarks in 10 US cities. We generate the dataset with a modified version of the process outlined by Armitage et al. (2020). Two base entity IDs - Q2221906 (“geographic location”) and Q83620 (“thoroughfare”) - form the basis of queries to extract entities at a distance of <=2<=2 hops in the Wikidata knowledge graph111https://query.wikidata.org/. Constituent cities consist of incorporated places exceeding 1 million people ranked by population density based on data for April 1, 2020 from the US Census Bureau222https://www.census.gov/programs-surveys/decennial-census/data/datasets.html. Images and coordinates are sourced from Wikimedia and text summaries are extracted with the MediaWiki API. Geographical cells are generated using the S2 Geometry Library333https://code.google.com/archive/p/s2-geometry-library/ with a range of nn entities [1,5][1,5]. Statistics for MC-10 are presented by partition in Table 5. As noted above, only a portion of textual inputs are used in pretraining and experiments.

Table 5: Statistics for the MC-10 dataset by partition.
Train Development
Number of entities 8,100 955
Mean length per text summary 727 745

TR-NY-PIT-central is a set of image files graphing path traces for trajectory plan estimation in two urban areas. Trajectories in central Manhattan are generated from routes in the Touchdown instructions Chen et al. (2019). Links EE connecting OO in the Pittsburgh partition of StreetLearn Mirowski et al. (2018) are the basis for routes where at least one node is positioned in the bounding box delimited by the WGS84 coordinates (40° 27’ 38.82", -80° 1’ 47.85") and (40° 26’ 7.31", -79° 59’ 12.86"). Labels are defined by step count cntcnt in the route. Total trajectories sum to 9,325 in central Manhattan and 17,750 in Pittsburgh. In the latter location, routes are generated for all nodes with 50 samples randomly selected where cnt=<7cnt=<7 and 200 samples where cnt>7cnt>7. The decision to generate a lower number of samples for shorter routes was determined by initial tests with the ConvNeXt Tiny model Liu et al. (2022). We opt for a maximum cntcnt of 66 steps to align with the longest route in the training partition of Touchdown. The resulting training partition of samples for Pittsburgh consists of 17,000 samples and is the resource used to pretrain gPMTPg_{PMTP} in the PM-VLN module.

B.2   Samples from Datasets

In auxiliary task ϕ2\phi_{2}, the gPMFg_{PMF} submodule of PM-VLN is trained on visual, textual, and geodetic position data types. Path traces from the TR-NY-PIT-central are used in ϕ1\phi_{1} to pretrain the gPMTPg_{PMTP} submodule on trajectory estimation. Samples for entities in MC-10 and path traces in TR-NY-PIT-central are presented in Figures 1 and 2.

Refer to caption
Figure 6: Samples from the MC-10 dataset.
Refer to caption
Figure 7: Samples from the TR-NY-PIT-central dataset with path traces representing routes in central Pittsburgh.

Appendix C Code and Data

Source code for the project and instructions to run the framework are released and maintained in a public GitHub repository under MIT license (https://github.com/JasonArmitage-res/PM-VLN). Code for the environment, navigation, and training adheres to the codebases released by Zhu et al. (2021) and Chen et al. (2019) with the aim of enabling comparisons with benchmarks introduced in prior work on Touchdown. Full versions of the MC-10 and TR-NY-PIT-central datasets are published on Zenodo under Creative Commons public license (https://zenodo.org/record/6891965#.YtwoS3ZBxD8).