UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Guangzhao Dai¹, Jian Zhao²^*, Yuantao Chen³, Yusen Qin⁴, Hao Zhao⁴, Guosen Xie¹, Yazhou Yao¹,
Xiangbo Shu¹^*, Xuelong Li²^*
¹Nanjing University of Science and Technology ²Northwest Polytechnical University
³The Chinese University of Hong Kong, Shenzhen ⁴Tsinghua University

Abstract

Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods suffer from high-level semantic details or intuitive appearance-level information crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly predicting high-fidelity 360° visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

— Albert Einstein

Refer to caption — Figure 1: Main insights of UnitedVLN for VLN-CE: Unlike existing state-of-the-art methods that explore only either predicted images or features in future environments, our UnitedVLN fully integrates navigation cues from both appearance and semantic information. By leveraging two complementary rendering strategies—(1) appearance-level rendering (e.g., distinct colors) and (2) semantic-level rendering (e.g., appearance-similar elements like doors)—UnitedVLN enhances the agent’s ability to interpret instructions accurately and navigate complex spaces. The visualizations show how UnitedVLN’s combined approach results in more accurate route choices, reducing errors caused by occlusions or ambiguities that challenge purely RGB- or feature-based methods.

1 Introduction

Vision-and-Language Navigation (VLN) [26, 8, 7, 4] requires an agent to understand and follow natural language instructions to reach a target destination. This task has recently garnered significant attention in embodied AI [38, 27]. Unlike traditional VLN, where the agent navigates a predefined environment with fixed pathways, Continuous Environment VLN (VLN-CE) [3, 51] presents a more complex challenge. In VLN-CE, the agent is free to move to any unobstructed location, making low-level actions, such as moving forward 0.25 meters or turning 15 degrees. Consequently, the agent faces a higher risk of getting stuck or engaging in unproductive navigation behaviors, such as repeatedly hitting obstacles or oscillating in place, due to visual occlusions or blind spots in future environments. To address these challenges, recent methods [29, 51, 46, 24] have focused on anticipating future environments, moving beyond reliance on current observations. In terms of synthesizing future observations at unvisited waypoints, these approaches, which include future visual images [46, 24] and semantic features [29, 51], can be categorized into two main explored paradigms: RGB-based and feature-based explorations. However, both explorations often fail to integrate intuitive appearance-level information with semantic context needed for robust navigation in complex environments.

RGB-based Exploration: exploring future visual images. A straightforward approach to explore future environments is to predict images of future scenes, as images contain rich, appearance-level information (e.g., color, texture, and lighting) that is crucial for holistic scene understanding. Building on this idea, some methods [46, 24] synthesize multiple future navigation trajectories by training an image generator [20] to predict panoramic images for path planning, demonstrating promising results. However, these methods are prone to suffering from the risk of lacking more complex, high-level semantic details. As illustrated in the bottom panel of Figure 1, the agent struggles to distinguish the high-level semantic differences between objects like a “door” and a “bedroom”, which may appear visually similar in the context of instructions.

Feature-based Exploration: exploring future semantic features. Rather than generating panoramic images, recent work [51] leverages a pre-trained Neural Radiance Field (NeRF) [34] model to predict refined future semantic features through ray-to-pixel volume rendering. However, relying on rendered features alone can result in a lack of intuitive appearance information (e.g., color, texture, or lighting), which helps the agent to build stable knowledge about environments, such as general room layout rules and geometric structure [46]. This lack potentially leads to more severe navigation errors. As shown at the top panel of Figure 1, the agent fails to accurately ground the color “black” in the phrase “black couch” when it encounters two differently colored couches, even though they are easily distinguishable at the visual appearance level.

In fact, human perception of an unknown environment is generally understood as a combination of appearance-level intuitive information and high-level semantic understanding, as suggested by studies in cognitive science [12, 21, 22]. Based on this insight, an agent designed to simulate human-like perception could also interpret instructions and navigate unseen environments. Recently, 3D Gaussian Splatting (3DGS) [23] has emerged in the computer graphics community for scene reconstruction, utilizing 3D Gaussians to speed up image rendering through a tile-based rasterizer. This motivates us to explore 3DGS to equip VLN-CE agent with both the capability to anticipate holistic future scenes while benefiting from learning semantic details in generated future observations. Furthermore, while NeRF is slower in image rendering than 3DGS, NeRF facilitates accurate semantic understanding, benefiting from volume rendering containing ray-to-pixel mechanism, which provides a precise one-to-one correspondence between sampled ray and pixel [32]. Thus, it motivates us to explore an effective rendering strategy for integrating semantic understanding in volume rendering with Gaussian rendering.

Based on the above analysis, we propose a generalizable 3DGS-based paradigm, aka UnitedVLN, which simultaneously renders both visual images and semantic features at higher quality (360° views) from sparse neural points, enabling the agent to more effectively explore future environments in VLN-CE. UnitedVLN primarily consists of two key components. First, we exploit a Search-Then-Query (STQ) sampling scheme for efficient neural point selection. For any neural points in the feature/point cloud, the scheme searches for neighboring points and queries their K-nearest neighbors. Second, to enhance navigation robustness, we introduce a Separate-Then-United (STU) rendering scheme. It utilizes hybrid rendering that integrates an efficient volume rendering with 3DGS to render high-level semantic features and visual images with appearance-level information. Our main contributions are highlighted as follows:

•

Unified VLN-CE Pre-training Paradigm: We propose UnitedVLN, a generalizable 3DGS-based pre-training framework. It simultaneously renders both high-fidelity 360° visual images and semantic features from sparse neural primitives, enabling the agent to effectively explore future environments in VLN-CE.
•

Search-Then-Query Sampling Scheme: We present a Search-Then-Query (STQ) sampling scheme for efficient selection of neural primitives. For each 3D position, the scheme searches for neighboring points and queries their K-nearest neighbors, improving model efficiency and resource utilization.
•

Separate-Then-United Rendering Scheme: We present a Separate-Then-United (STU) rendering scheme that integrates an efficient volume rendering for predicting high-level semantic features in NeRF and visual images with appearance-level information in 3DGS, thereby enhancing the model’s robustness in complex scenes.

2 Related Work

Vision-and-Language Navigation (VLN).

VLN [4, 38, 27, 26] recently has achieved significant advance and increasingly introduced several proxy tasks, e.g., step-by-step instructions [4, 27], dialogs-based navigation [43], and object-based navigation [38, 54], et al. Among them, VLN in the Continuous Environment (VLN-CE) [4, 13, 42, 49, 16] entails an agent following instructions freely move to the target destination in a continuous environment. Similar to the general VLN in a discrete environment with perfect pathways, many previous methods of VLN-CE are focused on the visited environment [36, 7, 14, 6, 19], neglecting the exploration of the future environment, causing poor performance of navigation. Thus, some recent works [51, 46, 24] attempt to ahead explore the future environment instead of current observations, e.g., future RGB images or features. However, these RGB-based or feature-based methods rely on single-and-limited future observations [41, 30, 45, 11], lacking appearance-level intuitive information or high-level complicated semantic information. Different from them, we propose a 3DGS-based paradigm named UnitedVLN that obtains full higher-fidelity 360° visual observations (both visual images and semantic features) for VLN-CE.

3D Scene Reconstruction.

Recently, neural scene representations have been introduced [34, 35, 23], such as Neural Radiance Fields (NeRF) [34] and 3D Gaussian Splitting (3DGS) [23], advancing the progress in 3D scene reconstruction. The more details can be found in [53]. Thus, it attracts significant attention in the embodied AI community [10, 28] and extends their tasks with NeRF. However, NeRF typically requires frequent sampling points along the ray and multiple accesses to the global MLPs, which heavily intensifies the rendering time and makes it more challenging to generalization to unseen scenes. More recently, 3D Gaussian Splitting (3DGS) [23] utilizes 3D Gaussians with learnable parameters, speeding up image rendering through a tile-based rasterizer. It motivates us to leverage 3DGS to boost image rendering. To this end, we design a 3DGS-based VLN-CE model, which fully explores future environments and generalizes an agent to unite understanding of appearance-level intuitive information and high-level semantic information.

3 Method

Task Setup.

UnitedVLN focuses on the VLN-CE [26, 27] task, where the agent is required to follow the natural instructions to reach the target location in the Continuous Environment. The action space in VLN-CE consists of a set of low-level actions, i.e., turn left 15 degrees, turn right 15 degrees, or move forward 0.25 meters. Following the standard panoramic VLN-CE setting [25, 17, 3], at each time step $t$ , the agent receives a 360° panoramic observations that consists of 12 RGB images $\boldsymbol{\mathcal{I}}_{t}=\{\boldsymbol{r}_{t,i}\}_{i=1}^{12}$ and 12 depth images $\boldsymbol{\mathcal{D}}_{t}=\{\boldsymbol{d}_{t,i}\}_{i=1}^{12}$ surrounding its current location (i.e., 12 view images with 30° each separation). For each episode, it also receives a language instruction $\boldsymbol{\mathcal{W}}$ , the agent needs to understand $\boldsymbol{\mathcal{W}}$ , utilize panoramic observations of each step, and move to the target destination.

Overview of UnitedVLN.

The framework of the proposed UnitedVLN is shown in Figure 2. It is mainly through three stages, i.e., Initialization, Querying, and Rendering. In Initialization, it encodes the existing observed environments (i.e., visited and current observations) into the point cloud and feature cloud (§ 3.1). In Querying, it efficiently regresses feature radiance and volume density in NeRF and images/feature Gaussians in 3DGS, with the assistance of the proposed Search-Then-Query sampling (STQ) (§ 3.2). In Rendering, it renders high-level semantic features via volume rendering in NeRF and appearance-level visual images via splitting in 3DGS, through separate-then-united rendering (STU) (§ 3.3). Finally, the NeRF-rendered features and 3DGS-rendered features are aggregated to obtain future environment representation for navigation.

3.1 Neural Point Initialization

During navigation, the agent gradually stores visual observations of each step online, by projecting visual observations, including current nodes and visited nodes, into point cloud $\mathcal{B}$ and feature cloud $\mathcal{M}$ . Among them, $\mathcal{M}$ is utilized to render high-level semantic features while $\mathcal{B}$ uses more dense points with color to render intuitive appearance-level information. Meanwhile, at each step, we also use a pre-trained waypoint predictor [17] to predict navigable candidate nodes, following the practice of prior VLN-CE works [2, 46, 3]. Note that $\mathcal{B}$ and $\mathcal{M}$ share a similar way of construction just different in projected subjects, i.e., images ( $\mathcal{B}$ ) and feature map ( $\mathcal{M}$ ). Here, we omit the construction of $\mathcal{M}$ for sake of readability. Please see for more details in supplementary material about the construction of $\mathcal{M}$ .

Point Cloud $\mathcal{B}$ stores holistic appearances of observed environments (e.g., current node and visited nodes), which consists of pixel-level point positions and colors, as shown in Figure 2. Specifically, at each time step $t$ , we first use 12 RGB images $\boldsymbol{\mathcal{I}}_{t}=\{\boldsymbol{r}_{t,i}\}_{i=1}^{12}$ to enumeratively project pixel colors $\{\boldsymbol{c}_{t,i}\in\mathcal{\mathbb{R}}^{H\times W\times 3}\}_{i=1}^{12}$ , where $H\times W\times 3$ denotes RGB images resolution. For the sake of calculability, we omit all the subscripts $(i,h,w)$ and denoted it as $l$ , where $l$ ranges from 1 to $L$ , and $L=12\cdot$ $H$ $\cdot$ $W$ . Then, we use $\boldsymbol{\mathcal{D}}_{t}=\{\boldsymbol{d}_{t,i}\}_{i=1}^{12}$ to obtain the point positions. Through camera extrinsic $[\mathbf{R},\mathbf{T}]$ and intrinsics $\mathbf{K}$ , each pixel in the $i$ -view image $\boldsymbol{r}_{t,i}$ is mapped to its 3D world position $\boldsymbol{P}_{t,l}=[p_{x},p_{y},p_{z}]$ using depth images $\boldsymbol{d}_{t,i}$ , as

\begin{split}\boldsymbol{P}_{t,l}&=\begin{bmatrix}\boldsymbol{d}_{t,l}\mathbf{R}^{-1}\mathbf{K}^{-1}\begin{bmatrix}h\ w\ 1\end{bmatrix}^{T}-\mathbf{T}\;\end{bmatrix}^{T}.\end{split}

(1)

Based on this, in each step, we gradually perceive point colors and their positions into the point cloud $\mathcal{B}$ , as

\mathcal{B}_{t}=\mathcal{B}_{t-1}\cup\{[\boldsymbol{P}_{t,j},\boldsymbol{C}_{t,j}]\}_{j=1}^{J}.

(2)

Similar to Eq. 1, we project feature maps extracted by pre-trained CLIP-ViT-B [39] to obtain feature cloud $\mathcal{M}$ , as

\mathcal{M}_{t}=\mathcal{M}_{t-1}\cup\{[\boldsymbol{Q}_{t,u},\boldsymbol{H}_{t,u},\boldsymbol{\theta}_{t,u},\boldsymbol{s}_{t,u}]\}_{j=1}^{U}.

(3)

Here, $\boldsymbol{Q}_{t,u}$ and $\boldsymbol{H}_{t,u}$ denote 3D positions and grid embedding in feature map, respectively. The $\boldsymbol{\theta}_{t,u}$ and $\boldsymbol{s}_{t,u}$ denote feature point directions and feature scales, respectively.

3.2 Search-Then-Query Sampling

Sampling in 3DGS.

To obtain more representative points, we filter low-information or noisy points in the source point cloud $\mathcal{B}$ by two steps, i.e., point search and point query. In the point search, we construct a KD-Tree [15] to represent point occupancy via coarse grids. Initialize with threshold $\epsilon$ to detect K nearest points by the matrix of distance $\boldsymbol{D}_{\text{kn}}$ :

{\boldsymbol{D}_{\text{kn}}}=\text{D}_{\text{oc}}^{\text{tree}}\left(\left\{{p}_{i}\in{P}\;\middle|\;d_{\text{oc}}({p}_{i})<\epsilon^{2}\right\},\text{K}\right),\vspace{-3pt}

(4)

where $\boldsymbol{d}_{\text{oc}}(\boldsymbol{p}_{i})$ denotes the square distance of $\boldsymbol{p}_{i}$ the nearest neighbor grid point, and $\text{D}_{\text{oc}}^{\text{tree}}$ denotes K-nearest search from occupancy tree. For these satisfied points filtered by Eq. 4, we calculate the total distance (i.e., density) to its neighbored points. Then, we query points with local maxima in the density distribution, which represent the most representative and dense structural information. In this way, the dense points $\boldsymbol{P}$ is reformulated to sparse new points $\boldsymbol{P}^{{}^{\prime}}$ :

\boldsymbol{P}^{{}^{\prime}}=\left\{\boldsymbol{q}_{i}\;\middle|\;i\in\Gamma\left(\frac{1}{\sum_{j}\boldsymbol{D_{ij}}}\right)\cup\Lambda\left(\frac{1}{\sum_{j}\boldsymbol{D_{ij}}}\right)\right\},

(5)

where $\boldsymbol{D}_{ij}$ denotes the distance between the $i$ -th query point and its $j$ -th neighbor. The $\Gamma$ and $\Lambda$ denote density and peak selection functions, respectively.

Based on $P^{{}^{\prime}}$ and their colors $\boldsymbol{C}^{{}^{\prime}}$ , we use a multi-input and single-output UNet-like architecture [47] to encode points with different scales to obtain neural descriptors $\boldsymbol{F}_{p}^{{}^{\prime}}$ . Then, we regress it to several Gaussian properties: rotation $\boldsymbol{R}\in\mathrm{R}^{4}$ , scale factor $\boldsymbol{S}\in\mathrm{R}^{3}$ and opacity $\alpha\in[0,1]$ , as

\boldsymbol{R},\boldsymbol{S},\alpha=\texttt{N}(\mathcal{H}_{\text{r}}(\boldsymbol{F}_{p}^{{}^{\prime}})),\texttt{E}(\mathcal{H}_{\text{s}}(\boldsymbol{F}_{p}^{{}^{\prime}})),\texttt{S}(\mathcal{H}_{\text{o}}(\boldsymbol{F}_{p}^{{}^{\prime}})).

(6)

Here $\mathcal{H}_{\text{r}}$ , $\mathcal{H}_{\text{s}}$ , and $\mathcal{H}_{\text{o}}$ denote three different MLPs to predict corresponding properties of Gaussian. The N, E, and S denote normalization operation, exponential function, and sigmoid function, respectively. We also use the neural descriptors $\boldsymbol{P}^{{}^{\prime}}$ to replace colors to obtain feature Gaussians, following point-rendering practice [47]. Thus, the images Gaussians $\mathcal{G_{\scalebox{0.6}{$\triangle$}}}$ and feature Gaussians $\mathcal{G_{\scalebox{0.8}{$\triangledown$}}}$ are formulated as

\mathcal{G_{\scalebox{0.6}{$\triangle$}}},\mathcal{G_{\scalebox{0.8}{$\triangledown$}}}=\{\boldsymbol{R},\boldsymbol{S},\alpha,\boldsymbol{P}^{{}^{\prime}},\boldsymbol{C}{{}^{\prime}}\},\{\boldsymbol{R},\boldsymbol{S},\alpha,\boldsymbol{P}^{{}^{\prime}},\boldsymbol{F}_{p}^{{}^{\prime}}\}.

(7)

Sampling in NeRF.

For each point $q$ along the ray, we then use KD-Tree [15] to search k-nearest features $\{\boldsymbol{h}_{k}\}_{k=1}^{K}$ within a certain radius $R$ in $\mathcal{M}$ . Based on $\{\boldsymbol{h}_{k}\}_{k=1}^{K}$ , for the feature point $\boldsymbol{p}_{q}$ , we use a MLP $\varphi_{\text{1}}$ to aggregate a new feature vector $\boldsymbol{f}_{q}^{{}^{\prime}}$ that represent point $\boldsymbol{q}$ local content as,

\boldsymbol{f}_{q,h}=\varphi_{\text{1}}(\boldsymbol{f}_{q},\boldsymbol{p}_{q}-\boldsymbol{h}_{k}),

(8)

\boldsymbol{f}_{q}^{{}^{\prime}}=\sum_{k}^{K}s_{k}\frac{w_{k}}{\sum w_{k}}\boldsymbol{f}_{q,h},\ w_{k}=\frac{1}{\|\boldsymbol{h}_{k}-\boldsymbol{p}_{q}\|}.

(9)

Here, $w_{k}$ denotes inverse distance, which makes closer neural points contribute more to the sampled point computation, $s_{k}$ denotes feature scales (cf. in Eq. 17), $\boldsymbol{f}_{q}$ denotes feature embedding of $\boldsymbol{p}_{q}$ , and $\boldsymbol{h}_{k}-\boldsymbol{p}_{q}$ denotes the relative position of $\boldsymbol{p}_{q}$ to $\boldsymbol{h}_{k}$ . Then, through two MLPs $\varphi_{\text{2}}$ and $\varphi_{\text{3}}$ , we regress the view-dependent feature radiance $r$ and $\sigma$ with the given view direction $\theta_{k}$ and feature scale $s_{k}$ of $\boldsymbol{h}_{k}$ .

r=\varphi_{\text{2}}(\boldsymbol{f}_{q}^{{}^{\prime}},\theta_{k}),

(10)

\displaystyle\sigma

\displaystyle=\sum_{i}\varphi_{\text{3}}\left(\boldsymbol{f}_{q,h}\right)\gamma_{i}\frac{w_{i}}{\sum_{i}w_{i}},\quad w_{i}=\frac{1}{\left\|\boldsymbol{h}_{k}-\boldsymbol{p}_{q}\right\|}.

(11)

3.3 Separate-Then-United Rendering

As shown in Figure 2, we render future observations, i.e., the feature map (in NeRF branch), image and feature map (in 3DGS branch). Specifically, by leveraging the differentiable rasterizer, we first render image/feature Gaussian parameters (cf. in Eq. 19) to image/feature map $\boldsymbol{I}_{g}$ / $\boldsymbol{F}_{g}$ as,

\boldsymbol{I}_{g},\boldsymbol{F}_{g}=Raster\{\mathcal{G_{\scalebox{0.6}{$\triangle$}}}\},Raster\{\mathcal{G_{\scalebox{0.8}{$\triangledown$}}}\}.\vspace{-3pt}

(12)

Note that $\boldsymbol{I}_{g}$ can maintain well quality with sufficient points. However, when the point sparsely distributes over the space, the representation ability of corresponding Gaussians is constrained, especially in complicated local areas, resulting in low-quality visual representation [47]. In contrast, $\boldsymbol{F}_{g}$ contains rich context information to describe local geometric structures with various scales. In other words, $\boldsymbol{F}_{g}$ can be treated as a complement to $\boldsymbol{I}_{g}$ . Therefore, we use multi-head cross-attention (CA) to replenish context from $\boldsymbol{F}_{g}$ into $\boldsymbol{I}_{g}$ , for promoting representation. To sum up, the aggregation representation $\boldsymbol{f}_{g}^{rf}$ can be formulated as,

\boldsymbol{f}_{g}^{rf}=\text{CA}(\varphi_{\text{g1}}(\boldsymbol{I}_{g}),\varphi_{\text{g2}}(\boldsymbol{F}_{g}),\varphi_{\text{g2}}(\boldsymbol{F}_{g})),\vspace{-3pt}

(13)

where $\varphi_{\text{g1}}$ and $\varphi_{\text{g2}}$ denote two visual encoder of CLIP-ViT-B [39] for encoding image and feature map.

In NeRF branch, with the obtained feature radiance $r$ and volume density $\sigma$ of sampled points along the ray, we render the future feature $\boldsymbol{F}_{f}$ by using the volume rendering [34]. Similarly, we use the encoder $\varphi_{\text{f}}$ of Transformer [44] to extract the feature embedding $\boldsymbol{f}_{n}$ for $\boldsymbol{F}_{f}$ in NeRF, as

\boldsymbol{f}_{n}^{f}=\varphi_{\text{f}}(\boldsymbol{F}_{f}).

(14)

To improve navigation robustness, we aggregate future feature embedding $\boldsymbol{f}_{g}^{rf}$ (cf. in Eq. 13) and $\boldsymbol{f}_{n}^{f}$ (cf. in Eq. 14) as a view representation of future environment. In this way, we aggregate all 12 future-view embeddings $\boldsymbol{F}^{\text{future}}$ via average pooling and project them to a future node. Finally, we use a feed-forward network (FFN) to predict navigation scores between the candidate node $\boldsymbol{F}^{\text{candidate}}$ and the future node $\boldsymbol{F}^{\text{future}}$ in the topological map, following practices of previous methods [3, 51]. Note that the scores for visited nodes are masked to avoid agent unnecessary repeated visits. Based on navigation goal scores, we select a navigation path with a maximum score, as

S^{\text{path}}=\{\texttt{Max}([\texttt{FFN}(\boldsymbol{F}^{\text{candidate}}),\texttt{FFN}(\boldsymbol{F}^{\text{future}})]\}.

(15)

3.4 Objective Function

According to the stage of VLN-CE, UnitedVLN mainly has two objectives, i.e., one aims to achieve better render quality of images and features (cf. Eq 12) in the pre-training stage and the other is for better navigation performance (cf. Eq 15) in the training stage. Please see in supplementary material for details about the setting of objective loss.

4 Experiment

Methods	Val Seen				Val Unseen				Test Unseen
Methods	NE↓	OSR↑	SR↑	SPL↑	NE↓	OSR↑	SR↑	SPL↑	NE↓	OSR↑	SR↑	SPL↑
CM² [14]	6.10	51	43	35	7.02	42	34	28	7.70	39	31	24
WS-MGMap [6]	5.65	52	47	43	6.28	48	39	34	7.11	45	35	28
Sim-2-Sim [25]	4.67	61	52	44	6.07	52	43	36	6.17	52	44	37
ERG [48]	5.04	61	46	42	6.20	48	39	35	-	-	-	-
CWP-CMA [17]	5.20	61	51	45	6.20	52	41	36	6.30	49	38	33
CWP-RecBERT [17]	5.02	59	50	44	5.74	53	44	39	5.89	51	42	36
GridMM [50]	4.21	69	59	51	5.11	61	49	41	5.64	56	46	39
Reborn [1]	4.34	67	59	56	5.40	57	50	46	5.55	57	49	45
Ego²-Map [18]	-	-	-	-	4.94	-	52	46	5.54	56	47	41
Dreamwalker [46]	4.09	66	59	48	5.53	59	49	44	5.48	57	49	44
Energy [31]	3.90	73	68	59	4.69	65	58	50	5.08	64	56	48
BEVBert [2]	-	-	-	-	4.57	67	59	50	4.70	67	59	50
ETPNav [3]	3.95	72	66	59	4.71	65	57	49	5.12	63	55	48
HNR [51]	3.67	76	69	61	4.42	67	61	51	4.81	67	58	50
UnitedVLN (Ours)	3.30	78	70	62	4.29	70	62	51	4.69	68	59	49

Table 1: Evaluation on the R2R-CE dataset.

Methods	Val Seen					Val Unseen
Methods	NE↓	SR↑	SPL↑	NDTW↑	SDTW↑	NE↓	SR↑	SPL↑	NDTW↑	SDTW↑
CWP-CMA [17]	-	-	-	-	-	8.76	26.6	22.2	47.0	-
CWP-RecBERT [17]	-	-	-	-	-	8.98	27.1	22.7	46.7	-
Reborn [1]	5.69	52.4	45.5	66.3	44.5	5.98	48.6	42.1	63.4	41.8
ETPNav [3]	5.03	61.5	50.8	66.4	51.3	5.64	54.8	44.9	61.9	45.3
HNR [51]	4.85	63.7	53.2	68.8	52.8	5.51	56.4	46.7	63.6	47.2
UnitedVLN (Ours)	4.71	64.9	53.8	69.9	53.5	5.49	57.7	47.1	64.0	47.8

Table 2: Evaluation on the RxR-CE dataset.

4.1 Datasets and Evaluation Metrics

Datasets.

To improve the rendered quality of images and features, we first pre-train the proposed 3DGS-based UnitedVLN on the large-scale indoor HM-3D dataset. Following the practice of prior VLN-CE works [17, 46, 3], we evaluate our UnitedVLN two VLN-CE public benchmarks, i.e, R2R-CE [26] and RxR-CE [27]. Please see supplementary material for more details about the illustration of datasets.

Evaluation Metrics.

Following standard protocols in previous methods [51, 3, 46], we use several metrics in VLN-CE for evaluating our UnitedVLN performance of navigation: Navigation Error (NE), Success Rate (SR), Oracle stop policy (OSR), Normalized inverse of the Path Length (SPL), Normalized Dynamic Time Warping (nDTW), and Success weighted by normalized Dynamic Time Warping (SDTW).

Methods	NE↓	OSR↑	SR↑	SPL↑
A1 (Base)	4.73	64.9	57.6	46.5
A2 (Base + STQ + NeRF Rendering)	4.45	67.8	61.4	49.9
A3 (Base + STQ + 3DGS Rendering)	4.37	68.4	61.9	50.6
A4 (Base + STU)	4.31	69.3	62.0	50.4
A5 (Base + STQ + STU)	4.29	70.0	62.4	51.1

Table 3: Ablation study of each component of UnitedVLN model.

4.2 Implementation Details

Our UnitedVLN adopts a pertaining-then-finetuning train paradigm, which first pre-training a generalized 3DGS to regress future observations, then generalizes it to the agent for VLN-CE in an inference way. Please see in supplementary material about settings of pre-training and training.

4.3 Comparison to State-of-the-Art Methods

Table 1 and 2 show the performance of UnitedVLN compared with the state-of-the-art methods on the R2R-CE and RxR-CE benchmarks respectively. Overall, UnitedVLN achieves SOTA results in the majority of metrics, proving its effectiveness from diverse perspectives. As demonstrated in Table 1, on the R2R-CE dataset, our method outperforms the SOTA method (i.e., HNR [51]): +1% on SR and +3% on OSR for the val unseen split; +1% on SR and -1.2% on NE for the test unseen split. Meanwhile, as illustrated in Table 2, the proposed method also achieves improvements in the metrics on the RxR-CE dataset.

Specifically, compared with Dreamwalker [46] that shares a partial idea of UnitedVLN to predict future visual images in Table 1, our UnitedVLN model achieves performance gains of about 11% on SR for all splits. Our UnitedVLN supplements the future environment with high-level semantic information, which is better than Dreamwalker depending on a single visual image. Compared with HNR, we still outperform its SR and OSR on the val unseen set, which relies on future features of environments but lacks appearance-level intuitive information. It proves that UnitedVLN can effectively improve navigation performance by uniting future appearance-level information (visual images) and high-level semantics (features).

4.4 Ablation Study

We conduct extensive ablation experiments to validate key designs of UnitedVLN. Results are reported on the R2R-CE val unseen split with a more complex environment and difficulty, and the best results are highlighted.

Methods	NE↓	OSR↑	SR↑	SPL↑
B1 (ETPNav)	4.71	64.8	57.2	49.2
B2 (UnitedVLN ${}_{\text{ETPNav}}$ )	4.25	67.6	58.9	49.8
B3 (HNR)	4.42	67.4	60.7	51.3
B4 (UnitedVLN ${}_{\text{HNR}}$ )	4.29	70.1	62.4	51.1

Table 4: Generalization analysis of UnitedVLN among different VLN-CE models.

Effect on each component of UnitedVLN.

In this part, we analyze the effect of each component in UnitedVLN. As illustrated in Table 3, A1 (Base) denotes performance of baseline model. Compared with A1, the performance gain of A2 is improved when equipped with future environmental features via NeRF rendering. Compared with A1, the navigation performance gain of A3 is also extended when equipped with 3DGS rendering on the Eq. 13, which validates the effectiveness of learning appearance-level information (visual images). Compared with A5, the performance of A4 (Base + STU) is decreased when remove point sampling in STQ. It proves that STQ can effectively enhance the performance of navigation by performing efficient neural point sampling. From the results in A5, when we combine features from NeRF rendering and 3DGS rendering on Section 3.3 and efficient sampling in STQ, it further improves and achieves the best performance, by +5.1(%) on OSR. To sum up, all components in UnitedVLN can jointly improve the VLN-CE performance.

Methods	Pre-train stage	Train stage
C1 (HNR)	2.21s (0.452Hz)	2.11s (0.473Hz)
C2 (UnitedVLN)	0.036s (27.78Hz)	0.032s (31.25Hz)

Table 5: Runtime analysis measured on one NVIDIA RTX 3090.

Extractors	NE↓	OSR↑	SR↑	SPL↑
D1 (ViT-B/16-ImageNet [9])	4.37	69.8	61.5	50.6
D2 (ViT-B/16-CLIP [39])	4.29	70.0	62.4	51.1

Table 6: Ablation study of the different extractors of 3DGS branch.

Effect on numbers of K-nearest on point sampling.

Figure 5 shows the effect of point sampling with different numbers of k-nearest features on SR accuracy. We set K $\in\{1,8,16,32\}$ for investigation, and K stabilizes from 8 and converges to the best performance at 16. Here, we select K = 16 in NeRF/3DGS sampling. It can be found that when K is set to smaller than 16 or larger than 16, the accuracy of navigation decreases slightly. Nevertheless, a larger or smaller number in a moderate range is acceptable since contextual information is aggregated by K-nearest sampling.

Effect on generalizability to other VLN-CE model.

Table 4 illustrates the performance of our proposed 3DGS-based paradigms to generalize two recent-and-representative VLN-CE models, i.e., ETPNav [3] and HNR [51], when train the agent to execute VLN-CE task. Compared with B1, our B2 (UnitedVLN ${}_{\text{ETPNav}}$ ) assembled the future environment with visual images and semantic features achieves significant performance gains for all metrics. Similarly, it also improves the navigation performance when generalizing our 3DGS-based paradigms to HNR model. It proves that our proposed 3DGS-based paradigms can generalize other VLN-CE models and achieve performance gains for them.

#	$\mathcal{L}_{1}^{r}$	$\mathcal{L}_{2}^{r}$	$\mathcal{L}_{ssim}^{r}$	$\mathcal{L}_{2}^{f}$	NE↓	OSR↑	SR↑	SPL↑
E1:	✓	✗	✗	✗	4.68	67.9	60.4	47.9
E2:	✗	✓	✗	✗	4.59	67.7	60.5	47.6
E3:	✓	✓	✗	✗	4.47	68.2	60.7	48.1
E4:	✓	✓	✓	✗	4.33	68.7	61.6	50.6
E5:	✓	✓	✓	✓	4.29	70.0	62.4	51.1

Table 7: Ablation study of the different loss in UnitedVLN model.

Effect on speed of image rendering.

Table 5 illustrates speed comparisons of image rendering of our UntedVLN with HNR (the SOTA model) on the stage of pre-training and training. For a fair comparison, we fix the same experiment setting with HNR. As shown in Table 5, UntedVLN rendering speed is about 63 $\times$ faster than HNR for all stages: our 0.036s (27.78Hz) vs. HNR 2.21s (0.452Hz) for the pre-training stage; our 0.032s (31.25Hz) vs. HNR 2.11s (0.473Hz) for the training stage. It proves that UnitedVLN achieves better visual representation with faster rendering.

Effect of different feature extractors on rendering.

Table 6 illustrates the performance comparison of UnitedVLN using different feature extractors to encode the rendered image and feature map (cf. Eq. 12) to feature embedding. Here, D1 (ImageNet-ViT-B/16) and D2 (ViT-B/16-CLIP) denote performance using the different pre-trained dataset, i.e., ImageNet [9] and CLIP [39]. As shown in Table 6, D2 achieves better performance compared with D1. The reason for performance gain on D2 may CLIP encodes more semantics due to large-scale image-text matching while lacking diverse visual concepts on ImageNet. Thus, we use ViT-B/16-CLIP as the feature extractor, enhancing the semantics of navigation representation.

Effect of multi-loss on rendering.

Table 4 illustrates the performance of the proposed UnitedVLN using diverse losses to pre-train, i.e., $\mathcal{L}_{1}^{r}$ , $\mathcal{L}_{2}^{r}$ , $\mathcal{L}_{ssim}^{r}$ , and $\mathcal{L}_{2}^{f}$ . Among them, i.e., $\mathcal{L}_{1}^{r}$ , and $\mathcal{L}_{2}^{r}$ , $\mathcal{L}_{ssim}^{r}$ are adopted use optime rendered RGB images for colors and geometry, while $\mathcal{L}_{2}^{f}$ is for feature similarity. Here, E1 - E5 denote compositions using different loss functions, and E5 achieves the best performance due to jointly optimizing RGB images with better colors and geometric structure, and features with semantics.

4.5 Qualitative Analysis

Visualization Example of rendering.

To validate the effect of pre-training UnitedVLN on rendering image quality, we visualize several 360° panoramic observations surrounding its current location (i.e., 12 view images with 30° separation each). Here, we report each view comparison between rendered images and ground-truth images, as shown in Figure 6. As shown in Figure 6, the rendered image not only reconstructs the colors and geometry of the real image but even the bright details of the material (e.g., reflections on a smooth wooden floor). This proves the effect of pre-training UnitedVLN for generalizing high-quality images.

Visualization Example of Navigation.

To validate the effect of UnitedVLN for effective navigation, we report the comparison of navigation strategy between the baseline model (revised ETPNav) and Our UnitedVLN. Here, we also report each node navigation score for a better view. As shown in Figure 3, the baseline model suffers navigation error as it obtains limited observations by relying on current waypoint while our UnitedVLN achieves correct decision-marking of navigation by full future explorations that aggregate intuitive appearances and complicated semantics information. This proves the effect of RGB-united-feature future representations, improving the performance of VLN-CE.

5 Conclusion and Discussion

Conclusion.

We introduce UnitedVLN, a generalizable 3DGS-based pre-training paradigm for improving Continuous Vision-and-Language Navigation (VLN-CE). It pursues full future environment representations by simultaneously rendering the visual images the semantic features with higher-quality 360° from sparse neural points. UnitedVLN has two insightful schemes, i.e., Search-Then-Query sampling (STQ) scheme, and separate-then-united rendering (STU) scheme. For improving model efficiency, STQ searches for each point only in its neighborhood and queries its K-nearest points. For improving model robustness, STU aggregate appearance by splatting in 3DGS and semantics information by volume-rendering in NeRF for robust navigation. Extensive experiments on two VLN-CE benchmarks demonstrate that UnitedVLN significantly outperforms state-of-the-art models.

Discussion.

Some recent work (e.g., HNR [51]) share the same spirit of using future environment rendering but there are some distinct differences: 1) There are different paradigms (3DGS vs. NeRF). 2) There are different future explorations (RGB-united-feature vs. feature), where continuous image prediction in a way that humans can understand makes the agent’s behaviors in each step more interpretable. 3) There are different scalability (efficient-and-fast rendering vs. inefficient-and-slow rendering). For the proposed UnitedVLN framework, this is a new paradigm that focuses on the full future environment representation besides just rough on single-and-limited features or images. Both STQ and STU are plug-and-play for enforcing sampling and rendering. This well matches our intention, contributing feasible modules like STQ and STU in the embodied AI. One more thing, we hope this paradigm can encourage further investigation of the idea “dreaming future and doing now”.

References

An et al. [2022] Dong An, Zun Wang, Yangguang Li, Yi Wang, Yicong Hong, Yan Huang, Liang Wang, and Jing Shao. 1st place solutions for rxr-habitat vision-and-language navigation competition. arXiv preprint arXiv:2206.11610, 2022.
An et al. [2023a] Dong An, Yuankai Qi, Yangguang Li, Yan Huang, Liang Wang, Tieniu Tan, and Jing Shao. Bevbert: Multimodal map pre-training for language-guided navigation. In ICCV, pages 2737–2748, 2023a.
An et al. [2023b] Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, and Liang Wang. Etpnav: Evolving topological planning for vision-language navigation in continuous environments. arXiv preprint arXiv:2304.03047, 2023b.
Anderson et al. [2018] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683, 2018.
Chang et al. [2017] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, pages 667–676, 2017.
Chen et al. [2022a] Peihao Chen, Dongyu Ji, Kunyang Lin, Runhao Zeng, Thomas H Li, Mingkui Tan, and Chuang Gan. Weakly-supervised multi-granularity map learning for vision-and-language navigation. In Advances in Neural Information Processing Systems, pages 38149–38161, 2022a.
Chen et al. [2021] Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. History aware multimodal transformer for vision-and-language navigation. In Advances in Neural Information Processing Systems, pages 5834–5847, 2021.
Chen et al. [2022b] Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In CVPR, pages 16537–16547, 2022b.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
DeVries et al. [2021] Terrance DeVries, Miguel Angel Bautista, Nitish Srivastava, Graham W Taylor, and Joshua M Susskind. Unconstrained scene generation with locally conditioned radiance fields. In ICCV, pages 14304–14313, 2021.
Feng et al. [2022] Weixi Feng, Tsu-Jui Fu, Yujie Lu, and William Yang Wang. Uln: Towards underspecified vision-and-language navigation. arXiv preprint arXiv:2210.10020, 2022.
Forrester [1971] Jay W Forrester. Counterintuitive behavior of social systems. Theory and Decision, 2(2):109–140, 1971.
Fried et al. [2018] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, 2018.
Georgakis et al. [2022] Georgios Georgakis, Karl Schmeckpeper, Karan Wanchoo, Soham Dan, Eleni Miltsakaki, Dan Roth, and Kostas Daniilidis. Cross-modal map learning for vision and language navigation. In CVPR, 2022.
Grandits et al. [2021] Thomas Grandits, Alexander Effland, Thomas Pock, Rolf Krause, Gernot Plank, and Simone Pezzuto. GEASI: Geodesic-based earliest activation sites identification in cardiac models. International Journal for Numerical Methods in Biomedical Engineering., 37(8):e3505, 2021.
Hong et al. [2021] Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, and Stephen Gould. Vln bert: A recurrent vision-and-language bert for navigation. In CVPR, pages 1643–1653, 2021.
Hong et al. [2022] Yicong Hong, Zun Wang, Qi Wu, and Stephen Gould. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In CVPR, 2022.
Hong et al. [2023] Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, and Hao Tan. Learning navigational visual representations with semantic map supervision. In ICCV, pages 3055–3067, 2023.
Huang et al. [2023] Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In ICRA, London, UK, 2023.
Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pages 5967–5976, 2017.
Johnson-Laird [1983] Philip Nicholas Johnson-Laird. Mental models: Towards a cognitive science of language, inference, and consciousness. Harvard University Press, 1983.
Johnson-Laird [2010] Philip N Johnson-Laird. Mental models and human reasoning. Proceedings of the National Academy of Sciences, 2010.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. TOG, 42(4):1–14, 2023.
Koh et al. [2021] Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In ICCV, pages 14738–14748, 2021.
Krantz and Lee [2022] Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continuous environments. In ECCV, 2022.
Krantz et al. [2020] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In ECCV, 2020.
Ku et al. [2020] Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, and Jason Baldridge. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In EMNLP, pages 4392–4412, 2020.
Kwon et al. [2023] Obin Kwon, Jeongho Park, and Songhwai Oh. Renderable neural radiance map for visual navigation. In CVPR, pages 9099–9108, 2023.
Li and Bansal [2023a] Jialu Li and Mohit Bansal. Improving vision-and-language navigation by generating future-view image semantics. In CVPR, pages 10803–10812, 2023a.
Li and Bansal [2023b] Jialu Li and Mohit Bansal. Improving vision-and-language navigation by generating future-view image semantics. In CVPR, pages 10803–10812, 2023b.
Liu et al. [2024a] Rui Liu, Wenguan Wang, and Yi Yang. Vision-language navigation with energy-based policy. Advances in Neural Information Processing Systems, 37:108208–108230, 2024a.
Liu et al. [2024b] Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Mvsgaussian: Fast generalizable gaussian splatting reconstruction from multi-view stereo. In ECCV, 2024b.
Liu et al. [2019] Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han. Point-voxel cnn for efficient 3d deep learning. Advances in Neural Information Processing Systems, 32, 2019.
Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. CACM, 65:99–106, 2021.
Park et al. [2021] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In CVPR, pages 5865–5874, 2021.
Pashevich et al. [2021] Alexander Pashevich, Cordelia Schmid, and Chen Sun. Episodic transformer for vision-and-language navigation. In ICCV, 2021.
Qi et al. [2017] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
Qi et al. [2020] Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. In CVPR, pages 9982–9991, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
Ramakrishnan et al. [2021] Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In CoRL, pages 8821–8831, 2021.
Tan et al. [2019] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL, pages 2610–2621, 2019.
Thomason et al. [2020] Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog navigation. In PMLR, 2020.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 33:5998–6008, 2017.
Wang et al. [2020] Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, and Jianbing Shen. Active visual information gathering for vision-language navigation. In ECCV, pages 307–322, 2020.
Wang et al. [2023a] Hanqing Wang, Wei Liang, Luc Van Gool, and Wenguan Wang. Dreamwalker: Mental planning for continuous vision-language navigation. In ICCV, pages 10873–10883, 2023a.
Wang et al. [2024a] Jiaxu Wang, Ziyi Zhang, Junhao He, and Renjing Xu. Pfgs: High fidelity point cloud rendering via feature splatting. arXiv preprint arXiv:2407.03857, 2024a.
Wang et al. [2023b] Ting Wang, Zongkai Wu, Feiyu Yao, and Donglin Wang. Graph based environment representation for vision-and-language navigation in continuous environments. arXiv preprint arXiv:2301.04352, 2023b.
Wang et al. [2019] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, pages 6629–6638, 2019.
Wang et al. [2023c] Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. In ICCV, pages 15625–15636, 2023c.
Wang et al. [2024b] Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, and Shuqiang Jiang. Lookahead exploration with neural radiance representation for continuous vision-language navigation. In CVPR, pages 13753–13762, 2024b.
Wang et al. [2024c] Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Sim-to-real transfer via 3d feature fields for vision-and-language navigation. In Conference on Robot Learning (CoRL), 2024c.
Xia and Xue [2023] Weihao Xia and Jing-Hao Xue. A survey on deep generative 3d-aware image synthesis. In ACM Comput. Surv., pages 1–34, 2023.
Zhu et al. [2021] Fengda Zhu, Xiwen Liang, Yi Zhu, Qizhi Yu, Xiaojun Chang, and Xiaodan Liang. Soon: Scenario oriented object navigation with graph-based exploration. In CVPR, pages 12689–12699, 2021.

\thetitle

Supplementary Material

This document provides more details of our method, experimental details and visualization examples, which are organized as follows:

•

Model Details (cf. § A);
•

Experiment Details (cf. § B);
•

Visualization Example (cf. § C).
•

Discussion Analysis (cf. § D).

The anonymous code of UnitedVLN is available: https://anonymous.4open.science/r/UnitedVLN-08B6/

Appendix A Model Details

A.1 Details of Feature Cloud.

Feature cloud $\mathcal{M}$ focuses on fine-grained contexts of observed environments, consisting of neural-point positions and grid features of features map. Following HNR [51], we use a pre-trained CLIP-ViT-B [39] model to extract grid features $\{\boldsymbol{h}_{t,i}\in\mathcal{\mathbb{R}}^{H^{{}^{\prime}}\times W^{{}^{\prime}}\times D^{{}^{\prime}}}\}_{i=1}^{12}$ , for 12 observed RGB images $\mathcal{R}_{t}=\{r_{t,i}\}_{i=1}^{12}$ at time step $t$ . Here, $H^{{}^{\prime}}\times W^{{}^{\prime}}\times D^{{}^{\prime}}$ denotes the resolution of feature maps extracted by CLIP. For the sake of calculability, we omit all the subscripts $(i,h^{{}^{\prime}},w^{{}^{\prime}})$ and denote its as $u$ , where $u$ ranges from 1 to $U$ , and $U=12\cdot$ $H^{{}^{\prime}}$ $\cdot$ $W^{{}^{\prime}}$ . Then, each grid feature $\boldsymbol{g}_{t,u}\in{\mathbb{R}}^{D}$ in $\mathcal{M}$ is mapped to its 3D world position $\boldsymbol{Q}_{t,j}=[q_{x},q_{y},q_{z}]$ following the mapping in Eq. 1. Besides the horizontal orientation $\theta_{t,j}$ , we also calculate size grid feature scale $s_{t,j}$ using camera’s horizontal field-of-view $\Theta_{HFOV}$ , as follows

\displaystyle\boldsymbol{s}_{t,u}=1/W^{{}^{\prime}}\cdot[\tan(\Theta_{HFOV}/2)\cdot\boldsymbol{d}_{t,u}],

(16)

where $W^{{}^{\prime}}$ is the width of the features maps extracted by the CLIP-ViT-B for each image. In this way, all these grid features and their spatial position of feature maps are perceived in the feature cloud $\mathcal{M}$ :

\mathcal{M}_{t}=\mathcal{M}_{t-1}\cup\{[\boldsymbol{Q}_{t,u},\boldsymbol{H}_{t,u},\boldsymbol{\theta}_{t,u},\boldsymbol{s}_{t,u}]\}_{j=1}^{U}.

(17)

A.2 Details of Pcd U-Net.

As shown in Fig. 6, we utilize a multi-input and single-output UNet-like architecture (Pcd U-Net) to encode points in the point cloud with different scales to obtain neural descriptors. Specifically, the UNet-like architecture has three set abstractions (SA) and three feature propagations (FP) including multilayer perceptron (MLP), point-voxel convolution (PVC) [33], Grouper block (in SA) [37], and Nearest-Neighbor-Interpolation (NNI). Through the above modules, we downsample the original point cloud with decreasing rates and concatenate the downsampled point clouds with different levels of feature maps as extra inputs. Thus, the neural descriptors $\boldsymbol{F}_{p}^{{}^{\prime}}$ can formulated as,

\displaystyle\boldsymbol{F}_{p}^{{}^{\prime}}=\mathcal{U}((\boldsymbol{P}^{{}^{\prime}},\boldsymbol{C}^{{}^{\prime}}),(\boldsymbol{P}^{{}^{\prime}},\boldsymbol{C}^{{}^{\prime}})_{\downarrow r_{1}},(\boldsymbol{P}^{{}^{\prime}},\boldsymbol{C}^{{}^{\prime}})_{\downarrow r_{2}}),

(18)

where the $\mathcal{U}$ denotes UNet-based extractor network for point cloud. And, the $\boldsymbol{P}^{{}^{\prime}}$ and $\boldsymbol{C}^{{}^{\prime}}$ denote sampled point coordinates and colors (cf. in Eq. 5). The $\downarrow$ represents a uniform downsampling operation and $r$ denotes the sampling rate where $r_{1}>r_{2}$ . This operation allows the extractor to identify features across various scales and receptive fields. The obtained single output feature vector serves as the descriptor for all points.

After that, with the obtained neural descriptors, coordinates as well as colors, we use different heads to regress the corresponding Gaussian properties in a point-wise manner. As shown in Fig. 6, the image Gaussian regressor contains three independent heads, such as convolutions and corresponding activation functions, i.e., rotation quaternion ( $\boldsymbol{R}$ ), scale factor ( $\boldsymbol{S}$ ), and opacity ( $\boldsymbol{\alpha}$ ). Meanwhile, in this way, we use the neural descriptors $\boldsymbol{P}^{{}^{\prime}}$ to replace colors and obtain feature Gaussians, following general point-rendering practice [47]. Thus, the images Gaussians $\mathcal{G_{\scalebox{0.6}{$\triangle$}}}$ and feature Gaussians $\mathcal{G_{\scalebox{0.8}{$\triangledown$}}}$ in the 3DGS branch can be formulated as,

\mathcal{G_{\scalebox{0.6}{$\triangle$}}},\mathcal{G_{\scalebox{0.8}{$\triangledown$}}}=\{\boldsymbol{R},\boldsymbol{S},\alpha,\boldsymbol{P}^{{}^{\prime}},\boldsymbol{C}{{}^{\prime}}\},\{\boldsymbol{R},\boldsymbol{S},\alpha,\boldsymbol{P}^{{}^{\prime}},\boldsymbol{F}_{p}^{{}^{\prime}}\}.

(19)

A.3 Details of Volume Rendering.

We set the k-nearest search radius $R$ as 1 meter, and the radius $\hat{R}$ for sparse sampling strategy is also set as 1 meter. The rendered ray is uniformly sampled from 0 to 10 meters, and the number of sampled points is set as 256.

A.4 Details of Gaussian Splatting.

3DGS is an explicit representation approach for 3D scenes, parameterizing the scene as a set of 3D Gaussian primitives, where each 3D Gaussian is characterized by a full 3D covariance matrix $\Sigma$ , its center point $x$ , and the spherical harmonic ( $SH$ ). The Gaussian’s mean value is expressed as:

G(x)=e^{-\frac{1}{2}(x)^{T}\Sigma^{-1}(x)}.

(20)

To enable optimization via backpropagation, the covariance matrix $\Sigma$ could be decomposed into a rotation matrix ( $R$ ) and a scaling matrix ( $S$ ), as

\Sigma=RSS^{T}R^{T}.

(21)

Given the camera pose, the projection of the 3D Gaussians to the 2D image plane can be characterized by a view transform matrix ( $W$ ) and the Jacobian of the affine approximation of the projective transformation ( $J$ ), as,

\Sigma^{{}^{\prime}}=JW\Sigma W^{T}J^{T}

(22)

where the $\Sigma^{{}^{\prime}}$ is the covariance matrix in 2D image planes. Thus, the $\alpha$ -blend of $\mathcal{N}$ ordered points overlapping a pixel is utilized to compute the final color $C$ of the pixel:

C=\sum_{i\in\mathcal{N}}c_{i}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j})\vspace{-1mm}

(23)

where $c_{i}$ and $\alpha_{i}$ denote the color and density of the pixel with corresponding Gaussian parameters.

A.5 Objective Function

Pre-training Loss.

For the rendered image (cf. Eq. 12), we use $\mathcal{L}_{1}^{r}$ and $\mathcal{L}_{2}^{r}$ loss to constrain the color similarity between the ground-truth image by calculating its L1 distance and L2 distance in the pixel domain. In addition, we also use SSIM loss $\mathcal{L}_{ssim}^{r}$ to constrain geometric structures between rendered images and ground-truth images. And, the aggregated rendered features (cf. Eq. 14 and Eq. 13) use $\mathcal{L}_{2}^{f}$ to constrain cosine similarity between ground-truth features extracted by the ground-truth image with pre-trained image encoder of CLIP. Thus, all pre-training loss in this paper can be formulated as follows:

\mathcal{L}_{pretrain}=\mathcal{L}_{1}^{r}+\mathcal{L}_{2}^{r}+\mathcal{L}_{ssim}^{r}+\mathcal{L}_{2}^{f}.

(24)

Training Loss.

For training VLN-CE agent, we use the pre-trained model of the previous stage to generalize future visual images and features in an inference way. Following the practice of previous methods [51, 52], with the soft target supervision $\mathcal{A}_{soft}$ , the goal scores in navigation (cf. in Eq. 15) are constrained by the cross-entropy (CE) loss:

\displaystyle\mathcal{L}_{nav}=\texttt{CE}(S^{path},\mathcal{A}_{soft}).

(25)

Appendix B Experimental Details

B.1 Datasets

R2R-CE [26] is derived from the discrete Matterport3D environments [5] based on the Habitat simulator [40], ensuring the VLN agent to navigate in the continuous environments. The horizontal field-of-view and the turning angle are $90^{\circ}$ and $15^{\circ}$ , respectively. It provides step-by-step language instructions, and the average length of instructions is 32 words.

RxR-CE [27] is an extensive multilingual VLN dataset, containing 126K instructions across English, Hindi, and Telugu. The dataset includes diverse trajectory lengths, averaging 15 meters, making navigation in continuous environments more challenging. In this dataset, The horizontal field-of-view and the turning angle are $79^{\circ}$ and $30^{\circ}$ , respectively.

B.2 Settings of Pre-training.

The UnitedVLN model is pre-trained in large-scale HM3D [40] dataset with 800 training scenes. Specifically, we randomly select a starting location in the scene and randomly move to a navigable candidate location at each step. At each step, up to 3 unvisited candidate locations are randomly picked to predict a future view in a random horizontal orientation and render semantic features via NeRF and image/feature via 3DGS.

On the 3DGS branch, the resolution of rendered images and features are $224\times 224\times 3$ and $224\times 224\times 768$ , which are then fed to the visual encoder of CLIP-ViT-B/16 [39] to extract corresponding feature embeddings, i.e., $\boldsymbol{f}_{g}^{r}\in\mathbb{R}^{1\times 768}$ and $\boldsymbol{f}_{g}^{f}\in\mathbb{R}^{1\times 768}$ . Similarly, we use the encoder $\varphi_{\text{f}}$ of Transformer [44] to extract the feature embedding $\boldsymbol{f}_{n}\in\mathbb{R}^{1\times 768}$ in NeRF. During pre-training, the horizontal field-of-view of each view is set as 90^∘. The maximum number of action steps per episode is set to 15. Using 8 Nvidia Tesla A800 GPUs, the UnitedVLN model is pre-trained with a batch size of 4 and a learning rate 1e-4 for 20k episodes.

B.3 Settings of the Training.

Settings of R2R-CE dataset.

For R2R-CE, we revise ETPNav [3] model as our baseline VLN model, where the baseline model initializes with the parameters of ETPNav model trained in the R2R-CE dataset. The UnitedVLN model is trained over 20k episodes on 4 NVIDIA Tesla A800 GPUs, employing a batch size of 8 and a learning rate of 1e-5.

Settings of RxR-CE dataset.

Similarly, in RxR-CE, the baseline VLN model is initialized with the parameters of ETPNav [3] model trained in the RxR-CE dataset. The UnitedVLN model is trained over 100k episodes on 4 NVIDIA Tesla A800 GPUs, employing a batch size of 8 and a learning rate of 1e-5.

Appendix C Visualization Example

Visualization Example of Navigation.

To validate the effect of UnitedVLN for effective navigation in a continuous environment, we report the visualization comparison of navigation strategy between the baseline model (revised ETPNav) and Our UnitedVLN. Here, we also report each node navigation score for a better view, as shown in Figure 7. As shown in Figure 7, the baseline model achieves navigation error as obtains limited observations by relying on a pre-trained waypoint model [17] while our UnitedVLN achieves correct decision-marking of navigation by obtaining full future explorations by aggregating intuitive appearances and complicated semantics information. In addition, we also visualize a comparison of navigation strategy between the HNR [51] (SOTA method) and Our UnitedVLN. Compared to HNR relying on limited future features, our UnitedVLN instead aggregates intuitive appearances and complicated semantics information for decision-making and achieving correct navigation. This proves the effect of RGB-united-feature future representations, improving the performance of VLN-CE.

Appendix D Discussion Analysis

Discuss the limitations of 3DGS.

While 3DGS enables efficient rendering through view-space z-coordinate sorting, this sorting mechanism can introduce popping artifacts (sudden color changes) during camera rotations due to shifts in relative primitive depths. Additionally, Gaussian switching may lead to loss of fine details. In the future, we will explore more advanced Gaussian representations and optimization techniques to avoid popping artifacts and Gaussian switching.

Discuss the role of rendering images.

In our method, it is possible to render semantic features directly using 3DGS or NeRF but achieve suboptimal performance. Our design stems from three key findings: 1) Although the predicted RGB images and feature maps are finally encoded to features, the features after RGB encoder and directly rendered features are similar in semantics but differ in appearance details (RGB retains more complete appearance-level details compared with high-level features of directly using 3DGS/NeRF rendering) 2) Our performance gain mainly drives from multi-source representation fusion (cf. Table 3 of “Base + STQ + STU”), i.e., low-level RGB (in 3DGS), middle-level semantics (in 3DGS), and high-level semantics (in NeRF), which improves the agent robustness for complicated environments 3) The future visual images created by 3DGS explain the intention of UnitedVLN in a way that human can understand, making agent behaviors more interpretable in each navigation step of decision-making.