This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

N3-Mapping: Normal Guided Neural Non-Projective Signed Distance Fields for Large-scale 3D Mapping

Shuangfu Song1, Junqiao Zhao∗,2, Kai Huang1, Jiaye Lin2, Chen Ye2, Tiantian Feng1 Manuscript received: December, 31, 2023; Revised: March, 28, 2024; Accepted: April, 27, 2024.This paper was recommended for publication by Editor Sven Behnke upon evaluation of the Associate Editor and Reviewers’ comments. This work is supported by the National Key Research and Development Program of China (No. 2021YFB2501104). (Corresponding Author: Junqiao Zhao.)1Shuangfu Song, Kai Huang and Tiantian Feng are with the School of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China, E-mail: {songshuangfu, huangkai, fengtiantian}@tongji.edu.cn.2Junqiao Zhao, Jiaye Lin and Chen Ye are with the Department of Computer Science and Technology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China, E-mail: {zhaojunqiao, jiayelin, yechen}@tongji.edu.cn.Digital Object Identifier (DOI): see top of this page.
Abstract

Accurate and dense mapping in large-scale environments is essential for various robot applications. Recently, implicit neural signed distance fields (SDFs) have shown promising advances in this task. However, most existing approaches employ projective distances from range data as SDF supervision, introducing approximation errors and thus degrading the mapping quality. To address this problem, we introduce N3-Mapping, an implicit neural mapping system featuring normal-guided neural non-projective signed distance fields. Specifically, we directly sample points along the surface normal, instead of the ray, to obtain more accurate non-projective distance values from range data. Then these distance values are used as supervision to train the implicit map. For large-scale mapping, we apply a voxel-oriented sliding window mechanism to alleviate the forgetting issue with a bounded memory footprint. Besides, considering the uneven distribution of measured point clouds, a hierarchical sampling strategy is designed to improve training efficiency. Experiments demonstrate that our method effectively mitigates SDF approximation errors and achieves state-of-the-art mapping quality compared to existing approaches. The code will be released at https://github.com/tiev-tongji/N3-Mapping.

Index Terms:
Mapping; SLAM; Implicit Neural Representations

I INTRODUCTION

Building accurate and dense maps within large-scale environments is crucial for various robot applications, such as autonomous driving. The signed distance field (SDF) has been an important map representation employed to accomplish this task, due to its advantages in characterizing geometric information and compatibility with downstream tasks [1]. Unfortunately, traditional SDF mapping methods [2, 3, 4] have faced two longstanding challenges. The first is the trade-off between mapping accuracy and memory consumption. The second is the approximation error introduced by using projective distance, i.e. the distance along range sensor rays to the measured surface, to estimate SDF values. This can easily lead to an overestimation of the actual distance to the nearest surface.

Recently, neural implicit representations [5, 6, 7] have shown promising potential in modeling 3D scenes. Neural network-based distance fields exhibit continuity, enabling them to overcome resolution limitations and achieve high-fidelity reconstructions. However, learning accurate distance fields from raw range sensor data is not trivial, as it is challenging to obtain true distance supervision. Most current methods [8, 9, 10, 11, 12] still employ projective distance to approximate the ground-truth SDF supervision for training efficiency, while neglecting the associated systematic errors. Several works [13, 14] leverage surface normals to correct the projective distance and reduce the approximation error. Nonetheless, the former [13] focuses solely on large planar surfaces, limiting its applicability. The latter [14], on the other hand, lacks training stability as it approximates the normal direction using the unstable gradients of neural networks during training. Currently, it is vital to develop a practical non-projective SDF mapping system to overcome these limitations.

To this end, we propose N3-Mapping, a large-scale mapping approach with normal guided neural non-projective signed distance fields. We observe that, for any point around the surface, the normal generally provides the direction to the nearest surface point. Therefore, our method directly samples points and corresponding distance values along the normal direction near the surface. Such sampled SDF labels tend to be close to the ground truth, leading to improved mapping quality. For the implicit map, our method utilizes octree-based hierarchical sparse voxels to store optimizable feature vectors and a shallow MLP to decode queried local features into SDF values.

For large-scale mapping, we employ a voxel-oriented training strategy to alleviate catastrophic forgetting by caching historical supervision signals into their corresponding map voxels. A sliding window mechanism is then applied to promptly drop data outside the window, ensuring a bounded memory footprint. To further improve efficiency, we propose a hierarchical sampling strategy to avoid redundant training in densely observed regions and insufficient training in sparsely observed regions.

Our method achieves efficient, high-quality and incremental mapping in large-scale environments. To summarize, our contributions are as follows:

  • 1

    A simple yet effective neural non-projective SDF training method guided by surface normals, enables accurate and complete 3D dense mapping.

  • 2

    A voxel-oriented training strategy combined with a sliding window mechanism and hierarchical sampling, mitigates the forgetting issue and enhances training efficiency.

  • 3

    Extensive experiments demonstrate that our method outperforms existing approaches in terms of mapping accuracy and completeness.

II RELATED WORK

Various scene representations have been explored for 3D dense mapping, including occupancy grids [15], meshes [16] and surfels [17]. In addition to these, SDFs have gained significant popularity in many robotic tasks such as planning [18, 2], localization [19, 14], and 3D mapping [4, 3]. As a milestone of using truncated signed distance function (TSDF) as map representation, KinectFusion [4] uses depth images as input and introduces the volumetric integration algorithm to enable real-time mapping. Subsequent studies follow the manner of TSDF integration and work on enhancing efficiency [2, 3], accuracy [20], and the capability to handle dynamic [21] and large-scale environments [22]. However, constrained by their explicit representations, these methods require substantial memory resources to achieve high resolutions.

Recently, research on implicit neural representations has made significant progress. Seminal works [5, 6, 7] use implicit neural networks to represent 3D objects and scenes, overcoming the limitations of traditional explicit representations. Inspired by them, [23, 24, 8] achieve incremental neural mapping by continuously training an MLP to represent the environment with sequentially input data. Considering the limited capacity of a single MLP, NICE-SLAM [25] combines dense grids to store optimizable local features and employs shallow MLPs to decode hidden information. This combination allows for more accurate and faster reconstruction but dense grids are not memory-efficient. Subsequent works tackle this issue by employing various data structures such as sparse octree [11, 10], hash encoding [26], and neural points [27]. In our approach, we adopt a sparse octree to store multi-layer local features, as done in SHINE-Mapping [11].

For training SDF with range data, recent studies [8, 9, 10, 11, 12] commonly use projective distance, i.e. the signed distance from a sampled point along the ray to the endpoint, as the supervision signal. This SDF approximation can lead to faster convergence while it also introduces errors, especially when the incidence angle of the ray is small. To obtain more accurate non-projective SDF, Voxfield [20] performs normals integration for each voxel and uses the fused normal to correct the distance value. Similarly, NeRF-LOAM [13] applies projective distance correction with normals specifically for ground points, and LocNDF [14] uses the gradient of the distance field to rectify SDF labels. Nevertheless, NeRF-LOAM only considers the ground plane, and LocNDF faces training instability due to circular dependency. Additionally, [24, 28] compute the distance between sampled points and the nearest measured surface points as the SDF label. However, such methods are sensitive to noise and rely on pre-built dense point clouds [28]. In our method, accurate SDF labels can be easily obtained by directly sampling points along the normal direction.

For incremental mapping with implicit representations, it has been shown that both neural network-based methods and feature grid-based methods face the challenge of catastrophic forgetting [11]. Most of these approaches [23, 25, 24, 10, 8] mitigate this issue by replaying historical keyframes. However, such replay-based methods are memory inefficient because they require storing historical keyframes, making them challenging to handle large-scale environments. To solve this problem, SHINE-Mapping [11] designs an update regularization strategy, albeit with a decrease in mapping performance. RIM [12] prioritizes robot-centric local implicit maps to achieve more efficient training but decouples the local map and global map. In contrast, our approach seamlessly integrates a voxel-oriented sliding window into the global implicit map. Supervisions from different viewpoints are accumulated in their respective voxels for batch training, resulting in smooth and consistent mapping results. Additionally, existing methods typically rely on naive random sampling to select samples for training, leading to redundant computations in densely observed regions while somewhat neglecting sparsely observed regions [29]. Our hierarchical sampling strategy ensures that all regions receive appropriate training attention.

III N3-Mapping

Refer to caption
Figure 1: Overview of our approach. With a sequence of range data and corresponding poses, our approach samples non-projective distance values to obtain accurate SDF labels along the normal direction. During training, these labels are used to supervise the learning of our implicit map through the voxel-oriented training strategy.

In this paper, we present N3-Mapping, a framework designed for large-scale and high-quality 3D mapping. as outlined in Figure 1. Given sequentially input points and normals from LiDAR or RGB-D cameras with associated poses, we obtain accurate non-projective SDF labels, along with corresponding sampled points, through normal-guided sampling (Section III-B). These training pairs are then maintained in our voxel-oriented sliding window (Section III-C1). For subsequent training, by employing a hierarchical sampling strategy (Section III-C2), we select a batch of training pairs at each iteration to optimize our octree-based implicit map (Section III-A2). Finally, we visualize and evaluate the mapping results as 3D meshes generated through the marching cubes algorithm [16].

III-A Preliminaries

III-A1 Signed Distance Fields

A signed distance field represents the 3D scene by assigning a signed distance to each point in space relative to the closest surface. The sign of the distance indicates whether a point is inside or outside the surface. The signed distance function ff can be defined as: f(𝐱)=sf(\mathbf{x})=s, which maps a 3D coordinate 𝐱\mathbf{x} to the signed distance value ss.

Notably, the gradient of SDF 𝐱f(𝐱)\nabla_{\mathbf{x}}f(\mathbf{x}) points towards the closest surface and on the surface the gradient is equal to the surface normal: 𝐱f(𝐱)=𝐧\nabla_{\mathbf{x}}f(\mathbf{x})=\mathbf{n} (or 𝐧-\mathbf{n} depending on the normal’s orientation). This implies that normal priors can provide informative guidance for neural SDF learning.

III-A2 Implicit Neural Map Representation

In this paper, we utilize octree-based hierarchical sparse voxels to store learnable features at each node vertex and a shallow MLP to decode hidden features into SDF values. Following SHINE-Mapping [11], we build a hash table with Morton codes as keys to query features efficiently and facilitate map scalability. For an arbitrary point 𝐱\mathbf{x} within the map, its corresponding local feature vector at a given level ll can be obtained through trilinear interpolation function TriLerp()TriLerp(\cdot) with its eight neighboring feature vectors {𝐯kl}\{\mathbf{v}_{k}^{l}\}. Then different levels of features are aggregated by summation and then fed into a shallow MLP fθf_{\theta} with globally shared weights to decode the SDF value ss:

s=fθ(l=1LTriLerp(𝐱,{𝐯il})),i{1,,8}.s={f_{\theta}}(\sum_{l=1}^{L}TriLerp(\mathbf{x},\{\mathbf{v}_{i}^{l}\})),\quad i\in\{1,\ldots,8\}.

We denote this function as s=fθ(𝐱)s=f_{\theta}(\mathbf{x}) for brevity in the following sections. Since the entire process is differentiable, we can optimize the feature vectors and the parameters of the MLP jointly using the generated signed distance value as supervision.

III-B Non-projective SDF Learning

Refer to caption
Figure 2: Different methods for sampling signed distance values. Both projective distance along the ray and corrected distance with normals lead to large errors on curved surfaces. Our normal-guided sampling method can produce more accurate distance values.

III-B1 Normal Guided Non-projective Distance

Ideally, we aim to use true signed distance values to supervise the learning of our implicit map. As discussed in Section II, existing methods often resort to projective distance as SDF approximation, which is obtained by sampling points along the ray 𝐫\mathbf{r} and calculating the distance ss from the sampled point 𝐱r\mathbf{x}_{r} to the endpoint 𝐩\mathbf{p}:

s=sign(𝐫(𝐩𝐱r))𝐩𝐱r.s=sign(\mathbf{r}\cdot(\mathbf{p-x}_{r}))||\mathbf{p-x}_{r}||.

This distance can be further projected along the surface normal 𝐧\mathbf{n} to reduce the approximation error [13, 14]:

s=sign(𝐫(𝐩𝐱r))(𝐩𝐱r)𝐧.s=sign(\mathbf{r}\cdot(\mathbf{p-x}_{r}))(\mathbf{p-x}_{r})\cdot\mathbf{n}.

However, this solution struggles to deal with curved or irregular surfaces, which are common in real-world scenarios. As illustrated in Figure 2, both the projective distance (blue line) and corrected distance (yellow line) still remain a substantial gap to the true distance (red line).

As discussed in Section III-A1, along the negative normal direction, the sampled points direct toward the closest surface, aligning with the gradient direction of the distance field. Therefore, we propose to directly sample points along the normal direction 𝐧\mathbf{n} and use the signed distance from the sampled point 𝐱n\mathbf{x}_{n} to the surface point 𝐩\mathbf{p} as the supervision:

s=sign(𝐧(𝐩𝐱n))𝐩𝐱n.s=sign(\mathbf{n}\cdot(\mathbf{p-x}_{n}))||\mathbf{p-x}_{n}||.

Such sampled SDF values are quite close to the ground truth. However, as the sampling distance increases, this approximation becomes less reliable because some sampled points may be closer to other surfaces. Hence, we set a truncation interval [tr,tr][-tr,tr] around the surface and only perform normal-guided sampling inside this region according to the Gaussian distribution 𝒩(0,σ)\mathcal{N}(0,\sigma), where σ\sigma is a hyperparameter that indicates the magnitude of the measurement noise. We also sample points in free space along the ray between the sensor and the truncation region. The SDF labels of these points are set to the truncation value tr=3σtr=3\sigma. This primarily aims to eliminate potential artifacts caused by dynamic objects, without compromising the quality of surface reconstruction.

III-B2 Loss Function

We apply the binary cross entropy (BCE) loss function like SHINE-Mapping [11] since it inherently enhances the importance of SDF labels that are close to zero. First, we map the SDF value to the occupancy probability within the range of (0, 1) through the sigmoid function: S(x)=1/(1+exβ)S(x)=1/(1+e^{-\frac{x}{\beta}}), where β\beta can control the reconstruction sharpness. Given sampled point 𝐱i\mathbf{x}_{i} with SDF label sis_{i}, forming a training pair, we use its corresponding occupancy value oi=S(si)o_{i}=S(s_{i}) as supervision. Then, the BCE loss can be formulated as follows:

bce=oilog(o^i)+(1oi)log(1o^i),\mathcal{L}_{bce}=-o_{i}\log(\hat{o}_{i})+(1-o_{i})\log(1-\hat{o}_{i}), (1)

where o^i=S(fθ(𝐱i))\hat{o}_{i}=S(f_{\theta}(\mathbf{x}_{i})) represents the predicted value of our implicit model after the sigmoid mapping.

Furthermore, if 𝐱i\mathbf{x}_{i} is inside the truncation region, we add an eikonal term to ensure valid and continuous SDF values. This term serves as a regularization to encourage the SDF gradient w.r.t. 3D query point coordinates to have a unit length:

eik=(𝐱ifθ(𝐱i)1)2.\mathcal{L}_{eik}=(\|\nabla_{\mathbf{x}_{i}}f_{\theta}(\mathbf{x}_{i})\|-1)^{2}. (2)

In summary, the final loss function is designed as follows:

total=bce+λeeik,\mathcal{L}_{total}=\mathcal{L}_{bce}+\lambda_{e}\mathcal{L}_{eik}, (3)

where λe\lambda_{e} represents the weight of the eikonal term. It is worth noting that we do not explicitly supervise the SDF gradient prediction using surface normals as done in [30, 31]. This is because the normal information has already been implicitly encoded in the SDF labels through the preceding normal-guided sampling process. Consequently, this simplified training process can lead to improved mapping efficiency.

III-C Voxel-oriented Training

III-C1 Voxel-oriented Sliding Window

Refer to caption
Figure 3: Illustration of the difference between keyframe-oriented and our proposed voxel-oriented sliding window. For the voxel 𝑽𝟎\bm{V_{0}} at the edge of the local window, our strategy preserves all supervisions from various views, while the keyframe-oriented method retains only partial observations, potentially leading to a forgetting issue.

Current feature grid-based methods [25, 10, 9] often select recently observed keyframes from the global set for efficient local optimization, similar to the sliding window method employed in traditional SLAM systems. However, as illustrated in Figure 3, this keyframe-oriented sliding window approach omits observations from frame Tn3T_{n-3} for voxel V0V_{0}. It has been revealed that optimizing local features using such partial observations can lead to non-smooth reconstruction results [11].

To tackle this issue, we propose a voxel-oriented sliding window strategy, in which each observation is associated with a voxel instead of a keyframe. Historical data is accumulated in their corresponding voxels by hash querying. Benefiting from this design, each voxel within the local window can retain complete supervision signals for training. For simplicity, we define the sliding window as a cube within the octree structure of our implicit map (as detailed in Section III-A2). The local window is centered at the current sensor’s origin and its main axes are aligned with the global octree. Given the sensor origin coordinates 𝐨\mathbf{o} and the maximum perception range rr, the 3D bounding box of this cube in voxel units can be derived as follows:

𝐨v=𝐨v,rv=rv[𝐛uv,𝐛lv]=[𝐨v+rv,𝐨vrv],\begin{gathered}\mathbf{o}^{v}=\left\lfloor\frac{\mathbf{o}}{v}\right\rfloor,\quad r^{v}=\left\lfloor\frac{r}{v}\right\rfloor\\ [\mathbf{b}_{u}^{v},\mathbf{b}_{l}^{v}]=[\mathbf{o}^{v}+r^{v},\mathbf{o}^{v}-r^{v}],\end{gathered} (4)

where vv is the leaf voxel size, \lfloor\cdot\rfloor denotes the floor function, 𝐨v\mathbf{o}^{v} is the voxel coordinate of the sensor origin, rvr^{v} is the perception range in voxel units, and [𝐛uv,𝐛lv][\mathbf{b}_{u}^{v},\mathbf{b}_{l}^{v}] represents the upper and lower bounds of the 3D bounding box in the voxel coordinate frame. This local window allows our approach to focus on mapping newly explored regions with limited memory usage while avoiding the forgetting issue. Moreover, to ensure global consistency, the MLP decoder will be frozen once convergence is achieved with a certain number of initial frames.

III-C2 Hierarchical Sampling

We note that the point cloud measurements tend to be unevenly distributed on the surface. If we simply employ random sampling at each training iteration, regions with higher point cloud density are more likely to be trained, while sparsely observed regions lack sufficient optimization to achieve complete convergence.

To address this problem, we adopt a hierarchical sampling strategy. In our implementation, each voxel maintains a block of training pairs, which includes query points and corresponding SDF labels, stored in an array. These arrays are organized in the list LPL_{P}. The association between each voxel and its index in LPL_{P} is maintained through a look-up table LL. As detailed in Algorithm 1, our sampling strategy involves two stochastic stages. In the first stage, we randomly sample NvN_{v} voxels within the sliding window. This ensures a spatially uniform voxel-wise sampling, independent of the point cloud’s distribution. Then, within each sampled voxel, we further sample NpN_{p} training pairs. If a voxel contains pairs fewer than a threshold NtN_{t}, we reduce the sampling number until it accumulates more observations. After collecting all sampled training pairs, we compute loss functions as described in Section III-B2. In practice, we perform hierarchical sampling in a parallel manner to accelerate this process.

Algorithm 1 Hierarchical Sampling
1:List of training pair arrays LPL_{P}; Voxels IDs (Morton codes) MM in the local window; Look-up table TT; Sampling number NvN_{v}, NpN_{p}; Threshold number NtN_{t}.
2:Sampled subset of training pairs PsubP_{\text{sub}}.
3:PsubP_{\text{sub}}\leftarrow EmptyArray
4:MM^{\prime}\leftarrow RandSample(M,NvM,N_{v})
5:for miMm_{i}\in M^{\prime} do
6:     idxT[mi]\text{idx}\leftarrow T[m_{i}], index of pairs array.
7:     PiLP[idx]P_{i}\leftarrow L_{P}[\text{idx}], get pairs array.
8:     if len(Pi)>Nt\text{len}(P_{i})>N_{t} then
9:         PP^{\prime}\leftarrow RandSample(Pi,NpP_{i},N_{p})
10:     else
11:         PP^{\prime}\leftarrow RandSample(Pi,Np/3P_{i},N_{p}/3)
12:     end if
13:     PsubPsub+PP_{\text{sub}}\leftarrow P_{\text{sub}}+P^{\prime}
14:end for

IV EXPERIMENTS

IV-A Experimental Setup

IV-A1 Baselines

For comparison, we choose two advanced TSDF integration-based methods, Voxblox [2] and Voxfield [20], along with two state-of-the-art implicit representation-based methods, SHINE-Mapping [11] and NeRF-LOAM [13], as our baselines. The odometry module of NeRF-LOAM is omitted since our focus lies in mapping performance. All methods run incremental mapping using their official open-source code with ground truth poses and the same voxel size.

IV-A2 Metrics

To quantitatively evaluate mapping results, we adopt standard reconstruction metrics including Accuracy (Acc., cmcm), Completion (Comp., cmcm), Chamfer-L1 distance (C-L1, cmcm), Completion Ratio (Comp.Ratio, %\%), and F-score (%\%). Considering that the observed regions may not fully cover the ground truth model, the unobserved portions will be culled before evaluation. The final mesh used for evaluation is generated by marching cubes on the same fixed-size grid to ensure fairness.

IV-A3 Implementation Details

Our method takes as input point-wise normals which are estimated using Open3D [32] with k-nearest neighbors set to 20. For implicit map representation, feature vectors are stored in the lowest three levels of the octree, each with a length of 8. The leaf voxel size is set to 0.2m0.2m and the truncation distance is 0.3m0.3m. We sample 3 points near the surface and another 3 in free space. Our shared MLP decoder contains 2 fully-connected layers with ReLU activations, and each layer has 32 hidden units. For training, we set Nv=1024Nv=1024, Np=8N_{p}=8 and the sharpness parameter β=0.1\beta=0.1. All experiments are conducted on a desktop PC with a 3.7 GHz Intel i9-10900X CPU and an NVIDIA RTX3090 GPU.

IV-B Mapping Quality

IV-B1 Maicity

Refer to caption
Figure 4: A qualitative comparison of different methods on the Maicity and the Newer College dataset. The odd rows show the reconstructed mesh colored by surface normals. The even rows present the error map with the ground truth point cloud as a reference where the redder points represent larger errors.
TABLE I: Quantitative mapping results of different methods on the MaiCity dataset.
Method
Acc. \downarrow
[cmcm]
Comp. \downarrow
[cmcm]
C-L1. \downarrow
[cmcm]
Comp.Ratio \uparrow
[10cm,10cm, %\%]
F-score \uparrow
[10cm,10cm, %\%]
Voxblox 4.76 18.92 11.84 69.65 80.24
Voxfield 3.64 11.84 7.74 82.76 88.78
SHINE 3.82 14.15 8.99 80.28 87.51
NeRF-LOAM 2.93 5.60 4.27 92.07 94.22
Ours 2.22 5.60 3.91 93.64 96.05

In our first experiment, we evaluate the mapping quality on the simulated MaiCity dataset [33], which provides both 64-beam noise-free LiDAR scans and a ground truth map of an urban scenario. Table I presents a quantitative comparison of our method and baselines. Our method achieves the best performance across all metrics. An intuitive qualitative demonstration is shown in Figure 4. It is evident that our approach produces the most accurate and complete reconstruction. For traditional methods, Voxblox [2] and Voxfield [20], the meshes obtained by them appear overly smoothed, lacking the preservation of details, like the tree and the pedestrian. SHINE-Mapping [11] uses a regularization-based continual learning strategy but still encounters the forgetting issue, resulting in an incomplete and non-smooth mesh. NeRF-LOAM [13] achieves good ground reconstruction, due to its ground separation strategy. Nonetheless, our method shows superior overall mapping accuracy, particularly in fine-grained structures and sparsely observed marginal regions.

IV-B2 Newer College

We also choose the real-world Newer College dataset [34] for evaluation. This dataset comprises a hand-carried LiDAR sequence captured at Oxford University with noticeable measurement noise and motion distortion. A high-precision point cloud model observed by a terrestrial scanner is used as pseudo ground truth. In accordance with NeRF-LOAM [13], we take one out of every five scans for mapping in all methods. This makes the task more challenging. The quantitative results and qualitative results are shown in Table II and Figure 4, respectively. It can be seen that our approach outperforms all baseline methods on this noisy dataset. Voxblox and Voxfield struggle with sparse observations and dynamic objects, producing large holes on the ground, and wrongly eliminating the tree region. SHINE-Mapping and NeRF-LOAM can produce relatively complete maps. But the former shows non-smooth reconstruction and large errors, even on flat ground. The latter tends to produce fragmented artifacts in edge areas and small holes on the ground. Our approach effectively balances reconstruction accuracy, completeness, and smoothness, yielding the best mapping performance.

TABLE II: Quantitative mapping results of different methods on the Newer College dataset
Method
Acc. \downarrow
[cmcm]
Comp. \downarrow
[cmcm]
C-L1. \downarrow
[cmcm]
Comp.Ratio \uparrow
[20cm,20cm, %\%]
F-score \uparrow
[20cm,20cm, %\%]
Voxblox 8.73 12.17 10.45 89.85 90.81
Voxfield 8.32 10.82 9.57 91.17 91.27
SHINE 6.98 10.44 8.71 90.23 92.49
NeRF-LOAM 7.53 10.45 8.99 92.19 92.93
Ours 6.32 9.75 8.04 92.86 94.54

IV-C Ablation Study

TABLE III: The ablation study of our method on the Maicity dataset
Normal Vox. Hier.
Acc. \downarrow
[cmcm]
Comp. \downarrow
[cmcm]
C-L1. \downarrow
[cmcm]
F-score \uparrow
[10cm,10cm, %\%]
3.03 8.28 5.66 91.87
2.55 6.34 4.44 94.97
2.31 5.91 4.11 95.82
2.57 8.85 5.71 93.27
2.22 5.60 3.91 96.05
Correction 2.40 7.00 4.70 94.42
KF 2.81 6.45 4.63 94.78
  • “Normal” represents the option of normal-guided non-projective signed distance. “Vox.” represents the voxel-oriented sliding window. “Hier.” represents the hierarchical sampling strategy. “Correction” means applying distance correction with normals directly. “KF” means using the keyframe-oriented sliding window strategy.

We conduct the ablation experiments to verify the effectiveness of each component and provide both quantitative and qualitative results in Table III and Figure 5.

IV-C1 Normal Guided Sampling

We compare the performance of our method with and without normal guided sampling. The results shown in Table III demonstrate that our normal guided sampling method significantly improves mapping accuracy and completeness. The corresponding visual evidence can be seen in Figure 5 (a1) and (b1). The area highlighted by the orange box illustrates that our method manages to avoid overestimating the signed distance with large incidence angles, leading to more valid distance fields. We further evaluate an alternative option that uses normals to correct the signed distance values for training [13], as depicted in Figure 5 (c1). The results show that it can provide some improvement over using the projective distance, but our reconstruction is more continuous and complete.

IV-C2 Voxel-oriented Sliding Window

We adopt a naive replay-based method as the default baseline, which randomly samples data from all historical keyframes to train the implicit map together with current measurements. Table III shows that the reconstruction errors significantly decrease when using the voxel-oriented sliding window. This improvement is particularly evident in newly observed regions, as marked by the orange circle in Figure 5 (a2) and (b2). This is because our method can rapidly achieve convergence within the sliding window, while the replay-based baseline dilutes the training attention to newly observed regions by historical data. We also provide the results of applying a keyframe-oriented sliding window for comparison. As shown in Figure 5 (c2), this design can enhance the reconstruction quality of the current local map. However, it leads to inconsistent reconstruction in historically visited areas, as highlighted by the red circle, due to the forgetting issue. In contrast, our method avoids this problem since each voxel retains complete supervision from various viewpoints.

IV-C3 Hierarchical Sampling Strategy

We compare the performance of our method with and without hierarchical sampling across different iteration settings to analyze its impact on training efficiency. The results in Figure 6 show that as the number of iterations increases, mapping performance gradually improves, reaching convergence at around 40 iterations. Notably, the mapping quality with hierarchical sampling is better even with fewer iterations. This illustrates the effectiveness of this strategy in enhancing training efficiency.

Refer to caption
Figure 5: Ablation study for our contributions and alternative designs on Maicity dataset. Regions are highlighted by colored boxes and circles to distinguish improvements
Refer to caption
Figure 6: Ablation study of hierarchical sampling strategy.

IV-D Scalable Incremental Mapping

Refer to caption
Figure 7: Top: the reconstruction result of KITTI odometry sequence 07. Bottom: the memory usage of historical data during the incremental mapping with and without our sliding window strategy.

To demonstrate the scalability of our method to larger outdoor environments, we perform incremental mapping on the KITTI odometry dataset [35]. Figure 7 showcases the reconstruction results of sequence 07 as an example. It can be seen that our method produces a complete, consistent, and dense map within a real-world driving scenario. Notably, thanks to our voxel-oriented sliding window strategy, the memory consumption of historical data remains stable at around 1GB throughout the mapping process. In contrast, the replay-based strategy experiences rapid memory growth during mapping, eventually leading to memory overflow.

IV-E Robustness Analysis

Refer to caption
Figure 8: Indoor mapping examples of the Neural RGBD dataset. Our approach manages to handle thin and complex geometries.
Refer to caption
Figure 9: Impact of truncation distance on reconstruction quality.

We test our method on the Neural RGBD dataset [8] to evaluate its robustness in different environments. Here we set the leaf voxel size to 4cmcm and truncation distance to 3cmcm to accommodate indoor environments. Figure 8 demonstrates that our approach effectively captures intricate details of thin and complex structures, such as the chair and the crate. As a reference, the results obtained by using projective distance exhibit a “bulging effect” and suffer from noticeable artifacts.

Refer to caption
Figure 10: Impact of moving obstacles during the mapping process on the Newer College dataset. Our approach manages to eliminate such dynamic objects.

We also evaluate the impact of truncation distance on reconstruction quality. As indicated in Figure 9, our method achieves lower C-L1 than those using projective SDF. However, our method also exhibits a slightly higher sensitivity to the truncation distance in the Staircase scenario with complex geometries. This is because, in such cases, normal guided sampled points can easily interfere with other nearby surfaces, resulting in poorer reconstruction. When applied in structurally simpler urban scenes (Maicity), our method demonstrates increased robustness to the choice of the truncation interval. It is also worth noting that an excessively small truncation distance can compromise the completeness of the reconstruction, leading to a significant increase in overall error.

Lastly, we take a close look at the impact of moving obstacles on the mapping results, using the Newer College dataset as an example. As depicted in Figure 10, during the mapping process, moving objects like the pedestrians leave artifacts along their motion trajectories. However, thanks to our free space sampling design, these artifacts can be gradually suppressed and eliminated, while static objects such as the van are retained. Consequently, the final mapping results are complete and clean.

IV-F Runtime Analysis

We choose Maicity-01 and KITTI-07 to compare the runtime in Table IV. Although our method takes longer for training pairs allocation in the preprocessing stage, it avoids the time-consuming regularization update during training, resulting in better efficiency. Currently, our method is unable to meet real-time requirements and performs incremental mapping offline. It can be further sped up by the optimized C++ implementation and more efficient implicit representations [12, 26].

TABLE IV: Average per-frame time consumption (secondsseconds)
Dataset Method Preprocess Mapping Total
Maicity SHINE 0.13 2.37 2.50
Ours 0.25 1.84 2.09
KITTI SHINE 0.17 2.92 3.09
Ours 0.31 2.14 2.45

V CONCLUSION

We introduce N3-Mapping, a novel large-scale mapping approach using normal guided neural non-projective SDFs. Experiments demonstrate that our method overcomes the limitations associated with projective distance and achieves more accurate and complete reconstruction. Additionally, our voxel-oriented training strategy not only enables efficient incremental mapping without the forgetting issue but also ensures bounded memory consumption. For future work, we can explore more robust non-projective SDF sampling methods. Besides, apart from surface reconstruction, our implicit maps with their inherent continuity have the potential to be reused for accurate robot localization [14, 13], which is the focus of our next work.

References

  • [1] H. Oleynikova, A. Millane, Z. Taylor, E. Galceran, J. Nieto, and R. Siegwart, “Signed distance fields: A natural representation for both mapping and planning,” in RSS 2016 workshop: geometry and beyond-representations, physics, and scene understanding for robotics.   University of Michigan, 2016.
  • [2] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “Voxblox: Incremental 3D Euclidean Signed Distance Fields for on-board MAV planning,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 1366–1373.
  • [3] I. Vizzo, T. Guadagnino, J. Behley, and C. Stachniss, “VDBFusion: Flexible and Efficient TSDF Integration of Range Sensor Data,” Sensors, vol. 22, no. 3, p. 1296, 2022.
  • [4] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon, “KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera,” in Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, 2011, pp. 559–568.
  • [5] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 165–174.
  • [6] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy Networks: Learning 3D Reconstruction in Function Space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4460–4470.
  • [7] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” in European Conference on Computer Vision, 2020, pp. 405–421.
  • [8] D. Azinović, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural RGB-D Surface Reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6290–6301.
  • [9] J. Wang, T. Bleja, and L. Agapito, “GO-Surf: Neural Feature Grid Optimization for Fast, High-Fidelity RGB-D Surface Reconstruction,” in 2022 International Conference on 3D Vision (3DV), 2022, pp. 433–442.
  • [10] X. Yang, H. Li, H. Zhai, Y. Ming, Y. Liu, and G. Zhang, “Vox-Fusion: Dense Tracking and Mapping with Voxel-based Neural Implicit Representation,” in 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2022, pp. 499–507.
  • [11] X. Zhong, Y. Pan, J. Behley, and C. Stachniss, “SHINE-Mapping: Large-Scale 3D Mapping Using Sparse Hierarchical Implicit Neural Representations,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 8371–8377.
  • [12] J. Liu and H. Chen, “Towards Real-time Scalable Dense Mapping using Robot-centric Implicit Representation,” arXiv preprint arXiv:2306.10472, 2023.
  • [13] J. Deng, Q. Wu, X. Chen, S. Xia, Z. Sun, G. Liu, W. Yu, and L. Pei, “NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8218–8227.
  • [14] L. Wiesmann, T. Guadagnino, I. Vizzo, N. Zimmerman, Y. Pan, H. Kuang, J. Behley, and C. Stachniss, “LocNDF: Neural Distance Field Mapping for Robot Localization,” IEEE Robotics and Automation Letters, vol. 8, no. 8, pp. 4999–5006, 2023.
  • [15] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989.
  • [16] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3D surface construction algorithm,” ACM SIGGRAPH Computer Graphics, vol. 21, no. 4, pp. 163–169, 1987.
  • [17] H. Pfister, M. Zwicker, J. Van Baar, and M. Gross, “Surfels: Surface elements as rendering primitives,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques.   ACM Press, 2000, pp. 335–342.
  • [18] M. Zucker, N. Ratliff, A. D. Dragan, M. Pivtoraiko, M. Klingensmith, C. M. Dellin, J. A. Bagnell, and S. S. Srinivasa, “CHOMP: Covariant Hamiltonian optimization for motion planning,” The International Journal of Robotics Research, vol. 32, no. 9-10, pp. 1164–1193, 2013.
  • [19] H. Huang, Y. Sun, H. Ye, and M. Liu, “Metric Monocular Localization Using Signed Distance Fields,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 1195–1201.
  • [20] Y. Pan, Y. Kompis, L. Bartolomei, R. Mascaro, C. Stachniss, and M. Chli, “Voxfield: Non-Projective Signed Distance Fields for Online Planning and 3D Reconstruction,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 5331–5338.
  • [21] R. A. Newcombe, D. Fox, and S. M. Seitz, “DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 343–352.
  • [22] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald, “Real-time large-scale dense RGB-D SLAM with volumetric fusion,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 598–626, 2015.
  • [23] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “iMAP: Implicit Mapping and Positioning in Real-Time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6229–6238.
  • [24] J. Ortiz, A. Clegg, J. Dong, E. Sucar, D. Novotny, M. Zollhoefer, and M. Mukadam, “iSDF: Real-Time Neural Signed Distance Fields for Robot Perception,” in Robotics: Science and Systems, 2022.
  • [25] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “NICE-SLAM: Neural Implicit Scalable Encoding for SLAM,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 786–12 796.
  • [26] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–15, 2022.
  • [27] E. Sandström, Y. Li, L. Van Gool, and M. R. Oswald, “Point-SLAM: Dense Neural Point Cloud-based SLAM,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 18 433–18 444.
  • [28] C. Shi, F. Tang, Y. Wu, X. Jin, and G. Ma, “Accurate implicit neural mapping with more compact representation in large-scale scenes using ranging data,” IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6683–6690, 2023.
  • [29] C. Jiang, H. Zhang, P. Liu, Z. Yu, H. Cheng, B. Zhou, and S. Shen, “H2-Mapping: Real-time Dense Mapping Using Hierarchical Hybrid Representation,” IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6787–6794, 2023.
  • [30] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, “Implicit Neural Representations with Periodic Activation Functions,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 7462–7473.
  • [31] A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman, “Implicit geometric regularization for learning shapes,” in International Conference on Machine Learning, 2020, pp. 3789–3799.
  • [32] Q.-Y. Zhou, J. Park, and V. Koltun, “Open3D: A modern library for 3D data processing,” arXiv:1801.09847, 2018.
  • [33] I. Vizzo, X. Chen, N. Chebrolu, J. Behley, and C. Stachniss, “Poisson Surface Reconstruction for LiDAR Odometry and Mapping,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 5624–5630.
  • [34] M. Ramezani, Y. Wang, M. Camurri, D. Wisth, M. Mattamala, and M. Fallon, “The newer college dataset: Handheld lidar, inertial and vision with ground truth,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 4353–4360.
  • [35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.