MonoNeuralFusion: Online Monocular Neural 3D Reconstruction with Geometric Priors

Zi-Xin Zou, Shi-Sheng Huang, Yan-Pei Cao, Tai-Jiang Mu,
Ying Shan, Hongbo Fu Zi-Xin Zou and Tai-Jiang Mu are with BNRist, the Department of Computer Science and Technology, Tsinghua University, Beijing, China. E-mail: [email protected], [email protected] Shi-Sheng Huang is with the School of Artificial Intelligence, Beijing Normal University, Beijing, China. Email: [email protected]. Hongbo Fu is with the School of Creative Media, City University of Hong Kong, Hong Kong, China. Email: [email protected]. Yan-Pei Cao and Ying Shan are with the ARC Lab, Tencent PCG. Email: [email protected], [email protected].

Abstract

High-fidelity 3D scene reconstruction from monocular videos continues to be challenging, especially for complete and fine-grained geometry reconstruction. The previous 3D reconstruction approaches with neural implicit representations have shown a promising ability for complete scene reconstruction, while their results are often over-smooth and lack enough geometric details. This paper introduces a novel neural implicit scene representation with volume rendering for high-fidelity online 3D scene reconstruction from monocular videos. For fine-grained reconstruction, our key insight is to incorporate geometric priors into both the neural implicit scene representation and neural volume rendering, thus leading to an effective geometry learning mechanism based on volume rendering optimization. Benefiting from this, we present MonoNeuralFusion to perform the online neural 3D reconstruction from monocular videos, by which the 3D scene geometry is efficiently generated and optimized during the on-the-fly 3D monocular scanning. The extensive comparisons with state-of-the-art approaches show that our MonoNeuralFusion consistently generates much better complete and fine-grained reconstruction results, both quantitatively and qualitatively.

Index Terms:

online monocular reconstruction, neural implicit scene representation, volume rendering, geometric prior guidance

Refer to caption — Figure 1: Online 3D scene reconstruction from a monocular video using our MonoNeuralFusion. Our MonoNeuralFusion utilizes a neural implicit scene representation with volume rendering and incrementally builds surface reconstruction from a sequence of posed RGB images by a geometry learning mechanism guided by geometric priors. Such a mechanism enables high-fidelity surface reconstruction with fine geometric details. We illustrate the final mesh and two close-ups from two views.

1 Introduction

Online reconstruction of 3D indoor scenes from monocular videos continues to be an important research topic in the computer graphics and computer vision communities, and benefits various applications in virtual/augmented reality, robotics, video games, etc. Although the state-of-the-art visual simultaneous localization and mapping (vSLAM) techniques [1, 2] can calculate accurate camera poses from monocular images, it is still challenging for the current monocular 3D reconstruction solutions to achieve complete, coherent and fine-grained reconstruction results.

Most depth-based online 3D reconstruction approaches perform 3D volumetric fusion based on depth maps, with a surface represented by truncated signed distance function (TSDF) [3, 4, 5, 6, 7] or neural implicit function [8, 9, 10]. However, one main drawback of a monocular video input is the lack of physically reliable depth, making it difficult to apply the current mainstream volumetric fusion techniques to the monocular 3D reconstruction task. Although the technique of Mutli-View-Stereo (MVS) or Structure-from-Motion (SfM) can provide coherent depth estimation [11, 12, 13, 14, 15], the resulting semi-dense or sparse depth maps often lead to incomplete reconstruction results. In addition, their time-consuming computation makes them not suitable for interactive applications. With the progress of deep learning, some pioneering works [16, 17, 18] adopt the single-view depth estimation to monocular 3D reconstruction, and have achieved impressive surface reconstruction results. However, given effective deep learning based monocular depth estimation approaches [19, 20, 21], it is still challenging to generate consistent depth estimation across different views, making it difficult to build coherent 3D reconstruction for large-scale VA/AR applications.

The recent work of NeuralRecon [22] proposes to reconstruct the 3D geometry with a neural network instead of multi-view depth maps, and has achieved coherent 3D surface reconstruction results from monocular videos. However, their simple average pooling for multi-view 3D volume feature fusion often leads to a over-smooth geometry reconstruction without enough geometric details. Instead, TransformerFusion [23] introduces spatial-aware feature fusion for better geometric detail reconstruction. However, its ability is still limited for fine-grained geometry reconstruction of certain objects, such as chair legs, monitor stands, etc. The recent success of NeRF [24] utilizes powerful volume rendering with an implicit representation, which enables impressive surface reconstruction from multi-view images. Although subsequent works achieve even better fine-grained surface reconstruction with the aid of geometry regularization (NeuS [25], MonoSDF [26], NeuRIS [27]) or Manhattan-world assumption[28], the time-consuming geometry learning of implicit scene representations keeps them away from online surface reconstruction for monocular videos. An efficient and effective geometry learning mechanism remains unexplored for online monocular 3D reconstruction, towards fine-grained surface reconstruction with better geometric details.

Aiming at much better fine-grained surface reconstruction quality during the on-the-fly 3D monocular scans, we introduce a novel neural implicit scene representation with volume rendering for the online monocular 3D reconstruction task. Instead of encoding the 3D scene geometry as a single MLP [24, 25, 29, 30], we formulate a neural implicit scene representation (NISR), which encodes a 3D scene as sparse feature volumes and decodes it as a continuous signed distance function (SDF), instead of a resolution-dependent SDF as NeuralRecon [22]. Based on NISR, we develop a novel volume rendering approach and an efficient volume rendering optimization, which is especially suitable for incremental geometry learning during the online surface reconstruction task. Moreover, we further introduce geometric priors (surface normal prior, eikonal regularization prior, and normal map prior) to both the neural implicit scene representation learning and the volume rendering optimization, leading to an effective geometric prior guided geometry learning mechanism for high-fidelity surface reconstruction with geometric details.

Based on such an efficient and effective geometric prior guided neural implicit scene representation with volume rendering, we propose MonoNeuralFusion to perform coherent 3D reconstruction from on-the-fly monocular 3D scans, with much better fine-grained surface quality. To demonstrate its effectiveness, we have extensively evaluated our approach on various public 3D indoor scan datasets, such as ScanNet [31], TUM-RGBD [32], and Replica [33], in comparison with state-of-the-art online monocular 3D reconstruction approaches, such as NeuralRecon [22] and TransformerFusion [23]. Results show that our approach achieves better surface reconstruction in quantitative accuracy metrics, with much better fine-detailed geometric details qualitatively, making itself a new state-of-the-art online monocular 3D reconstruction approach. We summarise our main contributions as follows:

•

We introduce the neural implicit scene representation (NISR) with volume rendering, serving as an efficient scene geometry representation for the online geometry learning task.
•

We propose an effective geometric prior guided geometry learning mechanism, by leveraging the geometric priors to neural implicit scene representation learning and volume rendering optimization, towards high-fidelity surface reconstruction with geometric details.
•

We introduce MonoNeuralFusion, an online system to incrementally build surface reconstruction, achieving much better fine-grained reconstruction quality thanks to our geometric prior guidance.

2 Related Work

Surface Reconstruction from Monocular Images. Previous monocular surface reconstruction approaches can be mainly divided into two types. The first type is a depth-based approach, which first estimates depth from single-view or multi-view images and then performs the geometry reconstruction by volumetric fusion. For instance, CNN-SLAM [16] is probably the first to perform the monocular 3D reconstruction by predicting a depth map from a single view with a CNN-based network and refining the depth through a traditional depth filter. CodeSLAM [34] and CodeMapping [35] propose a compact and optimizable code to represent the depth by using a conditional variational auto-encoder(VAE) and jointly optimize it from multi-view dense bundle adjustment. Different from depth estimation from single-view images, which completely relies on the learning ability of the depth prediction network, depth estimation from multi-view images could achieve locally more coherent depth estimation. Mobile3DRecon [18] uses a multi-view semi-global matching method followed by a depth refinement post-processing for robust monocular depth estimation. Recently, learning-based multi-view depth estimation methods benefit from priors with a data-driven approach and achieve much better depth estimation quality [36]. Bayesian filtering [37], Gaussian processing [38], and ConvLSTM [39] are further used to propagate past information to improve the global consistency of reconstruction results. Some following works combine volumetric convolution [40, 41] or readily available metadata [42] with MVSNet, producing globally more coherent results and outperforming the other depth-based methods.

The second type is a volume-based method, which directly generates a volumetric representation such as an occupancy field or SDF field. SurfaceNet [43] proposes to back-project color from two input views to build a color volume and predicts the occupancy probability of the volume grid by using 3D CNNs. Atlas [44] uses deep image features extracted from 2D CNNs instead of color and extends this method to multi-view images. VoRTX [45] replaces the average-based feature fusion by using a transformer architecture [46]. However, all the methods mentioned above are performed offline. To adapt to online applications, NeuralRecon [22] and TransformerFusion [23] perform the incremental feature fusion using the gated recurrent unit (GRU) and transformer, respectively. Compared to the depth-based method, the volume-based method could achieve globally coherent reconstruction but usually lacks local details. Thus, our goal is to improve the level of detail in reconstruction results. Different from the aforementioned volume-based methods, which focus on improving feature fusion, we focus on the geometric details themselves and leverage the effective geometric prior guided geometry learning mechanism to improve the quality of online reconstruction.

Neural Implicit Representation. Neural implicit representation has shown promising results in surface reconstruction in recent years due to its continuous representation and ability to learn geometric priors from large datasets. DeepSDF [47] and Occupancy Network [48], for the first time, propose to formulate an implicit function as a Multi-Layer Perceptron (MLP) with global features to predict SDF or an occupancy value for each query point. Some following works divide space into voxels [49, 50, 51, 8, 52] or multi-layer voxels [9, 53, 54] to improve the ability of reconstructing complex geometry with details. Some recent online systems, such as DI-Fusion [8] and NICE-SLAM [9], also take advantage of neural implicit representation for surface reconstruction from RGB-D sequences. TransformerFusion [23] is probably the first approach based on a neural implicit representation with an occupancy field for the online 3D monocular reconstruction task. Unlike TransformerFusion, we perform the geometry learning with the guidance of geometric priors instead, which helps to learn finer surface geometric details.

Neural Volume Rendering. NeRF [24] brings a boom of neural volume rendering in the novel view synthesis task, and its variants improve it in terms of speed [55], representation [30], sampling strategy [56], camera pose [57], generalization across scenes [58], etc. Some works [25, 29, 59] adapt this technology into neural implicit surfaces to achieve high-fidelity reconstruction from RGB images. [28] and [26] additionally use semantic or geometric cues to improve reconstruction quality in indoor scenes. NeuRIS [27] further improves the results by checking the multi-view consistency to eliminate the effects of unreliably predicted normals. These methods require hours of optimization and are time-consuming. When depth sensors are available, iMAP [10] and NICE-SLAM [9] are two representative methods that apply volume rendering to real-time SLAM systems. Inspired from these works, our method utilizes volume rendering to improve the geometric details of online scene reconstruction by considering surface normals. As for as we know, no other work is leveraging volume rendering on the online monocular 3D reconstruction task.

3 MonoNeuralFusion

Given a sequence of $N$ monocular RGB images $\{I_{n}\}_{n=1}^{N}$ with the corresponding camera poses $\{T_{n}\}_{n=1}^{N}$ , our goal is to perform incremental geometry learning for monocular 3D reconstruction with high quality and fine details.

To this end, we first formulate a neural implicit scene representation (NISR) (Sec. 3.1), which encodes 3D scene geometry as a sparse feature volume extracted from multi-view images in a coarse-to-fine fashion, and decodes it as a continuous SDF. Based on this, we further introduce a volume rendering technique (Sec. 3.2), with a hierarchical sampling that is suitable for the NISR representation. For fine-grained surface reconstruction, we develop a geometry learning mechanism, by leveraging geometric priors both in NISR pre-training and volume rendering optimization (Sec. 3.3). Finally, we provide an online system (Sec. 3.4) to process incremental surface mapping for coherent reconstruction with fine-grained geometric details. Fig. 2 shows the main components of MonoNeuralFusion, and Fig. 4 illustrates the pipeline of our online reconstruction system.

3.1 Neural Implicit Scene Representation

Most of previous online reconstruction works represent the 3D geometry as resolution-dependent SDFs [22, 44] or explicit depth maps [38, 39]. These two representations have disadvantages in the ability of detail expression and result coherence, compared with neural implicit representations, and they are not suitable for volume rendering optimization. Although some methods based on volume rendering optimization achieve impressive and fine-grained surface reconstruction [25, 26, 28, 27], they require a time-consuming optimization from scratch and is hard to extend to an online version without a good initial prediction. Besides, those encoding an entire scene to the parameters of MLP as global features is difficult to learn general geometry priors from datasets. Based on the above, to both leverage the advantages of stronger expression ability of neural implicit scene representations with volume rendering for fine-grained reconstruction, we propose to formulate a neural implicit scene representation (NISR) for a more flexible and effective geometry representation.

Our NISR encodes an entire 3D scene as a sparse feature volume, each voxel containing a continuous SDF decoded by a latent vector fused from image features of multi-view images. Note that compared with the continuous occupancy field in [23], the continuous SDF has a stronger ability for detail expression and is reasonable to supervise on the gradient domain. Besides, the sparse structure would lead to more efficency and effectivness for higher resolution. Finally, the surface mesh can be extracted using Marching Cubes [60] at an arbitrary resolution. One benefit of our NISR is the enabled efficient geometry learning based on volume rendering optimization in the space of latent vectors, which is suitable for the online 3D reconstruction task. Besides, we further pre-train our NISR with the guidance of geometric priors, leading to more effective feature latent vector extraction for fine-grained surface reconstruction.

Sparse Feature Volume Construction. We adopt the sparse feature volume data structure from NeuralRecon [22] to organize the 3D scene’s geometry content, where the entire 3D space is divided into a set of sparse voxels. Each voxel defines a neural SDF decoded by a feature latent vector trilinearly interpolated by the features from its eight corners. Specifically, we gradually construct the sparse feature volumes $F_{\theta}^{l},l\in\{c,m,f\}$ ( $c,m,f$ denotes the coarse, middle and fine levels, respectively) in a coarse-to-fine fashion, with $F_{\theta}^{l}$ being feature latent vectors fused by the 2D CNN-based image encoder from the input multi-view images (Fig. 2). To further improve the overall quality, we adopt a spatial-aware feature fusion based on a transformer module [61] instead of channel-wise average feature fusion used in NeuralRecon. The transformer-based fusion helps to build a more expressive feature latent vectors $F_{\theta}^{l}$ . These feature latent vectors $F_{\theta}^{l}$ are further merged with the previous feature latent vectors using a GRU module after sparse 3D CNN, as global feature latent vectors [22]. For brevity, we also notate the final global feature latent vectors as $F_{\theta}^{l}$ in each sparse feature volume.

Sparse Feature Volume Decoder. We decode the feature latent vector $F_{\theta}^{f}$ in the sparse feature volume in the fine level as a continuous SDF, which implicitly represents the geometry content. Specifically, for any query point $\mathbf{x}\in R^{3}$ , we use an MLP-based decoder $f_{\theta}$ to predict its signed distance $s$ as:

s=f_{\theta}(\mathbf{x},interp(F_{\theta}^{f},\mathbf{x})),

(1)

where $interp$ means the trilinear interpolation in $F_{\theta}^{f}$ . Besides, we additionally decode a radiance field $F_{\omega}$ for $F_{\theta}^{f}$ using an MLP-based radiance decoder $f_{\omega}$ :

c=f_{\omega}(\mathbf{x},interp(F_{\omega},\mathbf{x})).

(2)

3.2 Volume Rendering

In this section, we propose a novel volume rendering approach to render the geometry content represented by our NISR to any given novel views. Besides rendering the color map, we also render a normal map, which brings extra geometric regularization for the later volume rendering optimization. Specifically, given a casting ray $\{{\mathbf{p_{i}}}=\mathbf{o}+d_{i}\mathbf{v}\}$ , where $\mathbf{o}$ is the camera center and $v$ is the view direction, we render both the color map and the normal map along this casting ray by accumulating the measurements of the $N$ sampling points (see Fig. 2) as:

\displaystyle\hat{C}(\mathbf{r})=\sum_{i=1}^{N}T_{i}\alpha_{i}c_{i},\hat{N}(\mathbf{r})=\sum_{i=1}^{N}T_{i}\alpha_{i}n_{i},

(3)

where $\alpha_{i}$ is the discrete opacity value transformed from the SDF value $s_{i}$ decoded by $f_{\theta}$ following NeuS [25], and $T_{i}=\prod_{j=1}^{i-1}(1-\alpha_{i})$ is the accumulated transmittance. $c_{i}$ is the color value decoded by $f_{\omega}$ and $n_{i}$ is the normal estimation that can be calculated by the automatic derivation of the SDF decoder $f_{\theta}$ .

Since NISR is a sparse feature volume representation, and is different from the ordinary MLP-based representation like NeRF [24] and NeuS [25], we propose a hierarchical sampling strategy (see Fig. 3(c)) for a better point sampling inside the sparse voxels with the following two steps: (1) ray-voxel intersection sampling; and (2) inside-voxel sampling. Specifically, we first perform a ray-voxel intersection sampling to select sparse voxels that intersect with each casting ray. Then within each intersected voxel, we perform hierarchical inside-voxel sampling to sample specific points by converting the distance of a sampling point from the camera origin in the world space to the sparse space. Note that both uniformly coarse sampling and importance sampling on the top of the coarse probability estimation are processed in the sparse space. Given ray-voxel intersection pairs $\mathcal{P}=\{(r,v)\}$ , with their distances of ray-voxel out position in the world space $d_{w}^{(r,v)}$ and in the sparse space $d_{s}^{(r,v)}$ from the camera origin, we perform distance conversion of any sampling point as follows:

\displaystyle d_{s}=d_{w}-(d_{w}^{(r,v)}-d_{s}^{(r,v)}),

(4)

where $v$ is the corresponding intersected voxel id of ray $r$ for position sampling. In the end, we get a total of $N_{c}$ coarse sampling points and $N_{i}$ importance sampling points for each ray rendering.

Applying the hierarchical sampling strategy proposed in NeuS without any constraint would result in invalid sampling outside voxels, as illustrated in Fig. 3(a). NSVF, as using a similar scene representation with sparse voxels, yields a self-pruning from dense voxels and samples points guided by sparse voxels which are near the surface as optimization proceeds, thus eliminating the need of hierarchical sampling. However, sparse voxels would not guide point sampling very well in our method due to its more thickness near the underlying surface, and directly applying importance sampling would also lead to invalid sampling outside sparse voxels (Fig. 3(b)). Instead, our hierarchical sampling strategy can achieve more accurate point sampling, i.e., all samplings locate inside voxels. The second row of Fig. 3(a-c) illustrates the rendering results of these three sampling strategies. We can see that some rendered points (in blue boxes) in (a) and (b) are in the air due to the sampled points out of the sparse voxel space, while the rendered points of our method are on the surface (c).

3.3 Geometric Prior Guided Geometry Learning

In this subsection, we propose a geometric prior guided geometry learning mechanism, by taking full advantage of the geometric priors both in NISR pre-training and volume rendering optimization for high-fidelity reconstruction. In the NISR pre-training, we leverage effective regularization from geometric priors, such as normal prior and eikonal prior, to learn the encoder-decoder parameters of our NISR for a more detailed surface representation. In the volume rendering optimization, since it does not provide ground-truth surface normals sampled from a ground-truth mesh as in the pre-training stage, we leverage the geometric cues from the extra normal map prediction from RGB images to enhance the geometric details of reconstruction.

NISR Pre-training. For better feature latent vector extraction, we propose to pre-train our NISR beforehand with the geometric priors. Here we formulate the geometric prior as the surface normal loss with eikonal regularization as:

\displaystyle\mathcal{L}_{n}=(1-\langle\nabla f_{\theta}(\mathbf{x}),\hat{n}\rangle)+\lambda_{e}||\nabla f_{\theta}(\mathbf{x})^{2}-1||^{2},

(5)

where $f_{\theta}$ is the MLP-based decoder introduced in Equation 1. We sample points on the surface of a ground-truth mesh and compute their normals $\hat{n}$ for supervision. Finally, we train our NISR similar to NeuralRecon[22] but with a geometric prior guided loss function:

\displaystyle\mathcal{L}=\mathcal{L}_{o}+\lambda_{s}\mathcal{L}_{s}+\lambda_{n}\mathcal{L}_{n}+\lambda_{f}\mathcal{L}_{f},

(6)

where $\mathcal{L}_{o}=\sum_{l=1}^{L}\lambda_{o}^{l}\rm{BCE}(o^{l},\hat{o}^{l})$ denotes the binary cross-entropy (BCE) loss, $\mathcal{L}_{s}=|clamp(s,\tau)-clamp(\hat{s},\tau)|$ represents the clipped L1 loss with $clamp(x,\tau)=\min(\tau,\max(-\tau,x))$ , and $\mathcal{L}_{f}=\frac{1}{M}\sum_{i=1}^{M}||f_{i}||_{2}^{2}$ is used to regularize scene feature vectors $f_{i}\in F_{\theta}^{l}$ .

Volume Rendering Optimization. Based on the novel volume rendering from our NISR, we propose to optimize the latent vector of the sparse feature volume to perform accurate geometry learning for high-fidelity surface reconstruction. Specifically, we perform the optimization with the following loss function:

\displaystyle\mathcal{L}=\mathcal{L}_{rgb}+\lambda\mathcal{L}_{normal},

(7)

where $\mathcal{L}_{rgb}=\frac{1}{M}\sum|\hat{C}(\mathbf{r})-C(\mathbf{r})|$ is the color error between the rendered color $\hat{C}(\mathbf{r})$ and predicted color $C(\mathbf{r})$ commonly used in other reconstruction works [25, 29], and $\mathcal{L}_{normal}=\frac{1}{M}\sum\{|\hat{N}(\mathbf{r})-N(\mathbf{r})|+(1-\langle\hat{N}(\mathbf{r}),N(\mathbf{r})\rangle)\}$ describes the L1 distance and the cosine angle between the rendered normal $\hat{N}(\mathbf{r})$ and predicted normal $N(\mathbf{r})$ . $M$ is the number of the rendered pixel in a mini-batch. We use a pretrained out-of-the-box Omnidata model [62] to predict the normal map for each image.

3.4 Online Reconstruction

In this subsection, we combine all the components introduced above to build an online system to reconstruct the scene geometry. We build our system similar to a common SLAM system [2], except for the tracking component. In the implementation, We run a mapping thread and an optimization thread in parallel at the back-end, as illustrated in Fig. 4.

Mapping Thread. The mapping thread receives the incoming frames from RGB sequences and selects key-frames in sliding windows for efficiency as in NeuralRecon. Specifically, when the distance between a current frame and the last key-frame is larger than $\tau_{d}=0.1$ or the rotation angle from the last key-frame is larger than $\tau_{a}=15^{\circ}$ , a key-frame will be selected. When the size of sliding windows reaches $K=9$ , all images and their corresponding camera intrinsic and extrinsic parameters will be used to construct the sparse feature volume. When the feature volume is updated, Marching Cubes is applied to extract the underlying surface mesh. For every key-frame, it will be applied by a normal prediction and inserted into a database of key-frames, preparing for the subsequent volume rendering optimization.

Optimization Thread. The optimization thread runs loops. It selects $B$ optimization key-frames in the key-frame database each time and randomly samples $M$ pixels on each of them to build the optimization function as in Equation 7. We use Adam optimizer to refine the scene representation parameters for $T$ iterations in a loop.

Per-scene Fine-tuning. To further enhance the results for a specific scene, we apply an additional fine-tuning for the whole scene. The same as the optimization thread, we optimize the scene with sampling $M$ pixels on each of $B$ images for $T$ iterations. Since the reconstruction from the online system is already good, a few minutes (less than 9 minutes) fine-tuning is typically enough for getting high-fidelity results.

TABLE I: Quantitative comparisons of surface reconstruction on the ScanNet dataset. The top block includes the compared offline methods while the bottom block includes the compared online methods. The best results are in boldface and the second best are underlined. Normal consistency of TransformerFusion is missing due to the lack of its publicly released code or mesh results.

	Acc $\downarrow$	Comp $\downarrow$	Chamfer $\downarrow$	Precision $\uparrow$	Recall $\uparrow$	F-score $\uparrow$	Normal Consistency $\uparrow$
FastMVSNet	0.052	0.103	0.077	0.652	0.538	0.588	0.701
PointMVSNet	0.048	0.115	0.082	0.677	0.536	0.595	0.695
Atlas	0.072	0.078	0.075	0.675	0.609	0.638	0.819
GPMVS	0.058	0.078	0.068	0.621	0.543	0.578	0.715
DeepVideoMVS	0.066	0.082	0.074	0.590	0.535	0.560	0.765
TransformerFusion	0.055	0.083	0.069	0.728	0.600	0.655	-
NeuralRecon	0.038	0.123	0.080	0.769	0.506	0.608	0.816
Ours	0.039	0.094	0.067	0.775	0.604	0.677	0.842

TABLE II: Quantitative comparisons of depth estimation on the ScanNet dataset. The top block includes the compared offline methods while the bottom block includes the compared online methods. The best results are in boldface and the second best are underlined.

	Abs Rel $\downarrow$	Abs Diff $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	$\delta<1.25\uparrow$	Comp $\uparrow$
FastMVSNet	0.064	0.112	0.023	0.188	0.954	0.786
PointMVSNet	0.057	0.098	0.017	0.159	0.965	0.683
Atlas	0.064	0.120	0.043	0.244	0.925	0.979
GPMVS	0.063	0.124	0.022	0.202	0.957	1.000
DeepVideoMVS	0.061	0.128	0.022	0.204	0.962	1.000
NeuralRecon	0.066	0.099	0.038	0.197	0.932	0.891
Ours	0.048	0.079	0.024	0.164	0.951	0.921

4 Experiments

In this section, we first give the implementation details of our method and then demonstrate the effectiveness of our method on public datasets, by comparing our approach with the other surface reconstruction methods both qualitatively and quantitatively.

4.1 Implementation Details

For the NISR, we use MnasNet [63] as a 2D CNN image encoder to extract multi-level image features and we perform the 3D sparse CNN proposed by [64]. For NISR training, we empirically set the weights of each component in the loss as $\lambda_{o}^{l}=0.8^{l-1},\lambda_{s}=1.0,\lambda_{n}=1.0,\lambda_{e}=0.25,\lambda_{f}=0.001$ , and set the clamp threshold $\tau=0.2$ . We train our model using Adam optimizer with an initial learning rate of 0.001. For the volume rendering optimization, we empirically set the weights of normal to be 0.05. We select $B=9$ images and sample $M=512$ pixels on each of them as a mini-batch, and sample $N_{c}=16$ points for coarse sampling and $N_{i}=16$ points for importance sampling on each ray and $T=100$ for one loop in our online system. The radiance field $f_{\omega}$ is initialized randomly from the zero-mean Guassian distribution. Moreover, we set $N_{c}=N_{i}=32$ and optimize the scene in $T=5,000$ iterations for further per-scene fine-tuning.

4.2 Datasets, Metrics, and Baselines

Datasets. We select the training subset of ScanNet [31] as supervision to train our NISR. ScanNet [31] is a popular real-scan RGB-D dataset, containing 2.5 million views with ground-truth 3D camera poses and surface reconstruction in more than 1,500 scans. For evaluation, we first evaluate our approach on the testing subset (100 scenes) of ScanNet, but using the monocular RGB sequences only. Additionally, to evaluate the generalization ability of our approach, we also perform evaluation on other datasets including TUM RGB-D dataset [32] (with 10 monocular RGB sequences) and Replica [33] (with the same 8 monocular sequences by [10]), using the pre-trained NISR from ScanNet without further fine-tuning on these two datasets. Since the TUM RGB-D dataset does not provide ground-truth 3D surface meshes, we apply TSDF fusion [65] to generate 3D surface meshes at a resolution of 2cm for the subsequent evaluation.

Methods for Comparison. We compare our method with state-of-the-art online monocular reconstruction approaches, including GPMVS [38], DeepVideoMVS [39], NeuralRecon [22], and TransformerFusion [23], which are most relevant to ours. Additionally, to evaluate the final reconstruction quality of our approach, we also compare our approach with some previous offline monocular reconstruction approaches, such as Atlas [44], Point-MVSNet [66], and FastMVSNet [67]. During the comparison, since Atlas, NeuralRecon, TransformerFusion and our approach directly generate a surface mesh as output while others only perform multi-view depth estimation, we fuse the multi-view depth maps into the final surface mesh using TSDF fusion [65], thus enabling surface quality comparison for all the compared approaches. Besides, we fine-tune those pre-trained models which are not trained on the ScanNet dataset for a fair comparison.

Metrics. To evaluate the surface reconstruction quality, we adopt several popular 3D surface quality metrics, including accuracy, completion, chamfer distance, F-score (with both precision and recall) [44], and normal consistency [48]. For a comprehensive comparison, we additionally measure the multi-view depth estimation quality, using the widely used depth map accuracy metrics such as Abs Rel, Abs Diff, Seq Rel, RMSE, $\delta$ and Comp [68]. For approaches that do not directly generate multi-view depth estimation like ours, we choose to render the depth maps based on the final surface mesh using pyrender¹¹1https://github.com/mmatl/pyrender.

Since TransformerFusion only provides its evaluation script without releasing the test code or resulting reconstruction meshes, for a fair comparison we conduct comparison using its evaluation script for 3D metrics on the ScanNet dataset. For the other datasets and 2D metrics evaluation, we use the evaluation script from NeuralRecon, which is different from TransformerFusion’s in mesh point sampling. More details can be found in the supplementary materials.

4.3 Quantitative and Qualitative Evaluation

Evaluation on ScanNet. The quantitative comparison results for surface reconstruction and depth estimation on ScanNet Dataset are shown in Tables I and II, respectively. Compared to the approaches that perform multi-view depth estimation like FastMVSNet, PointMVSNet, GPMVS, and DeepVideoMVS, those surface reconstruction approaches that perform volumetric fusion including Atlas, NeuralRecon, TransformerFusion and ours can achieve globally more coherent reconstruction quality with a consistently large improvement in F-score as shown in Table I. It demonstrates that our method achieves the lowest Chamfer Distance and the highest F-score and Normal Consistency accuracy values among all the online and offline methods. Besides, almost the same Acc accuracy as NerualRecon (only a very slight increase about 0.001), which is much better the other online surface reconstruction approaches. As an offline method, Atlas uses more views together with dense voxel grids and naturally gets higher recall, while our method outperforms NeuralRecon and TransformerFusion and achieves comparable results on recall to Atlas. Furthermore, benefiting from our geometric prior guidance in both training and optimization, our method outperforms all the others with a significant improvement in terms of normal consistency, which measures the ability of capturing higher-order information. The normal consistency of TransformerFusion is missing due to the lack of its publicly released code or reconstruction mesh results. Since TransformerFusion releases an example mesh of one scene on its Github repository²²2https://github.com/AljazBozic/TransformerFusion, we provide the qualitative and quantitative comparison with it on this scene in the supplementary materials and the comparisons show that our method still significantly outperforms TransformerFusion on normal consistency with finer details. For depth estimation accuracy, our method also significantly outperforms the other methods in terms of Abs Rel and Abs Diff, and achieves comparable results in terms of Sq Rel and RMSE, as shown in Table II.

Fig. 5 shows visual comparison results of all the compared online methods and two representative offline methods, including Atlas for a volume-based method and PointMVS for a depth-based method. It demonstrates that the final mesh results of the surface reconstruction approaches are consistently more coherent than multi-view depth estimation methods even in offline configurations. Although NeuralRecon improves details compared with Atlas, it is still far away from fine-grained surface reconstruction. Benefiting from the proposed geometric prior guided NISR with volume rendering, our method not only successfully recovers thin parts or small objects, but also achieves sharper geometry (e.g., sofa in the fourth row), significantly improving the detail of reconstruction.

4.4 Generalization on Other Datasets

Evaluation on TUM RGB-D. We also conduct an evaluation on the TUM RGB-D dataset to show the generalization ability of our model. Table III shows the results on some major metrics compared with the other online methods. It can be observed that our method achieves the best performance, in all the surface reconstruction accuracy metrics and depth estimation accuracy metrics except F-score. The main reason of our method achieving a less F-score than GPMVS and DVMVS is the drop on recall (with ours (0.323), GPMVS (0.458), and DVMVS (0.507)). However, our method outperforms them on precision (with ours (0.464), GPMVS (0.343), and DVMVS (0.417)) and also outperforms NeuralRecon both on precison (0.464) and recall (0.258). For qualitative results, we can see that our method still achieves the high-fidelity results with the richest details (e.g., small objects on the table, chairs and monitors) among the compared methods, as shown in Fig. 6. Note that since the multi-view depth maps estimation would not be always robust for globally coherent surface reconstruction, we only show the visual effect of DeepVideoMVS as the best representative. This evaluation demonstrates that our geometric prior guided geometry learning for surface reconstruction can be generalized well into a new dataset, despite only pre-trained on the ScanNet dataset.

TABLE III: Evaluation on 10 selected sequences of the TUM RGB-D dataset. We conduct a comparison with online methods in major metrics for both 3D (Chamfer distance, F-score, and Normal Consistency) and 2D (Abs Rel, Abs Diff, and first inlier ratio metric).

	Chamfer $\downarrow$	F-score $\uparrow$	N.C. $\uparrow$	Abs Rel $\downarrow$	Abs Diff $\downarrow$	$\delta<1.25\uparrow$
GPMVS	0.201	0.387	0.649	0.079	0.202	0.915
DVMVS	0.152	0.452	0.682	0.078	0.222	0.918
NeuralRecon	0.134	0.325	0.788	0.090	0.140	0.928
Ours	0.107	0.375	0.806	0.062	0.108	0.959

Evaluation on Replica. Similarly, we also evaluate our results qualitatively in comparison with NeuralRecon on the Replica dataset. Fig. 7 shows some comparisons of visual effects. We can see that our method recovers more complete surfaces as well as fine-grained details, such as the pillow on the sofa. Especially, our method successfully reconstructs TV cabinet and garbage can with high-quality while NeuralRecon fails.

TABLE IV: Ablation study on the ScanNet dataset. We conduct several experiments of different settings as below. ✓ and ✗ denote using or not using a certain component.

\mathcal{L}_{n}

denotes surface normal, “fusion” denotes multi-view feature fusion, “opt” denotes volume rendering optimization, and “v.s.” denotes voxel size (measured in centimeter) of the last level of feature volume. Especially, in the column of “fusion”, “avg” means average and “transf” means transformer. In the column of “opt”, “online” means online volume rendering optimization and “ft” means we perform a per-scene fine-tuning afterwards.

Exp.	$\mathcal{L}_{n}$	fusion	opt.	v.s.	Acc. $\downarrow$	Comp. $\downarrow$	Prec. $\uparrow$	Recall $\uparrow$	F-score $\uparrow$	N.C. $\uparrow$	Abs Rel $\downarrow$	Abs Diff $\downarrow$	Sq Rel $\downarrow$	RMSE $\downarrow$	$\delta<1.25\uparrow$
a	✓	avg	✗	4	0.040	0.102	0.765	0.571	0.652	0.834	0.055	0.085	0.028	0.171	0.945
b	✗	transf	✗	4	0.040	0.104	0.752	0.562	0.642	0.814	0.059	0.091	0.033	0.185	0.941
c	✓	transf	✗	4	0.039	0.095	0.775	0.599	0.674	0.841	0.051	0.082	0.026	0.170	0.949
d	✓	transf	✗	8	0.049	0.096	0.702	0.544	0.612	0.826	0.067	0.102	0.037	0.202	0.931
e	✓	transf	✗	16	0.052	0.124	0.680	0.480	0.560	0.792	0.095	0.135	0.064	0.250	0.896
f	✓	transf	online	4	0.039	0.094	0.775	0.604	0.677	0.842	0.048	0.079	0.024	0.164	0.951
g	✓	transf	ft	4	0.039	0.093	0.777	0.609	0.681	0.842	0.045	0.076	0.022	0.158	0.955

4.5 Time Analysis

We evaluate the runtime of our system on a platform with an Nvidia RTX 3090 GPU and Intel Xeon(R) Gold 5218R CPU. Table V provides the detailed timing of each main component among all the test scenes in ScanNet. Our system performs normal prediction for every key-frame in 61.48ms and performs image encoding and feature volume construction for every local fragment (with 9 key-frames) in 34.35ms and 257.13ms, respectively. For volume rendering optimization, the timing of each iteration (including rendering and backward propagation) is 79.29ms. A final mesh is extracted in 295.79ms when our NISR is updated. Since the mapping thread and optimization thread run in parallel, the mapping thread can run at up to 2.83 fragments (about 25.5 key-frames) per second.

We additionally evaluate the influence of results at different interactive frame rates of input stream and plot the curve of quality results on the Chamfer distance, F-score, and normal consistency in Fig. 8. We can observe that the more time for optimization with a lower frame rate, the better results it can achieve. Furthermore, it also demonstrates that our method outperforms the NeuralRecon (with the Chamfer distance 0.080, F-score 0.608, and normal consistency 0.816) and TransformerFusion (with the Chamfer distance 0.069 and F-score 0.655), regardless of the frame rates.

TABLE V: Timing of each main component. “kf” is the abbreviation for key-frame, “frag” is the abbreviation for local fragment, and “iter” is the abbreviation for iterations.

Task	Normal prediction	Image encoding	Feature volume construction	Optimization	Mesh extraction
Timing (ms)	61.84/kf	34.35/frag	257.13/frag	79.29/iter	295.79

4.6 Ablative Analysis

To demonstrate the effectiveness of each main component of our full system, we conduct an ablation study experiment. Table IV lists the quantitative results under different settings. (a)-(g) in the following paragraph denotes the experiment of different settings as illustrated in Table IV with their quantitative results. Fig. 9 and 10 show their qualitative results corresponding to (a)-(g) in Table IV.

Surface Normal Loss. We conduct an experiment of not using surface normal (b) and using surface normal (c) to show the effect of the geometric prior in the training stage. Table IV shows when removing the surface normal loss, F-score and normal consistency (N.C.) of (b) show a significant decrease. Fig. 9 shows the visual effect without (b) or with (c) this geometric prior. We can observe that the reconstruction surface becomes rough and loses geometric details when removing it. In conclusion, the surface normal as the geometric prior could improve details and regularize the NISR for smoother surfaces, playing an important role in the training stage for high-fidelity reconstruction results.

Multi-view Feature Fusion. We evaluate the effect of different types of multi-view feature fusion with channel-wise average (a) or average from the output of transformer blocks (c) . Although we apply the transformer blocks in local fragments instead of global regions used in [45], Table IV(c) shows its effectiveness in feature fusion for improving the final result. From Fig. 9, we can see that the improvement of multi-view feature fusion cannot directly influence the qualitative results of surface reconstruction, where the reconstructed surface (see Fig. 9(b)) is still rough and lacks details. Thus, surface normal is a more relevant factor for fine-grained quality surface reconstruction, while the transformer module is more about improving sparse voxel prediction, leading to a higher recall.

Voxel Size. Although we can obtain a continuous SDF from our NISR and the final mesh can be extracted at an arbitrary resolution, the voxel size of the sparse feature volume also influences the quality of reconstruction. (c), (d), (e) in Table IV and Fig. 9 show the influence of different voxel sizes for the fine level (using 4cm, 8cm and 16cm respectively). It demonstrates that a smaller voxel size will lead to higher quality and more details of reconstruction. Theoretically, further a smaller voxel size can obtain even better results, but it also causes more timing and memory consumption. In our experiment, we find that setting voxel size as 4cm is fine for surface reconstruction in high quality with enough geometric details.

Volume Rendering Optimization. Volume rendering optimization provide a refinement for the sparse feature volume of scene to improve the quality of reconstruction. Compared with (c), using online optimization (f) would improve the recall (e.g. filling some missing regions of a computer display as shown in Fig. 10) and thereby increase the F-score value in Table IV. Besides, geometry cues from normal map prediction could add more details with a slight improvement of normal consistency and enhance the visual effects (e.g., adding sofa crevice in Fig. 10). Furthermore, we also provide a per-scene fine-tuning (g) with a whole scene optimization in only a few minutes (less than 9 minutes). This would further slightly improve the surface reconstruction results both quantitatively and qualitatively.

Color Representation. We evaluate the effect of color representation in the optimization step. Although the radiance field cannot be optimized well in such a short time, the color representation could provide some details which the normal map does not have, further improving the quality of surface reconstruction. Fig. 11 shows that the reconstruction result has more details when using the color representation.

4.7 Limitations and Discussion

Although we formulate the NISR as the sparse feature volume for a more flexible and effective geometry representation, this representation is heavily dependent on the coverage of predicted occupied voxels. Thus, our approach is limited in its ability to complete missing regions if some voxels cannot be predicted successfully. This missing region is hard to complete during online optimization, leading to a lower recall than the offline methods (see Fig. 12(a)). Second, the quality of normal prediction influences the online optimization and some inconsistent normal prediction would worsen the reconstruction results (see Fig. 12(b)). It might be improved by using multi-view normal consistency check to reduce the impact of inconsistent normals, or even treating normal vectors as optimization parameters to participate in the optimization. Third, although our method could achieve finer details of surface reconstruction, there is still a gap between the reconstructed models and depth scans data for complex thin parts (see Fig. 12(d)) and it is still challenging to reconstruct mirrors (Fig. 12(c)). Lastly, there is still room to improve the timing of our online optimization. The more iterations it optimizes, the better results it could achieve. Therefore, techniques to speed up the rendering optimization [54] can be further adopted to improve the results.

5 Conclusion

In this paper, we introduced MonoNeuralFusion for online 3D scene reconstruction from monocular videos. We formulate a geometry prior guided neural implicit scene representation (NISR) with volume rendering to achieving better fine-grained surface reconstruction. We pre-train our NISR with the guidance of geometry priors, leading to more effective feature latent vector extraction for fine-grained surface reconstruction. To efficiently and effectively render color and normal maps in the sparse feature volume, we propose a hierarchical sampling strategy, which ensures sampling inside sparse voxels. Base on these aforementioned, we run our online system to incrementally build surface reconstruction, meanwhile, performing the online volume rendering optimization to leverage the geometry cues from normal prediction to enhance geometric details of reconstruction. We demonstrate that our approach can achieve state-of-the-art quality of indoor scene reconstruction with fine geometric details on different datasets, even using pre-trained weights from the Scannet dataset without further fine-tuning.

References

[1] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
[2] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM3: an accurate open-source library for visual, visual-inertial, and multimap SLAM,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
[3] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in IEEE ISMAR, 2011, pp. 127–136.
[4] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics, vol. 36, no. 3, pp. 24:1–24:18, 2017.
[5] Y.-P. Cao, L. Kobbelt, and S.-M. Hu, “Real-time high-accuracy three-dimensional reconstruction with consumer RGB-D cameras,” ACM Transactions on Graphics, vol. 37, no. 5, pp. 171:1–171:16, 2018.
[6] S.-S. Huang, H. Chen, J. Huang, H. Fu, and S.-M. Hu, “Real-time globally consistent 3d reconstruction with semantic priors,” IEEE Transactions on Visualization and Computer Graphics, 2021. [Online]. Available: https://doi.org/10.1109/TVCG.2021.3137912
[7] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. H. S. Torr, and D. W. Murray, “Very high frame rate volumetric integration of depth images on mobile devices,” IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 11, pp. 1241–1250, 2015.
[8] J. Huang, S.-S. Huang, H. Song, and S.-M. Hu, “Di-fusion: Online implicit 3d reconstruction with deep priors,” in IEEE CVPR, 2021, pp. 8932–8941.
[9] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding for slam,” in IEEE CVPR, 2022, pp. 12 786–12 796.
[10] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” in IEEE ICCV, 2021, pp. 6209–6218.
[11] J. P. C. Valentin, A. Kowdle, J. T. Barron, N. Wadhwa, M. Dzitsiuk, M. Schoenberg, V. Verma, A. Csaszar, E. Turner, I. Dryanovski, J. Afonso, J. Pascoal, K. Tsotsos, M. Leung, M. Schmidt, O. G. Guleryuz, S. Khamis, V. Tankovich, S. R. Fanello, S. Izadi, and C. Rhemann, “Depth from motion for smartphone AR,” ACM Transactions on Graphics, vol. 37, no. 6, pp. 193:1–193:19, 2018.
[12] J. L. Schönberger, E. Zheng, J. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” in ECCV, 2016, pp. 501–518.
[13] J. L. Schönberger and J. Frahm, “Structure-from-motion revisited,” in IEEE CVPR, 2016, pp. 4104–4113.
[14] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski, “Building rome in a day,” Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011.
[15] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,” in ACM SIGGRAPH, 2006, pp. 835–846.
[16] K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense monocular slam with learned depth prediction,” in IEEE CVPR, 2017, pp. 6243–6252.
[17] S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, “Scenecode: Monocular dense semantic reconstruction using learned encoded scene representations,” in IEEE CVPR, 2019, pp. 11 776–11 785.
[18] X. Yang, L. Zhou, H. Jiang, Z. Tang, Y. Wang, H. Bao, and G. Zhang, “Mobile3drecon: real-time monocular 3d reconstruction on a mobile phone,” IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 12, pp. 3446–3456, 2020.
[19] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras,” in IEEE ICCV, 2019, pp. 8977–8986.
[20] C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in IEEE ICCV, 2019, pp. 3827–3837.
[21] Y. Li, F. Luo, and C. Xiao, “Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module,” Computational Visual Media, vol. 8, no. 4, pp. 631–647, 2022.
[22] J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao, “Neuralrecon: Real-time coherent 3d reconstruction from monocular video,” in IEEE CVPR, 2021, pp. 15 598–15 607.
[23] A. Bozic, P. R. Palafox, J. Thies, A. Dai, and M. Nießner, “Transformerfusion: Monocular RGB scene reconstruction using transformers,” in NeurIPS, 2021, pp. 1403–1414.
[24] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in ECCV, 2020, pp. 405–421.
[25] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” in NeurIPS, 2021, pp. 27 171–27 183.
[26] Z. Yu, S. Peng, M. Niemeyer, T. Sattler, and A. Geiger, “Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction,” arXiv:2022.00665, 2022.
[27] J. Wang, P. Wang, X. Long, C. Theobalt, T. Komura, L. Liu, and W. Wang, “Neuris: Neural reconstruction of indoor scenes using normal priors,” in ECCV, 2022.
[28] H. Guo, S. Peng, H. Lin, Q. Wang, G. Zhang, H. Bao, and X. Zhou, “Neural 3d scene reconstruction with the manhattan-world assumption,” in IEEE CVPR, 2022, pp. 5511–5520.
[29] L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of neural implicit surfaces,” in NeurIPS, 2021, pp. 4805–4815.
[30] L. Liu, J. Gu, K. Z. Lin, T. Chua, and C. Theobalt, “Neural sparse voxel fields,” in NeurIPS, 2020, pp. 15 651–15 663.
[31] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in IEEE CVPR, 2017, pp. 2432–2443.
[32] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in IEEE/RSJ IROS, 2012, pp. 573–580.
[33] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe, “The Replica dataset: A digital replica of indoor spaces,” arXiv preprint arXiv:1906.05797, 2019.
[34] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison, “Codeslam - learning a compact, optimisable representation for dense visual SLAM,” in IEEE CVPR, 2018, pp. 2560–2568.
[35] H. Matsuki, R. Scona, J. Czarnowski, and A. J. Davison, “Codemapping: Real-time dense mapping for sparse slam using compact scene representations,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7105–7112, 2021.
[36] K. Wang and S. Shen, “Mvdepthnet: Real-time multiview depth estimation neural network,” in 3DV, 2018, pp. 248–257.
[37] C. Liu, J. Gu, K. Kim, S. G. Narasimhan, and J. Kautz, “Neural rgb(r)d sensing: Depth and uncertainty from a video camera,” in IEEE CVPR, 2019, pp. 10 986–10 995.
[38] Y. Hou, J. Kannala, and A. Solin, “Multi-view stereo by temporal nonparametric fusion,” in IEEE ICCV, 2019, pp. 2651–2660.
[39] A. Düzçeker, S. Galliani, C. Vogel, P. Speciale, M. Dusmanu, and M. Pollefeys, “Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion,” in IEEE CVPR, 2021, pp. 15 324–15 333.
[40] A. Rich, N. Stier, P. Sen, and T. Höllerer, “3dvnet: Multi-view depth prediction and volumetric refinement,” in 3DV, 2021, pp. 700–709.
[41] J. Choe, S. Im, F. Rameau, M. Kang, and I. S. Kweon, “Volumefusion: Deep depth fusion for 3d scene reconstruction,” in IEEE CVPR, 2021, pp. 16 086–16 095.
[42] M. Sayed, J. Gibson, J. Watson, V. Prisacariu, M. Firman, and C. Godard, “Simplerecon: 3d reconstruction without 3d convolutions,” in ECCV, 2022.
[43] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang, “Surfacenet: An end-to-end 3d neural network for multiview stereopsis,” in IEEE ICCV, 2017, pp. 2326–2334.
[44] Z. Murez, T. van As, J. Bartolozzi, A. Sinha, V. Badrinarayanan, and A. Rabinovich, “Atlas: End-to-end 3d scene reconstruction from posed images,” in ECCV, 2020, pp. 414–431.
[45] N. Stier, A. Rich, P. Sen, and T. Höllerer, “Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion,” in 3DV, 2021, pp. 320–330.
[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
[47] J. J. Park, P. Florence, J. Straub, R. A. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in IEEE CVPR, 2019, pp. 165–174.
[48] L. M. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in IEEE CVPR, 2019, pp. 4460–4470.
[49] C. M. Jiang, A. Sud, A. Makadia, J. Huang, M. Nießner, and T. A. Funkhouser, “Local implicit grid representations for 3d scenes,” in IEEE CVPR, 2020, pp. 6000–6009.
[50] R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. A. Newcombe, “Deep local shapes: Learning local SDF priors for detailed 3d reconstruction,” in ECCV, 2020, pp. 608–625.
[51] S. Peng, M. Niemeyer, L. M. Mescheder, M. Pollefeys, and A. Geiger, “Convolutional occupancy networks,” in ECCV, 2020, pp. 523–540.
[52] H. Chen, J. Huang, T.-J. Mu, and S.-M. Hu, “Circle: Convolutional implicit reconstruction and completion for large-scale indoor scene,” in ECCV, 2022.
[53] J. Chibane, T. Alldieck, and G. Pons-Moll, “Implicit functions in feature space for 3d shape reconstruction and completion,” in IEEE CVPR, 2020, pp. 6968–6979.
[54] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 102:1–102:15, 2022.
[55] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. P. C. Valentin, “Fastnerf: High-fidelity neural rendering at 200fps,” in IEEE ICCV, 2021, pp. 14 326–14 335.
[56] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,” in IEEE ICCV, 2021, pp. 5835–5844.
[57] C. Lin, W. Ma, A. Torralba, and S. Lucey, “BARF: bundle-adjusting neural radiance fields,” in IEEE ICCV, 2021, pp. 5721–5731.
[58] X. Zhang, S. Bi, K. Sunkavalli, H. Su, and Z. Xu, “Nerfusion: Fusing radiance fields for large-scale scene reconstruction,” in IEEE CVPR, 2022, pp. 5449–5458.
[59] M. Oechsle, S. Peng, and A. Geiger, “UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction,” in IEEE ICCV, 2021, pp. 5569–5579.
[60] W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” in ACM SIGGRAPH, 1987, pp. 163–169.
[61] N. Stier, A. Rich, P. Sen, and T. Höllerer, “Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion,” in 3DV, 2021, pp. 320–330.
[62] A. Eftekhar, A. Sax, J. Malik, and A. R. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” in IEEE ICCV, 2021, pp. 10 766–10 776.
[63] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in IEEE CVPR, 2019, pp. 2820–2828.
[64] H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han, “Searching efficient 3d architectures with sparse point-voxel convolution,” in ECCV, 2020, pp. 685–702.
[65] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in ACM SIGGRAPH, 1996, pp. 303–312.
[66] R. Chen, S. Han, J. Xu, and H. Su, “Point-based multi-view stereo network,” in IEEE ICCV, 2019, pp. 1538–1547.
[67] Z. Yu and S. Gao, “Fast-mvsnet: Sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement,” in IEEE CVPR, 2020, pp. 1946–1955.
[68] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in NeurIPS, 2014, pp. 2366–2374.