This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for
Indoor Scene Reconstruction

Haodong Xiang1*   Xinghui Li1*   Xiansong Lai1   Wanting Zhang1   Zhichao Liao1     Kai Cheng2†  Xueping Liu1†
1Tsinghua University
2University of Science and Technology of China
https://xhd0612.github.io/GaussianRoom.github.io/
Abstract

Recently, 3D Gaussian Splatting(3DGS) has revolutionized neural rendering with its high-quality rendering and real-time speed. However, when it comes to indoor scenes with a significant number of textureless areas, 3DGS yields incomplete and noisy reconstruction results due to the poor initialization of the point cloud and under-constrained optimization. Inspired by the continuity of signed distance field (SDF), which naturally has advantages in modeling surfaces, we present a unified optimizing framework integrating neural SDF with 3DGS. This framework incorporates a learnable neural SDF field to guide the densification and pruning of Gaussians, enabling Gaussians to accurately model scenes even with poor initialized point clouds. At the same time, the geometry represented by Gaussians improves the efficiency of the SDF field by piloting its point sampling. Additionally, we regularize the optimization with normal and edge priors to eliminate geometry ambiguity in textureless areas and improve the details. Extensive experiments in ScanNet and ScanNet++ show that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis.

*Equal contribution.Corresponding author.

1 Introduction

3D reconstruction from multi-view RGB images is a fundamental task in computer vision and computer graphics. The reconstructed models serve various applications such as virtual reality, video games, autonomous driving, and robotics. One typical scenario for 3D reconstruction is an indoor scene, which is characterized by large textureless areas. MVS-based methods [43, 35, 55] often produce incomplete or geometrically incorrect reconstruction results, primarily due to the geometric ambiguities introduced by the textureless areas.

Recently, neural-radiance-filed-based methods [52, 51, 62] that model the scenes using signed distance field(SDF) have achieved complete and accurate mesh reconstruction in indoor scenes, benefiting from the continuity of neural SDFs and the introduction of monocular geometric priors [62]. However, they suffer from long optimization times due to dense ray sampling in volume rendering. Fortunately, 3D Gaussian Splatting (3DGS) [24] accelerates the optimization and rendering speed of neural rendering by its differentiable rasterization technique, which also provides a new possibility in 3D scene reconstruction. Despite its impressive rendering efficiency, 3DGS often gets noisy, and incomplete reconstruction results in indoor scenes. This is primarily due to the poor initialization of the SfM point cloud in textureless regions and the under-constrained densification and optimization of Gaussians.

Considering both the advantages of neural SDF in modeling surfaces and the efficiency of 3DGS, we introduce a novel approach named GaussianRoom, which incorporate neural SDF within 3DGS to improve geometry reconstruction in indoor scenes while preserving rendering efficiency. We design a jointly optimizing strategy to enable 3DGS and the neural SDF to facilitate each other.

First, we propose a SDF-guided primitive distribution strategy, which utilizes surfaces represented by SDF to guide the densification and pruning of Gaussian. For surface areas lacking initial Gaussians, we deploy new Gaussians using the SDF-guided Global Densification strategy. Simultaneously, we perform SDF-guided Densification and Pruning on existing Gaussians based on their positions relative to the scene surface. At the same time, we leverage the rendered depth from 3DGS to guide point sampling along the rays of the neural SDF, which reduces invalid sampling in free space and further improves the efficiency of optimization. Considering photometric consistency is insufficient to constrain the texture-less areas, we additionally introduce monocular normal priors in both Gaussian and neural SDF fields to regularize the geometry of textureless areas. Furthermore, to improve the details of indoor scenes, such as objects with fine structures, we propose to enhance the focus on these challenging regions during training by incorporating edge priors.

Extensive experiments on ScanNet and ScanNet++ show that GaussianRoom is capable of producing high-quality reconstruction results while simultaneously maintaining the efficient rendering of 3DGS. Compared to state-of-the-art methods, our approach surpasses both rendering and reconstruction quality. In summary, our contributions are as follows:

  • We propose GaussianRoom, a novel unified framework incorporating neural SDF with 3DGS. An SDF-guided primitive distribution strategy is proposed to guide the densification and pruning of Guassians. At the same time, the geometry represented by Gaussians could improve the efficiency of the SDF field by piloting its point sampling.

  • We design an edge-aware regularization term to improve the details of reconstruction and rendering and further incorporate monocular normal priors in optimization to provide the geometric cues for textureless regions.

  • Our method achieves both high-quality surface reconstruction and rendering for indoor scenes. Extensive experiments on various scenes show that our method achieves SOTA performance in multiple metrics.

2 Related Works

Multi-view Stereo Feature-based Multi-View Stereo (MVS) methods [2, 16, 43, 45] construct explicit representation of objects and scenes by matching image features across multiple views to estimate the 3D coordinates of pixels. Surface is then obtained by applying Poisson surface reconstruction [23]. In indoor scenes, especially in large texture-less areas, these methods often struggle due to the sparsity of features. Voxel-based approaches [6, 11, 44, 33] avoid the issues of poor feature matching by optimizing spatial occupancy and color within a voxel grid, but they are limited by memory usage at high resolutions, resulting in lower reconstruction quality. Learning-based multi-view stereo methods replace feature matching [35, 49, 63] and similar processes [41] in feature-based methods with neural networks to directly predict depth or three-dimensional volumes from images [22, 55, 56, 61, 65]. However, even with the use of large data during training, errors may occur in results when dealing with occlusions, complex lighting, or areas with subtle textures.

Neural Radiance Field Neural Radiance Fields (NeRF) [36] represents a scene as a continuous volumetric function of density and color, using neural networks to enable realistic new view synthesis. Methods such as Mip-NeRF [3, 54, 4] enhance rendering efficiency through optimized ray sampling techniques. Other works [37, 32, 14, 7, 29] accelerate training and rendering by leveraging spatial data structures, using alternative encodings, or resizing MLPs. Some works focus on improving the rendering quality through regularization terms. Depth regularization [12, 53], for instance, explicitly supervises ray termination to reduce unnecessary sampling time. Other approaches explore imposing smoothness constraints on rendered depth maps [38] or employing multi-view consistency regularization in sparse view settings [50, 25].

Although NeRF can produce realistic renderings for new viewpoint synthesis, the geometric quality of results directly obtained through Marching Cubes [34] is poor. Consequently, some research considers using alternative implicit functions, such as occupancy grids [39, 40] and signed distance functions (SDFs) [58, 51, 52, 57, 29], replacing NeRF’s volumetric density field. To further improve reconstruction quality,  [15, 64] propose to regularize optimization using SfM points, and  [19, 62] leverage priors such as Manhattan world assumption and pseudo depth supervision. However, these methods tend to cause missing reconstruction while consuming a long time for optimization.

3D Gaussian Splatting 3D Gaussian Splatting [24] has recently become very popular in the field of neural rendering, providing an explicit representation of scenes and enabling novel view synthesis without the dependency on neural networks. During training, 3DGS consumes a significant amount of GPU memory. These efforts [13, 26] aim to compress the memory footprint of Gaussian operations. GaussianPro [9] introduces a novel Gaussian propagation strategy that guides the densification process, resulting in more compact and precise Gaussians, especially in areas with limited texture details, and similar works, such as [5, 31, 28, 20], enhance the performance of Gaussian rendering across various scenarios. DN-Splatter [48] uses depth and normal priors during the optimization process to significantly enhance reconstruction results, achieving smoother and more geometrically accurate reconstructions. Similarly, SuGaR [18], 2DGS [21], and NeuSG [8] employ Gaussian-based methods for reconstruction purposes. Our concurrent work, GSDF [60], also jointly optimizes neural SDF and 3DGS. In contrast, we focus on indoor scenes and address the challenges posed by large textureless areas in 3DGS.

3 Preliminary

3.1 3D Gaussian Splatting

3DGS  [24] represents the 3D scene as differentiable 3D Gaussian primitives, achieving state-of-the-art visual quality and rendering speed. Each Gaussian primitive is defined by the mean 𝝁3\boldsymbol{\mu}\in\mathbb{R}^{3} and the covariance 𝚺3×3\boldsymbol{\Sigma}\in\mathbb{R}^{3\times 3}:

G(𝒙)=e12(𝒙𝝁)T𝚺1(𝒙𝝁).G(\boldsymbol{x})=e^{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^{\mathrm{T}}\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})}. (1)

To preserve the positive semi-definite character of the covariance matrix during optimization, 𝚺\boldsymbol{\Sigma} is further formulated as: 𝚺=𝑹𝑺𝑺T𝑹T\boldsymbol{\Sigma}=\boldsymbol{RSS}^{\mathrm{T}}\boldsymbol{R}^{\mathrm{T}}, where the rotation matrix 𝑹\boldsymbol{R} is orthogonal and scale matrix𝑺\boldsymbol{S} is diagonal. For rendering, 3D Gaussians will be projected onto the 2D image plane as 2D Gaussians following the depth-based sorting, and the color of pixel 𝒑\boldsymbol{p} is calculated as:

C^(𝒑)=iN𝒄iσij=1i1(1σj),σi=αiGi(𝒑),\hat{C}(\boldsymbol{p})=\sum_{i\in N}\boldsymbol{c}_{i}\sigma_{i}\prod_{j=1}^{i-1}(1-\sigma_{j}),\quad\sigma_{i}=\alpha_{i}G_{i}^{\prime}(\boldsymbol{p}), (2)

where αi\alpha_{i} represents the opacity of the i-th 3D Gaussian, GiG_{i}^{\prime} is the projected 2D Gaussian, NN denotes the number of sorted 2D Gaussians associated with pixel 𝒑\boldsymbol{p}, and 𝒄i\boldsymbol{c}_{i} is the color of GiG_{i}^{\prime}. Similar to  [9], we use direction of the shortest axis as normal nin_{i} for Gaussian primitive GiG_{i}, and then apply α\alpha-blending to render the normal map: 𝒩gs=iNniσij=1i1(1σj).\mathcal{N}_{gs}=\sum_{i\in N}n_{i}\sigma_{i}\prod_{j=1}^{i-1}(1-\sigma_{j}).

3.2 Neural Implicit SDFs

NeRF [39] implicitly learns a 3D scene as a continuous volume density and radiance field from multi-view images. The method faces challenges in defining clear surfaces, often resulting in noisy surfaces when derived from density. In contrast, SDF offers a common way for representing geometry surfaces implicitly as a zero-level set, {𝒙3fg(𝒙)=0}\{\boldsymbol{x}\in\mathbb{R}^{3}\mid f_{g}(\boldsymbol{x})=0\}, where fg(𝒙)f_{g}(\boldsymbol{x}) is the SDF value from an MLP fg()f_{g}(\cdot). Following NeuS [52], we replace the volume density with SDF

and convert the SDF value to the opacity αi\alpha_{i} with a logistic function:

αi=max(ϕs(f(𝒙i))ϕs(f(𝒙i+1))ϕs(f(𝒙i)),0),\alpha_{i}=\max\left(\frac{\phi_{s}(f(\boldsymbol{x}_{i}))-\phi_{s}(f(\boldsymbol{x}_{i+1}))}{\phi_{s}(f(\boldsymbol{x}_{i}))},0\right), (3)

where ϕs\phi_{s} denotes a Sigmoid function. Following the volume rendering methodology, the predicted color of pixel 𝒑\boldsymbol{p} is calculated by accumulating weighted colors of the sample points along the ray 𝒓\boldsymbol{r}:

C^(𝒑)=i=1NTiαi𝒄𝒊,Ti=exp(j=1i1αjδj),\hat{C}(\boldsymbol{p})=\sum_{i=1}^{N}T_{i}\alpha_{i}\boldsymbol{c_{i}},T_{i}=\exp\left(-\sum_{j=1}^{i-1}\alpha_{j}\delta_{j}\right), (4)

where TiT_{i} is the cumulative transmittance and NN is the number of sample points along the ray 𝒓\boldsymbol{r}. Similarly, the normal can be rendered as 𝒩^(p)=i=1NTiαi𝒏^i\hat{\mathcal{N}}(p)=\sum_{i=1}^{N}T_{i}\alpha_{i}\hat{\boldsymbol{n}}_{i}, where 𝒏^i=fn(ti)\hat{\boldsymbol{n}}_{i}=\nabla f_{n}(t_{i}) denotes the derivative of SDF at point tit_{i}, which can be calculated by PyTorch’s automatic derivation.

Refer to caption
Figure 1: Overview. GaussianRoom integrates neural SDF within 3DGS and forms a positive cycle improving each other. (a) We employ the geometric information from the SDF to constrain the Gaussian primitives, ensuring their spatial distribution aligns with the scene surface. (b) We utilize rasterized depth from Gaussian to efficiently provide coarse geometry information, narrowing down the sampling range to accelerate the optimization of neural SDF. (c) We introduce monocular normal prior and edge prior, addressing the challenges of texture-less areas and fine structures indoors.

4 Methodology

Given multi-view posed images and point clouds obtained from the Structure from Motion (SfM) algorithm, our objective is to enable 3DGS to accurately reconstruct indoor scene geometry while preserving its rendering quality and efficiency. To this end, we first incorporate an implicit SDF field within 3DGS and design a mutual learning strategy to realize high-quality reconstruction and rendering (Sec. 4.1). Furthermore, we present the monocular geometric cues, i.e. normal prior and edge prior, to improve the reconstruction of the planes and fine details in indoor scenes respectively (Sec. 4.2). Finally, we discuss the loss functions and the overall optimization process (Sec. 4.3). An overview of our framework is provided in Fig. 1.

4.1 Mutual Learning of 3D Gaussian and Neural SDF

In this section, we propose to model an indoor scene using 3D Gaussians and neural SDF with a mutual learning strategy. For 3DGS optimization, we utilize SDF to guide the distribution of Gaussian primitives, which densifies the Gaussians around the surface and significantly reduces the floaters in non-surface space. Furthermore, we introduce a Gaussian-guided sampling methodology to pilot point sampling for neural SDF, which improves the training efficiency.

Refer to caption
Figure 2: (a) Gaussian primitives distribution (b) Ground truth scene surface and Gaussian primitives distribution (c) The red Gaussian points represent new Gaussians generated by the SDF-guided Global Densification strategy, while the green Gaussian points indicate those adjusted through the SDF-guided Densification and Pruning process.

SDF-guided Primitive Distribution for 3D Gaussian As illustrated in Fig. 2 (a), due to the lack of underlying geometry constrain, Gaussian primitives become disorganized during the optimization in indoor scenes, resulting in randomly scattered floaters in the non-surface areas.

SuGaR [18] introduces a regularization term to align flattened Gaussian primitives with the scene’s surface, improving mesh extraction performance but resulting in degraded rendering quality. In contrast, our approach preserves the original shape of the Gaussian primitives while utilizing the scene surface geometry information provided by the neural SDF to guide their distribution, as shown in Fig. 2 (c). Specifically, the guidance of the SDF includes two aspects, one is to deploy Gaussian primitives to achieve global densification in spatial locations with low SDF values that lack Gaussian primitives, and the other is to guide the densification and pruning of existing Gaussian primitives.

SDF-guided Global Densification. Note that the original 3DGS algorithm globally prunes Gaussian primitives based on opacity thresholds, yet it restricts its densification strategy solely to local operations like cloning or splitting, thereby posing challenges when generating new Gaussian primitives during densification in areas lacking initial Gaussians. Especially in indoor scene reconstruction tasks, such scarcity of initial Gaussians is prevalent in texture-less areas. To address this issue, we develop a global densification strategy that leverages the geometric information of the entire scene provided by the neural SDF.

As depicted by the Global Densification strategy in Fig. 2 (c), we partition the scene space into N3\mathrm{N}^{3} cubic grids and calculate the SDF value at the center of each grid. If the value falls below the threshold (Sc<τsS_{c}<\tau_{s}), it indicates that the grid is in proximity to the scene surface. Subsequently, we enumerate the existing Gaussian primitives within each grid. In cases where the number of Gaussian primitives is insufficient (Ng<τnN_{g}<\tau_{n}), we select the KK Gaussian neighbors of the grid’s center point and generate KK new Gaussian primitives within the grid. The initial attributes of these newly generated Gaussian primitives are sampled from a normal distribution defined by the mean and variance of the KK neighboring Gaussians.

SDF-guided Densification and Pruning. For regions where a sufficient number of Gaussian primitives already exist, we employ an enhanced version of the Densification and Pruning strategy, which integrates SDF geometric information. For each Gaussian primitive at position 𝒙\boldsymbol{x}, its SDF value is given by: S=fg(𝒙)S=f_{g}(\boldsymbol{x}). The criteria for determining whether a Gaussian primitive should be densified or pruned can be expressed as follows:

η=exp(S2λσσ2),\eta=\exp(-\frac{S^{2}}{\lambda_{\sigma}\sigma^{2}}), (5)

where σ\sigma denotes the opacity of the Gaussian primitive, and λσ\lambda_{\sigma} is its coefficient. When η\eta is small, it signifies that the Gaussian is either far from the SDF zero-level set or possesses low opacity. In such instances, if η<τp\eta<\tau_{p}, the Gaussian primitive will be pruned. Conversely, when η>τd\eta>\tau_{d} and the gradient of the Gaussian satisfies g>τg\nabla_{g}>\tau_{g}, the Gaussian primitive will be densified. GSDF [60] introduces a similar method, but our defined η\eta in Eq. 5 integrates both the SDF value and the Gaussian’s opacity while serving as separate criteria independent of the gradient of Gaussian, thus avoiding the trade-off between densifying Gaussians in high-gradient regions and removing floaters. In other words, our methods will not mistakenly densify Gaussians detached from the SDF surface.

Gaussian-guided Point Sampling for Neural SDF To reconstruct the accurate geometry surface efficiently, it is advisable to sample as many points around the true surface as possible. Previous works [47, 27] use the predicted SDF values to pilot sampling along the ray. However, this approach is time-consuming and suffers from the chicken and egg problem. Instead, we employ the 3D Gaussians as coarse geometry guidance for point sampling similar to [60]. Specifically, we leverage the rasterized depth maps from the 3D Gaussians to narrow down the ray sampling range of the neural SDF field. The rendered depth value DD of pixel 𝒑\boldsymbol{p} rendered is defined as follows:

D(𝒑)=iNdiσij=1i1(1σj),D(\boldsymbol{p})=\sum_{i\in N}d_{i}\sigma_{i}\prod_{j=1}^{i-1}(1-\sigma_{j}), (6)

where NN is the number of 3D Gaussians encountered by the ray 𝒓\boldsymbol{r} corresponding to 𝒑\boldsymbol{p}, did_{i} represents the depth of the i-th 3D Gaussian and σi\sigma_{i} demotes the opacity of projected Gaussian primitive.

Once the rendered depth D(𝒑)D(\boldsymbol{p}) has been obtained, we define the sampling range as follows: 𝒓=[𝒐+(D(𝒑)γ|S|)𝒗,𝒐+(D(𝒑)+γ|S|)𝒗]\boldsymbol{r}=[\boldsymbol{o}+(D(\boldsymbol{p})-\gamma\left|S\right|)\cdot\boldsymbol{v},\boldsymbol{o}+(D(\boldsymbol{p})+\gamma\left|S\right|)\cdot\boldsymbol{v}], where 𝒐\boldsymbol{o} and 𝒗\boldsymbol{v} respectively denotes the camera center and view direction of pixel 𝒑\boldsymbol{p}, SS represents the corresponding SDF value, and γ\gamma is a hyperparameter indicating the length of the sampling range.

4.2 Monocular Cues constrained Optimization

It is important to note that indoor scenes encompass not only large texture-less areas such as walls and floors but also detailed regions. However, in the original 3DGS and neural SDF, the optimization only relies on image reconstruction loss without incorporating any geometric constraints, resulting in noisy and blurry results. In this section, we incorporate monocular geometric cues into our designed mutual learning pipeline, including edge prior and normal prior, which respectively are used to constrain details and flat areas.

Edge-guided Details Optimization Noticing that detailed regions account for a small proportion of indoor scenes compared with texture-less flat areas like floors and ground, leading to relative ignorance of the details. Based on the observation that detailed regions mostly have sharp shapes or rich textures with obvious edge information, we propose a novel edge-guided optimization strategy. We use a pre-trained edge detection network [46] to generate edge maps.

Edge-aware Neural SDF Optimization. For neural SDF, we design an edge-guided ray sampling strategy, which performs explicit sampling at detailed areas. Specifically, we define an image-variant weight for ray sampling according to the ratio of edge pixels:

ωi=δNedgei/(H×W),\omega_{i}=\delta\cdot N_{edge}^{i}/(H\times W), (7)

where NedgeiN_{edge}^{i} is the number of edge pixels for image IiI_{i}, H×WH\times W is the total number of pixels in image IiI_{i}, and δ\delta is a hyperparameter indicating the importance of edge. For the qq sampled training rays of image IiI_{i}, ωiq\omega_{i}*q rays are sampled from the edge region set and (1ωi)q(1-\omega_{i})*q rays are sampled from a random set. The hybrid sampling method ensures that sharp boundary information can be sampled in each iteration adaptively.

Edge-aware 3DGS Optimization. Considering regions with high-frequency information that are relatively difficult to recover, we design a simple yet effective regularizing strategy to improve the rendering results of details. Our method is developed based on the insight that pixels in details and pixels in flat areas have the same influence on the result in 3DGS, which is inefficient in indoor scenes. Thus, we utilize edge priors as weights to regularize the photometric consistency. For each pixel 𝒑\boldsymbol{p} of image IiI_{i}, the corresponding training weight of the photometric loss is defined as follows:

w𝒑=2ϕ(e𝒑),w_{\boldsymbol{p}}=2\phi(e_{\boldsymbol{p}}), (8)

where e𝒑e_{\boldsymbol{p}} represents corresponding edge map value for pixel 𝒑\boldsymbol{p} and ϕ()\phi(\cdot) denotes a Sigmoid function. We utilize the mapping function to confine the loss weight within the interval [1,2]\left[1,2\right] and increase the importance of edge areas in the optimization, which is beneficial to recover fine details.

Normal-guided Geometry Constrain Based on the observation that texture-less areas usually exhibit well-defined planarity, we utilize normal information to provide geometry constraints. Specifically, we use [1] to get monocular normal priors, and inspired from [62, 9], we utilize the prior to simultaneously constrain the rendered normal maps of neural SDF and 3DGS.

4.3 Loss Functions

The Gaussians is supervised by rendering losses c\mathcal{L}_{c}, DSSIM\mathcal{L}_{D-SSIM} and normal loss normal\mathcal{L}_{normal}:

gs=λ1c+(1λ1)DSSIM+λ2normal,\mathcal{L}_{gs}=\lambda_{1}\mathcal{L}_{c}+(1-\lambda_{1})\mathcal{L}_{D-SSIM}+\lambda_{2}\mathcal{L}_{normal}, (9)

where the rendering loss c\mathcal{L}_{c} is defined as:

c=1qkCkC^k1wk,\mathcal{L}_{c}=\frac{1}{q}\sum_{k}\|C_{k}-\hat{C}_{k}\|_{1}\cdot w_{k}, (10)

where CkC_{k} and C^k\hat{C}_{k} are ground truth and rendered colors respectively, wkw_{k} is the weighted term from the edge map mentioned in Sec. 4.2. The normal loss normal\mathcal{L}_{normal} is defined as:

normal=1qk𝒩k𝒩^k1,\mathcal{L}_{normal}=\frac{1}{q}\sum_{k}\|\mathcal{N}_{k}-\hat{\mathcal{N}}_{k}\|_{1}, (11)

where 𝒩k\mathcal{N}_{k} and 𝒩^k\hat{\mathcal{N}}_{k} denote the predicted monocular normals and rendered normals respectively.

The neural SDF is supervised by rendering loss c\mathcal{L}_{c}, normal loss normal\mathcal{L}_{normal} and Eikonal loss eik\mathcal{L}_{eik}:

sdf=c+normal+λeikeik,\mathcal{L}_{sdf}=\mathcal{L}_{c}+\mathcal{L}_{normal}+\lambda_{eik}\mathcal{L}_{eik}, (12)

where the Eikonal loss is to regularize the SDF following [17]. Our total loss is defined as:

=gs+sdf.\mathcal{L}=\mathcal{L}_{gs}+\mathcal{L}_{sdf}. (13)

The hyper-parameter settings are detailed in the supplementary material.

Refer to caption
Figure 3: Qualitative reconstruction comparisons. For each indoor scene, the first row is the top view of the whole room, and the second row is the details of the masked region. The reconstruction results of GaussianRoom visually have better scene integrity than other methods, especially in details.

5 Experiment

5.1 Experimental Setup

Dataset We evaluate the performance of our approach on reconstruction and rendering quality using 10 real-world indoor scenes from publicly available datasets: 8 available scenes from ScanNet(V2) [10] and 2 scenes from ScanNet++ [59].

Implementation Details We build our code based on 3DGS [24] and NeuS [52]. During the optimization, we divide the training into three stages. We first pre-train 3DGS with 15k iterations and then co-optimize 80k iterations, of which the 3DGS and the neural SDF don’t instruct each other for the first 6k to ensure that they each learn rough information about the scene before mutual learning. All experiments are conducted on a single Tesla V100 GPU.

Baselines We evaluate our method against SOTA reconstruction and rendering approaches respectively. In terms of reconstruction, we compare with COLMAP [42], NeRF [36], NeuS [52], MonoSDF [62], HelixSurf [30], 3DGS [24], GaussianPro [9], SuGaR [18] and DN-Splatter [48]. For rendering, we compare with Gaussian-based methods due to their impressive rendering quality, including 3DGS [24], SuGaR [18], GaussianPro [9] and DN-Splatter [48]. For DN-Splatter [42], we follow their optional experimental settings of using monocular depths instead of real sensor depths as supervision and retrain DN-Splatter, ensuring the comparison is fair.

Metrics We follow the evaluation protocol from [51, 62] and report Accuracy, Completion, Precision, Recall, and F-score with a threshold of 5cm for 3D geometry evaluation. For rendering evaluation, we follow standard practice and report SSIM, PSNR, and LPIPS metrics.

5.2 Reconstruction Evaluation

Tab. 1 shows our method significantly outperforms both Gaussian-based methods and NeRF-based methods in geometry metrics on the ScanNet and ScanNet++ datasets. As illustrated in Fig. 3, Gaussian-based scene reconstruction methods, such as SuGaR [18] and DN-Splatter [48], are impacted by the geometric disorder of Gaussians, leading to uneven Poisson reconstruction results. In contrast, our methods can achieve smooth and continuous results due to the continuity of neural SDF. Compared with NeRF-based methods, GaussianRoom achieves more complete and detailed reconstruction while greatly shortening training time, due to integrated 3DGS encouraging more efficient point sampling near the surface and edge prior offers more attention to detailed regions. For example, MonoSDF requires about 18-hour optimization, whereas GaussianRoom achieves better results within 4-hour training. Please refer to the supplementary for more qualitative results on ScanNet++ [59] dataset compared with existing methods.

Method Accuracy\downarrow Completion\downarrow Precision\uparrow Recall\uparrow F-score\uparrow
COLMAP [42] 0.062 / 0.091 0.090 / 0.093 0.640 / 0.519 0.569 / 0.520 0.600 / 0.517
NeRF [36] 0.160 / 0.135 0.065 / 0.082 0.378 / 0.421 0.576 / 0.569 0.454 / 0.484
NeuS [52] 0.105 / 0.163 0.124 / 0.196 0.448 / 0.316 0.378 / 0.265 0.409 / 0.288
MonoSDF [62] 0.048 / 0.039 0.068 / 0.043 0.673 / 0.816 0.558 / 0.840 0.609 / 0.827
HelixSurf [30] 0.063 /     - 0.134 /     - 0.657 /     - 0.504 /     - 0.567 /     -
3DGS [24] 0.338 / 0.113 0.406 / 0.790 0.129 / 0.445 0.067 / 0.103 0.085 / 0.163
GaussianPro [9] 0.313 / 0.141 0.394 / 1.283 0.112 / 0.353 0.075 / 0.081 0.088 / 0.129
SuGaR [18] 0.167 / 0.129 0.148 / 0.121 0.361 / 0.435 0.373 / 0.444 0.366 / 0.439
DN-Splatter [48] 0.212 / 0.294 0.210 / 0.276 0.153 /0.108 0.182 / 0.108 0.166 / 0.107
Ours 0.047 / 0.035 0.043 / 0.037 0.800 / 0.894 0.739 / 0.852 0.768 / 0.872
Table 1: Quantitative reconstruction comparison on ScanNet / ScanNet++ datasets. We report the average results over 8 and 2 scenes respectively. The best results are marked in bold.
ScanNet ScanNet++
Method SSIM\uparrow PSNR\uparrow LPIPS\downarrow SSIM\uparrow PSNR\uparrow LPIPS\downarrow
3DGS [24] 0.731 22.133 0.387 0.843 21.816 0.294
SuGaR [18] 0.737 22.290 0.382 0.831 20.611 0.318
GaussianPro [9] 0.721 22.676 0.395 0.831 21.285 0.320
DN-Splatter [48] 0.639 21.621 0.312 0.826 20.445 0.268
Ours 0.758 23.601 0.391 0.844 22.001 0.296
Table 2: Quantitative rendering comparison with existing methods on ScanNet and ScanNet++ datasets. We report the average results over 8 and 2 scenes respectively.
Refer to caption
Figure 4: Qualitative rendering comparisons. As shown from the above-highlighted patches, the rendering results of GaussianRoom significantly outperform other GS-based methods, including texture-less regions and details.

5.3 Rendering Evaluation

According to Tab. 2, our method achieves superior rendering metrics on both the ScanNet and ScanNet++ datasets compared to 3DGS [24], SuGaR [18], GaussianPro [9], and DN-Splatter [48]. The substantial improvements in the PSNR metric indicate that our rendered images exhibit significantly reduced distortion, credited to our designed SDF-guided Primitive Distribution Strategy, which effectively prunes floating Gaussian primitives in non-surface space and deploys Gaussians near the surface that lacks initial Gaussians, as demonstrated in Fig. 2 (c). As shown in Fig. 4, our method achieves sharp details and superior renderings in both rich-texture and low-texture regions, benefiting from our rational utilization of the geometric information within the neural SDF to align Gaussian primitives with the surface of the scene.

5.4 Ablation Study

In this section, we conduct experiments by individually removing each improvement module from our full model to verify the effectiveness of each design. Quantitative results and qualitative visualizations can be found in Tab. 3 and Fig. 5.

3D Reconstruction Novel View Synthesis
Method Accuracy \downarrow Completion \downarrow Precision \uparrow Recall \uparrow F-score \uparrow SSIM \uparrow PSNR \uparrow LPIPS \downarrow
w/o SGD 0.048 0.044 0.790 0.730 0.758 0.756 23.545 0.394
w/o SDP 0.048 0.044 0.785 0.724 0.753 0.738 22.589 0.395
w/o Gaussian-guided sampling 0.049 0.045 0.777 0.718 0.746 0.754 23.463 0.395
w/o normal prior 0.127 0.120 0.412 0.367 0.388 0.755 23.625 0.395
w/o edge prior 0.055 0.046 0.780 0.722 0.750 0.754 23.350 0.397
GaussianRoom (Full) 0.047 0.043 0.800 0.739 0.768 0.758 23.601 0.391
Table 3: Ablation study in the proposed modules. The ablation experiments were conducted on 8 scenes from the ScanNet dataset, with the best metrics highlighted in bold.
Refer to caption
Figure 5: Ablation studies. The above highlighted patches visually show the rendering degradation caused by each missing module.

The experimental results demonstrate that our SDF-guided Densification and Pruning (SDP) and SDF-guided Global Densification (SGD) modules effectively impose geometric constraints on disordered Gaussians, thereby enhancing rendering quality and providing more accurate depth information. With the depth information, improvements can be made to the sampling precision and convergence speed of the neural SDF through the Gaussian-guided Point Sampling (GPS) module, creating a mutually beneficial feedback loop. The ablation studies of the normal prior and edge prior demonstrate that these two components respectively enhance 3D reconstruction quality in low-texture regions and detailed regions. Given the large low-texture areas typical in indoor reconstruction tasks, the supervision provided by the normal prior proves highly effective.

However, similar to DN-Splatter [48], utilizing the normal prior as a regularization term to constrain the orientation of Gaussians results in a decrease in the PSNR metric. Despite this, our full model excels in both SSIM and LPIPS metrics, achieving superior visualization quality in detailed regions, as depicted in Fig. 5.

6 Conclusion

In conclusion, we present a novel unified framework that combines neural SDF with 3DGS. By incorporating a learnable neural SDF field with the SDF-guided primitive distribution strategy, our method overcomes the limitations of 3DGS in reconstructing indoor scenes with textureless areas. Meanwhile, the integration of Gaussians improves the efficiency of the neural SDF, while regularization with normal and edge priors enhances geometry details. Extensive experiments demonstrate the state-of-the-art performance of our method in surface reconstruction and novel view synthesis on ScanNet and ScanNet++ datasets.

References

  • [1] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13137–13146, 2021.
  • [2] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.
  • [3] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  • [4] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023.
  • [5] Luis Bolanos, Shih-Yang Su, and Helge Rhodin. Gaussian shadow casting for neural characters. arXiv preprint arXiv:2401.06116, 2024.
  • [6] Adrian Broadhurst, Tom W Drummond, and Roberto Cipolla. A probabilistic framework for space carving. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001, volume 1, pages 388–393. IEEE, 2001.
  • [7] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision, pages 333–350. Springer, 2022.
  • [8] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance. arXiv preprint arXiv:2312.00846, 2023.
  • [9] Kai Cheng, Xiaoxiao Long, Kaizhi Yang, Yao Yao, Wei Yin, Yuexin Ma, Wenping Wang, and Xuejin Chen. Gaussianpro: 3d gaussian splatting with progressive propagation. arXiv preprint arXiv:2402.14650, 2024.
  • [10] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
  • [11] Jeremy S De Bonet and Paul Viola. Poxels: Probabilistic voxelized volume reconstruction. In Proceedings of International Conference on Computer Vision (ICCV), volume 2, page 2. Citeseer, 1999.
  • [12] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  • [13] Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, and Zhangyang Wang. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. arXiv preprint arXiv:2311.17245, 2023.
  • [14] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  • [15] Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems, 35:3403–3416, 2022.
  • [16] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Gipuma: Massively parallel multi-view stereo reconstruction. Publikationen der Deutschen Gesellschaft für Photogrammetrie, Fernerkundung und Geoinformation e. V, 25(361-369):2, 2016.
  • [17] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
  • [18] Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. arXiv preprint arXiv:2311.12775, 2023.
  • [19] Haoyu Guo, Sida Peng, Haotong Lin, Qianqian Wang, Guofeng Zhang, Hujun Bao, and Xiaowei Zhou. Neural 3d scene reconstruction with the manhattan-world assumption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5511–5520, 2022.
  • [20] Abdullah Hamdi, Luke Melas-Kyriazi, Guocheng Qian, Jinjie Mai, Ruoshi Liu, Carl Vondrick, Bernard Ghanem, and Andrea Vedaldi. Ges: Generalized exponential splatting for efficient radiance field rendering. arXiv preprint arXiv:2402.10128, 2024.
  • [21] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. arXiv preprint arXiv:2403.17888, 2024.
  • [22] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018.
  • [23] Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG), 32(3):1–13, 2013.
  • [24] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023.
  • [25] Yixing Lao, Xiaogang Xu, Xihui Liu, Hengshuang Zhao, et al. Corresnerf: Image correspondence priors for neural radiance fields. Advances in Neural Information Processing Systems, 36, 2024.
  • [26] Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian representation for radiance field. arXiv preprint arXiv:2311.13681, 2023.
  • [27] Xinghui Li, Yikang Ding, Jia Guo, Xiansong Lai, Shihao Ren, Wensen Feng, and Long Zeng. Edge-aware neural implicit surface reconstruction. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 1643–1648. IEEE, 2023.
  • [28] Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, and Federico Tombari. Geogaussian: Geometry-aware gaussian splatting for scene rendering. arXiv preprint arXiv:2403.11324, 2024.
  • [29] Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H Taylor, Mathias Unberath, Ming-Yu Liu, and Chen-Hsuan Lin. Neuralangelo: High-fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8456–8465, 2023.
  • [30] Zhihao Liang, Zhangjin Huang, Changxing Ding, and Kui Jia. Helixsurf: A robust and efficient neural implicit surface learning of indoor scenes with iterative intertwined regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13165–13174, 2023.
  • [31] Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, et al. Vastgaussian: Vast 3d gaussians for large scene reconstruction. arXiv preprint arXiv:2402.17427, 2024.
  • [32] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663, 2020.
  • [33] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2019–2028, 2020.
  • [34] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
  • [35] Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5695–5703, 2016.
  • [36] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • [37] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022.
  • [38] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5480–5490, 2022.
  • [39] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3504–3515, 2020.
  • [40] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5589–5599, 2021.
  • [41] Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. Octnetfusion: Learning depth fusion from data. In 2017 International Conference on 3D Vision (3DV), pages 57–66. IEEE, 2017.
  • [42] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, pages 4104–4113, 2016.
  • [43] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016.
  • [44] Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring. International journal of computer vision, 35:151–173, 1999.
  • [45] Robust Multiview Stereopsis. Accurate, dense, and robust multiview stereopsis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 32(8), 2010.
  • [46] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021.
  • [47] Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. In ACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022.
  • [48] Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Seiskari, Esa Rahtu, and Juho Kannala. Dn-splatter: Depth and normal priors for gaussian splatting and meshing. arXiv preprint arXiv:2403.17822, 2024.
  • [49] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047, 2017.
  • [50] Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Ziwei Liu. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9065–9076, 2023.
  • [51] Jiepeng Wang, Peng Wang, Xiaoxiao Long, Christian Theobalt, Taku Komura, Lingjie Liu, and Wenping Wang. Neuris: Neural reconstruction of indoor scenes using normal priors. In European Conference on Computer Vision, pages 139–155. Springer, 2022.
  • [52] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  • [53] Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5610–5619, 2021.
  • [54] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022.
  • [55] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European conference on computer vision (ECCV), pages 767–783, 2018.
  • [56] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5525–5534, 2019.
  • [57] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems, 34:4805–4815, 2021.
  • [58] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33:2492–2502, 2020.
  • [59] Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023.
  • [60] Mulin Yu, Tao Lu, Linning Xu, Lihan Jiang, Yuanbo Xiangli, and Bo Dai. Gsdf: 3dgs meets sdf for improved rendering and reconstruction. arXiv preprint arXiv:2403.16964, 2024.
  • [61] Zehao Yu and Shenghua Gao. Fast-mvsnet: Sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1949–1958, 2020.
  • [62] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems, 35:25018–25032, 2022.
  • [63] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015.
  • [64] Jingyang Zhang, Yao Yao, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, and Long Quan. Critical regularizations for neural surface reconstruction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6270–6279, 2022.
  • [65] Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, and Tian Fang. Visibility-aware multi-view stereo network. arXiv preprint arXiv:2008.07928, 2020.

Appendix A Appendix / supplemental material

The supplementary material section is organized as follows: 1) The first section elaborates on the implementation details of GaussianRoom, covering hyper-parameters and geometry evaluation details. 2) We then show additional experimental results. 3) Lastly, we delve into limitations and future directions.

A.1 Implementation Details

During the co-optimization stage of 3DGS and neural SDF, we implement the SDF-guided Densification and Pruning strategy every 100 iterations and the SDF-guided Global Densification strategy every 2000 iterations, rather than employing the original Gaussian densification and pruning strategy, with other hyper-parameters remaining mostly consistent with 3DGS. As for the neural SDF, We sample 1024 rays per batch and 64+64 points on each ray. After training, we extract a mesh from the SDF by the Marching Cube algorithm [34] with the volume size of 5123512^{3}. For geometry comparison, we utilize Poisson reconstruction to get the mesh for COLMAP [42] and some Gaussian-based methods [24, 9, 48] and set octree depth = 11.

Below, we introduce the hyper-parameters used in our experiments:

  • For the edge-aware neural SDF optimization strategy, the hyperparameter δ\delta indicating the importance of edges is set to 2.

  • For the resolution of partitioned scene, we set N=128\mathrm{N}=128.

  • For the loss weights discussed in Sec. 4.3, we set λ1\lambda_{1} = 0.8, λ2\lambda_{2} = 0.01 and λeik\lambda_{eik} = 0.1.

  • For taking K nearest neighbor Gaussians to initialize the new Gaussians, we set KK = 10

  • We execute the SDF-guided Densification and Pruning strategy every FSDPF_{SDP} iterations, and we set FSDPF_{SDP} = 100

  • We execute the SDF-guided Global Densification strategy every FSGDF_{SGD} iterations, and we set FSGDF_{SGD} = 1999

  • The other hyperparameters used in the experiment are as follows: τs\tau_{s} = 0.01, τn\tau_{n} = 10, τp\tau_{p} = 0.998, τd\tau_{d} = 0.999, τg\tau_{g} = 0.0002, γ\gamma = 5.

Eikonal Loss Following [17], the loss 𝑒𝑖𝑘\mathcal{L}_{\mathit{eik}} to regularize the gradients of SDF is defined as:

𝑒𝑖𝑘=1𝑛𝑞n,q(fg(𝐱n,q)21)2,\mathcal{L}_{\mathit{eik}}=\frac{1}{\mathit{nq}}\sum_{\mathit{n},\mathit{q}}(\|\nabla f_{g}(\mathbf{x}_{\mathit{n},\mathit{q}})\|_{2}-1)^{2}, (14)

where fg()f_{g}(\cdot) represents the geometry network of neural SDF.

Evaluation Details For 3D geometry evaluation, the output mesh by some methods [52, 51, 62] would reconstruct some areas out of the scope of the ground truth. So the direct evaluation between them is not fair. To address the above issues, similar to [51], we first remove the faces that cannot be observed by any input camera pose. Then we filter faces at empty areas of the ground truth using 2D mask information rendered by the ground truth for each input view. Tab. 4 define the 3D geometry metrics for evaluation of the main paper. Among these metrics, F-score is considered as the most representative metric to evaluate geometry quality, which contains the information of both accuracy and completeness.

3D Metics
Acc. meanpP(minpPpp)\mbox{mean}_{p\in P}(\min_{p^{*}\in P^{*}}||p-p^{*}||)
Comp. meanpP(minpPpp)\mbox{mean}_{p^{*}\in P^{*}}(\min_{p\in P}||p-p^{*}||)
Prec. meanpP(minpPpp<.05)\mbox{mean}_{p\in P}(\min_{p^{*}\in P^{*}}||p-p^{*}||<.05)
Recall meanpP(minpPpp<.05)\mbox{mean}_{p^{*}\in P^{*}}(\min_{p\in P}||p-p^{*}||<.05)
F-score 2×Perc×RecallPerc+Recall\frac{2\times\text{Perc}\times\text{Recall}}{\text{Perc}+\text{Recall}}
Table 4: Definitions of 3D geometry metrics: pp and pp^{*} are the predicted and ground truth depths.

A.2 More Visualization Results

In this section, we give more visualization results of our method compared with other methods.

Rendering Comparisons We compare visualize rendering results with 3DGS [24], SuGaR [18] and DN-Splatter [48]. As shown in Fig. 6 and Fig. 7, GaussianRoom significantly outperforms other GS-based methods in the highlighted patches, including texture-less regions and details.

Refer to caption
Figure 6: Qualitative rendering comparisons on ScanNet dataset. As shown from the above highlighted patches, the rendering results of GaussianRoom significantly outperform other GS-based methods, including texture-less regions and details.
Refer to caption
Figure 7: Qualitative rendering comparisons on ScanNet and ScanNet++ datasets. As shown from the above highlighted patches, the rendering results of GaussianRoom significantly outperform other GS-based methods, including texture-less regions and details.

Reconstruction Comparisons For ScanNet++ dataset, we compare with NeRF [25], NeuS [52], MonoSDF [62], SuGaR [18] and DN-Splatter [48]. As shown in Fig. 8, our method can reconstruct finer details of indoor scenes.

Refer to caption
Figure 8: Qualitative reconstruction comparisons on ScanNet++. For each indoor scene, the first row is the top view of the whole room, and the second row is the details of the masked region.

A.3 Limitations and Future Works

Although the training time of our method is greatly improved compared to existing NeRF-based reconstruction methods, the efficiency of our neural SDF still lags behind the 3DGS, leading to extended training time compared to 3DGS only. Hence, improving the efficiency of the MLP-based neural SDF is a promising direction for our future work.