This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors

Hao-Kang Liu111Authors contributed equally to this work.
National Taiwan University
   I-Chao Shen111Authors contributed equally to this work.
The University of Tokyo
   Bing-Yu Chen
National Taiwan University
Abstract

Though Neural Radiance Field (NeRF) demonstrates compelling novel view synthesis results, it is still unintuitive to edit a pre-trained NeRF because the neural network’s parameters and the scene geometry/appearance are often not explicitly associated. In this paper, we introduce the first framework that enables users to remove unwanted objects or retouch undesired regions in a 3D scene represented by a pre-trained NeRF without any category-specific data and training. The user first draws a free-form mask to specify a region containing unwanted objects over a rendered view from the pre-trained NeRF. Our framework first transfers the user-provided mask to other rendered views and estimates guiding color and depth images within these transferred masked regions. Next, we formulate an optimization problem that jointly inpaints the image content in all masked regions across multiple views by updating the NeRF model’s parameters. We demonstrate our framework on diverse scenes and show it obtained visual plausible and structurally consistent results across multiple views using shorter time and less user manual efforts.

[Uncaptioned image]\captionof

figure Given a pre-trained NeRF model, the user can (a) choose a view and (b) draw a mask to specify the unwanted object in the 3D scene. Our framework optimized the NeRF model based on user-provided mask and remove the unwanted object in the mask region. The optimized NeRF generated by our framework synthesize inpainted result resembles ground truth result in different views.

1 Introduction

Recent advancements in neural rendering, such as Neural Radiance Fields (NeRF) [23] has emerged as a powerful representation for the task of novel view synthesis, where the goal is to render unseen viewpoints of a scene from a given set of input images. NeRF encodes the volumetric density and color of a scene within the weights of a coordinate-based multi-layer perceptron. Several follow-up works extend original NeRF to handle different tasks, such as pose estimation [18, 38], 3D-aware image synthesis [32, 25, 4], deformable 3D reconstruction [29, 20, 26], and modeling dynamic scenes [39, 12, 10].

Though NeRF achieves great performance on photo-realistic scene reconstruction and novel view synthesis, there remain enormous challenges in editing the geometries and appearances of a scene represented by a pre-trained NeRF model. Unlike traditional image editing, a user needs to transfer his/her edits on a rendered view to the NeRF model to edit the whole scene, thus introducing multiple challenges. First, it is unclear where the edited regions appear on other rendered views. Second, because millions of parameters are used in a pre-trained NeRF model, it is unclear which parameters control the different aspects of the rendered shape and how to change the parameters according to the sparse local user input. Previous works [21] enable users to perform color and shape editing on a category-level NeRF. However, these methods require additional category-specific training and data to support the desired editings.

In this paper, we focus on the NeRF inpainting problem, i.e., removing unwanted objects in a 3D scene represented by a pre-trained NeRF. Although we can ask a user to provide a mask and the inpainted image for each rendered view and use these images to train a new NeRF, there are several disadvantages. First, it is labor-intensive to provide masks for many rendered views. Second, there will be visual inconsistency across different inpainted views introduced by separate inpainting.

To address these issues, we propose a framework to help users easily remove unwanted objects by updating a pre-trained NeRF model. Given a pre-trained NeRF, the user first draws a mask over a rendered view. Given the user-drawn mask, our framework first rendered a couple of views sampled from a pre-set trajectory. Next, we transfer the user-drawn masks to these sampled views using existing video object segmentation method [6]. Our framework then generates (i) guiding color image regions using [3] and (ii) guiding depth images using Bilateral Solver [2] within these masked regions. Noted that our framework can use any existing methods for generating guiding color and depth images. Finally, we formulate an optimization problem that jointly inpaints the image content within the all transferred masked regions with respect to the guiding color and depth images.

We demonstrate our framework on several scenes represented by pre-trained NeRFs in LLFF dataset. We show that our framework generates visually plausible and consistent results. Furthermore, we also demonstrate our experiments on the custom dataset to show the correctness between inpainted results and ground truth results.

2 Related Work

2.1 Novel view synthesis

Constructing novel views of a scene captured by multiple images is a long standing problem in computer graphics and computer vision. Traditional methods use structure-from-motion [13] and bundle adjustment [35] to reconstruct explict point cloud structure and recover camera parameters. Other methods synthesize novel views by interpolating within a 4D light fields [11, 16] and by compositing the warped layers in the multiplane image representations (MPI) [36, 45]. Recently, the coordinate-based neural representations have shown significant promise as an alternative to discrete, grid representations for scene representations. Neural Radiance Fields (NeRF) [23] use a multi-layer perceptron (MLP) and positional encoding to model a radiance field at an unprecedented level of fidelity. However, NeRF works on a large number of input images and requires lengthy training time. Many works attempt to reduce the number of images NeRF requires by introducing depth-supervised loss [7, 31] and category-specific priors [42]. Meanwhile, previous works try to reduce the training time by optimizing voxel grids of features [34, 41] and factoring the radiance field [5].

These recent advances greatly improve the practical use of NeRF. However, it is still unintuitive how a user can edit a pre-trained NeRF model. The main reason is because the neural network of a NeRF model has millions of parameters. Which parameters control the different aspects of the rendered shape and how to change the parameters to achieve desired edtis are still unknown. Previous works enable users to select certain object [30], edit a NeRF model using strokes [21], natural language [37], and by manipulating 3D model [44] directly. However, these methods require to learn additional category-level conditional radiance fields or segmentation network to facilitate such edits. Unlike these methods, our framework did not requires the user to prepare any additional category-specific training data and training procedure for removing unwanted objects in a pre-trained NeRF model.

2.2 Image inpainting

In recent years, two broad approaches to image inpainting exist. Patch-based method [9, 33, 1] fill the holes by searching for patches with similar low-level image features such as rgb values. The search space can be the non-hole region of the input image or from other reference images. The inpainted results are obtained by a global optimization after the relevant patches are retrieved. These methods often fail to handle large holes where the color and texture variance is high. Meanwhile, these methods often cannot make semantically aware patch selections. Deep learning-based methods often predict the pixel values inside masks directly in a semantic-aware fashion. Thus they can synthesize more visually plausible contents especially for images like faces [17, 40], objects [28] and natural scenes [14]. However, these methods often focus on regular masks only. To handle irregular masks, partial convolution [19] is proposed where the convolution is masked and re-normalized to utilize valid pixels only. Yu et al. [43] uses GAN mechanism to maintain local and global consistency in the final results. Nazeri et al. [24] focus on improving the image structure in the inpainting results by conditioning their image inpainting network on edges in the masked regions. MST inpainting [3] and ZITS [8] further consider both edge and line structure to synthesize more reasonable results. In this work, we use MST inpainting network [3] to obtain the guided inpainted result because of its superior performance on inpainting images while preserving structures. Our framework can replace MST inpainting with other inpainting methods since we only used the inpainted results as a guiding signal for our optimization problem.

Refer to caption
Figure 1: (a) Given a pre-trained NeRF FΘF_{\Theta}, an user specifies the unwanted region on an user-chosen view with a user-drawn mask. Our framework sampled initial images and initial depth images and generate both guiding images and guiding depth images. (b) Our framework update Θ\Theta by optimizing both color-guiding loss (LcolorL_{\text{color}}) and depth-guiding loss (LdepthL_{\text{depth}}). (\bm{\rightarrow} denotes render a view from a NeRF model and \bm{\rightarrow} denotes updating Θ\Theta by optimizing losses.)

3 Method

In this section, we first summarize the mechanism of NeRF [23] and formulate our problem setting.

3.1 Preliminaries: NeRF

NeRF is a continuous volumetric radiance field FΘ:(𝐱,𝐝)(𝐜,σ)F_{\Theta}:(\mathbf{x},\mathbf{d})\rightarrow(\mathbf{c},\sigma) represented by a MLP network with Θ\Theta as its weights. FΘ:(𝐱,𝐝)(𝐜,σ)F_{\Theta}:(\mathbf{x},\mathbf{d})\rightarrow(\mathbf{c},\sigma) takes a 3D position 𝐱={x,y,z}\mathbf{x}=\{x,\ y,\ z\} and 2D viewing direction 𝐝={θ,ϕ}\mathbf{d}=\{\theta,\ \phi\} as input and outputs volume density σ\sigma and directional emitted color 𝐜\mathbf{c}. NeRF renders the color of each camera rays passing through the scene by computing the volume rendering intergral using numerical quadrature. The expected color C(𝐫)C(\mathbf{r}) of camera ray 𝐫(t)=𝐨+t𝐝\mathbf{r}(t)=\mathbf{o}+t\mathbf{d} is defined as:

C^(r)\displaystyle\hat{C}(r) =i=1NT(ti)(1exp(σ(ti)δi))c(ti),\displaystyle=\sum_{i=1}^{N}T(t_{i})(1-\exp{(-\sigma(t_{i})\delta_{i})})c(t_{i}), (1)
whereT(ti)\displaystyle\text{where}\quad T(t_{i}) =exp(j=1i1σ(tj)δj),\displaystyle=\exp{(-\sum_{j=1}^{i-1}\sigma(t_{j})\delta_{j})}\ , (2)

where NN denotes the total quadrature points sampled between near plane tnt_{n} and far plane tft_{f} of the camera, and δi=ti+1ti\delta_{i}=t_{i+1}-t_{i} is the distance between two adjacent points. We denote the color and desity at point tit_{i} produced by NeRF model FΘF_{\Theta} as c(ti)c(t_{i}) and σ(ti)\sigma(t_{i}).

Using the above differentiable rendering equation, we can propagte the errors and update Θ\Theta through mean square error:

mse=𝐫(C^c(𝐫)C(𝐫))22+(C^f(𝐫)Ci(𝐫))22,\displaystyle\mathcal{L}_{mse}=\sum_{\mathbf{r}\in\mathcal{R}}\|(\hat{C}^{c}(\mathbf{r})-C(\mathbf{r}))\|^{2}_{2}+\|(\hat{C}^{f}(\mathbf{r})-C_{i}(\mathbf{r}))\|^{2}_{2}, (3)

where \mathcal{R} is a ray batch, C(𝐫),C^c(𝐫),C^f(𝐫)C(\mathbf{r}),\hat{C}^{c}(\mathbf{r}),\hat{C}^{f}(\mathbf{r}) are the ground truth, coarse volume predicted, and fine volume predicted RGB colors for ray rr respectively. For simplicity, we further define FΘimage:𝐨IF_{\Theta}^{\text{image}}:\mathbf{o}\rightarrow I as a function that takes a camera position (𝐨\mathbf{o}) as input, and outputs the rendered image of a pre-trained NeRF model FΘF_{\Theta}.

Refer to caption
Figure 2: Our sampling strategy follows the trajectory in (a). Noted that the target scene faces toward the “+y” directon as shown in (b). Each blue dot represents a view we can sample, while the red dot represents the sample view used in the optimization framework. Also, we use sample images to construct the point cloud for better understanding. We conduct all the experiments based on this setting.

3.2 Overview

Given a pre-trained NeRF model: FΘF_{\Theta}, a user can specify the unwanted region by drawing a mask MuM_{u} over a user-chosen rendered view Iu=FΘimage(𝐨u)I_{u}=F_{\Theta}^{\text{image}}(\mathbf{o}_{u}), where Mu=1M_{u}=1 for pixel outside the masked region, and 𝐨u\mathbf{o}_{u} is the user-chosen camera position. Our goal is to obtain an updated NeRF model FΘ~F_{\tilde{\Theta}} such that the unwanted region masked by MuM_{u} is removed in every rendered views. As shown in Figure 1, our method first sample KK camera positions 𝐎={𝐨s|s=1K}\mathbf{O}=\{\mathbf{o}_{s}|s=1...K\} along the test-set camera trajectory used in LLFF [22] (Figure 2). For each camera position, we rendered a rgb image IsI_{s} and depth image DsD_{s} using FΘF_{\Theta} and obtained all rendered views 𝐈={Is|s=1K}\mathbf{I}=\{I_{s}|s=1...K\} and their depth images 𝐃={Ds|s=1K}\mathbf{D}=\{D_{s}|s=1...K\}. We will use IsI_{s} and I(𝐨s)I(\mathbf{o}_{s}) to represent the image rendered from camera position 𝐨s\mathbf{o}_{s} interchangeably throughout the paper. One can potentially remove the unwanted object specified by MuM_{u} using the following naive method. First, remove the content within the transferred masked region on each sampled rendered view. Then, update Θ\Theta using only the image content outside all transferred masked region by optimizing the “masked mse (mmse)” function:

Lmmse=𝐫(C^c(𝐫)C(𝐫))M(𝐫)22\displaystyle L_{\text{mmse}}=\sum_{\mathbf{r}\in\mathcal{R}}\|(\hat{C}^{c}(\mathbf{r})-C(\mathbf{r}))\odot M(\mathbf{r})\|^{2}_{2} (4)
+(C^f(𝐫)Ci(𝐫))M(𝐫)22,\displaystyle+\|(\hat{C}^{f}(\mathbf{r})-C_{i}(\mathbf{r}))\odot M(\mathbf{r})\|^{2}_{2},

where MM is the transferred mask on the same view as the sample ray 𝐫\mathbf{r}. However, because there is no explicit guidance on what image content and structure should be in the masked region, the unwanted object will remain in the result of optimizing Section 3.2. To provide explicit guidance, our method takes the user-drawn mask MuM_{u}, sampled rendered views 𝐈\mathbf{I}, and sampled depth images 𝐃\mathbf{D} as input, and outputs

  • guiding user-chosen image and guiding depth image: IuGI_{u}^{G} and DuGD_{u}^{G}.

  • transferred masks: 𝐌={Ms|s=1K}\mathbf{M}=\{M_{s}|s=1...K\}.

  • guiding sampled images and their guiding depth images: 𝐈G={IsG|s=1K}\mathbf{I}^{G}=\{I_{s}^{G}|s=1...K\} and 𝐃G={DsG|s=1K}\mathbf{D}^{G}=\{D_{s}^{G}|s=1...K\}.

Finally, our method obtains updated parameters Θ~\tilde{\Theta} by optimizing our nerf inpainting formulation: Φ(Mu,IuG,𝐌,𝐃sG)\Phi(M_{u},{I}_{u}^{G},\mathbf{M},\mathbf{D}_{s}^{G}).

3.3 Guiding material generation

For each sampled rendered view IsI_{s}, our goal is to generate a mask MsM_{s} that covers the same object as the user-drawn mask MuM_{u}. We use a video object segmentation method (STCN) [6] to generate MsM_{s}. With the transferred masks MsM_{s}, we need to generate the guiding images and guiding depth images. The guiding image generation can be describe as

IsG=ρ(Is,Ms),\displaystyle I^{G}_{s}=\rho(I_{s},M_{s}), (5)

where IsGI^{G}_{s} is the guiding image, and ρ\rho is a single image inpainting method (we used MST inpainting network [3]). After obtaining IsGI^{G}_{s}, we can obtain the guiding depth image using

DsG=τ(Ds,Ms,IsG),\displaystyle D_{s}^{G}=\tau(D_{s},M_{s},I^{G}_{s}), (6)

where DsGD^{G}_{s} is the guiding depth image, and τ\tau is a depth image completion method (we used Fast Bilateral solver [2]). Noted that our framework can replace ρ\rho to any other single image inpainting method and τ\tau to any other single depth image completion method.

3.4 NeRF inpainting optimization

We obtain the updated parameters Θ~\tilde{\Theta} that removes the unwanted object in the 3D scene by optimizing:

Θ~argminΘLcolor(Θ)+Ldepth(Θ)\displaystyle\tilde{\Theta}\coloneqq\operatorname*{arg\,min}_{\Theta}\;L_{\text{color}}(\Theta)+L_{\text{depth}}(\Theta) (7)

where LcolorL_{\text{color}} is the color-guiding loss and LdepthL_{\text{depth}} is the depth-guiding loss.

3.4.1 Color-guiding loss

The color-guiding loss used to is defined as

Lcolor(Θ)=Lcolorall(Θ)+Lcolorout(Θ),\displaystyle L_{\text{color}}(\Theta)=L_{\text{color}}^{\text{all}}(\Theta)+L_{\text{color}}^{\text{out}}(\Theta), (8)

where LcolorallL_{\text{color}}^{\text{all}} is defined on views 𝐎all\mathbf{O}^{\text{all}}, LcoloroutL_{\text{color}}^{\text{out}} is defined on views 𝐎out\mathbf{O}^{\text{out}}, and 𝐎all𝐎out={𝐎,𝐨u}\mathbf{O}^{\text{all}}\cup\mathbf{O}^{\text{out}}=\{\mathbf{O},\mathbf{o}_{u}\}. LcolorallL_{\text{color}}^{\text{all}} is used to measure the color difference of the entire image (inside and outside of the mask) on the rendered view and is defined as:

Lcolorall(Θ)=𝐨𝐎allFΘimage(𝐨)IoG.\displaystyle L_{\text{color}}^{\text{all}}(\Theta)=\sum_{\mathbf{o}\in\mathbf{O}^{\text{all}}}F^{\text{image}}_{\Theta}(\mathbf{o})-I_{o}^{G}. (9)

LcoloroutL_{\text{color}}^{\text{out}} is used to measure the color difference outside the mask on the rendered view and is defined as:

Lcolorout(Θ)=𝐨𝐎out(FΘimage(𝐨)IoG)Mo.\displaystyle L_{\text{color}}^{\text{out}}(\Theta)=\sum_{\mathbf{o}\in\mathbf{O}^{\text{out}}}(F^{\text{image}}_{\Theta}(\mathbf{o})-I_{o}^{G})\odot M_{o}. (10)

In our framework, we set 𝐎all=𝐨u\mathbf{O}^{\text{all}}=\mathbf{o}_{u} and 𝐎out=𝐎\mathbf{O}^{\text{out}}=\mathbf{O}.

3.4.2 Depth-guiding loss

While we can obtain visual plausible inpainted color results using the color-guided loss, it often generates incorrect depth, which might cause incorrect geometry and keep some unwanted objects in the scene. To fix these incorrect geometries, we introduce a depth-guided loss, which is defined as:

Ldepth(Θ)=𝐨s𝐎Df(𝐨s)DG(𝐨s)22+Dc(𝐨s)DG(𝐨s)22,\displaystyle L_{\text{depth}}(\Theta)=\sum_{\mathbf{o}_{s}\in\mathbf{O}}\|D^{f}(\mathbf{o}_{s})-D^{G}(\mathbf{o}_{s})\|^{2}_{2}+\|D^{c}(\mathbf{o}_{s})-D^{G}(\mathbf{o}_{s})\|^{2}_{2}, (11)

where Df(𝐨s)D^{f}(\mathbf{o}_{s}) is the fine volume predicted depth image, Dc(𝐨𝐬)D^{c}(\mathbf{o_{s}}) is the coarse volume predicted depth image, and we rendered both depth image from a sampled camera position 𝐨s\mathbf{o}_{s} using FΘF_{\Theta}. We compute the depth Df(𝐨s)D^{f}(\mathbf{o}_{s}) through computing the accumalation of σ\sigma from ray batches.

Refer to caption
Figure 3: Qualitative comparison - LLFF dataset. For each scene, we show the user-chosen view image and the user-provided mask on the left. We then show the color image and depth image generated by different methods: our method (ours), baseline1 (b1), and baseline2 (b2). The depth map of b1 still keep depth of the unwanted object. Meanwhile, the color of b2 might cause noise or shadow on the scene(shown in horns). Our method, compared to these two baselines, have better color and correct geometry on final results.
Refer to caption
Figure 4: Qualitative comparison - custom dataset. For each custom secne, we demonstrate the ground truth rendered image, results generated by our framework, baseline1 (b1), and baseline2 (b2). Our framework generates more accurate depth maps and synthesize more fine structures compared to baseline1. Compared to baseline2, our framework synthesizes more realistic and shape results.
Refer to caption
Figure 5: Qualitative comparison - visual consistency. The rendered views generated by baseline1 have severe visual inconsistency across different views (within the red box region). Meanwhile, our method synthesize visual consistent results across different views.
Refer to caption
Figure 6: Depth-guiding ablation result. We show the optimization results using different guiding losses. We observed that the geometry of the unwanted object could not be removed using color-guiding term only (depth inside the blue box). On the other hand, using depth-guiding term only helps to get correct geometry but introduces color noises outside the masked region (red box and yellow box). Our method combines both terms to generate correct geometry (blue box) without introducing any color noises (red box and yellow box).
Refer to caption
Figure 7: Depth-guiding losses discussion. The unwanted object in the region with high depth variations can be removed by using depth-guiding loss (red box). However, using depth-guiding loss only loses color information in the flat region (blue box). Combining both losses removes the unwanted object without losing any color information.
Refer to caption
Figure 8: Color-guiding ablation result. We show the optimization results using different number of views to compute LcolorallL_{\text{color}}^{\text{all}}. The visual inconsistency becomes larger in the masked region (red box) when we increase the number of views used to compute LcolorallL_{\text{color}}^{\text{all}}.

4 Experiments and evaluations

In this section, we show qualitative results on LLFF [22] dataset and our custom dataset, followed by ablation studies.

4.1 Implementation detail

We implement our framework in PyTorch [27] and Python 3.93.9. We test our framework on a machine with Intel i7-7800X and a GTX-1080 graphics card to train our models. For each scene, we first train a model initialized to random weights and optimize it for 200,000200,000 steps with a batch size of 4,0964,096 using Adam [15], which takes about 1818 to 2020 hours. The sample points used in fine and coarse model are 128128 and 6464, respectively. To inpaint each scene, we optimize Eq. 7 for 50,00050,000 steps which takes about five hours in total.

4.2 Evaluation

4.2.1 Datasets

To verify our framework’s performance, we create a custom dataset that contains three custom scenes: figyua, desuku, and terebi. The purpose is to obtain the ground truth results of NeRF inpainting. For each custom scene, we collect a pair of photo set, i.e., (original and removed). For original set, we keep all the objects in the scene and take photos from 2424 camera positions. For removed set, we remove one object in the scene manually and take photos from the same 2424 camera positions. The original sets is used as the input of our framework, and the removed set is used as the ground truth of the results after the inpainting optimization.

4.2.2 Experiment setup

As we are the first to free-form inpainting on NeRF, we propose two baseline methods for comparisons:

baseline1: per-view color updating

We update the pre-trained NeRF model FΘF_{\Theta} with all guiding images 𝐈G\mathbf{I}^{G} by optimizing

Θ~argminΘIsG𝐈G(FΘimage(𝐨s)IsG)\displaystyle\tilde{\Theta}\coloneqq\operatorname*{arg\,min}_{\Theta}\;\sum_{I_{s}^{G}\in\mathbf{I}^{G}}(F^{\text{image}}_{\Theta}(\mathbf{o}_{s})-I_{s}^{G}) (12)
baseline2: masked mse retraining

We re-train a new NeRF model using all guiding images 𝐈G\mathbf{I}^{G} by optimizing:

Θ~argminΘIsG𝐈G(FΘimage(𝐨s)IsG)Ms,\displaystyle\tilde{\Theta}\coloneqq\operatorname*{arg\,min}_{\Theta}\;\sum_{I_{s}^{G}\in\mathbf{I}^{G}}(F^{\text{image}}_{\Theta}(\mathbf{o}_{s})-I_{s}^{G})\odot M_{s}, (13)

where Θ\Theta is randomly initialized.

Both baseline1 and baseline2 did not consider depth information during updating the pre-trained NeRF model or re-train a new NeRF model.

We compared the inpainted results of our framework to inpainted results of two baseline methods on LLFF dataset and our custom dataset described in Section 4.2.1. For LLFF dataset, we perform qualitative evaluation by applying three methods on each pre-trained model. For our custom dataset, we perform both qualitative and quantitative evaluations. For each scene, we trained two separate NeRF models for the original set and the removed set. We apply three methods to the pre-trained NeRF model using the original set. We then compared the inpainted result with the image generated by the pre-trained NeRF model using the removed set.

4.2.3 Results and discussions

LLFF dataset

We show the qualitative comparison between our method and two baseline methods using LLFF dataset in Figure 3 and Figure 5. In Figure 3, we observed that the depth maps of the inpainted NeRF generated by baseline1 did not match the inpainted image content. In Figure 5, we showed that there are obvious visaul inconsistencies between views in the results generated by baseline1. To avoid these visual inconsistency, we choose to provide color guidance using only the user-chosen view and let the NeRF model maintain the view consistency by itself. Baseline2 recovers visual satisfactory image content without any color guidance inside the masked region. However, baseline2 still generate results that losses fine structures or synthesize some unnatural patches at complicated regions, which can be oberserved at Horns and Orchids.

Custom dataset

We show the qualitative comparison between our method and two baseline methods using our custom dataset in NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors and Figure 4. In Figure 4, we can observe that although baseline1 can synthesize rgb content closer to ground truth, it still fail to generate correct depth map. On the other hand, baseline2 recovers the content in the masked region guided by the content from different views but still creates noisy and blurry content. Our framework generates closer color and depth images to the ground truth rendered results compared to the two baseline methods.

Overall, our inpainting optimization updates a pre-trained NeRF model to obtain correct geometry and preserve visual consistency across views.

4.3 Ablation study

4.3.1 How important is the depth-guiding loss?

Introducing the depth-guiding loss (LdepthL_{\text{depth}}) is one of the major contribution of our framework. We validate its effectiveness by comparing with the optimization results using color-guiding loss (LcolorL_{\text{color}}) only, depth-guiding loss (LdepthL_{\text{depth}}) only, and both losses.

We showed the results in Figure 6. We observe that optimizing using LdepthL_{\text{depth}} only already leads to correct geometries inside the masked region but introduces color noises outside the masked region. Our method optimizes both losses and generate correct geometries without color noises. In Figure 7, we can also observed that the unwanted object in the region with high depth variations can be removed by using LdepthL_{\text{depth}} only (red box). However, using LdepthL_{\text{depth}} only loses color information in the flat region (blue box). Our method combines these two losses and remove the unwanted object without losing color information in the flat region.

4.3.2 How important is color-guiding within the masked regions from sampled views?

In our framework, we only use LcolorallL_{\text{color}}^{\text{all}} to guide the color reconstruction inside the masked regions. We validate this function design by adjusting the number of views used to compute LcolorallL_{\text{color}}^{\text{all}} during the optimization.

We compared the results of following three settings:

  1. 1.

    only user-chosen view is used to guide the color inside the masked region, i.e.,  𝐎all=𝐨u\mathbf{O}^{\text{all}}=\mathbf{o}_{u} and 𝐎out=𝐎\mathbf{O}^{\text{out}}=\mathbf{O}.

  2. 2.

    three sampled views are used to guide the color inside the masked region, i.e., 𝐎all={𝐨i,𝐨j,𝐨k}\mathbf{O}^{\text{all}}=\{\mathbf{o}_{i},\mathbf{o}_{j},\mathbf{o}_{k}\} where i,j,ki,j,k are randomly sampled, and 𝐎out={𝐨s|𝐎𝐎all}\mathbf{O}^{\text{out}}=\{\mathbf{o}_{s}|\mathbf{O}-\mathbf{O}^{\text{all}}\}.

  3. 3.

    all sampled views (i.e., 2424) are used to guide the color inside the masked region, i.e., 𝐎all=𝐎\mathbf{O}^{\text{all}}=\mathbf{O} and 𝐎out=𝐨u\mathbf{O}^{\text{out}}=\mathbf{o}_{u}.

As shown in Figure 8, we observe that more visual inconsistency will be introduced when we use more inpainted images as color guidance. Our framework obtains stable results for most of the scene using user-chosen view only; thus, we choose to not to include other inpainted images in the computation of LcolorallL_{\text{color}}^{\text{all}}.

5 Limitations and future work

Fuse color and depth guidance.

Our framework leverages existing image inpainting and depth completion method to generate initial guidance materials. Once there are appearances or geometries artifacts in the initial guidance materials, the optimized NeRF might output undesired or incorrect results. Meanwhile, our framework shares the same limitations as the image inpainting method we used. For example, our framework fails to inpaint the image region with high reflectance content (Figure 9) or with a thin structure. It is possible to design a fusion method to fuse color and depth guidances from multiple methods.

Update masks and guidance materials.

In our current framework, we used fixed masks and guidance materials during the optimization. However, this is sub-optimal when the unwanted object is occluded in some views. In the future, we plan to extend our framework to update the masks in every optimization step using 3D volume features extracted from the NeRF model. We also plan to use a discriminator to constrain the synthesized results and improve the view consistency.

Volume feature for mask transferring

Our current framework uses the existing video-based object segmentation method to transfer the user-drawn mask. It is possible to perform mask transferring by conducting 3D volume segmentation using the volume feature extracted from the pre-trained NeRF.

Refer to caption
Figure 9: Our framework fails to inpaint the mask region in the left region (red box) and introduces artifacts in the optimized results.

6 Conclusion

In this paper, we propose the first framework that enables users to remove unwanted objects or retouch undesired regions in a 3D scene represented by a pre-trained NeRF. Our framework requires no additional category-specific data and training. Instead, we formulated a novel optimization to inpaint the pre-trained NeRF with the generated RGB-D guidances. We demonstrated our framework handles a variety of scenes well, and we also validate our framework using a custom dataset where ground truth inpainted results are available. We believe that the custom dataset we proposed and our framework can foster future research on neural radiance field editings.

References

  • [1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.
  • [2] Jonathan T. Barron and Ben Poole. The fast bilateral solver. ECCV, 2016.
  • [3] Chenjie Cao and Yanwei Fu. Learning a sketch tensor space for image inpainting of man-made scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14509–14518, 2021.
  • [4] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799–5809, 2021.
  • [5] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. arXiv preprint arXiv:2203.09517, 2022.
  • [6] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, 34, 2021.
  • [7] Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. arXiv preprint arXiv:2107.02791, 2021.
  • [8] Qiaole Dong, Chenjie Cao, and Yanwei Fu. Incremental transformer structure enhanced image inpainting with masking positional encoding. arXiv preprint arXiv:2203.00867, 2022.
  • [9] Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1033–1038. IEEE, 1999.
  • [10] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. arXiv preprint arXiv:2105.06468, 2021.
  • [11] Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 43–54, 1996.
  • [12] Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5784–5794, 2021.
  • [13] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [14] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG), 36(4):1–14, 2017.
  • [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [16] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 31–42, 1996.
  • [17] Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang. Generative face completion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3911–3919, 2017.
  • [18] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5741–5751, 2021.
  • [19] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European conference on computer vision (ECCV), pages 85–100, 2018.
  • [20] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor: Neural free-view synthesis of human actors with pose control. ACM SIGGRAPH Asia, 2021.
  • [21] Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, and Bryan Russell. Editing conditional radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5773–5783, 2021.
  • [22] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.
  • [23] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • [24] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019.
  • [25] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
  • [26] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
  • [27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [28] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  • [29] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural radiance fields for dynamic scenes. https://arxiv.org/abs/2011.13961, 2020.
  • [30] Zhongzheng Ren, Aseem Agarwala, Bryan Russell, Alexander G. Schwing, and Oliver Wang. Neural volumetric object selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. ( alphabetic ordering).
  • [31] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. arXiv preprint arXiv:2112.03288, 2021.
  • [32] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166, 2020.
  • [33] Denis Simakov, Yaron Caspi, Eli Shechtman, and Michal Irani. Summarizing visual data using bidirectional similarity. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • [34] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. arXiv preprint arXiv:2111.11215, 2021.
  • [35] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.
  • [36] Richard Tucker and Noah Snavely. Single-view view synthesis with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 551–560, 2020.
  • [37] Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. arXiv preprint arXiv:2112.05139, 2021.
  • [38] Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
  • [39] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9421–9431, 2021.
  • [40] Raymond A Yeh, Chen Chen, Teck Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5485–5493, 2017.
  • [41] Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. arXiv preprint arXiv:2112.05131, 2021.
  • [42] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
  • [43] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
  • [44] Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, Yuewen Ma, Rongfei Jia, and Lin Gao. Nerf-editing: geometry editing of neural radiance fields. arXiv preprint arXiv:2205.04978, 2022.
  • [45] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.