\UseRawInputEncoding

¹¹institutetext: Computer Vision Group, Chalmers University of Technology
¹¹email: {bjosef,david.nilsson,chetsung,fredrik.kahl}@chalmers.com ²²institutetext: KTH Royal Institute of Technology
²²email: [email protected]

Adjustable Visual Appearance for Generalizable Novel View Synthesis

Josef Bengtson 11 David Nilsson 11 Che-Tsung Lin 11 Marcel Büsching 22 Fredrik Kahl 11

Abstract

We present a generalizable novel view synthesis method which enables modifying the visual appearance of an observed scene so rendered views match a target weather or lighting condition without any scene specific training or access to reference views at the target condition. Our method is based on a pretrained generalizable transformer architecture and is fine-tuned on synthetically generated scenes under different appearance conditions. This allows for rendering novel views in a consistent manner for 3D scenes that were not included in the training set, along with the ability to (i) modify their appearance to match the target condition and (ii) smoothly interpolate between different conditions. Experiments on real and synthetic scenes show that our method is able to generate 3D consistent renderings while making realistic appearance changes, including qualitative and quantitative comparisons. Please refer to our project page for video results: https://ava-nvs.github.io

Keywords:

3D Style transfer Generalizable Novel View Synthesis NeRFs.

1 Introduction

The field of novel view synthesis has seen rapid progress in the last few years after the success of Neural Radiance Fields (NeRFs) [25] and follow-up works [26, 24, 1]. A desired quality for these types of 3D scene representations is to be able to disentangle different scene properties from each other, for instance, being able to change the visual appearance without changing the content of the scene. There exists some works in this direction [24, 36], but they are limited to interpolating between observed visual appearances of the 3D scene, thus requiring images of the scene with the desired visual appearance. In contrast, we develop a method that is able to generalize to 3D scenes not used in training, and that thus can adjust the appearance of a scene without having access to any images of that scene at the target visual appearance, see Fig. 1.

Refer to caption — Figure 1: Given multiple views of a scene in one weather and lighting condition, we want to generate novel views of the given scene with adjusted visual appearance corresponding to a target condition without scene specific optimization.

For traditional NeRF-based methods, the properties of the 3D scene are encoded in the weights of a multilayer perceptron (MLP), so each trained model is exclusive to that particular scene. A main challenge is thus that a separate optimization process has to be performed for each individual scene. One approach to handle this is to find ways to improve the efficiency of the training process [26, 21]. A different approach is to avoid per-scene training and instead train cross-scene generalizable methods [46, 43, 5, 23, 35], which are able to synthesize novel views of a scene given just images and corresponding camera poses, and do not require expensive scene-specific optimization.

We present a generalizable novel view synthesis method that allows for changing the visual appearance of a scene while ensuring multi-view consistency. For this we build upon Generalizable NeRF Transformer (GNT) [35], a transformer-based [39] novel view synthesis method. Specifically, we introduce a latent appearance variable to enable the control of the visual appearance of rendered views. By using a generalizable NeRF model and the introduced latent appearance variable, we are able to render novel views and change the appearance of scenes that are not seen when training our model without the need for observations of the scene at the target appearance. We will release code and trained models.

In summary, our main contributions are:

•

We introduce a method that allows for changing the appearance of a novel scene, while ensuring multi-view consistency, by using a latent appearance variable conditioned on a target visual appearance.
•

We propose a novel loss function which is designed to align the views rendered with a target appearance to the scene observed in that target condition, which enables jointly learning novel view synthesis and appearance change.
•

We create a synthetic dataset containing urban scenes, with each scene available at four different diverse weather and lighting conditions. The dataset is used for training our model for visual appearance change and enables quantitative evaluation. The dataset will be made publicly available.

2 Related work

Here we will review progress on NeRFs, focusing on generalizable methods. We will then review 2D style transfer methods and stylized NeRFs methods.

Neural Radiance Fields.

NeRFs [25] synthesize consistent and photo-realistic novel views of a scene, by representing each scene as a continuous 5D radiance field parameterized by an MLP mapping 3D positions and 2D viewing directions to volume densities and view-dependent emitted radiances. Views are synthesized by querying points along camera rays and and utilizing volumetric rendering to aggregate the output colors and densities into RGB values. There have been several works improving NeRFs further, e.g. to improve the efficiency [26, 21] and handling few input views [46, 27].

Generalizable Novel View Synthesis.

The original NeRF methodology is constrained to training a neural network for representing a single scene, optimizing from scratch for each new scene, without leveraging any prior knowledge. Methods for generalizable neural rendering address this limitation by training on multiple scenes, enabling the learning of a general understanding of how to utilize source observations to synthesize novel views. Earlier methods such as [46, 5] use a multilayer perceptron (MLP) conditioned on feature vectors extracted from the source images to predict color and radiance values which are aggregated with volumetric rendering. To enhance generalization capabilities and rendering quality, recent approaches have incorporated transformer-based architectures [39, 9] for feature aggregation from the source images [23, 42], computing densities along the camera ray [43], and even for the entire rendering pipeline [34, 35, 33, 11]. While these methods have demonstrated impressive rendering quality, they are currently incapable of modifying the appearance of the rendered views.

2D Style Transfer.

The success of Generative Adversarial Networks (GANs) [13] has largely driven advances in 2D style transfer. Methods such as Pix2Pix [20], Pix2Pix-HD [44], and BicycleGAN [50] utilize paired training data, which consists of corresponding images in the source and target conditions. CycleGAN [49] and CyEDA [2] employ cycle-consistency constraints to learn from unpaired data. NICE-GAN [6] reuses the discriminator for encoding the images of the target domain. In addition to GANs, the style-attentional network (SANet) [28] can synthesize a content image with the style of another image. Diffusion models [8, 17] have recently achieved superior results in image generation. Palette [32] introduced the first diffusion-based paired image-translation model, and DiffuseIT [22] recently presented a diffusion-based unsupervised image translation method. Instruct-Pix2Pix [3] re-trains a latent diffusion model [30] using paired images generated by prompt-to-prompt [16] and massive instructions generated by GPT-3 [4] to facilitate instruction-based style transfer. While the images translated by these 2D style transfer methods can individually appear realistic, they do not ensure temporal consistency. In contrast, our method inherently ensures 3D consistency. We experimentally compare our results with 2D style transfer methods applied frame by frame on rendered views.

Visual Appearance Change for NeRF Models.

Prior work to enable changing the visual appearance of a NeRF model [24, 36] typically assign an appearance embedding vector to each image which affect the appearance but not geometry, and is optimized alongside the NeRF model parameters. In [24], low dimensional embeddings allows for smooth interpolation between lighting conditions. One limitation of this approach is that it requires access to images of the scene at both lighting conditions as input. In contrast, our method is a generalizable method that does not require images at both lighting conditions as input when rendering novel views with changed visual appearance. Another line of research is to edit the style of a NeRF model based on a given style prompt [18, 14, 19, 47] typically given as a reference image. More recent works [40, 41] use the joint language-image embedding space of CLIP [29] to enable specifying the desired style using a text prompt. These methods focus on artistic style changes and have thus not been specifically trained and evaluated on realistic appearance changes such as differences in weather or lighting, in contrast to our method. The recent method Instruct-NeRF2NeRF [15] enables editing a NeRF model based on a text-prompt, by iteratively updating dataset images using a pretrained 2D editing model [3]. This method allows for a variety of appearance changes since they utilize pre-trained diffusion models to perform the editing, but they require training per-scene NeRF models and additional per-scene optimization, in contrast to our method that does not require per-scene training.

3 Method

We now give an overview of Generalizable NeRF Transformer (GNT) [35] and present our method for adjusting the visual appearance of synthesized views.

3.1 Basics of GNT

GNT utilizes a two-stage transformer-based architecture that allows for novel view synthesis from source views. The first stage is a view transformer that aggregates information from neighboring views using epipolar geometry. The second stage is a ray transformer that performs rendering.

View Transformer.

The view transformer computes a coordinate aligned feature field $\mathcal{F}:(\mathbf{x},\mathbf{\theta})\rightarrow\mathbf{f}\in\mathbb{R}^{d}$ that maps a 3D position $\mathbf{x}$ and viewing direction $\mathbf{\theta}$ to a feature vector $\mathbf{f}$ . Firstly each source view is encoded to a feature map using a U-Net [31] Image Encoder $\mathbf{F}_{i}=\text{U-Net}(\mathbf{I}_{i})$ . The feature representation of a 3D point $\mathbf{x}$ is obtained by projecting it to every source image via the projections $\Pi_{i}(\mathbf{x})$ and fetching the corresponding value of $\mathbf{F}_{i}$ . The view transformer (VT) is then used to combine all these feature vectors through attention as

\mathcal{F}(\mathbf{x},\mathbf{\theta})=\text{VT}(\mathbf{F}_{1}(\Pi_{1}(\mathbf{x}),\mathbf{\theta}),\cdots,\mathbf{F}_{N}(\Pi_{N}(\mathbf{x}),\mathbf{\theta})).

(1)

Ray Transformer.

The ray transformer aggregates information along a given camera ray by performing attention between feature values $\mathbf{f}_{i}=\mathcal{F}(\mathbf{x}_{i},\mathbf{\theta})$ on the ray. The GNT pipeline consists of stacking several view and ray transformers, which iteratively refines the feature field. The final ray transformer computes the RGB value $\mathbf{C(r)}$ corresponding to a camera ray $\mathbf{r}$ by feeding the sequence of feature vectors along the ray $\{\mathbf{f}_{1},\cdots,\mathbf{f}_{M}\}$ into the ray transformer, performing mean pooling followed by an MLP as

\mathbf{C(r)}=\text{MLP}\circ\text{Mean}\circ\text{RT}(\mathbf{f}_{1},\ldots,\mathbf{f}_{M}).

(2)

This enables training the method using the standard color prediction loss term commonly used by NeRFs. The attention values from the ray transformer correspond to the importance of each feature vector $\mathbf{f}_{i}$ along the ray, filling a similar role as the opacity in a traditional NeRF method.

3.2 Adjusting Visual Appearance

To change the visual appearance of rendered views to match a target appearance, we propose to introduce a latent appearance variable $\mathbf{z_{c^{\prime}}}$ as an additional input to the ray transformer, to condition the rendering on the target appearance. The proposed architecture can be seen in Fig. 2.

The latent variable should correspond to a predefined appearance condition and the value for each condition is jointly optimized with the rest of the network. Since our goal is to change the visual appearance, we include $\mathbf{z_{c^{\prime}}}$ so that the geometry is kept unchanged. To ensure this, it is used to update the value-tokens in the ray transformer while keeping the attention values unchanged, i.e., $V_{c^{\prime}}=f_{z}([V;\bf{z}_{c^{\prime}}])$ where $f_{z}$ is a single layer MLP that takes in the original value tokens $V$ concatenated with the latent appearance variable $\mathbf{z_{c^{\prime}}}$ and generates new value tokens $V_{c^{\prime}}$ . This enables computing a visual appearance change loss,

\mathcal{L}_{appearance}=\left\|\mathbf{C}(r,\mathbf{I}_{s,c},\mathbf{z}_{c^{\prime}})-\mathbf{\hat{C}}(r,\mathbf{I}_{t,c^{\prime}})\right\|_{2}^{2}.

(3)

This loss enforces that when inputting source views $\mathbf{I}_{s,c}$ from the condition $c$ together with the latent appearance variable $\mathbf{z}_{c^{\prime}}$ of the condition $c^{\prime}$ , then the predicted color $\mathbf{C}(r,\mathbf{I}_{s,c},\mathbf{z}_{c^{\prime}})$ should match the ground truth color $\mathbf{\hat{C}}(r,\mathbf{I}_{t,c^{\prime}})$ for the corresponding target images $\mathbf{I}_{t,c^{\prime}}$ , making it possible for the method to learn to adapt the appearance to match a target condition. If the condition for the target image $\mathbf{I}_{t,c}$ corresponds to that of the source images $\mathbf{I}_{s,c}$ , then this becomes a traditional reconstruction loss $\mathcal{L}_{rec}$ , and our full loss is $\mathcal{L}=\mathcal{L}_{rec}+\mathcal{L}_{appearance}$ . Rendering images with changed visual appearance is done by computing the rendered color $\mathbf{C}(r,I_{s,c},\mathbf{z}_{c^{\prime}})$ for all pixels in an image, giving as input source views from one condition and a latent variable $\mathbf{z}_{c^{\prime}}$ corresponding to the desired target condition.

4 Experiments

Qualitative and quantitative experiments are performed to test our method’s ability to adapt the visual appearance of real and synthetic scenes that have not been seen during training.

Dataset.

The used dataset is generated using the autonomous driving simulator CARLA [10], which enables the generation of synthetic images within a simulated city environment along with their ground truth camera poses. Additionally, weather and lighting conditions can easily be changed.

For our experiments, four conditions were defined, corresponding to night, day, rain and evening. A scene was defined as a sequence of 10 observations taken along a road. With four different conditions, this led to a total of 40 images per scene. All generated images are $800\times 600$ pixels. The CARLA map was split into two regions, one to generate 145 training scenes, and the other to generate 38 evaluation scenes, ensuring separation between training and evaluation scenes. This dataset alongside the code will be made publicly available. We also show qualitative examples evaluating our trained model on scenes from the Spaces dataset [12] to show that our method can generalize to real images.

Implementation.

The model was initialized with weights from a GNT network pretrained on a combination of synthetic and real data [35]. The model was trained to perform visual appearance change using the 145 training scenes from the introduced CARLA dataset, including the proposed appearance change loss term (3). The training was performed on a single A100 GPU, taking approximately 8 hours, and the method was then able to generalize to scenes not seen during training. Note that the model was trained for all training scenes at the same time, and there is no scene-specific training for the test scenes. When we test the model, we only use images of the test scene in the source condition $\mathbf{I}_{s,c}$ and not any images of the test scene in the target condition $c^{\prime}\neq c$ .

Baselines.

The GAN-based methods, such as Pix2Pix-HD [44], BicycleGAN [50], NICE-GAN[6], and CyEDA [2], along with the diffusion model-based method Palette [32], have been retrained using our synthetic dataset. The reference-based methods, DiffuseIT [22] and SANet [28], are capable of performing image translation using a reference images at the target condition from the synthetic dataset. Instruct-Pix2Pix [3] is pre-trained on editing images based on text prompts and was not retrained on our synthetic dataset.

Furthermore, we compared our method with the Instruct-NeRF2NeRF [15] model, utilizing the official implementation that employs the Nerfstudio [37] Nerfacto NeRF model. Due to the unsatisfactory quality of the Nerfacto models when using 10 images, we increased the number of images in the sequence to 25 images. More details can be found in appendix 0.C.

Table 1: Comparison of similarity of rendered views for our method with ground truth images for all combinations of weather and lighting conditions (PSNR

\uparrow

|

SSIM

\uparrow

|

LPIPS

\downarrow

). The values along the diagonal correspond to novel view synthesis without appearance change. The off-diagonal values correspond to evaluating novel views with changed visual appearance to match the target condition.

From Day From Night From Evening From Rain Into Day 23.9 $|$ 0.77 $|$ 0.60 15.3 $|$ 0.56 $|$ 0.62 16.7 $|$ 0.64 $|$ 0.61 15.7 $|$ 0.59 $|$ 0.60 Into Night 21.0 $|$ 0.56 $|$ 0.55 27.4 $|$ 0.68 $|$ 0.57 20.7 $|$ 0.54 $|$ 0.55 21.2 $|$ 0.57 $|$ 0.55 Into Evening 24.1 $|$ 0.75 $|$ 0.58 20.0 $|$ 0.62 $|$ 0.57 25.4 $|$ 0.76 $|$ 0.58 21.4 $|$ 0.69 $|$ 0.57 Into Rain 23.4 $|$ 0.71 $|$ 0.58 21.7 $|$ 0.66 $|$ 0.57 21.3 $|$ 0.69 $|$ 0.56 26.8 $|$ 0.78 $|$ 0.58

Table 2: Qualitative comparison of rendering quality against 2D style transfer methods (PSNR

\uparrow

|

SSIM

\uparrow

|

LPIPS

\downarrow

). We observe that our method outperforms all 2D style transfer methods on these metrics, with significant increases in performance on PSNR and SSIM for most scenarios

Type Method Scenarios Day to Night Day to Evening Day to Rain Night to Day Evening to Day Rain to Day Non- diffusion Pix2Pix-HD [44] 19.7 $|$ 0.36 $|$ 0.565 18.4 $|$ 0.35 $|$ 0.603 19.7 $|$ 0.53 $|$ 0.582 13.8 $|$ 0.40 $|$ 0.629 15.3 $|$ 0.46 $|$ 0.619 13.9 $|$ 0.43 $|$ 0.629 BicycleGAN [50] 19.0 $|$ 0.38 $|$ 0.556 18.8 $|$ 0.41 $|$ 0.587 22.7 $|$ 0.66 $|$ 0.578 14.2 $|$ 0.47 $|$ 0.630 15.9 $|$ 0.56 $|$ 0.627 15.0 $|$ 0.54 $|$ 0.630 NICE-GAN [6] 18.3 $|$ 0.29 $|$ 0.553 18.8 $|$ 0.39 $|$ 0.589 20.8 $|$ 0.56 $|$ 0.583 12.9 $|$ 0.29 $|$ 0.626 14.6 $|$ 0.45 $|$ 0.618 14.3 $|$ 0.47 $|$ 0.624 CyEDA [2] 17.9 $|$ 0.32 $|$ 0.556 18.8 $|$ 0.40 $|$ 0.597 20.0 $|$ 0.67 $|$ 0.579 11.7 $|$ 0.47 $|$ 0.625 14.0 $|$ 0.59 $|$ 0.633 12.3 $|$ 0.53 $|$ 0.634 SANet [28] 18.9 $|$ 0.50 $|$ 0.571 20.2 $|$ 0.64 $|$ 0.606 20.1 $|$ 0.66 $|$ 0.581 14.5 $|$ 0.52 $|$ 0.629 15.5 $|$ 0.59 $|$ 0.618 12.6 $|$ 0.45 $|$ 0.616 Diffusion Models Palette [32] 19.4 $|$ 0.42 $|$ 0.577 20.6 $|$ 0.54 $|$ 0.618 22.6 $|$ 0.66 $|$ 0.601 12.1 $|$ 0.41 $|$ 0.689 9.8 $|$ 0.38 $|$ 0.688 9.8 $|$ 0.38 $|$ 0.695 DiffuseIT [22] 17.1 $|$ 0.32 $|$ 0.594 17.2 $|$ 0.44 $|$ 0.627 16.0 $|$ 0.43 $|$ 0.613 11.2 $|$ 0.35 $|$ 0.626 12.3 $|$ 0.38 $|$ 0.618 11.8 $|$ 0.35 $|$ 0.625 Instruct-Pix2Pix [3]¹¹1Instruct-Pix2Pix is pre-trained on editing images based on text prompts and was therefore not retrained on our synthetic dataset. 15.9 $|$ 0.34 $|$ 0.579 14.3 $|$ 0.53 $|$ 0.653 14.1 $|$ 0.53 $|$ 0.638 11.5 $|$ 0.46 $|$ 0.647 8.7 $|$ 0.34 $|$ 0.674 13.4 $|$ 0.52 $|$ 0.640 Ours 21.0 $|$ 0.56 $|$ 0.549 24.1 $|$ 0.75 $|$ 0.585 23.4 $|$ 0.71 $|$ 0.577 15.3 $|$ 0.56 $|$ 0.624 16.7 $|$ 0.64 $|$ 0.615 15.7 $|$ 0.59 $|$ 0.602

Table 3: Quantitative comparison of the consistency of novel view rendering against 2D style transfer methods and instruct-NeRF2NeRF (tOF

\downarrow

|

tLP

\downarrow

[7]). We can observe that our method significantly outperforms most of the 2D methods. Please see our project page for a video illustrating the rendering consistency: https://ava-nvs.github.io

Type	Method	Scenarios
Type	Method	Day to Night	Day to Evening	Day to Rain	Night to Day
Non-diffusion	Pix2Pix-HD [44]	2.59 $\|$ 0.147	1.58 $\|$ 0.169	1.60 $\|$ 0.030	2.49 $\|$ 0.078
	BicycleGAN [50]	5.13 $\|$ 0.053	4.79 $\|$ 0.083	5.10 $\|$ 0.024	5.20 $\|$ 0.047
	NICE-GAN [6]	1.93 $\|$ 0.040	1.24 $\|$ 0.081	1.25 $\|$ 0.014	2.09 $\|$ 0.055
	CyEDA [2]	1.62 $\|$ 0.027	1.21 $\|$ 0.115	0.96 $\|$ 0.022	1.25 $\|$ 0.032
	SANet [28]	2.37 $\|$ 0.069	2.05 $\|$ 0.097	1.73 $\|$ 0.088	2.01 $\|$ 0.092
Diffusion Models	Palette [32]	10.12 $\|$ 0.115	7.21 $\|$ 0.109	8.41 $\|$ 0.057	18.77 $\|$ 0.050
	DiffuseIT [22]	24.62 $\|$ 0.236	29.23 $\|$ 0.242	27.43 $\|$ 0.166	29.75 $\|$ 0.188
	Instruct-Pix2Pix [3]	1.62 $\|$ 0.123	1.37 $\|$ 0.121	1.45 $\|$ 0.088	1.48 $\|$ 0.092
NeRF Editing	IN2N [15]	7.29 $\|$ 0.137	6.09 $\|$ 0.085	5.65 $\|$ 0.051	- $\|$ - ¹¹1We could not get satisfactory renderings for this condition, more details can be found in appendix 0.C.
Ours	Ours	1.44 $\|$ 0.026	1.10 $\|$ 0.035	0.97 $\|$ 0.013	1.25 $\|$ 0.032

Qualitative Results.

Our model is evaluated on the 38 evaluation scenes not seen during training. The method is capable of synthesizing novel views using only a set of images with corresponding camera poses. Furthermore, it is able to adapt the visual appearance of the scene to specified weather and lighting conditions, without having access to observations of the scene under those target conditions. We show several qualitative examples of this. Fig. 3 shows that our method is able to change the visual appearance of images to match a target weather and lighting condition, and Fig. 4 shows a comparison with other methods.

It also becomes possible to interpolate between two latent variables corresponding to different conditions by defining $\mathbf{z}_{\alpha}=\alpha\mathbf{z_{c}}+(1-\alpha)\mathbf{z_{c^{\prime}}}$ for $\alpha\in[0,1]$ . In Fig. 5, we observe that this enables getting realistic intermediate visual appearances that are not included in the original images. The model trained on appearance change of synthetic scenes can also be applied to change appearance of real scenes [12], as seen in Fig. 6 where we can see realistic appearance changes even though the model is not trained on any scenes from that dataset.

Quantitative Results.

We now show quantitative rendering quality results. We show PSNR, SSIM [45] and LPIPS [48], where the images with changed appearance are evaluated against the corresponding ground truth images for the target weather and lighting conditions. In Table 1, we show how our method performs on all possible combinations of source and target conditions. Using the same source and target conditions corresponds to novel view synthesis without appearance change, which, as expected, gives better metrics, but the gap is small for some combinations, e.g. comparing Day into Evening with Evening into Evening. In Table 2, we compare our method with several 2D style transfer methods. We see that our method outperforms the 2D methods on the performance metrics for all combinations. We observe that performance varies for the different conditions and that adapting images from another condition into day is the most challenging, while transforming from day gives significantly higher performance for all methods. A comparison against Instruct-NeRF2NeRF [15] was also performed, but results varied largely for different scenes and prompts. Further details are included in appendix 0.C.

We show two consistency metrics [7] in Table 3. If $(x_{1},\ldots,x_{n})$ and $(y_{1},\ldots,y_{n})$ are two image sequences rendered from the same pose sequences, we define $\text{tOF}=\|\text{OF}(y_{t+1},y_{t})-\text{OF}(x_{t+1},x_{t})\|_{1}$ , where OF is the optical flow computed via RAFT [38] and $\text{tLP}=\|\text{LPIPS}(y_{t+1},y_{t})-\text{LPIPS}(x_{t+1},x_{t})\|_{1}$ . The metrics are low if the reference images and the rendered images yield similar optical flow and similar changes in LPIPS, which is assumed to correspond to a consistent rendering. We observe that our method significantly outperforms most of the 2D style transfer methods. Notably, Instruct-NeRF2NeRF exhibits poorer consistency results than anticipated, primarily stemming from two key factors. Firstly, the NeRF models generate low-quality novel views for some scenes. Secondly, there are inconsistent appearance changes in response to certain prompts, which results in unrealistic alterations that do not clearly preserve the scene content. CyEDA gives comparable consistency metrics for some scenarios, but gives less realistic rendered views as seen in Fig. 4 and Table 2.

Table 4: Ablation Study comparing two approaches for generating latent appearance variables, comparing the similarity of rendered views with ground truth images (PSNR

\uparrow

|

SSIM

\uparrow

|

LPIPS

\downarrow

). We observe that both approaches give similar performance for changing appearance from one condition to another.

Latent Variables Scenarios Day to Night Day to Evening Day to Rain Night to Day Enforced structure 21.0 $|$ 0.56 $|$ 0.55 24.1 $|$ 0.75 $|$ 0.58 23.4 $|$ 0.71 $|$ 0.58 15.3 $|$ 0.56 $|$ 0.62 No enforced structure 21.6 $|$ 0.57 $|$ 0.55 23.2 $|$ 0.71 $|$ 0.57 22.8 $|$ 0.70 $|$ 0.55 15.3 $|$ 0.56 $|$ 0.61

Ablation Study.

We compare two different ways of learning latent appearance variables $\mathbf{z}_{c}\in\mathbb{R}^{d}$ . One approach is to initialize a random $d$ -dimensional vector with no enforced structure for each condition as a learnable parameter that is optimized jointly with the rest of the model. Another approach is to enforce structure by representing each condition as a fixed 2D-coordinate, placing them such that the evening condition is in between day and night, based on the assumption that one should pass through evening when going from day to night. These fixed 2D coordinates are then fed through a small learned fully-connected network to generate $\mathbf{z}_{c}$ . Comparing the performance metrics for these two approaches, as can be seen in Table 4, shows that both approaches give similar performance. However, enforcing a structure on the latent space leads to more realistic lighting effects when interpolating, as can be seen in Fig. 5, giving the appearance of a sunset. Based on this, it was decided to use the latent appearance variable with the enforced structure for our experiments. The choice of dimension $d=136$ , for the latent appearance variable, was made by observing that a higher dimension leads to a better ability to handle local appearance changes, such as turning on street lamps and removing shadows. More details and qualitative examples can be found in appendix 0.B.

5 Conclusions

We present a transformer based generalizable novel view synthesis method that allows for change of visual appearance without any scene-specific training. This is achieved by introducing a latent appearance variable that is used to change the visual appearance to match a given weather and lighting condition while keeping the scene structure unchanged. We also introduce a synthetic dataset based on CARLA for training and evaluating the methods and present experiments that show that this method is able to change the visual appearance of both synthetic and real scenes, to match a specified weather and lighting condition without any scene-specific training. The generated latent variables also make it possible to smoothly interpolate between different weather and lighting conditions. Compared to 2D style transfer, our method is view consistent by design. We experimentally show that our method outperforms multiple 2D style transfer methods, both in terms of rendering quality and that the rendering of nearby views are more consistent. A comparison with Instruct-NeRF2NeRF shows that our method is more robust in providing desired appearance changes while ensuring multi-view consistency and preserving scene content. Our generalizable approach is also more flexible, not requiring training a NeRF model for each scene, and also allows using fewer input images.

5.0.1 Acknowledgements

This work received full support from the Wallenberg AI, Autonomous Systems, and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Computational resources were provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Chalmers Centre for Computational Science and Engineering (C3SE), partially funded by the Swedish Research Council under grant agreement no. 2022-06725.

References

[1] Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. ICCV (2021)
[2] Beh, J.C., Ng, K.W., Kew, J.L., Lin, C.T., Chan, C.S., Lai, S.H., Zach, C.: Cyeda: Cycle-object edge consistency domain adaptation. In: ICIP (2022)
[3] Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)
[4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020)
[5] Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In: ICCV (2021)
[6] Chen, R., Huang, W., Huang, B., Sun, F., Fang, B.: Reusing discriminators for encoding: Towards unsupervised image-to-image translation. In: CVPR (2020)
[7] Chu, M., Xie, Y., Mayer, J., Leal-Taixé, L., Thuerey, N.: Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG) 39(4), 75–1 (2020)
[8] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. NeurIPS (2021)
[9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2022)
[10] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: An open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning. pp. 1–16 (2017)
[11] Du, Y., Smith, C., Tewari, A., Sitzmann, V.: Learning To Render Novel Views From Wide-Baseline Stereo Pairs. In: CVPR (2023)
[12] Flynn, J., Broxton, M., Debevec, P., DuVall, M., Fyffe, G., Overbeck, R., Snavely, N., Tucker, R.: Deepview: View synthesis with learned gradient descent. In: CVPR (2019)
[13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020)
[14] Gu, J., Liu, L., Wang, P., Theobalt, C.: StyleneRF: A style-based 3d aware generator for high-resolution image synthesis. In: ICLR (2022)
[15] Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: Editing 3d scenes with instructions. In: ICCV (2023)
[16] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
[17] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. neuriPS (2020)
[18] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (Oct 2017)
[19] Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: Consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: CVPR (2022)
[20] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
[21] Kurz, A., Neff, T., Lv, Z., Zollhöfer, M., Steinberger, M.: Adanerf: Adaptive sampling for real-time rendering of neural radiance fields. In: ECCV (2022)
[22] Kwon, G., Ye, J.C.: Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264 (2022)
[23] Liu, Y., Peng, S., Liu, L., Wang, Q., Wang, P., Christian, T., Zhou, X., Wang, W.: Neural rays for occlusion-aware image-based rendering. In: CVPR (2022)
[24] Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections. In: CVPR (2021)
[25] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
[26] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41(4), 102:1–102:15 (Jul 2022)
[27] Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S.M., Geiger, A., Radwan, N.: Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In: CVPR (2022)
[28] Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: CVPR (2019)
[29] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)
[30] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
[31] Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). LNCS, vol. 9351, pp. 234–241. Springer (2015)
[32] Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi, M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH (2022)
[33] Sajjadi, M.S., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., Vora, S., Lucic, M., Duckworth, D., Dosovitskiy, A., Uszkoreit, J., Funkhouser, T., Tagliasacchi, A.: Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In: CVPR (2022)
[34] Suhail, M., Esteves, C., Sigal, L., Makadia, A.: Generalizable Patch-Based Neural Rendering. In: ECCV (2022)
[35] T, M.V., Wang, P., Chen, X., Chen, T., Venugopalan, S., Wang, Z.: Is attention all that neRF needs? In: ICLR (2023)
[36] Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan, P.P., Barron, J.T., Kretzschmar, H.: Block-nerf: Scalable large scene neural view synthesis. In: CVPR (2022)
[37] Tancik, M., Weber, E., Ng, E., Li, R., Yi, B., Wang, T., Kristoffersen, A., Austin, J., Salahi, K., Ahuja, A., Mcallister, D., Kerr, J., Kanazawa, A.: Nerfstudio: A modular framework for neural radiance field development. In: ACM SIGGRAPH 2023. Association for Computing Machinery, New York, NY, USA
[38] Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: ECCV (2020)
[39] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, ., Polosukhin, I.: Attention is All you Need. In: NIPS (2017)
[40] Wang, C., Chai, M., He, M., Chen, D., Liao, J.: Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In: CVPR (2022)
[41] Wang, C., Jiang, R., Chai, M., He, M., Chen, D., Liao, J.: Nerf-art: Text-driven neural radiance fields stylization. IEEE Transactions on Visualization and Computer Graphics pp. 1–15 (2023)
[42] Wang, D., Cui, X., Salcudean, S., Wang, Z.J.: Generalizable neural radiance fields for novel view synthesis with transformer (2022)
[43] Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin-Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: CVPR (2021)
[44] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)
[45] Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13(4), 600–612 (2004)
[46] Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: CVPR (2021)
[47] Zhang, K., Kolkin, N., Bi, S., Luan, F., Xu, Z., Shechtman, E., Snavely, N.: Arf: Artistic radiance fields (2022)
[48] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
[49] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
[50] Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. NIPS (2017)

Appendix

A supplementary video is available at https://ava-nvs.github.io. The video results are discussed in 0.A. In 0.B we give more detail regarding the generation of latent appearance variables. In 0.C we give additional details regarding the comparison with Instruct-NeRF2NeRF[15]. In 0.D we show qualitative comparisons for additional scenarios.

Appendix 0.A Video

The supplementary video presents additional findings regarding interpolation, demonstrating our method’s capacity to given images at one appearance condition smoothly transition the appearance to match a different weather and lighting condition while ensuring multi-view consistency.

Furthermore, the video includes a temporal consistency comparison with applying 2D style transfer on rendered images, for the Day to Night scenario. Our observations reveal that nearly all 2D methods exhibit some degree of flickering and temporal inconsistencies, with DiffuseIT [22] displaying the most significant issues, frequently hallucinating objects or structures. CyEDA [2] appears to generate random unexpected bright spots. Of the 2D methods only Palette [32] successfully learns to fully activate street lamps, but noticeable pixel intensity fluctuations result in inconsistent renderings. In contrast, our method delivers both multi-view consistency and realistic lighting changes.

Finally the video also contains a comparison with the Instruct-NeRF2NeRF method for the Day to Night and Day to Rain scenarios. We can observe that Instruct-NeRF2NeRF method gives multiview consistent renderings, but it struggles with increased blurriness and cloudy artifacts. We can for the Day to Night scenario observe that the method struggles with clearly preserving the scene content. More details about the Instruct-NeRF2NeRF comparison can be found in the next section, including results for additional prompts.

Appendix 0.B Generating Latent Variables

We propose two approaches for generating latent appearance variables $\mathbf{z_{c}}\in\mathbb{R}^{d}$ . One approach is to initialize a random $d$ -dimensional vector for each condition and then include it as a learnable parameter that is optimized jointly with the rest of the model. In this case, the latent variables are fully learned with no enforced structure.

Another approach is to enforce structure on the latent appearance variables by defining fixed 2D-coordinates $c$ corresponding to each condition that is then passed through a small fully-connected network to generate $\mathbf{z_{c}}=f_{z}(c)$ , where the parameters of this additional fully-connected network are learned jointly with the rest of the model. For our case with four weather and lighting conditions, we define the fixed 2D coordinates as shown in Fig. 7.

The reason behind this placement is to get the desired behavior when interpolating between two conditions, ensuring that the evening condition is passed through when interpolating between day and night conditions, and places rain on a separate axis since it corresponds to appearance change not directly connected to variations in daylight. The fully connected network $f_{z}(c)$ takes in a 2D coordinate corresponding to a condition and outputs a latent appearance variable $\mathbf{z_{c}}$ of dimension $d$ . For the performed experiments, we used $d=136$ , and two hidden layers of size $16$ and $68$ , respectively. The choice of $d$ was made after testing different values and observing that a higher dimension leads to better ability to handle local appearance changes such as turning on lamps and removing shadows, as seen in Fig. 8.

Comparing performance metrics for learnable latent variables with no enforced structure in Table 5 with the ones in Table 1 where latent appearance variables with enforced structure are used, shows that both approaches for generating the latent variable $\mathbf{z_{c}}$ give similar performance when changing appearance from one condition to another. However, enforcing a structure on the latent space leads to more realistic lighting effects when interpolating between two conditions, as can be seen in Fig. 5, giving the appearance of a sunset. Based on this, it was decided to use the latent appearance variable with the enforced structure for our experiments.

Table 5: Comparison of similarity of rendered views when using learnable latent variables with no enforced structure. Comparing with ground truth images for all combinations of weather and lighting conditions (PSNR

\uparrow

|

SSIM

\uparrow

|

LPIPS

\downarrow

). The values along the diagonal correspond to novel view synthesis without appearance change.

From Day From Night From Evening From Rain Into Day 23.3 $|$ 0.76 $|$ 0.61 15.3 $|$ 0.56 $|$ 0.61 16.4 $|$ 0.61 $|$ 0.60 15.9 $|$ 0.61 $|$ 0.61 Into Night 21.6 $|$ 0.57 $|$ 0.55 28.1 $|$ 0.73 $|$ 0.55 21.1 $|$ 0.56 $|$ 0.54 21.4 $|$ 0.58 $|$ 0.55 Into Evening 23.2 $|$ 0.71 $|$ 0.57 19.9 $|$ 0.47 $|$ 0.58 23.1 $|$ 0.55 $|$ 0.57 20.3 $|$ 0.50 $|$ 0.57 Into Rain 22.8 $|$ 0.70 $|$ 0.55 20.9 $|$ 0.66 $|$ 0.56 21.3 $|$ 0.69 $|$ 0.57 23.1 $|$ 0.55 $|$ 0.58

Appendix 0.C Comparison with Instruct-NeRF2NeRF

We used the official Instruct-NeRF2NeRF implementation. This implementation uses the Nerfstudio [37] Nerfacto NeRF model. The quality of the Nerfacto models when using 10 images was very poor, so we increased the number of images in the sequence to 25. There were still issues of the quality of the rendered views for some scenes, especially for the night scenes, as can be seen in Fig. 9. This is the reason why consistency metrics for the Night to Day scenario for the Instruct-NeRF2NeRF method are not included in Table 3.

Figures 10 and 11 show results for Instruct-NeRF2NeRF for additional prompts. This shows that the quality of rendered views can vary strongly based on the prompt that is used, and some prompts such as "Make it midnight" and "Make it stormy" led to the scene content almost completely disappearing. The figures also show that the same prompt can result in differing appearance changes when used on different scenes, e.g. the prompt "Turn it into evening" leads to images with very different color schemes for the two different scenes. In contrast our method gives similar types of appearance change for different scenes.

Appendix 0.D Comparisons for additional scenarios

We further compare against other methods for additional scenarios, including Day to Evening (Fig. 12), Day to Rain (Fig. 13), and the most challenging scenario Night to Day (Fig. 14). Our method is able to clearly preserve content and 3D consistency while making appropriate adjustments to the appearance of the scene. For instance, we effectively eliminate shadows from images for the Day to Evening scenario. In the Day to Rain scenario, our model maintains scene content while altering visual appearance, ensuring multi-view consistency. Lastly, our model can even learn to deactivate interior lighting in buildings in the most challenging Night to Day scenario.

Type	Method	Scenarios
Type	Method	Day to Night	Day to Evening	Day to Rain	Night to Day
Non-diffusion	Pix2Pix-HD [44]	2.59 $\|$ 0.147	1.58 $\|$ 0.169	1.60 $\|$ 0.030	2.49 $\|$ 0.078
	BicycleGAN [50]	5.13 $\|$ 0.053	4.79 $\|$ 0.083	5.10 $\|$ 0.024	5.20 $\|$ 0.047
	NICE-GAN [6]	1.93 $\|$ 0.040	1.24 $\|$ 0.081	1.25 $\|$ 0.014	2.09 $\|$ 0.055
	CyEDA [2]	1.62 $\|$ 0.027	1.21 $\|$ 0.115	0.96 $\|$ 0.022	1.25 $\|$ 0.032
	SANet [28]	2.37 $\|$ 0.069	2.05 $\|$ 0.097	1.73 $\|$ 0.088	2.01 $\|$ 0.092
Diffusion Models	Palette [32]	10.12 $\|$ 0.115	7.21 $\|$ 0.109	8.41 $\|$ 0.057	18.77 $\|$ 0.050
	DiffuseIT [22]	24.62 $\|$ 0.236	29.23 $\|$ 0.242	27.43 $\|$ 0.166	29.75 $\|$ 0.188
	Instruct-Pix2Pix [3]	1.62 $\|$ 0.123	1.37 $\|$ 0.121	1.45 $\|$ 0.088	1.48 $\|$ 0.092
NeRF Editing	IN2N [15]	7.29 $\|$ 0.137	6.09 $\|$ 0.085	5.65 $\|$ 0.051	- $\|$ - ¹¹1We could not get satisfactory renderings for this condition, more details can be found in appendix 0.C.
Ours	Ours	1.44 $\|$ 0.026	1.10 $\|$ 0.035	0.97 $\|$ 0.013	1.25 $\|$ 0.032