CG-NeRF: Conditional Generative Neural Radiance Fields

Kyungmin Jo, Gyumin Shim¹¹footnotemark: 1, Sanghun Jung, Soyoung Yang, Jaegul Choo Equal contribution

Abstract

While recent NeRF-based generative models achieve the generation of diverse 3D-aware images, these approaches have limitations when generating images that contain user-specified characteristics. In this paper, we propose a novel model, referred to as the conditional generative neural radiance fields (CG-NeRF), which can generate multi-view images reflecting extra input conditions such as images or texts. While preserving the common characteristics of a given input condition, the proposed model generates diverse images in fine detail. We propose: 1) a novel unified architecture which disentangles the shape and appearance from a condition given in various forms and 2) the pose-consistent diversity loss for generating multimodal outputs while maintaining consistency of the view. Experimental results show that the proposed method maintains consistent image quality on various condition types and achieves superior fidelity and diversity compared to existing NeRF-based generative models.

Introduction

The neural radiance field (NeRF) (Mildenhall et al. 2020) successfully addresses unseen view synthesis, a long-lasting problem in computer vision, by learning to construct 3D scene from a set of images taken from multiple viewpoints via a differentiable rendering technique. Because NeRF takes the 3D coordinate and the viewpoint of a target scene as inputs, it is capable of synthesizing view-consistent images (i.e., images corresponding to the input view point). Due to the success of NeRF, this approach has been widely extended to various fields, such as video (Li et al. 2021; Xian et al. 2021), pose estimation (Su et al. 2021), scene labeling and understanding (Zhi et al. 2021), and object category modeling (Xie et al. 2021).

While these techniques utilize NeRF only for synthesizing an unseen view, recent studies have emerged that generate photorealistic multi-view images based on generative adversarial networks(GANs) (Schwarz et al. 2020; Niemeyer and Geiger 2021c; Chan et al. 2021). Compared to the existing 2D-based generative models, these studies can produce 3D-aware images by generating view-consistent images for given camera poses. However, because the generative models produce images without any user-specified condition, these studies have limits regarding the generation of images that contain the desired characteristics of the condition, as shown in Fig. LABEL:fig:teaser.

To extend the task of existing generative NeRF models, we propose a novel task, termed conditional generative NeRF (CG-NeRF), which performs 3D-aware image synthesis for a given condition. The proposed task aims to create view-consistent images with diverse styles in fine detail while reflecting the characteristics of conditions. To the best of our knowledge, we newly tackle this task, extending existing generative NeRF approaches that do not consider user-specified conditions.

First, while aiming to reflect user-specified conditions, we realize that conditions can exist in various forms (e.g., texts and images). Thus, to accommodate such user-specified conditions, we propose a unified method adaptively applicable to various condition types, as processing those various types of conditions in an unified model can significantly enhance its applicability for users. We consider various condition types, including color images, grayscale, sketches, low-resolution images, and text, as shown in the condition inputs in Fig. LABEL:fig:teaser. To support these various type of conditions, we leverage a pre-trained semantic multimodal encoder, CLIP, and this is followed by the disentanglement of shape and appearance from the encoded vector of the input condition.

Generating diverse images is also one of the important factors. To produce various images, we design a generative model, capable of generating fine details while reflecting the common characteristics of the input conditions. This is a challenging problem because fully relying on the input conditions may be an overly simple solution in terms of the generation model, and this problem could degrade the diversity. To enhance the diversity of the output images further, we propose a novel pose-consistent diversity (PD) loss that explicitly penalizes view inconsistencies.

In summary, our contributions are as follows:

•

We newly define a novel task, the conditional generative neural radiance fields referred to as CG-NeRF.
•

The proposed method generates diverse and photorealistic images reflecting the condition inputs, effectively disentangling the shape and appearance from the input condition.
•

For improved diversity of the output images, we propose the pose-consistent diveristy (PD) loss, which helps to maximize only style differences while maintaining consistency of the view.
•

We conduct extensive experiments and demonstrate that our unified model can generate diverse images, reflecting various conditions.

Refer to caption — Figure 1: Illustration of our main architecture. Notations are summarized in Table 1.

Related Work

Neural Radiance Fields

Recent advancements (Mildenhall et al. 2020; Hedman et al. 2021; Jain, Tancik, and Abbeel 2021; Srinivasan et al. 2021; Yu et al. 2021; Lindell, Martel, and Wetzstein 2021) in the area of novel view synthesis have been accomplished by employing the NeRF. The seminal work (Mildenhall et al. 2020) has proven the effectiveness of volume rendering with NeRF, and later studies (Hedman et al. 2021; Wang et al. 2021; Zhang et al. 2020) proposed further improvements over the original NeRF. While some NeRF studies enhance the original NeRF in terms of both quality and efficiency, our work is more related to generative NeRF methods, which have attracted attention recently.

Generative NeRF

Along with the improvements to the NeRF itself, generative NeRF models (Schwarz et al. 2020; Niemeyer and Geiger 2021c; Chan et al. 2021; Niemeyer and Geiger 2021a) have also emerged. GRAF (Schwarz et al. 2020) proposes a generative model with implicit radiance fields for the novel scene generation. Moreover, GIRAFFE (Niemeyer and Geiger 2021c) improves GRAF by separating the object instances in a controllable way, which lets users gain more ability to compose new scenes. Another study, pi-GAN (Chan et al. 2021), which is more closely related to our work, employs the SIREN (Sitzmann et al. 2020) activation function along with the multilayer perceptron (MLP), which is effective when used for novel scene generation. However, none of the previous approaches have attempted to add conditions while generating, though this would allow users to generate various scenes according to diverse conditions. Therefore, in this work, we propose a novel model, CG-NeRF, which can significantly improve the applicability of the NeRF methods.

CLIP

The conditions from which we want to generate images can exist in various forms, typically in the form of images or texts. To address both cases at the same time, a model that can take multimodal inputs is required. Among such models (Xu et al. 2018; Tao et al. 2020; Radford et al. 2021a), CLIP (Radford et al. 2021a) shows an impressive ability to embed text and image information into the same semantic space. We adopt CLIP as our global feature extractor in various conditions, making our model widely applicable for both images and text.

Proposed Approach

Overview

We propose a novel method called conditional generative NeRF (CG-NeRF), which can generate camera-pose-dependent images conditioned on various types of input data. Unlike recent unconditional generative models that learn neural radiance fields from unlabeled 2D images, we extend the generative model to a conditional model utilizing extra information as input, such as text, sketches, gray-scale, low-resolution images, or even color images. We design a model that can generate diverse images with different details, sharing the common characteristics of condition inputs. As shown in Fig. 1, the global feature vector $\bf{c}$ extracted from the input condition is fed to the network along with the noise codes $\mathbf{z}^{s}$ and $\mathbf{z}^{a}$ randomly sampled from a standard Gaussian distribution $p_{z}$ . The noise codes specify fine details that are not contained in the given global features. In the proposed model, the generator $G_{\theta}$ (2 in Fig. 1) learns radiance field representations and synthesizes images $\hat{\mathbf{I}}$ corresponding to the given global feature vector $\mathbf{c}$ and noise codes $\mathbf{z}^{s}$ and $\mathbf{z}^{a}$ , i.e.,

G_{\theta}(\boldsymbol{\xi},\mathbf{c},\mathbf{z}^{s},\mathbf{z}^{a})=\hat{\mathbf{I}},

(1)

where $\boldsymbol{\xi}$ is the camera pose for calculating the 3D coordinate $\mathbf{x}$ and the viewing direction $\mathbf{d}$ (Schwarz et al. 2020). Below, we describe the model structure designed for CG-NeRF in detail.

	Notation	Name
Input	$\mathbf{x}\in\mathbb{R}^{3}$	3D coordinate
	$\mathbf{d}\in\mathbb{R}^{2}$	Viewing direction
	$\mathbf{c}\in\mathbb{R}^{L_{c}}$	Global feautre vector
	$\mathbf{z}^{s}\in\mathbb{R}^{L_{s}}$	Shape noise code
	$\mathbf{z}^{a}\in\mathbb{R}^{L_{a}}$	Appearance noise code
Output	$\boldsymbol{\gamma}_{i}^{s},\boldsymbol{\gamma}_{i}^{a}\in\mathbb{R}^{L_{\gamma}}$	Frequency
	$\boldsymbol{\beta}_{i}^{s},\boldsymbol{\beta}_{i}^{a}\in\mathbb{R}^{L_{\beta}}$	Phase shift
	$\sigma_{pj}\in\mathbb{R}$	Density
	$\mathbf{f}_{pj}\in\mathbb{R}^{L_{f}}$	Feature vector
	$\mathbf{F}_{p}\in\mathbb{R}^{L_{f}}$	Rendered feature
	$\mathbf{I},\mathbf{\hat{I}}\in\mathbb{R}^{H\times W\times 3}$	Real/Generated image
Function	$g_{\theta}$ : $\mathbb{R}^{L_{c}+L_{s}+L_{a}+5}\mapsto{\mathbb{R}^{2L_{f}}}$	Feature fields generator
	$M^{s}$ : $\mathbb{R}^{L_{c}+L_{s}}\mapsto{\mathbb{R}^{N^{s}\times(L_{\gamma}+L_{\beta})}}$	Shape mapping network
	$M^{a}$ : $\mathbb{R}^{L_{c}+L_{a}}\mapsto{\mathbb{R}^{(N^{a}+1)\times(L_{\gamma}+L_{\beta})}}$	Appearance mapping network
	$\Phi^{s}$ : $\mathbb{R}^{3}\mapsto{\mathbb{R}^{L_{f}}}$	Shape block
	$\Phi^{a}$ : $\mathbb{R}^{L_{f}+2}\mapsto{\mathbb{R}^{2L_{f}}}$	Appearance block

Table 1: Summarized notations.

p\in\{1,\,\cdots\,,H_{V}W_{V}\}

, and

j\in\{1,\,\cdots\,,J\}

J

indicates the number of sampling points per ray.

H\times W

and

H_{V}\times W_{V}

are the spatial resoultion of image and features, respectively.

Architecture: Conditional Generative NeRF

As illustrated in Figure 1, the main architecture consists of three components: a (1) feature extractor $E$ that extracts global feature vectors from the given conditions, (2) a generator that creates an image by reflecting the conditions, and (3) a discriminator that distinguishes real images from fake images based on the condition input and that predicts the camera poses of fake images for the PD loss, which will be described in detail later.

As CG-NeRF aims to synthesize conditional 3D-aware images, the condition input is encoded to a global feature vector through the global feature extractor $E$ (1 in Fig. 1). To extract global semantic features from the given condition inputs in our case, we adopt CLIP (Radford et al. 2021b), accomodating various types of input conditions such as images and text, as a state-of-the-art multimodal encoder.

We design our generator network by combining two recent promising techniques, which are proven to generate high-quality images for the generative neural radiance field task: a SIREN-based backbone (Chan et al. 2021) and a feature-level volume rendering method (Niemeyer and Geiger 2021c). The SIREN-based (Sitzmann et al. 2020) network architecture enhances the visual quality of the NeRF-based generative model but requires a large amount of memory for training due to color-level volume rendering at the full image resolution (Chan et al. 2021). To address this issue, we leverage feature-level volume rendering, inspired by a recently proposed method (Niemeyer and Geiger 2021c). The feature-level volume rendering process substantially mitigates the problem because a volume is rendered at the level of feature vector $\mathbf{f}$ , having a smaller scale than the image resolution.

Given a global feature vector $\mathbf{c}$ , a noise code of shape $\mathbf{z}^{s}$ and appearance $\mathbf{z}^{a}$ , the feature fields generator $g_{\theta}$ (2-1 in Fig. 1) produces the density $\sigma$ and feature vector $\mathbf{f}$ in the corresponding $\mathbf{x}$ and $\mathbf{d}$ as

g_{\theta}({\bf x}_{pj},{\bf d}_{p},\mathbf{c},\mathbf{z}^{s},\mathbf{z}^{a})=(\sigma_{pj},{\bf f}_{pj}),

(2)

where $\sigma_{pj}$ and ${\bf f}_{pj}$ denote the density and the feature vector, respectively, at the corresponding 3D coordinate. Further details are described in the next section.

Once the density $\sigma$ and the feature vector $\bf{f}$ are estimated by the feature fields generator $g_{\theta}$ (2-1 in Fig. 1) at each 3D coordinate, the final feature $\bf{F}_{p}\in\mathbb{R}^{L_{f}}$ is computed through a feature-level volume rendering process as

\mathbf{F}_{p}=\sum_{j=1}^{J}T_{pj}\alpha_{pj}\mathbf{f}_{pj},\\

(3)

where the transmittance $T_{pj}=\prod_{k=1}^{j-1}\left(1-\alpha_{pk}\right)$ . The alpha value for $\mathbf{x}_{pj}$ is calculated as $\alpha_{pj}=1-e^{-\sigma_{pj}\delta_{pj}}$ , and $\delta_{pj}$ is the distance between neighboring sample points along the ray direction (Mildenhall et al. 2020). The 2D feature map $\mathbf{F}\in\mathbb{R}^{H_{V}\times W_{V}\times L_{f}}$ rendered through the volume rendering process is then upsampled to a RGB images at a higher resolution $\hat{\mathbf{I}}\in\mathbb{R}^{H\times W\times 3}$ using the 2D convolutional neural network (CNN) decoder network (2-2 in Fig. 1). The decoder network consists of CNN layers with leaky ReLU activation functions (Xu et al. 2015) and nearest neighbor upsampling layers.

Condition-based Disentangling Network

We propose a novel approach that aims to disentangle both the shape and appearance contained in a given global feature vector. For a text condition example “ round bird with a red body”, “round” and “bird” are shapes, and “red” is an attribute indicating the appearance. Two mapping networks $M^{s}$ and $M^{a}$ serve to generate the styles of the shape and appearance, respectively, from the global feature vector $\mathbf{c}$ and noise codes $\mathbf{z}^{s}$ and $\mathbf{z}^{a}$ . The global feature vector $\mathbf{c}\in\mathbb{R}^{L_{c}}$ contains the prominent attribute of the condition. In constrast, the noise codes $\mathbf{z}^{s}\in\mathbb{R}^{L_{s}}$ and $\mathbf{z}^{a}\in\mathbb{R}^{L_{a}}$ are responsible for the details that the global feature vector does not include. The mapping network consists of pairs of a linear layer and ReLU and produces frequencies $\boldsymbol{\gamma}$ and phase shifts $\boldsymbol{\beta}$ as

		$\displaystyle M^{s}\left(\mathbf{c},\mathbf{z^{s}}\right)=\text{cat}\{\left(\boldsymbol{\gamma}_{i}^{s},\boldsymbol{\beta}_{i}^{s}\right)\}_{i=1\cdots N^{s}}$		(4)
		$\displaystyle M^{a}\left(\mathbf{c},\mathbf{z^{a}}\right)=\text{cat}\{\left(\boldsymbol{\gamma}_{i}^{a},\boldsymbol{\beta}_{i}^{a}\right)\}_{i=1\ldots N^{a}+1}\text{, }$		(4)

where $N^{s}$ and $N^{a}$ denote the numbers of MLPs in each block. cat indicates channel-wise concatenation. The predicted frequencies and phase shifts are fed to the two blocks $\Phi^{s}$ and $\Phi^{a}$ in the feature fields generator.

Taking these as inputs along with the 3D coordinate $\mathbf{x}$ and the direction $\mathbf{d}$ , two consecutive blocks encode features using pairs of a linear layer and activation function of feature-wise linear modulation (FiLM) SIREN. The sine function of the FiLM SIREN layer modulated by the obtained frequency and phase shift are applied to the outputs of the linear layers as an activation function; i.e.,

\displaystyle\phi_{i}\left(\mathbf{y}_{i}\right)=\sin(\boldsymbol{\gamma}_{i}(\mathbf{W}_{i}\mathbf{y}_{i}+\mathbf{b}_{i})+\boldsymbol{\beta}_{i})\text{,}

(5)

where $\phi_{i}$ : $\mathbb{R}^{M_{i}}\mapsto{\mathbb{R}^{N_{i}}}$ is the $i$ -th MLP of each ${\Phi}^{s}$ and ${\Phi}^{a}$ . $\mathbf{W}_{i}\in\mathbb{R}^{N_{i}\times{M_{i}}}$ and $\mathbf{b}_{i}\in\mathbb{R}^{N_{i}}$ are the weight and the bias applied to input $\mathbf{y}_{i}\in\mathbb{R}^{M_{i}}$ . The two blocks in the feature fields generator have the following formulations:

		$\displaystyle\leavevmode\resizebox{258.36667pt}{}{${\Phi}^{s}\left(\mathbf{x}_{pj}\right)=\phi_{N^{s}}^{s}\left(\phi_{N^{s}-1}^{s}(\cdots\phi_{1}^{s}(\mathbf{x}_{pj}))\right)\text{,}$}$		(6)
		$\displaystyle\leavevmode\resizebox{375.80542pt}{}{${\Phi}^{a}({\Phi}^{s}(\mathbf{x}_{pj}),\mathbf{d}_{p})=\phi_{N^{a}+1}^{a}(\text{cat}(\phi_{N^{a}}^{a}(\cdots\phi_{1}^{a}({\Phi}^{s}\left(\mathbf{x}_{pj}\right))),\mathbf{d}_{p}))\text{.}$}$		(6)

Inspired by an existing approach (Schwarz et al. 2020), we assign the roles of reflecting the shape to the first block, close to the input, and the appearance to the second block, close to the output. The block for shape utilizes the 3D coordinate as the input to generate shape-encoded features, while the appearance block takes the output of the previous block as input and generates encoded features of the shape and appearance. By utilizing these features and viewing directions as inputs, features reflecting the viewing direction are generated from the last layer of the appearance block.

Pose-consistent Diversity Loss

As our method generates images conditioned on extra inputs, variations of the output images are restricted, especially when a color image is given as a condition input. To enable the generator network to produce semantically diverse images based on the condition input, we regularize the generator network with the diversity-sensitive loss (Yang et al. 2019a). This is defined as

\displaystyle\mathcal{L}_{\text{div}}(\theta)=\mathbb{E}_{\mathbf{z}^{s},\mathbf{z}^{a}\sim p_{z},\boldsymbol{\xi}\sim p_{\xi},\mathbf{c}\sim p_{r}}[\parallel\hat{\mathbf{I}}_{1}-\hat{\mathbf{I}}_{2}\parallel_{1}],

(7)

where $\hat{\mathbf{I}}_{1}$ is $G_{\theta}(\boldsymbol{\xi},\mathbf{c},\mathbf{z}^{s1},\mathbf{z}^{a1})$ and $\hat{\mathbf{I}}_{2}$ is $G_{\theta}(\boldsymbol{\xi},\mathbf{c},\mathbf{z}^{s2},\mathbf{z}^{a2})$ .

However, we empirically discover that simply applying the diversity-sensitive loss causes undesirable effects that attempt to change not only the style but also the pose of the output images (Fig. 5). Because the pose of the output images should be determined only by the input camera pose $\boldsymbol{\xi}$ , pose changes in the output images are a significant side effect. We analyze this undesirable phenomenon as follows; from the generator network’s point of view, the model maximizes the pixel difference via two different methods: (1) changing the style of the output images as desired or (2) changing the poses between two output images generated with the same camera pose, which is strongly undesired.

To explicitly address such an issue, we propose a pose regularization term applicable to the original diversity-sensitive loss, which explicitly penalizes pose difference between images generated from different noise codes $\mathbf{z}^{s}$ and $\mathbf{z}^{a}$ but from the same camera pose. The intuition behind the proposed regularization is that the model generates two images to have only a style difference constrained to have the same pose, which can be additionally learned by an auxiliary network. We propose to add the regularization term $\mathcal{L}_{\text{pose}}$ to the diversity-sensitive loss $\mathcal{L}_{\text{div}}$ , which is defined as

$\mathcal{L}_{\text{pose}}(\theta)=\mathbb{E}_{\mathbf{z}^{s},\mathbf{z}^{a}\sim p_{z},\boldsymbol{\xi}\sim p_{\xi},\mathbf{c}\sim p_{r}}[1-\cos(D_{\psi}^{\xi}(\hat{\mathbf{I}}_{1})-D_{\psi}^{\xi}(\hat{\mathbf{I}}_{2}))],$

(8)

where $D_{\psi}^{\xi}$ is the auxiliary pose estimator network we additionally train for the pose penalty loss jointly with the discriminator.

The proposed method simultaneously learns the output images’ poses by training the pose estimator network. We modify our discriminator network to contain an auxiliary pose estimator, by adjusting the channel size of the last layer to estimate the camera pose values of the output image. Because we randomly sample camera poses $\boldsymbol{\xi}$ from the prior distribution $p_{\xi}$ to generate view-consistent images, the sampled camera pose is utilized as the ground truth pose when training the pose estimator. We define the camera pose $\boldsymbol{\xi}$ with radius $r_{cam}$ , rotation angle $\kappa_{r}\in[-\pi,\pi]$ , and elevation angle $\kappa_{e}\in[0,\pi]$ . Given that we use a fixed value for $r_{cam}$ =1, the pose estimator predicts the rotation angle and elevation angle, applying the Sigmoid function to the output value multiplied by $2\pi$ and $\pi$ respectively. The camera pose reconstruction loss is defined as

\displaystyle\leavevmode\resizebox{375.80542pt}{}{$\mathcal{L}_{\text{pose}}(\psi)=\mathbb{E}_{\mathbf{z}^{s},\mathbf{z}^{a}\sim p_{z},\boldsymbol{\xi}\sim p_{\xi},\mathbf{c}\sim p_{r}}[1-\cos(D_{\psi}^{\xi}(\hat{\mathbf{I}})-\boldsymbol{\xi}_{\text{gt}})],$}

(9)

where $D_{\psi}^{\xi}(\hat{\mathbf{I}})=\boldsymbol{\xi}_{\text{pred}}=(\hat{\kappa_{r}},\hat{\kappa_{e}})$ . $D_{\psi}^{\xi}$ denotes the auxiliary pose estimator and $\boldsymbol{\xi}_{\text{gt}}$ is a randomly sampled camera pose value to generate $\hat{\mathbf{I}}$ . Because the angle can be represented by a periodic function, we design the pose reconstruction loss with the cosine function to penalize the angle difference, addressing its discontinuity at $2\pi$ .

Training Objective

To synthesize conditional outputs, we adopt a conditional GAN (Isola et al. 2017) by training a discriminator that learns to match images and condition feature vectors. As shown in Figure 1, the discriminator extracts the image feature through a series of 2D convolution layers, and the image feature is then concatenated with matching condition $\mathbf{e}$ to predict the condition-image semantic consistency. The matching condition $\mathbf{e}\in\mathbb{R}^{L_{c}+L_{s}+L_{a}}$ is the global feature vector $\mathbf{c}$ concatenated with detail codes $\mathbf{z^{s}}$ and $\mathbf{z^{a}}$ . The number of feature extracting layers is determined by the resolution of the training images. The discriminator network learns whether the given image is real or fake and matches its condition feature vector simultaneously.

At training time, we use the non-saturating GAN loss with a matching-aware gradient penalty (Mescheder, Geiger, and Nowozin 2018; Tao et al. 2020). Instead of the $R_{1}$ gradient penalty (Mescheder, Geiger, and Nowozin 2018), we adopt the matching-aware gradient penalty loss, which is known to promote the generator to synthesize more realistic and semantic-consistent images to condition-image pairs. We define three different types of data items: synthetic images with the matching condition, real images with a matching condition, and real images with a mismatching condition. The target data point on which the gradient penalty is applied can be defined by real images with the matching condition feature vector. The entire formulation of conditional GAN loss, i.e.,

		$\displaystyle\mathcal{L}_{\text{adv}}(\psi)=\mathbb{E}_{\mathbf{I}\sim p_{r}}\left[f\left(D_{\psi}(\mathbf{I},\mathbf{e})\right)\right]$		(10)
		$\displaystyle+(1/2)\mathbb{E}_{\mathbf{I}\sim p_{mis}}\left[f\left(-D_{\psi}(\mathbf{I},\mathbf{e})\right)\right]$
		$\displaystyle+(1/2)\mathbb{E}_{\boldsymbol{\xi}\sim p_{\xi},\mathbf{e}\sim p_{r},p_{z}}\left[f\left(-D_{\psi}\left(G_{\theta}(\boldsymbol{\xi},\mathbf{c},\mathbf{z}^{s},\mathbf{z}^{a}),\mathbf{e}\right)\right)\right]$
		$\displaystyle+k\mathbb{E}_{\mathbf{I}\sim p_{r}}\left[\left(\left\\|\nabla_{\mathbf{I}}D_{\psi}(\mathbf{I},\mathbf{e})\right\\|+\left\\|\nabla_{\mathbf{e}}D_{\psi}(\mathbf{I},\mathbf{e})\right\\|\right)^{p}\right],$
		$\displaystyle\mathcal{L}_{\text{adv}}(\theta)=\mathbb{E}_{\boldsymbol{\xi}\sim p_{\xi},\mathbf{e}\sim p_{r},p_{z}}\left[f\left(D_{\psi}\left(G_{\theta}(\boldsymbol{\xi},\mathbf{c},\mathbf{z}^{s},\mathbf{z}^{a}),\mathbf{e}\right)\right)\right]$

where $f(u)=-\log(1+\exp(-u))$ . $p_{r}$ and $p_{mis}$ denote the real data distribution and mismatching data distribution, respectively. $k$ and $p$ are two hyper-parameters that balance the gradient penalty effects.

Our full training objective functions for the generator network $G_{\theta}$ are summarized as

\displaystyle\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{adv}}-\lambda_{\text{div}}\mathcal{L}_{\text{div}}+\lambda_{\text{pose}}\mathcal{L}_{\text{pose}},

(11)

where $\lambda_{\text{div}}$ and $\lambda_{\text{pose}}$ are weights for each loss term.

Experiments

Dataset setups

We evaluate our CG-NeRF on various datasets, in this case CelebA-HQ (Karras et al. 2017), CUB-200 (Wah et al. 2011), and Cats (Zhang, Sun, and Tang 2008). Bacause the proposed method generates output images with extra input information given, unlike NeRF-based generative models (Chan et al. 2021; Niemeyer and Geiger 2021c; Schwarz et al. 2020), we evaluate our method only on the test set for a fair comparison. For the condition inputs, we select five different data forms to consider the different properties of input conditions in terms of the shape and appearance, e.g., color images, grayscale, sketches, low-resolution images, and text. To generate 3D-aware images from sketch conditions, first we apply a Sobel filter to extract pseudo sketch information from the image (Richardson et al. 2021) after which we apply a sketch simplification method (Simo-Serra et al. 2016). For low-resolution image conditions, we apply bilinear downsampling to images with a ratio of 1/16. Training images are resized to a resoultion of 128 $\times$ 128. To extract the global feature only of the object, we remove the background for CelebA-HQ and CUB-200 datasets.

Experimental results

Dataset

CelebA

CelebA-HQ

Cats

CUB-200

with background

without background

Method

Image

Resolution

FID↓

Precision↑

Recall↑

Image

Resolution

FID↓

Precision↑

Recall↑

Image

Resolution

FID↓

Precision↑

Recall↑

Image

Resolution

FID↓

Precision↑

Recall↑

GRAF

128

53.13

0.75

0.01

256

64.35

0.57

0.00

13.73

0.86

0.20

41.65

0.80

0.09

GIRAFFE

25.39

0.88

0.10

256

38.16

0.69

0.03

16.05

0.74

0.37

pi-GAN

128

20.92

0.75

0.49

128

22.57

0.61

0.25

without background,

conditioned on color image

with background,

conditioned on color image

without background,

conditioned on text

Ours

128

7.01

0.90

0.55

128

13.86

0.91

0.52

128

26.53

0.82

0.22

Table 2: Quantitative comparison in terms of FID, precision, and recall. A low FID score means that the distribution of the generated image is close to that of the real image in terms of the mean and standard deviation. A high precision score implies that the generated image is realistic, and a high recall score indicates that the generated images capture greater variation of the real images. To guarantee the optimal performance of each algorithm, we measure the performance based on publicly available pre-trained models from previous work.

To the best of our knowledge, there exists a no previous work performing condition generative NeRF task has been published. Hence, we perform quantitative and qualitative comparison of our model with existing NeRF-based generative models (Schwarz et al. 2020; Niemeyer and Geiger 2021c; Chan et al. 2021) to demonstrate the competitive performance of the proposed method.

Quantitative comparison

To evaluate our approach quantitatively, we measure three metrics: the Frechet Inception Distance (FID) (Heusel et al. 2017), precision, and recall using publicly available libraries¹¹1https://github.com/toshas/torch-fidelity²²2https://github.com/clovaai/generative-evaluation-prdc (Obukhov et al. 2020; Naeem et al. 2020). FID is the most popular metric for evaluating the quality of GANs as it reveals a discrepancy between distributions of real and fake images. On the other hand, precision and recall measure the quality of GANs in terms of fidelity and diversity, respectively. As reported in Table 2, to guarantee the most reliable performance of the previous methods, we evaluate the comparison results using a publicly available pre-trained model and its corresponding experiment setting. Because these visual quality metrics are highly sensitive to measurement tools showing incorrect implementations across different image processing libraries (Parmar, Zhang, and Zhu 2021), we evaluate all of the comparison methods using the same measurement tools. Based on the performances we measured, the proposed method shows better scores in terms of FID, precision, and recall compared to the existing methods for the most part. For the Cats dataset, which has all four comparison results, our method still produces competitive performance on FID as well as the best performance on precision and recall.

Qualitative comparison

Figure 3 shows comparisons of our method with other NeRF-based generative models in terms of the visual quality. For a fair comparison, according to the definition of precision (Kynkäänniemi et al. 2019), we select images in the order of the closest distance to the real image among fake images existing in the manifold of the real image. The distance is measured utilizing features of the real and fake images in the Euclidean space due to the high dimensionality of the image and lack of semantics in the RGB space. To display diverse images across all the methods, images are sampled with the distanced rank interval. Our method shows competitive visual quality on both the CelebA (CelebA-HQ) and Cats datasets. (Fig. 3).

CelebA-HQ	FID↓	IS↑
Color Image	7.01	2.14
Grayscale	7.23	2.12
Sketch	7.01	2.16
Low-Resolution	7.91	2.05
Text	7.31	2.13

Cats	FID↓	IS↑
Color Image	13.86	2.06
Grayscale	12.51	2.02
Low-Resolution	19.40	2.13
CUB-200	FID↓	IS↑
Text	26.53	3.52

Table 3: Quantitative comparisons (FID / IS) on the CelebA-HQ, Cats, and CUB-200 datasets with different condition types in terms of the image quality.

Effects of various condition types

In this section, we perform experiments to analyze the training behavior of our method depending on the input condition type. We compare the results with five different types of condition input to validate that our method yields consistent generation performance. As shown in Fig. 2, as the color image has the largest amount of condition information among the five different condition types, it restricts the range of style variation of output images generated with random noise codes. In contrast, weak conditions such as text or low-resolution images show dynamic changes in their results with random shapes or appearances. To evaluate our approach in terms of condition types quantitatively, we measure the FID (Heusel et al. 2017) and Inception Score (IS) (Salimans et al. 2016) as shown in Table 3. For each dataset, our method consistently maintains high visual quality across all types of input conditions.

Analysis of Experiments

	without PDloss		with PDloss
Condition Types	Precision↑	Recall↑	Precision↑	Recall↑
Color Image	0.899	0.520	0.900	0.550
Grayscale	0.897	0.532	0.900	0.536
Sketch	0.904	0.547	0.892	0.567
Low-Resolution	0.910	0.497	0.896	0.514
Text	0.898	0.489	0.891	0.510
Average	0.902	0.517	0.895	0.535

Table 4: Effect of the PD loss on precision and recall for measuring the fidelity and diversity, respectively, on the CelebA-HQ Dataset.

Enhanced Diversity

Because the PD loss proposed in this paper can improve the diversity of the generated images, we analyze the effect of the PD loss by taking recall and precision measurements. As shown in Table 4, as a result of applying the PD loss, the recall value is improved by about 3.5%, and the precision shows a decrease of about 0.77% on average, showing minimul degradation of visual quality. In addition, the recall is improved in all conditions; in particular, for the color and grayscale condition settings, both precision and recall are improved. From this result, applying the PD loss can increase the diversity while maintaining similar fidelity outcomes. Fig. 4 visualizes the result for a qualitative comparison of cases with and without the PD loss. The PD loss encourages the model to generate more diverse images compared to those without this loss, not only on the hair and skin color but also on the illumination.

Pose Penalty

To validate the importance of the pose-penalty in relation to the diversity-sensitive loss (Yang et al. 2019a) for our method, we conduct an ablation study to confirm the effect of the pose-penalty when attaching the diversity-sensitive loss when training. As shown in Fig. 5 (a), the diversity-sensitive loss alone prevents the network from properly learning the canonical views of objects. This implies that the model maximizes the pixel-level difference causing the pose difference of the output image, which is an undesirable effect. With the PD loss, the network properly learns to maximize the style difference while maintaining the pose. For a quantitative validation, we measure the head poses of randomly generated canonical view images using the pre-trained head pose estimator (Yang et al. 2019b). As shown in Fig. 5 (b), view-consistency is maintained with a pose-penalty by a large margin compared to the result without a pose-penalty, by showing the lower standard deviation of angles of identical view images. Note that the difference in the standard deviation of the rotation angle is larger than that in the elevation angle, as the prior camera pose distribution has a broader range of the rotation angle.

Results of CUB-200

Fig. 6 shows qualitative results on the CUB-200 dataset for text input condition. Our proposed model successfully utilizes contextual information in the given text input to generate conditional multi-view images. However, for most existing NeRF-based generative models, we empirically find that the visual quality is degraded for CUB-200 dataset in certain range of viewpoints. We suppose the performance degradation comes from large discrepancy between the prior camera pose distribution and the real one, as described in (Niemeyer and Geiger 2021b). We plan to address this issue for future work.

Conclusion

In this paper, we propose a novel conditional generative model called CG-NeRF, which takes the existing generative NeRF to the next level. CG-NeRF creates photorealistic view-consistent images reflecting the condition input, such as sketches or text. Our framework effectively extracts both the shape and appearance from the condition and generates diverse images by adding details through noise codes. In addition, we propose the PD loss to enhance the variety of generated images while maintaining view consistency. Experimental results demonstrate that our method achieves state-of-the-art performance qualitatively and quantitatively based on the quality metrics of FID, precision, and recall. In addition, the proposed method generates various images reflecting the properties of the condition types in terms of the shape and appearance.

References

Chan et al. (2021) Chan, E. R.; Monteiro, M.; Kellnhofer, P.; Wu, J.; and Wetzstein, G. 2021. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5799–5809.
Hedman et al. (2021) Hedman, P.; Srinivasan, P. P.; Mildenhall, B.; Barron, J. T.; and Debevec, P. 2021. Baking Neural Radiance Fields for Real-Time View Synthesis. arXiv:2103.14645.
Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
Isola et al. (2017) Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1125–1134.
Jain, Tancik, and Abbeel (2021) Jain, A.; Tancik, M.; and Abbeel, P. 2021. Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis. arXiv:2104.00677.
Karras et al. (2017) Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
Kynkäänniemi et al. (2019) Kynkäänniemi, T.; Karras, T.; Laine, S.; Lehtinen, J.; and Aila, T. 2019. Improved precision and recall metric for assessing generative models. arXiv preprint arXiv:1904.06991.
Li et al. (2021) Li, Z.; Niklaus, S.; Snavely, N.; and Wang, O. 2021. Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6498–6508.
Lindell, Martel, and Wetzstein (2021) Lindell, D. B.; Martel, J. N.; and Wetzstein, G. 2021. Autoint: Automatic integration for fast neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14556–14565.
Mescheder, Geiger, and Nowozin (2018) Mescheder, L.; Geiger, A.; and Nowozin, S. 2018. Which training methods for GANs do actually converge? In International conference on machine learning, 3481–3490. PMLR.
Mildenhall et al. (2020) Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; and Ng, R. 2020. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, 405–421. Springer.
Naeem et al. (2020) Naeem, M. F.; Oh, S. J.; Uh, Y.; Choi, Y.; and Yoo, J. 2020. Reliable Fidelity and Diversity Metrics for Generative Models.
Niemeyer and Geiger (2021a) Niemeyer, M.; and Geiger, A. 2021a. CAMPARI: Camera-Aware Decomposed Generative Neural Radiance Fields. arXiv:2103.17269.
Niemeyer and Geiger (2021b) Niemeyer, M.; and Geiger, A. 2021b. CAMPARI: Camera-Aware Decomposed Generative Neural Radiance Fields. arXiv preprint arXiv:2103.17269.
Niemeyer and Geiger (2021c) Niemeyer, M.; and Geiger, A. 2021c. GIRAFFE: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11453–11464.
Obukhov et al. (2020) Obukhov, A.; Seitzer, M.; Wu, P.-W.; Zhydenko, S.; Kyl, J.; and Lin, E. Y.-J. 2020. High-fidelity performance metrics for generative models in PyTorch. Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
Parmar, Zhang, and Zhu (2021) Parmar, G.; Zhang, R.; and Zhu, J.-Y. 2021. On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation. arXiv preprint arXiv:2104.11222.
Radford et al. (2021a) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021a. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020.
Radford et al. (2021b) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021b. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.
Richardson et al. (2021) Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; and Cohen-Or, D. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2287–2296.
Salimans et al. (2016) Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. Advances in neural information processing systems, 29: 2234–2242.
Schwarz et al. (2020) Schwarz, K.; Liao, Y.; Niemeyer, M.; and Geiger, A. 2020. GRAF: Generative radiance fields for 3d-aware image synthesis. arXiv preprint arXiv:2007.02442.
Simo-Serra et al. (2016) Simo-Serra, E.; Iizuka, S.; Sasaki, K.; and Ishikawa, H. 2016. Learning to simplify: fully convolutional networks for rough sketch cleanup. ACM Transactions on Graphics (TOG), 35(4): 1–11.
Sitzmann et al. (2020) Sitzmann, V.; Martel, J.; Bergman, A.; Lindell, D.; and Wetzstein, G. 2020. Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33.
Srinivasan et al. (2021) Srinivasan, P. P.; Deng, B.; Zhang, X.; Tancik, M.; Mildenhall, B.; and Barron, J. T. 2021. NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Su et al. (2021) Su, S.-Y.; Yu, F.; Zollhoefer, M.; and Rhodin, H. 2021. A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering. arXiv preprint arXiv:2102.06199.
Tao et al. (2020) Tao, M.; Tang, H.; Wu, S.; Sebe, N.; Jing, X.-Y.; Wu, F.; and Bao, B. 2020. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865.
Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset.
Wang et al. (2021) Wang, Z.; Wu, S.; Xie, W.; Chen, M.; and Prisacariu, V. A. 2021. NeRF–: Neural Radiance Fields Without Known Camera Parameters. arXiv preprint arXiv:2102.07064.
Xian et al. (2021) Xian, W.; Huang, J.-B.; Kopf, J.; and Kim, C. 2021. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9421–9431.
Xie et al. (2021) Xie, C.; Park, K.; Martin-Brualla, R.; and Brown, M. 2021. FiG-NeRF: Figure-Ground Neural Radiance Fields for 3D Object Category Modelling. arXiv preprint arXiv:2104.08418.
Xu et al. (2015) Xu, B.; Wang, N.; Chen, T.; and Li, M. 2015. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.
Xu et al. (2018) Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; and He, X. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1316–1324.
Yang et al. (2019a) Yang, D.; Hong, S.; Jang, Y.; Zhao, T.; and Lee, H. 2019a. Diversity-sensitive conditional generative adversarial networks. arXiv preprint arXiv:1901.09024.
Yang et al. (2019b) Yang, T.-Y.; Chen, Y.-T.; Lin, Y.-Y.; and Chuang, Y.-Y. 2019b. Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1087–1096.
Yu et al. (2021) Yu, A.; Ye, V.; Tancik, M.; and Kanazawa, A. 2021. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4578–4587.
Zhang et al. (2020) Zhang, K.; Riegler, G.; Snavely, N.; and Koltun, V. 2020. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492.
Zhang, Sun, and Tang (2008) Zhang, W.; Sun, J.; and Tang, X. 2008. Cat head detection-how to effectively exploit shape and texture features. In European Conference on Computer Vision, 802–816. Springer.
Zhi et al. (2021) Zhi, S.; Laidlow, T.; Leutenegger, S.; and Davison, A. J. 2021. In-Place Scene Labelling and Understanding with Implicit Scene Representation. arXiv preprint arXiv:2103.15875.



(a) Trained without PD loss	(b) Trained with PD loss