Neural Rendering of Humans in Novel View and Pose from Monocular Video

Tiantian Wang^1∗, Nikolaos Sarafianos², Ming-Hsuan Yang¹, Tony Tung²
¹University of California, Merced, ²Meta Reality Labs Research, Sausalito

Abstract

We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Despite the significant progress recently on this topic, with several methods exploring shared canonical neural radiance fields in dynamic scene scenarios, learning a user-controlled model for unseen poses remains a challenging task. To tackle this problem, we introduce an effective method to a) integrate observations across several frames and b) encode the appearance at each individual frame. We accomplish this by utilizing both the human pose that models the body shape as well as point clouds that partially cover the human as input. Our approach simultaneously learns a shared set of latent codes anchored to the human pose among several frames, and an appearance-dependent code anchored to incomplete point clouds generated by each frame and its predicted depth. The former human pose-based code models the shape of the performer whereas the latter point cloud-based code predicts fine-level details and reasons about missing structures at the unseen poses. To further recover non-visible regions in query frames, we employ a temporal transformer to integrate features of points in query frames and tracked body points from automatically-selected key frames. Experiments on various sequences of dynamic humans from different datasets including ZJU-MoCap show that our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.

Refer to caption — Figure 1: Novel View Synthesis on Unseen Poses. Given a monocular video, we predict novel views with body poses unseen from training with fine-level details (wrinkles) that works such as NeuralBody [34] or HumanNeRF [46] struggle to obtain.

1 Introduction

We set out to develop a method that generates photo-realistic humans under novel views and unseen poses from monocular RGB videos. To represent static scenes, neural radiance fields (NeRF) [29] learn an implicit representation using neural networks, which has enabled photo-realistic rendering of shape and appearance from images. With dense multi-view observations as input, NeRF encodes density and color as a function of 3D coordinates and viewing directions by the MLPs along with a differentiable renderer to synthesize novel views. While it shows unprecedented visual quality on static scenes, applying it to high quality free-viewpoint rendering of humans in dynamic videos remains a challenging task. Aiming to generalize NeRF to dynamic videos, D-NeRF [36] encodes a time step $t$ to differentiate motions across frames and converts scenes from the observation space to a shared canonical space to model the neural radiance field. As such, they can handle dynamic scenes to some extent but the poses remain uncontrollable by users. Furthermore, some approaches [25, 35] introduce human pose as an additional input to serve as a geometric guidance for different frames. However, they either cannot generalize to novel poses or need more than one input view.

To overcome these limitations, we propose a novel approach by learning implicit radiance fields based on pose and appearance representations for high fidelity novel view and pose synthesis. We leverage the human pose extracted from the parametric body model as a geometric prior to model motion information across frames. Shared latent codes anchored to the human poses are optimized, so that they integrate information across frames. However, a model that only formulates latent codes in a shared space will not generalize well to unseen poses without test-time optimization of the latent codes. To address this, we propose to model the appearance information by utilizing single-view depth information obtained by a depth estimation network. Our model learns the appearance code anchored to incomplete point clouds in the 3D space. Point clouds are obtained by using single-view depth information to lift the RGB image to the 3D space, which provides partial information of the visible parts of the human body. The learned implicit representation enables reasoning of the unknown regions and complements the missing details on the human body.

To further leverage the temporal information from multiple frames, we introduce a temporal transformer that aggregates the trackable information. We utilize the parametric body model to track points from the query frame to the key frames. Following that, based on the learned implicit representation, we extract the pose code across frames and feed it into the temporal transformer for feature aggregation. Our method is extensively evaluated against state-of-the-art techniques on several sequences of humans in motion and exhibits significantly higher rendering quality under new views and unseen poses. In addition, we reconstruct fine-level details such as cloth wrinkles, hand details at a resolution and fidelity that several prior top-performing methods such as NeuralBody [35] or HumanNeRF [46] fail to recover (Figure 1). The contributions of this work are:

•

A new novel view synthesis framework that shows significant improvement on unseen poses compared to existing video methods, with high-fidelity reconstruction of fine-level facial, cloth and body details.
•

We combine pose and appearance representations by modeling shared information across frames and specific information at each individual frame. These two representations help generalize better to novel poses compared to only utilizing the pose representation.
•

A temporal transformer is introduced to combine information across frames, which helps to recover non-visible details in the query frame (at unseen views).

2 Related Work

3D Neural Representations. Early 3D shape representation works can be classified into three categories: point-based methods [1, 37], voxel-based methods [5, 47] and mesh-based methods [2, 13, 50].

Implicit representations are then used to represent shapes by reconstructing a continuous surface geometry, which utilizes the spatial coordinates as the input and outputs the signed distances or occupancy values. With advances in differential rendering methods, geometry and appearance can be learned from multi-view observations. Related works can be categorized into static [26, 29, 40, 41, 52, 51] and dynamic scenes [4, 8, 9, 10, 15, 17, 20, 22, 31, 32, 33, 34, 35, 36, 39, 44].

Static Scene Representations. SRN [41] represents scenes as continuous functions that maps 3D coordinates to a feature representation of local scene properties and formulates the image as a differentiable ray-marching algorithm. NSVF [26] utilizes a sparse voxel octree to represent a set of voxel-bounded implicit fields. A differentiable ray-marching operation is adopted to render views from a set of posed RGB images. NeRF [29] optimizes a neural radiance field for a scene, which maps 3D coordinates and viewing directions to density and color using a neural network. While NeRF can render photo-realistic images given dense images as input, it is limited mostly to static scenes.

Dynamic Scene Representations. Dynamic NeRFs [36, 35] extend NeRF to dynamic scenes by introducing a latent deformation field or human poses. NeuralBody [35] proposes a set of latent codes shared across all frames anchored to a human body model in order to replay character motions from arbitrary view points under training poses. HumanNeRF [46] or A-NeRF [42] learn the motion information by combining the skeletal and the non-rigid transformations. For these methods, the synthesis fails under novel poses. Human pose based representation can model the body shape at any time step but will fail to capture detailed appearance. To overcome this problem, we propose to construct the appearance-based representation by utilizing the 2D features anchored to the point clouds as an input.

Dynamic Scene Fusion. To model the temporal cues across frames, previous works [7, 20, 22, 48, 53] combine motion information and introduce animatable avatar approaches [11, 12, 16, 21, 27, 28, 42, 43, 45, 46, 49, 55, 56]. Some of these approaches rely on keypoints [28], correspondences [21] or vertex normal alignment [49] to generate details. Li et al. [22] learn dense scene flow fields that learn 3D offset vectors from a point in time $t$ to the same point in time $t$ -1 and time $t$ +1. The offsets are implicitly supervised with 2D optical flow. Kwon et al. [20] employ a temporal transformer to integrate skeletal features across different frames. The vertices of the human body are first reprojected to the 2D plane and then image features are sampled to obtain the skeletal features. Although both [20] and our method use a temporal transformer, the way we use the transformer is substantially different. Kwon et al. [20] use the transformer to combine pixel-aligned skeletal features obtained by projecting the vertices to the 2D image plane and then sampled from images using bilinear interpolation. They require multiple views as input as inaccurate features are extracted when projecting the 3D vertices into single view. Instead of combining skeletal features, we propose to use a transformer to combine the pose codes for any 3D point and its tracked points. Our method optimizes the 3D feature volumes and does not require multi-view inputs.

3 Methodology

Given a monocular video of a human in motion, we synthesize free-viewpoint videos of the person under novel views and new poses. During training multi-view videos are utilized to train our pipeline. We denote the set of input video frames as $\{I_{t}|t=1,...,N_{f}\}$ , where $t$ represents the frame index and $N_{f}$ is the number of frames. To avoid the influence of background changes due to the camera movement, we remove the background color with the mask using [23] and only focus on the human in the foreground. The overview of our approach is illustrated in Figure 2.

3.1 Neural Radiance Fields

NeRF [29] represents a static scene as a radiance field and renders color using volume rendering [18]. It utilizes the 3D location $\bm{x}=(x,y,z)$ and 2D viewing direction $\bm{d}$ as input and outputs color $\bm{c}$ and volume density $\sigma$ with a network for any 3D point:

F_{\Theta}:(\gamma_{x}({\bm{x}}),\gamma_{d}({\bm{d}}))\rightarrow(\bm{c},\sigma).

(1)

$\gamma_{x}$ and $\gamma_{d}$ are the positional encoding functions for viewing direction and spatial location, respectively. To render the pixel color, NeRF uses the volume rendering integral equation by accumulating volume densities and colors for all sampled points along the ray. Let $\bm{r}$ be the camera ray emitted from the center of projection to a pixel on the image. The pixel color bounded by $h_{n}$ and $h_{f}$ is given by:

\displaystyle\tilde{C}(\bm{r})=\int_{h_{n}}^{h_{f}}T(h)\sigma(\bm{r}(h))\bm{c}(\bm{r}(h),\bm{d})dh,

(2)

where $T(h)=\exp(-\int_{h_{n}}^{h}\sigma(\bm{r}(s))ds)$ denotes the accumulated transmittance along the ray from $h_{n}$ to $h$ . NeRF is scene-specific with known camera parameters, and renders photo-realistic scenes with. To extend NeRF to model dynamic humans, we propose to learn the implicit representation to represent the shape and appearance information of the human. Specifically, we introduce a pose-conditioned representation shared by all frames and an appearance-conditioned representation specific to each frame.

3.2 Pose-conditioned Representation

Following [34, 35], we assume the 3D human model is given for each frame (i.e. we use the pre-computed available SMPL body fits or do the body fitting at each frame as a pre-processing step). We first extract the vertices from the posed 3D mesh and aim to learn a set of pose codes $\bm{Z}=\{\bm{z}^{1},\bm{z}^{2},\ldots,\bm{z}^{N_{m}}\}$ anchored to the vertices of the human body model. Here $N_{m}$ denotes the number of codes whereas the dimension of each pose code is set to 16 similar to [35]. The implicit representation is then learned by forwarding the pose code into a neural network, which represents the geometry and shape of a human performer. The pose space is shared across all frames, which can be treated as a common canonical space and enables the representation of a dynamic human based on the NeRF. Finally a neural network learns the density and color for any 3D point and volume rendering is used to render per-pixel RGB values.

The pose codes anchored to the body model are relatively sparse in the 3D space and as such, directly calculating the pose codes using trilinear interpolation would lead to less effective features for most points. During our experimental investigation we identified that a SparseConvNet is the right design choice as it propagates the codes defined on the mesh surface to the nearby 3D space. The SparseConvNet encodes the pose codes with $N_{m}$ vertices which correspond to $N_{m}$ pose codes are optimized during training. To acquire the pose code for each point sampled along the camera ray, we use trilinear interpolation to query the code at continuous 3D locations. Here the pose code for the $i$ -th point $\bm{x}_{t}^{i}$ at frame $t$ is represented by $\phi(\bm{x}_{t}^{i},\bm{Z})$ and is then fed to a neural network to predict the density and color. The pose codes learned in the shared space of all frames model the human shape well in both known and unseen poses. However, the synthesized views still lack details under novel poses without optimizing the pose codes. Hence to model per-frame details, we propose an appearance-conditioned implicit representation using the monocular image and its predicted depth as the reference inputs.

3.3 Appearance-conditioned Representation

An image along with its predicted depth can serve as the appearance human body prior under a single view. To learn detailed information at each individual frame, we learn the appearance code anchored to the point clouds. The point clouds are obtained by lifting the RGB to the 3D space using the depth image which is generated by finetuning a state-of-the-art depth estimation method [54] on our dataset. In that way the point clouds model the partially visible body of the human performer and capture details such cloth wrinkles. Given a 2D pixel $\bm{p}_{t}^{i}$ and its corresponding depth value $d_{t}^{i}$ , the point cloud generation process is formulated as $\bm{p}^{i}_{t}=F(\bm{p}_{t}^{i},d_{t}^{i})$ where $\bm{p}^{i}_{t}$ is the generated 3D point for frame $t$ , and $F(\cdot,\cdot)$ is the function generating a 3D point given a 2D pixel and a camera pose $\{\mathbf{K}_{t},\left[\mathbf{R|t}\right]_{t}\}$ . Different from the pose-conditioned latent codes that are shared across all frames, the proposed appearance-conditioned codes are anchored to the point clouds, which are obtained from the pixel-aligned features extracted from the image encoder $E$ . To take advantage of the rich semantic and detailed cues from images, we use ResNet34 [14] to encode the image feature map $E(I_{t})$ for the input image $I_{t}$ . Specifically, we first extract features from the ResNet34, that are passed to three Conv2D layers that reduce the dimension followed by a SparseConvNet to encode the features anchored to the sparse point clouds. To obtain the appearance code for each point sampled along the camera ray, we use trilinear interpolation to query the code at the continuous 3D locations. $\psi(\bm{x}_{t}^{i},E)$ is adopted to represent the appearance code for point $\bm{x}_{t}^{i}$ . The appearance code together with the pose code are forwarded into a neural network to predict the density and color. The appearance code learned on each single frame models the details on the human body and recovers some missing pixels in the 3D space.

3.4 Temporal Fusion Module

Frames from different timesteps provide complementary information to the query frame, and will be referred as key frames. A temporal transformer then effectively integrates the features extracted from the query and key frames. To obtain the corresponding pixels in the key frame, we use the parametric body model of each frame to track the points.

Point Tracking. First, $N_{a}$ points on each face of the mesh are randomly sampled, which results in $N_{s}\times N_{a}$ points on the whole body surface where $N_{s}$ represents the number of faces. We calculate the distance between a 3D point sampled on the camera ray and all points on the surface at the query frame $I_{t}$ . We keep each sample $\bm{x}_{t}^{i}$ close to the surface for rendering the color if $\min_{v\in\mathcal{V}_{t}}||\bm{x}_{t}^{i}-v||_{2}<\gamma$ and obtain the nearest point $\bm{\hat{x}}_{t}^{i}$ on the surface at frame $I_{t}$ , where $\mathcal{V}_{t}$ is the set of sampled points. Furthermore, we track the points at different frames that match $\bm{\hat{x}}_{t}^{i}$ by the body motion, and assign the feature of the tracked points to $\bm{x}_{t}^{i}$ .

Key Frame Selection. We automatically select three key frames from training frames. We first rotate the human pose along the Y-axis by 90^∘, 180^∘ and 270^∘ and calculate the distances between all training poses and the rotated poses for the query frame $\bm{S}_{t}$ by $||\bm{S}_{t}-\bm{S}_{j}||_{2}$ . We keep the frames with the $K$ -NN distances, where $j$ is the index of the training frames, $\bm{S}$ are the coordinates of the vertices extracted from the body mesh and $K$ is set to 1 for each rotated frame.

Temporal Fusion. After obtaining the pose codes from $N$ frames ( $K_{f}$ key frames and one query frame), a transformer based structure [6] is introduced that takes the $N$ features as input and employs a multi-head attention mechanism along with an MLP for feature aggregation. The fusion module is described in Figure 2. Query pose code $\phi(\bm{x}_{t}^{i},\bm{Z})$ is combined with the key frame pose codes by using the attention weight. Here we use $f_{q}(\cdot),f_{k}(\cdot)$ and $f_{v}(\cdot)$ generated by fully-connected layers to represent the query, key and value. The query and the key are used to calculate the attention map using the multiplication operation, which represents the correlation between the query pose code and the key pose code. The attention map retrieves all key pose codes and combines with the value by an addition operation. Formally, the attention weight for point $\bm{x}_{t}^{i}$ in frame $t$ and tracked point $\bm{x}_{k}^{i}$ in frame $k$ is calculated by:

a_{t,k}^{i}=\Omega\left(\frac{f_{q}(\phi(\bm{x}_{t}^{i},\bm{Z}))\cdot f_{k}(\phi(\bm{x}_{k}^{i},\bm{Z}))^{\top}}{\sqrt{d}}\right),

(3)

where $\sqrt{d}$ is a scaling factor based on the network depth, and $\Omega(\cdot)$ denotes the softmax operation. The aggregated feature for input $\phi(\bm{x}_{t}^{i},\bm{Z})$ is formulated as:

\phi^{\prime}(\bm{x}_{t}^{i},\bm{Z})=\sum_{k\in\mathcal{K}}\phi(\bm{x}_{t}^{i},\bm{Z})\cdot a_{t,k}^{i}+f_{v}(\phi(\bm{x}_{t}^{i},\bm{Z})),

(4)

where $\mathcal{K}$ denotes the index set of the combined frames. In this work, multi-head self-attention is adopted by running multiple self-attention operations, in parallel. The results from different heads are integrated to obtain the final output. After the self-attention mechanism, each input feature contains its original information and also takes into account the information from all other frames. As such, the information from key frames and the query frame are combined together. Average pooling is then employed to integrate all features, which serves as the output of the temporal fusion module. In our implementation, we do not adopt any positional encoding on the input feature sequence.

3.5 Density and Color Regression

Figure 2 shows the prediction of density and color that are represented by a neural network. For each frame, the network takes the pose code, appearance code, spatial location and viewing direction as the input and outputs the density and color for each point in the 3D space. Similar to [29, 38], we apply positional encoding to both the viewing direction $\bm{d}$ and the spatial location $\bm{x}$ by mapping the inputs to a higher dimensional space. For frame $t$ , the volume density and color at point $\bm{x}_{t}^{i}$ is predicted as a function of the latent codes, which is defined as:

(\sigma_{t}^{i},\bm{c}_{t}^{i})=M(\phi(\bm{x}_{t}^{i},\bm{Z}),\psi(\bm{x}_{t}^{i},E),\gamma_{d}(\bm{d}_{t}^{i}),\gamma_{x}(\bm{x}_{t}^{i})),

(5)

where $M(\cdot)$ represents a neural network. $\gamma_{d}(\bm{d}_{t}^{i})$ and $\gamma_{x}(\bm{x}_{t}^{i})$ are the positional encoding functions for viewing direction and spatial location, respectively.

3.6 Objective Functions

The objective function of our approach is defined as $\mathcal{L}=\mathcal{L}_{c1}+\mathcal{L}_{c2},$ where $\mathcal{L}_{c1}$ is the reconstruction loss for the rendered pixels and $\mathcal{L}_{c2}$ is the image loss for the image decoder network $D$ . The image decoder comprises multiple Conv2D layers behind the ResNet34, and aims to reconstruct the input image. The reconstruction loss forces the encoder to be optimized and generates better pixel-aligned features. We render the color of each ray using both the coarse and fine set of samples, and minimize the mean squared error between the rendered pixel color $\tilde{C}_{c}(\mathbf{r})$ and ground-truth color $C(\mathbf{r})$ for training:

\mathcal{L}_{c1}=\sum_{\mathbf{r}\in R}||\tilde{C}_{c}(\mathbf{r})-C(\mathbf{r})||_{2}^{2}+||\tilde{C}_{f}(\mathbf{r})-C(\mathbf{r})||_{2}^{2},

(6)

where $R$ is the set of rays. $\tilde{C}_{c}$ and $\tilde{C}_{f}$ denote the predictions of the coarse and fine networks. Finally, $\tilde{I}(\bm{p})$ and ${I}(\bm{p})$ are the reconstructed and ground truth colors for pixel $\bm{p}$ in the set of pixels $\mathcal{I}$ and are used to compute the image loss:

\mathcal{L}_{c2}=\sum_{\bm{p}\in\mathcal{I}}||\tilde{I}(\bm{p})-I(\bm{p})||_{2}^{2}.

(7)

Table 1: Quantitative Comparison on training (top 5 rows) and novels views (bottom 5 rows) under novel poses across all sequences.

		Sequence1		Sequence2		Sequence3		Sequence4		Sequence5
Models	Train Views	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
NeuralBody	✓	22.83	0.79	14.12	0.46	19.08	0.77	23.84	0.73	24.66	0.83
NHP	✓	22.16	0.75	13.28	0.43	18.89	0.76	23.45	0.75	24.34	0.80
Ani-NeRF	✓	22.03	0.74	13.26	0.40	18.65	0.69	23.33	0.68	24.21	0.79
HumanNeRF	✓	23.52	0.80	14.36	0.47	18.97	0.79	24.62	0.74	24.59	0.82
Ours	✓	24.76	0.81	15.51	0.49	18.78	0.72	24.84	0.76	24.78	0.83
NeuralBody		22.76	0.79	13.52	0.41	19.91	0.79	23.81	0.77	23.17	0.78
NHP		21.96	0.77	13.18	0.41	19.67	0.77	23.42	0.73	22.76	0.74
Ani-NeRF		21.86	0.71	13.03	0.37	18.32	0.64	22.67	0.66	22.68	0.75
HumanNeRF		23.13	0.79	13.93	0.42	19.88	0.76	23.89	0.76	23.83	0.79
Ours		24.67	0.80	15.01	0.45	18.70	0.71	24.63	0.75	23.91	0.81

4 Implementation Details

Network Details. For the encoder $E$ , we extract a feature pyramid [24] from each image similar to [52]. A ResNet34 backbone pretrained on ImageNet is utilized for our experiments. The output feature of the decoder has $1/4$ the spatial resolution compared to the input image. Multi-scale features are extracted prior to the fourth pooling layer. We extract pixel-aligned features using bilinear interpolation, and then concatenate them to form a latent vector of size 256. To construct the image decoder $D$ , we simply connect several Conv2D/Upsampling layers to reconstruct the input image.

For the depth prediction network, we utilize [54] which takes a single frame as input and outputs a depth map. This network is fine-tuned using our training images and ground truth depths. For inference, the depth map is predicted and is then processed by lifting the pixels to the 3D space to remove the points outside the dilated posed mesh. For the videos without ground truth depths for training, we use the depth map predicted by the NeuralBody [35].

For the transformer network, we utilize three heads for the self-attention module, which has a similar structure as [6]. Following NeRF [29], we perform hierarchical volume sampling and simultaneously optimize a coarse and fine network with identical network architecture. At the coarse scale, we sample a set of $M_{c}$ points using stratified sampling. With the prediction of the coarse network, we then sample another set of points along each camera ray, where samples are more likely to be located at relevant regions for rendering. We sample additional $M_{f}$ locations and use all $M_{c}+M_{f}$ locations to render the fine results, where $M_{c}$ and $M_{f}$ are set to 64.

Training Details. We train all layers using Adam [19] with base learning rates for the encoder-decoder network and other layers set to $10^{-3}$ and $5\times 10^{-4}$ , which decay exponentially during the optimization. Additional network architecture as well as implementation details are provided in the supplementary material. It takes about 48 hours using 4 GeForce RTX 3090 GPUs to train our method for 200 frames and 30 views each. The inference time on a single image is $\raise 0.73193pt\hbox{$\scriptstyle\mathtt{\sim}$}50s$ . Note that fast training is not a primary goal of our work and several recent techniques [3, 30] for accelerating our training regime could be used in the future.

5 Experiments

Datasets. To train our method, we rely on the proposed dataset (four sequences of real humans in motion that captured with a 3dMD full-body scanner and a single sequence of a synthetic human in motion) and the public ZJU-MoCap dataset [35]. The 3dMD body scanner comprises 18 calibrated RGB cameras that capture a human in motion performing various actions and facial expressions and output a reconstructed 3D geometry and texture per frame. These scans are noisy but capture facial expressions and fine-level details like cloth wrinkles. The synthetic scan is a high-res animated 3D human model with synthetic clothes (T-shirt and pants) that were simulated. Unlike the 3dMD scans, the 3D geometry is clean but lacks facial expressions. We render RGB and Depth for all sequences from 30 views that cover the whole hemisphere (similarly to the way NeRF data are generated) at 6 fps using Blender Cycles.

Each video has more than 200 frames of 1024 $\times$ 1024 resolution. For the proposed dataset, we select the first half of the frames for training and the rest for inference. For the ZJU-MoCap dataset, we use the same training frames as [35]. Both training and test frames contain large variations in terms of the motion and facial expressions. At training and testing, a single image at each frame is used as the input. All the input images at different frames share the same static camera pose. In addition, 29 (proposed dataset) or 14 (ZJU-MoCap) more views with different camera poses are used to train the network. The output is a rendered view given any camera pose (not including the camera pose of the input image).

Baselines for Quantitative Comparisons.

•

NeuralBody [35] models dynamic scenes using latent codes anchored to the human pose as an extra input besides the coordinate and viewing direction.
•

NHP [20] extends NeuralBody to a generalizable model by aggregating temporal and multi-view pixel-aligned features. We remove the multi-view branch with only a monocular video as the input.
•

Ani-NeRF [34] combines NeRF and 3D human skeletons by learning the blend weight field to recover animatable human models.
•

HumanNeRF [46] learns a volumetric representation of the person in a canonical T-pose and a motion field that maps the estimated canonical representation to every frame of the video.

Evaluation Metrics. Following existing approaches [29, 35], we evaluate the performance on the proposed dataset using two metrics, including the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM).

5.1 Experimental Results

We conduct a wide range of quantitative and qualitative comparisons to demonstrate the key contributions of our work along with ablation studies against simplified variants where proposed modules are removed. We evaluate our approach against the aforementioned works on the task of view synthesis for new unseen poses. All methods except HumanNeRF (monocular video for training) use the same camera views and human poses during the training and test stages for fair comparison.

Novel View and Novel Pose. We quantitatively evaluate our approach on novel views of all sequences and report our findings at the bottom part of Table 1 and Table 2. The proposed method clearly outperforms other competitive approaches on most video sequences. All other methods, have a hard time generating realistic renders when the test poses deviate significantly from what was seen during training. On the contrary, our proposed method generates fine-level details on the human body, which indicates that the point clouds-based representation and the temporal transformer can help reason the missing structure and recover the non-visible parts at the unseen poses.

Training View and Novel Pose. The quantitative results on training views and novel poses are shown at the top part of Table 1 and Table 2. We conduct this experiment to showcase that our approach generalizes well to unseen poses as the difference between training and testing views is fairly small. On most video sequences, the proposed method performs better compared to all other baselines.

ZJU-MoCap Results. In Table 2 we provide quantitative comparisons on the publicly available ZJU-MoCap dataset and show that clearly outperforms all prior work on both training and novel views under unseen poses. Our method benefits from the intra-frame point clouds representation and inter-frame temporal information.

Table 2: ZJU-MoCap - Quantitative Comparison on training (top 4 rows) and novels views (bottom 4 rows) under novel poses on Subject 313 and Subject 393. See Supplemental for more.

		Subject 313		Subject 393
Models	Train Views	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
NeuralBody	✓	23.86	0.88	22.78	0.86
Ani-NeRF	✓	23.94	0.88	23.01	0.87
HumanNeRF	✓	22.61	0.83	21.84	0.84
Ours	✓	24.41	0.89	23.63	0.87
NeuralBody		23.70	0.87	22.53	0.84
Ani-NeRF		23.47	0.86	22.39	0.85
HumanNeRF		22.53	0.83	21.72	0.83
Ours		23.93	0.88	22.90	0.86

Qualitative Results. We qualitatively compare our approach other top-performing methods under novel poses in Fig. 4. With the human pose as the geometric guidance, NeuralBody predicts the body shape well but fails to render fine-level details on the human body. NeuralBody does not generalize well to novel poses because the shared latent codes across all frames are not optimized during the test stage. Ani-NeRF and HumanNeRF use skeleton as input, which ends up overfitting to the training poses and generates blurry results for unseen poses. The qualitative comparisons are provided in Fig 4 where our approach captures fine-level details on the body ( $1^{st},3^{rd}$ rows) and head ( $2^{nd}$ row) better than prior works [34, 35, 46].

Limitations. The temporal transformer recovers more non-visible pixels in the body. Without encoding facial expressions, our method can handle humans without substantial expression variations. However when the the query frame has a facial expression different from the key frames our method predicts blurred facial expressions as combining the key frames with the query frame makes the network unable to differentiate the specific facial characteristics in the query frame. Future work will encode facial expressions for each frame as a separate code and thus being able to render such diverse expressions under new views.

5.2 Ablation Studies

Table 3: Left: Ablation Study of appearance-based representation and transformer fusion module on Sequence1. Right: Effect of the number of key frames on Sequence1.

Model Variant	PSNR $\uparrow$	SSIM $\uparrow$
W/o appearance-based repr.	23.65	0.79
W/o temporal transformer	23.70	0.77
W/o both	22.78	0.75
Ours	24.76	0.81

# frames	PSNR $\uparrow$	SSIM $\uparrow$
1	24.17	0.80
3 (Ours)	24.76	0.81
5	24.84	0.81

Effect of the Appearance-based Representation. Using the appearance code brings performance improvements (Table 3 and Fig. 3) on the fine structures (cloth wrinkles, facial expressions) in different parts of the body, which demonstrates that the appearance code anchored to the points clouds helps recover the missing pixels in the query view.

Effect of the Temporal Transformer. As shown in Table 3 and Fig. 3, temporal fusion module can help the model generate better rendering performance. We observe that the details like wrinkles on the pants are finer, the hands are cleaner and the face is significantly more crisp.

Effect of the Number of Key Frames. To evaluate the impact of the number of key frames, we report the performance in Table 3. We observe that the performance increases with more key frames and saturates with 5 frames.

Effect of the Depth Estimation. Our proposed approach relies on depth estimation that allows us to lift the RGB input to 3D and obtain the pointcloud that is then fed to our 3D backbone architecture. To identify the impact of the depth estimation module we conducted an ablation study where ground-truth depth is utilized for Sequence 4 of the proposed dataset. We observed that PSNR and SSIM are 24.96 and 0.77 (compared to 24.84 and 0.76) respectively when using ground-truth depth which are just slightly higher than using our depth estimation model. This indicates that even using the inaccurate depth information, our method can generalize well to the unseen poses. For videos without ground truth depth (i.e. ZJU-MoCap data) we use the depth predicted by NeuralBody and despite relying on their estimation, we clearly outperform them (shown in Tab. 2) which showcases the importance of using depth information for the rendering of articulated avatars.

6 Conclusion

In this paper, we built upon recent advances of neural radiance fields pertaining to digital humans and addressed key challenges that existing human body based methods suffer from, preventing them to generalize well to unseen poses. Towards that direction, we proposed to integrate a pose code and an appearance code to synthesize humans in novel views and different poses with high fidelity. The pose code that is anchored to the human pose models the human shape, whereas the appearance code anchored to the point clouds infers the fine-level details and recovers the missing parts. The point clouds are generated by lifting the 2D information to the 3D space using an estimated depth map. To leverage temporal information, we proposed to use the body motion to track points from the query frame to a few automatically-selected key frames and adopted a temporal transformer to aggregate information across multiple frames. The transformer-based fusion module recovers the non-visible part in the query frame. Our approach achieves significantly better results against several prior methods under novel views and unseen poses with quality that has not been observed in prior work. We provided a plethora of experimental comparisons, qualitative results and ablation studies to back-up our claims and we showcase that fine-level information such as fingers, logos, cloth wrinkles and face details are faithfully rendered with high fidelity.

References

[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In ICML, 2018.
[2] Heli Ben-Hamu, Haggai Maron, Itay Kezurer, Gal Avineri, and Yaron Lipman. Multi-chart generative surface modeling. TOG, 2018.
[3] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In ECCV, 2022.
[4] Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao Bao, Xu Jia, and Huchuan Lu. Animatable neural radiance fields from monocular rgb videos. arXiv preprint arXiv:2106.13629, 2021.
[5] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[7] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM TOG, 2016.
[8] Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B Tenenbaum, and Jiajun Wu. Neural radiance flow for 4d view synthesis and video processing. In ICCV, 2021.
[9] Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In CVPR, 2021.
[10] Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. arXiv preprint arXiv:2105.06468, 2021.
[11] Xiangjun Gao, Jiaolong Yang, Jongyoo Kim, Sida Peng, Zicheng Liu, and Xin Tong. Mps-nerf: Generalizable 3d human rendering from multiview images. TPAMI, 2022.
[12] Chen Geng, Sida Peng, Zhen Xu, Hujun Bao, and Xiaowei Zhou. Learning neural volumetric representations of dynamic humans in minutes. In CVPR, 2023.
[13] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-mâché approach to learning 3d surface generation. In CVPR, 2018.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[15] Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. Baking neural radiance fields for real-time view synthesis. In ICCV, 2021.
[16] Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Instantavatar: Learning avatars from monocular video in 60 seconds. arXiv preprint arXiv:2212.10550, 2022.
[17] Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. Neuman: Neural human radiance field from a single video. In ECCV, 2022.
[18] James T Kajiya and Brian P Von Herzen. Ray tracing volume densities. SIGGRAPH, 1984.
[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[20] Youngjoong Kwon, Dahun Kim, Duygu Ceylan, and Henry Fuchs. Neural human performer: Learning generalizable radiance fields for human performance rendering. arXiv preprint arXiv:2109.07448, 2021.
[21] Ruilong Li, Julian Tanke, Minh Vo, Michael Zollhöfer, Jürgen Gall, Angjoo Kanazawa, and Christoph Lassner. Tava: Template-free animatable volumetric actors. In ECCV, 2022.
[22] Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021.
[23] Shanchuan Lin, Linjie Yang, Imran Saleemi, and Soumyadip Sengupta. Robust high-resolution video matting with temporal guidance. In WACV, 2022.
[24] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
[25] Yariv Lior, Kasten Yoni, Moran Dror, Galun Meirav, Atzmon Matan, Basri Ronen, and Lipman Yaron. Multiview neural surface reconstruction by disentangling geometry and appearance. In NeurIPS, 2020.
[26] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. In NeurIPS, 2020.
[27] Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. Neural actor: Neural free-view synthesis of human actors with pose control. TOG, 2021.
[28] Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang, and Shunsuke Saito. Keypointnerf: Generalizing image-based volumetric avatars using relative spatial encoding of keypoints. In ECCV, 2022.
[29] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
[30] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. TOG, 2022.
[31] Phong Nguyen-Ha, Nikolaos Sarafianos, Christoph Lassner, Janne Heikkilä, and Tony Tung. Free-viewpoint rgb-d human performance capture and rendering. In ECCV, 2022.
[32] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo-Martin Brualla. Deformable neural radiance fields. arXiv preprint arXiv:2011.12948, 2020.
[33] Bo Peng, Jun Hu, Jingtao Zhou, and Juyong Zhang. Selfnerf: Fast training nerf for human from monocular self-rotating video. arXiv preprint arXiv:2210.01651, 2022.
[34] Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Animatable neural radiance fields for modeling dynamic human bodies. In ICCV, 2021.
[35] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
[36] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, 2021.
[37] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
[38] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In ICML, 2019.
[39] Amit Raj, Michael Zollhoefer, Tomas Simon, Jason Saragih, Shunsuke Saito, James Hays, and Stephen Lombardi. Pva: Pixel-aligned volumetric avatars. arXiv preprint arXiv:2101.02697, 2021.
[40] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In CVPR, 2019.
[41] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In NeurIPS, 2019.
[42] Shih-Yang Su, Frank Yu, Michael Zollhöfer, and Helge Rhodin. A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. In NeurIPS, 2021.
[43] Gusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan Lu. Neural capture of animatable 3d human from monocular video. In ECCV, 2022.
[44] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In ICCV, 2021.
[45] Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu Tang. Arah: Animatable volume rendering of articulated human sdfs. In ECCV, 2022.
[46] Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In CVPR, 2022.
[47] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
[48] Hongyi Xu, Thiemo Alldieck, and Cristian Sminchisescu. H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion. In NeurIPS, 2021.
[49] Tianhan Xu, Yasuhiro Fujita, and Eiichi Matsumoto. Surface-aligned neural radiance fields for controllable 3d human synthesis. In CVPR, 2022.
[50] Yaoqing Yang, Chen Feng, Yiru Shen, and Dong Tian. Foldingnet: Interpretable unsupervised learning on 3d point clouds. arXiv preprint arXiv:1712.07262, 2017.
[51] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. PlenOctrees for real-time rendering of neural radiance fields. In ICCV, 2021.
[52] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
[53] Tao Yu, Zerong Zheng, Kaiwen Guo, Jianhui Zhao, Qionghai Dai, Hao Li, Gerard Pons-Moll, and Yebin Liu. Doublefusion: Real-time capture of human performances with inner body shapes from a single depth sensor. In CVPR, 2018.
[54] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
[55] Fuqiang Zhao, Wei Yang, Jiakai Zhang, Pei Lin, Yingliang Zhang, Jingyi Yu, and Lan Xu. Humannerf: Efficiently generated human radiance field from sparse inputs. In CVPR, 2022.
[56] Zerong Zheng, Han Huang, Tao Yu, Hongwen Zhang, Yandong Guo, and Yebin Liu. Structured local radiance fields for human avatar modeling. In CVPR, 2022.