This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Casual 6-DoF: free-viewpoint panorama
using a handheld 360° camera

Rongsen Chen, Fang-Lue Zhang,  Simon Finnie, Andrew Chalmers,  Taehyun Rhee R. Chen, S. Finnie, A. Chalmers and T. Rhee are with Computational Media Innovation Centre, Victoria University of Wellington, New Zealand.
E-mail: {rongsen.chen, simon.finnie, andrew.chalmers, taehyun.rhee}@vuw.ac.nz. F.-L Zhang is with School of Engineering and Computer Science, Victoria University of Wellington, New Zealand. E-mail: [email protected].
T. Rhee and F.-L. Zhang are the corresponding authors.Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Six degrees-of-freedom (6-DoF) video provides telepresence by enabling users to move around in the captured scene with a wide field of regard. Compared to methods requiring sophisticated camera setups, the image-based rendering method based on photogrammetry can work with images captured with any poses, which is more suitable for casual users. However, existing image-based rendering methods are based on perspective images. When used to reconstruct 6-DoF views, it often requires capturing hundreds of images, making data capture a tedious and time-consuming process. In contrast to traditional perspective images, 360° images capture the entire surrounding view in a single shot, thus, providing a faster capturing process for 6-DoF view reconstruction. This paper presents a novel method to provide 6-DoF experiences over a wide area using an unstructured collection of 360° panoramas captured by a conventional 360° camera. Our method consists of 360° data capturing, novel depth estimation to produce a high-quality spherical depth panorama, and high-fidelity free-viewpoint generation. We compared our method against state-of-the-art methods, using data captured in various environments. Our method shows better visual quality and robustness in the tested scenes.

Index Terms:
6 Degrees-of-freedom, 6-DoF, Reference View Synthesis, Free-Viewpoint Images, Panoramic Depth Estimation

1 Introduction

Recent advancements in Virtual and Mixed Reality (VR/MR) have led to a surge in popularity of 360° panoramic media. They are well suited for VR/MR due to their wide field-of-view (FoV), which provides complete rotational freedom. In its current form, however, 360° media has a lack of freedom for translational motion which can break the user’s immersion  [39, 36].

Recent research of 6 degrees-of-freedom (6-DoF) media has been able to generate motion parallax in accordance with user motion. However, current approaches, such as Facebook’s manifold RED camera [31], Google’s welcome to light field [27] and layered meshes [6], require sophisticated camera setups that are additionally reliant on professional capturing devices. The motion parallax generated by these methods is constrained to a small area, meaning they have limitations to provide free-viewpoint navigation over moderate to large distances.

Refer to caption
Figure 1: Our method provides 6-DoF experiences using 360° panoramic images captured by a handheld 360° camera. (a) The input 360° panoramas captured by a conventional 360° camera, (b) the corresponding depth panoramas estimated by our method, (c) synthesized novel views with comparisons, and (d) their corresponding zoomed in images.

Image Base Rendering (IBR) methods [37, 38] based on photogrammetry [18, 35], on the other hand, only require an unstructured set of overlapping images that are more accessible for conventional (casual) users. These approaches can operate over larger areas due to their flexible camera setup and recovered geometric information. Inside-Out [15] synthesizes high-quality free-viewpoint navigation using unstructured sets of RGBD images to enable 6-DoF. The method was further refined in DeepBlending [14], where the captured depth maps were replaced with a depth estimation. These methods are still, however, limited by their use of perspective cameras, requiring large quantities of images to accurately capture full environments. This process is time-consuming, complex, and unpleasant for the casual user.

The spherical 360° camera, in contrast to the perspective camera, has a complete view of the environment in each image (a perspective camera would normally take 5-8 shots to cover the same area). Using the 360° camera for the capturing process would greatly simplify the process for IBR-based 6-DoF methods. The use of 360° cameras in view synthesis has been explored recently in Omniphoto [4]. However, their approach was limited to generating views within a structured circle, restricting the user’s range of motion while eliminating all vertical motion. In this paper, we overcome these limitations with the use of casually captured panoramas from a single 360° camera.

Our method provides real-time 6-DoF viewing experiences using 360° panoramic images captured by a handheld 360° camera (Fig. 1). Given an unstructured collection of 360° monocular panoramic images, our novel panoramic view synthesis method can synthesize panoramic images from novel viewpoints (a point in 3D space where no image was previously captured) in 30fps. Our method starts with an offline process to recover the orientation and position of each input panorama. We then recover the sparse and dense depth panoramas of the scene. Unlike previous methods [14, 32] that use this information to generate dense 3D geometric models for rendering new viewpoints, we directly synthesize 360° RGB images using the recovered depth from the input panoramas. We present a novel depth interpolation and refinement scheme that ensures high visual quality and fast view synthesis.

We tested our method in various indoor and outdoor environments at medium to large scale scenes, with casually captured data using a consumer-grade 360° camera. We evaluated our method against current state-of-the-art approaches [15, 14, 25] in each of our environments over short and long ranges of motions.

Our contributions are summarized as follows:

  • We present a novel platform with a complete pipeline to enable 6-DoF viewing experiences using a set of panoramas captured by a handheld consumer-grade 360° camera.

  • We present a robust approach to reconstruct the depth panoramas from a set of RGB 360° panoramic images.

  • We developed a novel method to synthesize novel panoramic viewpoints in real-time (30fps) to allow users to walk around within a large-scale captured scene.

2 Related Work

2.1 6 Degrees-of-Freedom

6-DoF methods have been attracting much attention in recent years due to the need for motion parallax in VR applications. Thatte et al. [40] introduced stacked omnistereo, which uses two camera rigs stacked on top of each other in order to capture 6-DoF content. Welcome to Light Field [27] used a spinning camera setup for capturing high-density images and providing reliable depth estimation. Facebook Manifold RED [31] is a camera system that has been specifically designed to capture 6-DoF video. More recently, Broxton et al. [6] proposed a system that uses deep-learning to convert videos captured by a spherical camera rig into a layered mesh, which can be viewed in 6-DoF. These state-of-the-art capture methods were restricted to professional cameras and sophisticated setups. The photogrammetry-based methods such as depth image-based rendering (DIBR) [50, 26, 22] only require a set of overlapping photos which are not necessarily from the same camera, thus being more suitable for casually captured videos.

Inside-Out [15] is a photogrammetry-based method that enables wide area, free-viewpoint rendering via tile selection. DeepBlending [14] uses a similar architecture to Inside-Out. One significant improvement of this method is that they use deep learning-based image blending to achieve higher visual fidelity. Recently, Xu et al. [44] further improved the work to have the ability to reconstruct reflection on a reflective surface. However, with the requirement of thousands of input images company with input depth from Kinect, thus, is not ideal for casual users such as tourists. Recent works such as NERF [25, 47], free-view synthesis [32], and stable view synthesis [33] used deep learning to render high fidelity novel views. However, these methods are slow and require significant running time to synthesize each frame.

The methods discussed above were built for perspective cameras, meaning direct application to 360° panoramas would perform poorly. In this paper, we present a method for creating high fidelity 6-DoF scenes with 360° cameras.

2.2 Panoramic 6 Degrees-of-Freedom

Huang et al. [17] presented an approach that used geometry estimated via Structure-from-Motion (SfM) to guide the vertex warping of a 360° video. This technique simulates the feel of a 6-DoF experience. However, the experience is somewhat lacking due to its inability to produce motion parallax. Cho et al [9] extended this method, allowing it to use multiple 360° panoramas as input. However, it still presents the same limitation. The lack of motion parallax, in either case, can lead to VR discomfort.

One of the early 6-DoF panorama applications which provided motion parallax was proposed by Serrano et al. [36]. In their system, they used deep-learning to predict the depth map of a given 360° panorama and render a mesh representation of the environment (inherently allowing for motion parallax). They designed a three-layered representation to handle occlusion. However, their method only took input from a single viewpoint, meaning the output quality and range of motion were limited. MatryODShka [1] adapted Multi-Plane Image (MPI) [24, 49] to 360° panoramas, creating a layered representation of Omnidirectional Stereo (ODS) images with deep learning, however, this approach also has a limitation on its synthesis quality and range of available motion.

In this paper, we present a novel method of performing wide-area free-viewpoint navigation using casually captured 360° panoramas.

Refer to caption
Figure 2: The overview of our method. (a) The input 360° panoramas. (b) Camera parameter estimation. (c) Depth estimation for the panoramas (d) View synthesize that enables high-fidelity 6-DoF viewing.

2.3 Depth estimation

Recovering depth information is important to enabling accurate 6-DoF images. Previous depth estimation methods for 360° images has attempted to recover the depth map via matched features [34], this type of method suffers from feature detection errors often producing an incomplete depth map. Most recent works on 360° images designed to perform depth estimation with single [42] or pair of 360° images[51, 43]. These works estimate depth without validating depth consistency across images. The inconsistency across depth images causes severe ghosting artifacts and is not suitable for view synthesis of large scenes where dozens to hundreds of images are often used for reconstruction. To reconstruct consistent depth across multiple images, the depth needs to be estimated with the spirit of Multi-View Stereo (MVS).

Similar to other areas in computer vision, research has attempted to approach MVS using CNNs [45, 46]. However, these methods often perform poorly on wildly captured images. Since their networks are trained by comparing the estimated depth with ground truth depth, it is limited to adapt the newly captured scenes when ground truth is not available. Neural radiance field (NeRF) [25] uses a Multi-Layer perceptron (MLP) to learn the presentation of the scene. Although this method requires training for every given new scene, the advantage is that it can train the network by directly comparing the estimated result with the capture RGB images, thus, more suitable for general scenes than methods trained with ground truth depth. However, we observed NeRF has difficulty in network converging when training with 360° panoramas. We demonstrate the limitation in Sect. 8. Since the learning-based methods were unable to provide acceptable depths from our survey and tests, we developed our depth estimation as in Sect. 5.

3 Overview

An overview of our method is shown in Figure 2. It has an offline pre-processing phase and an online view synthesis phase. During the pre-processing phase, the captured 360° video sequence is first processed by registering camera parameters. Then the sparse and dense panoramic depths of each input panorama are estimated. In contrast to previous methods [15, 14], we estimated the dense depth map via a separate epipolor-based depth estimation instead of performing surface reconstruction on the sparse point cloud that builds using sparse depth maps. The estimated sparse and dense depths will then be refined to produce high-quality depth panoramas in several steps; sparse and dense depth fusion, depth correction via raymarching, and forward depth projection. The refinement process will reduce the noise in the estimated depths, as well as ensure depth consistency across input panoramas.

The online phase is for novel 360° view synthesis, which can be done in real-time to support interactive VR applications. Given a target position to synthesize a novel view panorama, we firstly project all input panoramas to the target position based on their estimated depths, generating an initial RGBD 360° panorama. We generate dense depth panoramas by interpolating the depth values of the corresponding pixels from the input panoramas and further enhancing them with the same depth correction algorithm as used in the pre-processing phase. Using the enhanced depth panorama, we are able to synthesize the RGB values of the novel view panorama by retrieving RGB pixels from the neighboring input panoramas and using weighted blending to combine them.

Refer to caption
Figure 3: The overview of our depth estimation pipeline. The process starts by generating depth panoramas from RGB inputs using the patch-based multi-view stereo and epipolar geometry. We then apply our depth refinement process on the fused depth panorama. Depth Fusion is consisting of depth panorama fusion in back-to-front fashion, and morphological operations.

4 Capture and Registration

Our method takes a set of 360° panoramas as input, either discretely captured in different locations, or selected from a series frames in a video. Since blurring the image will harm the performance of camera registration methods (producing an inaccurate estimated camera pose), this will lead to ghosting artifacts in the view synthesis and degrade the quality of the estimated depth. When sampling frames from a video we use the variance of Laplacian Operator [2] to reject any blurred frames and substitute their closest sharp frames instead. Similar to other methods based on photogrammetry [15, 14, 32, 33], our method made no assumptions regarding the camera transforms (translation and rotation).

We recover the camera transforms corresponding to the selected frames using COLMAP (a SfM software package) [35]. We chose COLMAP over other software such as Meshroom [12] because we found that COLMAP produces more robust registration result, as well as to ensure a fair comparison with the alternative methods  [14, 25] (which also used COLMAP).

However, COLMAP is designed for perspective images. In order to use it on 360° panoramas, we first project them into a set of perspective sub-images with overlapping FoV. Every panorama is projected into 8 perspective images, each with a 110° horizontal FoV, by ignoring the top and bottom poles of the 360° panorama. We sample perspective images in this way because we find it provides better input for COLMAP than cubemap projection in terms of feature quality and quantity. Since the 8 perspective images were sampled from the same 360° panorama, they shared the same center point, using this we can recover the position of the 360° camera.

5 Depth Estimation for 360° Panorama

Recent methods for view synthesis [14, 33] follow a process of first estimating sparse depths from RGB inputs, and then densifying them via surface reconstruction [19, 8] to recover dense depth information for a given scene. However, since casually captured image-sets from non-professional users often lead to imperfect reconstruction, the reconstructed sparse depth will contain large regions of missing depths such as the complex outdoor scene as shown in Fig. 3 (sparse depth panorama). It causes surface reconstruction errors, and thus, poorer results for view synthesis. To address this issue, we propose to use a two-stage approach to estimate the sparse depth and dense depth separately, using the patch-match depth estimation method and epipolar depth estimation method respectively. We further propose a depth refinement process to improve the quality of the estimated depth map. This refined depth will be later used for synthesizing novel views.

5.1 Two Stage Panorama Depth Estimation

Sparse depth estimation: Our first step is to estimate a sparse depth panorama for each of the input RGB panoramas using the patch-match-based depth estimation method. We achieve this via COLMAP as in other similar methods[14, 33]. This approach produces good results for both small and thin objects, but it has limitations on texture-less regions, resulting in a depth panorama that contains a lot of missing information (referred to as a sparse depth panorama). Directly performing view synthesis with such depth would result in scenes with missing geometry. Thus, a denser depth information is required for a complete view synthesis result.

Dense depth estimation: Previous methods [14] obtain the dense presentation of the geometry via surface reconstruction using the sparse depth. However, the sparse depth estimated from casually image set may contains large gaps that is difficult for surface reconstruction to handle. Therefore, in contrast to previous methods, we propose to separately estimate the dense depth panorama using epipolar geometry [48]. Although this type of method often fails to detect thin geometries, it is able to recover more complete depth information of texture-less region than patch-base methods.

We implement the dense depth estimation as follows. Given a 360° panorama, we first generate sweep volumes directly using NN closest 360° panoramas (we use N=4N=4 in our experiment), given that it has a full FoV that could help to effectively avoid the out-of-FoV issue in narrow FoV videos. We then compare the sweep volume with the given 360° panorama to generate cost volume. We compute the cost volume using the classical ad-census [23] method, because it is able to demonstrate good result on texture-less region. Similar to previous methods [16, 30] we adapt guided image filters to filter the matching cost. Guided image filters smooth the filtered cost volume with regards to the edge of the guiding image, which helps sharpen the edge of the resulting depth images. We then follow the classical winner-takes-all depth selection to produce the dense depth map.

5.2 Depth Refinement

Our dense depth panorama is estimated on a per-view basis, meaning that the estimated depth information is based on the parallax present in neighboring panoramas. Each panorama has different neighbors, meaning the recovered depth panoramas may not be consistent, as presented by the top row of Fig. 4. This will cause visual artifacts during view synthesis. We solve the depth inconsistency, with an iterative depth refinement process. Our depth refinement process consist of three steps, depth fusion, depth correction via raymarching, and forward depth projection.

Depth Fusion: Our depth fusion is a straight forward process that combines the dense depth map and sparse depth map in back to front fashion. During depth fusion we using the dense depth map as a hole filler to filling up the space where sparse depth estimation fails to reconstruct. Then we apply an morphological operation to the combined depth map to reduce the floating noise comes with the sparse depth.

During the first iteration, we perform depth fusion using the estimated sparse and dense depth panorama. Similar to Hedman et al. [14], our depth fusing allows us to adopt advantages of each depth type while avoiding their limitations. Nonetheless, the fused depth can still contain issues such as depth inconsistency that comes from the depth map estimated using epipolar approach. Thus, it needs further refinement to be used in view synthesis.

Refer to caption
Figure 4: The result from each refinement step. Initial Depth: Obtained by fusing the patch-matched based and epipolar based estimated depth together. Depth Correction via Raymarching: Correcting the depth that were too close to the camera using raymarching. Forward Depth Projection: Merging with neighboring depth via projection. Depth Fusion: use the result of forward depth projection as the new dense depth and combine it with the sparse depth map.

Depth Correction via Raymarching: The depth consistency is optimized by finding the optimal depth value among neighboring panoramas. Floating geometry artifact often occurs when the estimated depth map has outliers that are too close to the camera. For instance, in Fig. 4 top row, the area around pavilion has estimated depth value much smaller than its neighboring area. We first attempt to remove those outlier points by checking weather an depth value is the highest depth that are agreed upon all neighboring depth panoramas in a given range. We check this consistency via a raymarching process, which is similar to the one used in Koniaris et al. [20] as shown in Algorithm 1. The increase rate rr and the nearest KrmK_{rm} input panoramas need to be set manually for different scenes in order to obtain optimal results. In our experiments, we set rr and KrmK_{rm} to 0.005 and 4 respectively by default, and found they worked well in all our scenes.

Refer to caption
Figure 5: View synthesis with each stage of the refinement result. The right most and left most are two closest reference views in the data-set. (a) Without refinement. (b) Result after applying depth fusion (initial depth). (c) Result after applying depth correction. (d) Result after forward depth projection. (e) Result after final depth fusion.
Algorithm 1 Depth Correction via Raymarching Algorithm
0:  for any given depth panorama DD, it’s corresponding position 𝐭𝐧𝐞𝐰\mathbf{t_{new}}, the number of nearest neighbors KrmK_{rm}, the collection of neighboring input depth panoramas {D1,D2,,DK}\{D_{1},D_{2},\ldots,D_{K}\}, their corresponding positions {𝐭1,𝐭2,,𝐭K}\{\mathbf{t}_{1},\mathbf{t}_{2},\ldots,\mathbf{t}_{K}\}, increase rate rr.
0:  corrected depth DcorD_{cor}
1:  Dcor=zero(Width,Height)D_{cor}=zero(Width,Height)
2:  for j=1j=1 to HeightHeight do
3:     for i=1i=1 to WidthWidth do
4:        d=D(i,j)d=D(i,j)
5:        for k=1k=1 to KrmK_{rm} do
6:           irep,jrep,drep=i_{rep},j_{rep},d_{rep}= reprojection(i,j,d,𝐭𝐧𝐞𝐰,𝐭ki,j,d,\mathbf{t_{new}},\mathbf{t}_{k})
7:           if drep<Dk(irep,jrep)d_{rep}<D_{k}(i_{rep},j_{rep}) then
8:              d=d+rdd=d+rd
9:              k=1k=1
10:           end if
11:        end for
12:        Dcor(i,j)=dD_{cor}(i,j)=d
13:     end for
14:  end for
15:  return  DcorD_{cor}

Although our raymarching process able to removes the outlier depths that were too close to the camera, it often erased part of object. This is because, due to the limitation of epipolar based depth estimation, not all neighboring views would be correctly estimated for thin objects, if the sparse depth map fails to address, the raymarching process would end up removing the depth information of that object. An example has been shown in the second row of Fig. 4, where the pillar of the pavilion has been partially erased. In such cases, the refined depth panorama will lose the depth information related to the occluding object, causing artifacts where geometric details are missing. We overcome this issue via our following forward depth projection process.

Forward Depth Projection: Our depth correction refines the depth map by checking the consistency of depth values across neighboring panoramas within a given range. Since each panorama has different neighbors, the result of the depth correction will be slightly different. Although an occluded region in one image might be partially erased during the first depth refinement process, such information may still exist in one of its neighboring panoramas. Our forward depth projection is able to make use of this view-dependent depth information, by merging a given depth and its neighboring panoramas to recover lost occlusion information. Our depth projection is shown in Algorithm 2. Where, the neighborhood range KfpK_{fp} is scene dependant. Then, we set the value Kfp=Krm+2K_{fp}=K_{rm}+2, in our experiments. The slightly wider search range obtains information from the neighboring panoramas that are less likely affected by the same problematic depth panorama during raymarching.

Algorithm 2 Forward Depth Projection Algorithm
0:  position of the target viewpoint 𝐭𝐧𝐞𝐰\mathbf{t_{new}}, the number of nearest neighbors KfpK_{fp}, the collection of neighboring input depth panoramas {D1,D2,,DK}\{D_{1},D_{2},\ldots,D_{K}\}, their corresponding positions {𝐭1,𝐭2,,𝐭K}\{\mathbf{t}_{1},\mathbf{t}_{2},\ldots,\mathbf{t}_{K}\}.
0:  fused depth DfusD_{fus}
1:  Dfus=infinity(Width,Height)D_{fus}=infinity(Width,Height)
2:  for k=1k=1 to KfpK_{fp} do
3:     for j=1j=1 to HeightHeight do
4:        for i=1i=1 to WidthWidth do
5:           d=Dk(i,j)d=D_{k}(i,j)
6:           irep,jrep,drep=i_{rep},j_{rep},d_{rep}= reprojection(i,j,d,𝐭k,𝐭𝐧𝐞𝐰i,j,d,\mathbf{t}_{k},\mathbf{t_{new}})
7:           if drep<Dfus(irep,jrep)d_{rep}<D_{fus}(i_{rep},j_{rep}) then
8:              Dfus(irep,jrep)=drepD_{fus}(i_{rep},j_{rep})=d_{rep}
9:           end if
10:        end for
11:     end for
12:  end for
13:  return  DfusD_{fus}

For each iteration of depth refinement, we increase the value of KrmK_{rm} and KfpK_{fp} slightly (we use 2 in our experiment). The slightly increased checking range helps to improve the consistency of the depth over more panoramas, eventually KrmK_{rm} and KfpK_{fp} will be equals to the number of input panoramas. However, in our experiments, we found that around 3 iterations was sufficient for most of our test scenes. We illustrate each step of our depth refinement in Fig. 4, while showing the result of each refinement step. We also illustrated a few synthesized views using the depth maps from each depth refinement step to demonstrate the quality improvement.

6 Free-Viewpoint Synthesis

Our next step is to synthesize novel view panoramas using a set of input RGB panoramas and their corresponding depths and transformations obtained by our pre-processing phase. Previous methods [15, 14] relying on the mesh reconstruction for each input image, leading to accumulated geometric errors when blending each mesh for novel view synthesis, thus limit the quality. To avoid this, we approach our view synthesis via depth based image warping with real-time depth correction. During view synthesis, we first generate the depth panorama for the synthesized view with depth correction from the input panoramas. Then, use this depth panorama to extract RGB pixel values from inputs, and weighted blend them to synthesis the pixel colors of the novel view panorama.

6.1 Spherical 3D reprojection

Image to 3D coordinates: We first aligned orientations of input panoramas to ensure they facing the same direction. Since we use the equirectangular representation of 360° panoramas as in the prior work [40], we first convert the pixels (i,j,d)(i,j,d) of the input equirectangular RGBD images into spherical polar coordinates (θ,ϕ,d)(\theta,\phi,d), where θ[π,π]\theta\in[-\pi,\pi] and ϕ[π2,π2]\phi\in[\frac{-\pi}{2},\frac{\pi}{2}] and dd is the corresponding depth value from input depth panorama DkD_{k}. We then project all the converted spherical coordinates of each input panorama into local 3D Cartesian coordinates v\vec{v} by Equation 1.

[vxvyvz]=[dcos(ϕ)cos(θ)dsin(ϕ)dcos(ϕ)sin(θ)]\begin{bmatrix}v_{x}\\ v_{y}\\ v_{z}\end{bmatrix}=\begin{bmatrix}d\cos{(\phi)}\cos{(\theta)}\\ d\sin{(\phi)}\\ -d\cos{(\phi)}\sin{(\theta)}\end{bmatrix} (1)

Forming novel panorama: The next step is to reproject v\vec{v} into the center of target novel view panorama in its local coordinates. If we define the center of target novel view panorama as 𝐭𝐧𝐞𝐰\mathbf{t_{new}}, then the transformation between any other input panorama which centered at 𝐭k\mathbf{t}_{k} can be calculated using t=𝐭𝐧𝐞𝐰𝐭𝐤\vec{t}=\mathbf{t_{new}}-\mathbf{t_{k}}. Then all 3D points v\vec{v} of input panoramas are reprojected into the target position 𝐭new\mathbf{t}_{new} by (vt)(\vec{v}-\vec{t}). Then, the reprojected points of each input can subsequently be converted into spherical polar coordinates (θrep,ϕrep,drep)(\theta_{rep},\phi_{rep},d_{rep}) of 𝐭new\mathbf{t}_{new} using Equation 2.

[θrepϕrepdrep]=[arctan((vztz)(vxtx))arcsin((vyty)drep)||vt||]\begin{bmatrix}\theta_{rep}\\ \phi_{rep}\\ d_{rep}\end{bmatrix}=\begin{bmatrix}\arctan{(-\frac{(v_{z}-t_{z})}{(v_{x}-t_{x})})}\\ \arcsin{(\frac{(v_{y}-t_{y})}{d_{rep}})}\\ \lvert\lvert\vec{v}-\vec{t}\rvert\rvert\end{bmatrix} (2)

This spherical polar coordinates then converted into pixels of the equirectangular representation of the novel view panorama.

We use Equation 1 and 2 to extract the necessary data from the input panoramas, and synthesize the novel views in two steps: 1) Backwards warping with depth correction, and 2) image blending, described as follows.

Refer to caption
Figure 6: Examples of novel view synthesis from various directions.
Refer to caption
Figure 7: Synthesized view using different strategies. (a) Novel view synthesized with the average weight, (b) blending with wdw_{d} only, (c) blending with both wdw_{d} and wcamw_{cam}, and (d) is the result when our full weighting formulation applied.

6.2 Backwards Warping with Depth Correction

Direct projection of the input panoramas into a novel view tends to produce many missing pixels. Simple color interpolation such as using linear or cubic function result in a blending of foreground and background.

To overcome this, we first synthesize a depth panorama at the target novel view, interpolate the depth values to avoid any missing values. In our approach we adapted a morphological closing [21] similar to the one used by thatte et al. [40] in order to achieve the desired result. We use the synthesized depth panorama to extract RGB pixel values from each input panorama using Equation 1 and 2, to synthesis the novel view panorama.

However, there may be disagreement between the input depth panoramas for what the correct depth value is at the target position. A depth value that refers to the correct pixel in one input panorama may point to a different location in another input panorama, leading to visual artifacts in the synthesized panorama. To correct this, we apply Algorithm 1 to the synthesized depth panorama, improving the overall visual consistency. We perform backwards warping with the corrected depth panorama in order to obtain the RGB pixels from the input panoramas with reduced visual artifacts.

6.3 Image Blending

To synthesis color values of our novel view panorama, we use the synthesized depth panorama to extract the RGB pixel values from the corresponding input panoramas. We extract the pixel values from the closest KK neighboring panoramas, where K=4K=4. If a suitable pixel cannot be found from closest neighbors, progress further to find the corresponding pixels across other input panoramas.

After extracting corresponding pixel values, their RGB values are blended by our weighting formula. Our weighting formula is based on three components: the correctness of the estimated depth wdw_{d}, the distance between the position of target view and input panorama wcamw_{cam}, and the angle between the viewpoints and the relevant point in the scene wangw_{ang}. Then, the final blending weight WW for each pixel is computed by:

W=wdwangwcamW=w_{d}w_{ang}w_{cam} (3)

The wdw_{d} is computed by the difference between the estimated depth and the actual depth value as:

wd=(|drepd|+1)1w_{d}=(|d_{rep}-d|+1)^{-1} (4)

where drepd_{rep} is the estimated depth by reprojection from the position of the novel view depth panorama to the position of input depth panorama, and dd is a depth value at the input depth panorama. If the difference between dd and drepd_{rep} is large, it means an occlusion occurs, and the pixel from this input panorama should contribute less towards our output.

The camera weight wcamw_{cam} is used to measure the distance between the center of the novel view panorama and the center of the input panorama defined as:

wcam=s||t||w_{cam}=\frac{s}{\lvert\lvert\vec{t}\rvert\rvert} (5)

where t\vec{t} is described in Sect. 6.1. The constant ss is the scale weight, and we set it as 1010 in our experiment.

Our angle weight wangw_{ang} is to reproduce view dependent features in the synthesized image by favoring pixels with similar view angles as considered in [7, 4]. The wangw_{ang} is calculated by the angle between the vector t\vec{t} and vector v\vec{v} , defined as:

wang=πarccos(tv||t||||v||)w_{ang}=\pi-\arccos{\left(\frac{\vec{t}\cdot\vec{v}}{\lvert\lvert\vec{t}\rvert\rvert\lvert\lvert\vec{v}\rvert\rvert}\right)} (6)

Finally, each pixel value of the novel panorama In(p)I_{n}(p) is synthesised by weighted blending of corresponding pixel values from KK neighboring input panoramas IkI_{k} as:

I(p)=k=1KWkIk(qk)k=1KWkI(p)=\frac{\sum_{k=1}^{K}W_{k}I_{k}(q_{k})}{\sum_{k=1}^{K}W_{k}} (7)

where qkq_{k} represents the corresponding pixel in the kk-th input panorama. We illustrate the effect of different weighting formulation in Fig. 7. Our weighted blending successfully synthesis pixel values of the novel view panorama while reducing blending errors such as ghosting artifacts.

Refer to caption
Figure 8: Test results for the three indoor scenes. ULR fails in all cases. Compared to Inside-Out our method shows fewer visual artifacts on the borders and surfaces of objects. Unlike DeepBlending, our method looks sharper with more texture detail.
Refer to caption
Figure 9: Results of the three outdoor scenes. Our method presents more robust view synthesis for large outdoor scenes in situations where others failed.

7 Data and Implementation

We used the Nvidia CUDA library for GPU-based computations such as view synthesis. We rendered our novel view panorama to a VR HMD using the Oculus SDK and OpenGL libraries. The current implementation is able to work at 30fps.

We evaluated our method with our own captured datasets for the following two reasons. First, there are not any available benchmark datasets used for evaluating free-viewpoint synthesis methods that consist of 360° panoramas. Secondly, one of the key features of our method is its ability to function on casually captured real-world 360° panoramas, meaning we needed a dataset that meets this criteria. We captured 360° videos of different environments using the Ricoh Theta V. Most of our scenes were captured with a handheld camera. We ensured that our dataset was captured under a variety of environments so that we could evaluate the scalability and robustness of our method. For detail of the captures scene please refer to the appendix.

We sample every 10th frame from the original video for our dataset while avoiding few frames with motion blur caused by casual capturing setup, and this produces around 50cm intervals across input panoramas.

8 Results

We compared our results quantitatively and qualitatively with three closely related state-of-the-art methods: Unstructured Lumigraph Rendering (ULR) [7], Inside-Out [15], and DeepBlending [14]. They are implemented in the system provided by Bonopera et al. [5]. We also performed a qualitative comparison with the recent Neural Radiance Field (NeRF) [25] and OmniPhotos[4]. Since NeRF’s default projection method only has front facing and out-side in camera poses, we implemented a new projection method for 360° cameras, so we can directly use 360° panorama as input for NeRF. We trained NeRF with 250k iterations on each of the tested scenes. Since the prior works [7, 15, 14] do not support 360° panoramas directly, the inputs of the above methods are the same as what we used in COLMAP (Sect. 4), which are 8 perspective images per 360° panorama.

8.1 Qualitative Results

Fig. 6 shows motion parallax of the synthesis view in various directions. Fig. 12 shows examples of synthesised panoramas, which provide novel panoramic views away from the captured path.

Refer to caption
Figure 10: Qualitative Comparison with NeRF on different outdoor scenes.

Comparisons of the view synthesis of the indoor scenes are in Fig. 8. We noticed that learning-based methods such as DeepBlending efficiently reduce visual artifacts caused by the geometry estimation error, but it often blurs the scene. Both our method and Inside-Out preserve the sharp appearance of the scene, but, compared with Inside-Out, our method shows fewer visual artifacts on boundary regions.

Comparisons of view synthesis of the outdoor scenes are in Fig. 9, where our method outperforms the prior works. The geometry reconstruction of a large outdoor scene is very challenging. DeepBlending and Inside-Out are highly dependant on the mesh quality by the surface reconstruction. The sparse depths produced by COLMAP tend to contain large missing regions, and therefore may cause surface reconstruction errors. Furthermore, their depth refinement does not consider depth consistency across scenes, and therefore often produces different errors across input panoramas. When blending multiple inputs for novel view synthesis, errors in different scenes are accumulated, producing visual artifacts.

Our depth estimation and refinement method generates reliable and consistent depth information for synthesizing panoramas in different viewpoints. Our method recovers detailed geometry with reduced errors and improves results by fusing sparse depths. However, one side effect is that it may remove some of the finer details, causing ghosting artifacts in some cases. Even with this side effect, our method consistently produces higher quality results than prior methods in all our test scenes.

We have tested our method with scene data used in Omniphoto [4], and the results are shown in Fig. 11. Similar to MegaParallax [3] Omniphoto relies on optical flow to guide image warping. Their estimated optical flow describes the movement of pixels from one image to another, and thus, is only valid within the capture circle. When a user moves outside of the capture circle, optical flow will no longer be able to provide adequate guidance for image warping, and the method has to rely on the estimated geometry proxy. However, their geometry proxy only provides rough geometry information. Using it to perform image warping produces significant visual artifacts, such as twisted objects. Compared to Omniphoto, our method and DeepBlending both can synthesize reasonable results outside the captured regions. In our test, our method shows better visual quality than deep-blending.

In our test, NeRF struggles to generate novel views using our dataset. We hypothesize that our test scene attempts to cover a very large area with 360° panorama images, the scale of the scene exceeds the capability of NeRF’s MLP network, which leads to poor results. We further compared our method with NeRF, and the result is shown in Fig. 10.

Previous methods [15, 14] were designed with dense inputs, and therefore we have tested with twice as short intervals. However, their results were not improved in our experiment. We hypothesize that the distortion of 360° panorama produced by the current 360° camera model is bad for convention reconstruction methods, increasing the sample frames introduce more noise features, which do not help improve the quality of the reconstructed mesh. Similar to previous methods, the number of images also impacts the quality of our method. We demonstrate how our method performs the other methods with the different number of input images in Table I. The increase in the number of input images would improve the result, but would also require more time for it to be processed. Nonetheless as shown in the table, for our method an input of 20 images from a 14 second capture video sequence is sufficient for an acceptable reconstruction result.

Refer to caption
Figure 11: Qualitative Comparison with OnmiPhoto on two selected scenes.
Refer to caption
Figure 12: Examples of the synthesized view. The green star illustrates the position where the view was synthesized. The red point on the bottom right of each equirectangular image represents the positions (top-down view) of where the input panoramas were captured. The gray area is the potential region for user to move around with a novel synthesized view.

8.2 Quantitative Result

We quantitatively evaluated our method by comparing it with three closely-related methods [7, 15, 14]. The evaluation method we use follows the approach used by Waechter et al. [41]. The idea is to render novel views images at the same viewpoint as each of the input images, then use the input images as the ground truth to compare with in terms of completeness and visual quality [13].

We tested each method on six scenes with three indoor (PathWay, Hallway, Office) and three outdoor scenes (Bridge, Botanic Garden, Car Park). We did not evaluate the top/bottom view since viewers tend to fixate on the equator when exploring the panorama [4]. We measure the visual quality of the synthesized view at each location quantitatively by comparing the synthesized view with the ground truth image in several metrics such as Multiscale Structural Similarity Index (MS-SSIM), peak signal-to-noise ratio (PSNR), and the perceptual similarity (LPIPS). The average scores of our tests are in Table II. As shown in the table, our method outperforms others in all metrics in both indoor and outdoor scenes.

We also performed a set of ablation studies and show the result in Table II. It shows that synthesis results with only sparse depth estimation are the worst, only dense depth estimation improves the results, and a combination of the two shows a better result than each method. Also, the final depth refinement step shows a clear improvement.

56 inputs 33 inputs 20 inputs
Process-time <18<18 hours <8<8 hours <5<5 hours
PSNR\uparrow 23.7523.75 22.0522.05 20.7920.79
SSIM\uparrow 0.930.93 0.910.91 0.870.87
LPIPS\downarrow 0.050.05 0.070.07 0.100.10
TABLE I: The impact of number of input panoramas to the process time and quality.
Indoor Scenes
PSNR \uparrow MS-SSIM\uparrow LPIPS\downarrow
ULR 21.4921.49 0.850.85 0.130.13
Inside-Out 22.5122.51 0.850.85 0.080.08
DeepBlending 22.3122.31 0.820.82 0.100.10
Our method with full process 25.25\mathbf{25.25} 0.92\mathbf{0.92} 0.07\mathbf{0.07}
Our with only sparse depth 14.7414.74 0.680.68 0.190.19
Our with only dense depth 22.4622.46 0.850.85 0.110.11
Our with only depth fusion 24.1824.18 0.900.90 0.090.09
Outdoor Scenes
PSNR \uparrow MS-SSIM\uparrow LPIPS\downarrow
ULR 10.7210.72 0.790.79 0.530.53
Inside-Out 17.317.3 0.740.74 0.180.18
DeepBlending 16.1116.11 0.710.71 0.220.22
Our method with full process 21.93\mathbf{21.93} 0.84\mathbf{0.84} 0.12\mathbf{0.12}
Our with only sparse depth 11.1211.12 0.500.50 0.330.33
Our with only dense depth 19.9819.98 0.750.75 0.150.15
Our with only depth fusion 20.0920.09 0.810.81 0.140.14
TABLE II: Quantitative Comparison of our method with prior works and ablated version of our method. The score are the mean score over all scene indoor/our door scenes. \uparrow means higher is better and \downarrow means lower is better.

8.3 Performance

Our results were tested on a desktop PC with an Intel Xeon W-2133 CPU at 3.60GHz, 16GB of system RAM, and an Nvidia RTX 2080Ti GPU. The performance of novel panorama synthesis varies with the number of input panoramas. Throughout the experiment, we have used a set of the closest four input panoramas to synthesize a novel panorama. The data used for the view synthesis method is a set of 4K (4094×20484094\times 2048) equirectangular images and their corresponding 2K (2048×10242048\times 1024) depth panoramas. We are able to synthesize the panoramas at 30fps while rendering to the Oculus Rift headset. This frame rate increases to around 90fps when synthesizing a novel panorama at a resolution of 2048×10242048\times 1024.

The time it takes for our captured panorama collection to process depends on the number of images, as shown in Table I. We report the time and the amount of storage it requires to process a set of 20 images using our method and prior methods in Table III.

Deep-Blending Inside-Out Ours
COLMAP <3<3 hours <3<3 hours <3<3 hours
Depth Refinement <1<1 hours <1<1 hours <1.5<1.5 hours
Storage require 2.9GB 3.9GB 47MB
TABLE III: The processing time of a scene of 20 input panoramas for the tested methods.

The storage space needed for the final, processed data-set, of 20 source views, is about 47MB. Depth estimation requires a significant space for COLMAP to store their processing data, for instance, the 20 inputs scene required about 3GB of storage. However, once the sparse geometry reconstruction is completed, only the camera information and reconstructed sparse point cloud (sparse depths) are kept, and the rest of data can be discarded.

9 Discussion and future work

Our method shows promising results in our tests, outperforming the prior works. However, producing high fidelity 6-DoF video from a single handheld 360° camera is challenging, with a lot of room to improve.

Depth Estimation with Learning-based methods: Our depth estimation could be improved for both better depth accuracy and faster processing. The recent NeRF [25] could provide a clue to improve this. Although, we observed that naive NeRF has converging issues when we train it with 360° images, which we believe the issue is related to the MLP network. The main idea of NeRF is to use a learning-based method to learn the representation of scene. The scene information is still described using similar techniques from the classical MVS, such as ray sampling. When performing ray sampling the images will be converted to 3D points, thus, the format of the image does not effect the result of this process. Furthermore, as 360° images have advantages of FoV compared to normal perspective images, the use of 360° images should improve the result. Thus, replacing the MLP network in NeRF with a more effective learning structure, and use the improved NeRF for depth estimation, could be a good next step to improve our pipeline.

Reflective and Transparent Surface: Our depth estimation has limitations for reflective and transparent surfaces such as glass or water. An example of these limitations are shown in Fig. 12 (Hill scene), where our method contains ghosting artifacts on the glass window. The Multi-Plane Image (MPI) representation has demonstrated its ability to reproduce the behaviour of reflective and transparent surfaces [49, 11, 6]. Adapting similar methods into our pipeline to separate the transparent layers would be an interesting next step.

Dynamic Objects: Similar to previous work [15, 10, 14, 25], our method has limitations with dynamic objects. Given that the input of our method is a capture from a single camera, moving objects will shown in different positions in each image, making it difficult to estimate consistent depth for the object. Furthermore, the different positions also cause inconsistent projection for the object (same object projected into a different place). This can be solve by estimated the flow of the dynamic object over time [28, 29], and use flow information to guide image synthesis.

Stitching artifact: 360° panoramas are produced by multi-image stitching, which with current methods still contains small distortions or gaps around the image boundaries, causing inconsistencies of object shape in these areas. We show an example of such artifacts in Fig. 13. This causes ghosting artifact in the view synthesis when performing image blending. Reducing stitching artifacts in the captured 360° panoramas could reduce such artifacts.

Refer to caption
Figure 13: Examples of the stitching artifact. The red circle shows where the image stitching artefact appears.

Pre-processing: Our method relies on off-line pre-processing including depth estimation and refinement steps. It is acceptable for a static scene, but limited when adapting dynamic scenes.

User-study: Overall user studies are highly restricted due to COVID-19 at the time of study; nationwide lockdown restricting any mass gathering. Whenever available, we plan to perform user study to measure presence, immersion, VR sickness, and the overall usability of 6-DoF experiences produced by our methods.

10 Conclusion

We present a novel and complete pipeline to synthesize a novel view panorama using a set of 360° images captured by a handheld 360° camera. The result provides 6-DoF experiences, a sensation to walk around a captured large scale scene with support of full panoramic view in any location. We have tested our method in various test scenes captured by a single handheld 360° camera including indoor and outdoor scenes, and mid to large scale scenes. We also compared our results with current state-of-the art methods, showing better visual quality and robustness. Our method consistently produces high fidelity results across all test scenes, while previous method sometimes failed. We also outline the current limitations and potential future work that could lead to improvements. We believe our method, tested on a consumer graded 360° camera, will be easily adapted to various applications that requires 6-DoF experiences of a captured large scale scenes, particularly benefiting casual users.

References

  • [1] B. Attal, S. Ling, A. Gokaslan, C. Richardt, and J. Tompkin. Matryodshka: Real-time 6dof video view synthesis using multi-sphere images. In European Conference on Computer Vision, pp. 441–459. Springer, 2020.
  • [2] R. Bansal, G. Raj, and T. Choudhury. Blur image detection using laplacian operator and open-cv. In 2016 International Conference System Modeling & Advancement in Research Trends (SMART), pp. 63–67. IEEE, 2016.
  • [3] T. Bertel, N. D. Campbell, and C. Richardt. Megaparallax: Casual 360 panoramas with motion parallax. IEEE transactions on visualization and computer graphics, 25(5):1828–1835, 2019.
  • [4] T. Bertel, M. Yuan, R. Lindroos, and C. Richardt. Omniphotos: casual 360° vr photography. ACM Transactions on Graphics (TOG), 39(6):1–12, 2020.
  • [5] S. Bonopera, P. Hedman, J. Esnault, S. Prakash, S. Rodriguez, T. Thonat, M. Benadel, G. Chaurasia, J. Philip, and G. Drettakis. sibr: A system for image based rendering, 2020.
  • [6] M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hedman, M. Duvall, J. Dourgarian, J. Busch, M. Whalen, and P. Debevec. Immersive light field video with a layered mesh representation. ACM Transactions on Graphics (TOG), 39(4):86–1, 2020.
  • [7] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen. Unstructured lumigraph rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 425–432, 2001.
  • [8] F. Cazals and J. Giesen. Delaunay triangulation based surface reconstruction. In Effective computational geometry for curves and surfaces, pp. 231–276. Springer, 2006.
  • [9] H. Cho, J. Kim, and W. Woo. Novel view synthesis with multiple 360 images for large-scale 6-dof virtual reality system. In 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 880–881. IEEE, 2019.
  • [10] I. Choi, O. Gallo, A. Troccoli, M. H. Kim, and J. Kautz. Extreme view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7781–7790, 2019.
  • [11] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker. Deepview: View synthesis with learned gradient descent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2367–2376, 2019.
  • [12] C. Griwodz, S. Gasparini, L. Calvet, P. Gurdjos, F. Castan, B. Maujean, G. D. Lillo, and Y. Lanthony. Alicevision Meshroom: An open-source 3D reconstruction pipeline. In Proceedings of the 12th ACM Multimedia Systems Conference - MMSys ’21. ACM Press, 2021. doi: 10 . 1145/3458305 . 3478443
  • [13] P. Hedman, S. Alsisan, R. Szeliski, and J. Kopf. Casual 3d photography. ACM Transactions on Graphics (TOG), 36(6):1–15, 2017.
  • [14] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Drettakis, and G. Brostow. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (TOG), 37(6):1–15, 2018.
  • [15] P. Hedman, T. Ritschel, G. Drettakis, and G. Brostow. Scalable inside-out image-based rendering. ACM Transactions on Graphics (TOG), 35(6):1–11, 2016.
  • [16] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):504–511, 2012.
  • [17] J. Huang, Z. Chen, D. Ceylan, and H. Jin. 6-dof vr videos with a single 360-camera. In 2017 IEEE Virtual Reality (VR), pp. 37–44. IEEE, 2017.
  • [18] M. Jancosek and T. Pajdla. Multi-view reconstruction preserving weakly-supported surfaces. In CVPR 2011, pp. 3121–3128. IEEE, 2011.
  • [19] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, vol. 7, 2006.
  • [20] C. Koniaris, M. Kosek, D. Sinclair, and K. Mitchell. Compressed animated light fields with real-time view-dependent reconstruction. IEEE transactions on visualization and computer graphics, 25(4):1666–1680, 2018.
  • [21] L. Koskinen, J. T. Astola, and Y. A. Neuvo. Soft morphological filters. In Image Algebra and Morphological Image Processing II, vol. 1568, pp. 262–270. International Society for Optics and Photonics, 1991.
  • [22] C. Lipski, F. Klose, and M. Magnor. Correspondence and depth-image based rendering a hybrid approach for free-viewpoint video. IEEE Transactions on Circuits and Systems for Video Technology, 24(6):942–951, 2014.
  • [23] X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang. On building an accurate stereo matching system on graphics hardware. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 467–474. IEEE, 2011.
  • [24] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 38(4):1–14, 2019.
  • [25] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pp. 405–421. Springer, 2020.
  • [26] P. Ndjiki-Nya, M. Koppel, D. Doshkov, H. Lakshman, P. Merkle, K. Muller, and T. Wiegand. Depth image-based rendering with advanced texture synthesis for 3-d video. IEEE Transactions on Multimedia, 13(3):453–465, 2011.
  • [27] R. S. Overbeck, D. Erickson, D. Evangelakos, M. Pharr, and P. Debevec. A system for acquiring, processing, and rendering panoramic light field stills for virtual reality. ACM Transactions on Graphics (TOG), 37(6):1–15, 2018.
  • [28] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla. Nerfies: Deformable neural radiance fields. ICCV, 2021.
  • [29] K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), dec 2021.
  • [30] E. Penner and L. Zhang. Soft 3d reconstruction for view synthesis. ACM Transactions on Graphics (TOG), 36(6):1–11, 2017.
  • [31] A. P. Pozo, M. Toksvig, T. F. Schrager, J. Hsu, U. Mathur, A. Sorkine-Hornung, R. Szeliski, and B. Cabral. An integrated 6dof video camera and system design. ACM Transactions on Graphics (TOG), 38(6):1–16, 2019.
  • [32] G. Riegler and V. Koltun. Free view synthesis. In European Conference on Computer Vision, pp. 623–640. Springer, 2020.
  • [33] G. Riegler and V. Koltun. Stable view synthesis. arXiv preprint arXiv:2011.07233, 2020.
  • [34] T. Sato and N. Yokoya. Efficient hundreds-baseline stereo by counting interest points for moving omni-directional multi-camera system. Journal of Visual Communication and Image Representation, 21(5-6):416–426, 2010.
  • [35] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113, 2016.
  • [36] A. Serrano, I. Kim, Z. Chen, S. DiVerdi, D. Gutierrez, A. Hertzmann, and B. Masia. Motion parallax for 360 rgbd video. IEEE Transactions on Visualization and Computer Graphics, 25(5):1817–1827, 2019.
  • [37] H. Shum and S. B. Kang. Review of image-based rendering techniques. In Visual Communications and Image Processing 2000, vol. 4067, pp. 2–13. International Society for Optics and Photonics, 2000.
  • [38] H.-Y. Shum, S.-C. Chan, and S. B. Kang. Image-based rendering. Springer Science & Business Media, 2008.
  • [39] J. Thatte and B. Girod. Towards perceptual evaluation of six degrees of freedom virtual reality rendering from stacked omnistereo representation. Electronic Imaging, 2018(5):352–1, 2018.
  • [40] J. Thatte, T. Lian, B. Wandell, and B. Girod. Stacked omnistereo for virtual reality with six degrees of freedom. In 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. IEEE, 2017.
  • [41] M. Waechter, M. Beljan, S. Fuhrmann, N. Moehrle, J. Kopf, and M. Goesele. Virtual rephotography: Novel view prediction error for 3d reconstruction. ACM Transactions on Graphics (TOG), 36(1):1–11, 2017.
  • [42] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai. Bifuse: Monocular 360 depth estimation via bi-projection fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 462–471, 2020.
  • [43] N.-H. Wang, B. Solarte, Y.-H. Tsai, W.-C. Chiu, and M. Sun. 360sd-net: 360° stereo depth estimation with learnable cost volume. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 582–588. IEEE, 2020.
  • [44] J. Xu, X. Wu, Z. Zhu, Q. Huang, Y. Yang, H. Bao, and W. Xu. Scalable image-based indoor scene rendering with reflections. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
  • [45] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783, 2018.
  • [46] Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan. Recurrent mvsnet for high-resolution multi-view stereo depth inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5525–5534, 2019.
  • [47] K. Zhang, G. Riegler, N. Snavely, and V. Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  • [48] Z. Zhang. Determining the epipolar geometry and its uncertainty: A review. International journal of computer vision, 27(2):161–195, 1998.
  • [49] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018.
  • [50] S. Zinger, L. Do, and P. de With. Free-viewpoint depth image based rendering. Journal of visual communication and image representation, 21(5-6):533–541, 2010.
  • [51] N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras. Spherical view synthesis for self-supervised 360 depth estimation. In 2019 International Conference on 3D Vision (3DV), pp. 690–699. IEEE, 2019.