Adaptive Joint Optimization for 3D Reconstruction with Differentiable Rendering

Jingbo Zhang, Ziyu Wan, and Jing Liao^∗ ^∗: corresponding author. J. Zhang, Z. Wan and J. Liao are with Department of Computer Science, City University of Hong Kong. E-mail: [email protected], [email protected], [email protected].

Abstract

Due to inevitable noises introduced during scanning and quantization, 3D reconstruction via RGB-D sensors suffers from errors both in geometry and texture, leading to artifacts such as camera drifting, mesh distortion, texture ghosting, and blurriness. Given an imperfect reconstructed 3D model, most previous methods have focused on the refinement of either geometry, texture, or camera pose. Or different optimization schemes and objectives for optimizing each component have been used in previous joint optimization methods, forming a complicated system. In this paper, we propose a novel optimization approach based on differentiable rendering, which integrates the optimization of camera pose, geometry, and texture into a unified framework by enforcing consistency between the rendered results and the corresponding RGB-D inputs. Based on the unified framework, we introduce a joint optimization approach to fully exploit the inter-relationships between geometry, texture, and camera pose, and describe an adaptive interleaving strategy to improve optimization stability and efficiency. Using differentiable rendering, an image-level adversarial loss is applied to further improve the 3D model, making it more photorealistic. Experiments on synthetic and real data using quantitative and qualitative evaluation demonstrated the superiority of our approach in recovering both fine-scale geometry and high-fidelity texture. Code is available at https://adjointopti.github.io/adjoin.github.io/.

Index Terms:

Texture optimization, geometry refinement, 3D reconstruction, adaptive interleaving strategy, differentiable rendering.

1 Introduction

Reconstructing real-world 3D objects and scenes with high-fidelity texture and geometry has been a long-standing, important problem, since it has broad application prospects in various fields such as VR/AR, animations, and video games [1, 2, 3]. With the emergence and wide availability of hand-held RGB-D cameras, it is now very convenient to reconstruct both the 3D geometry and the texture of an object. However, the reconstruction results are still far from satisfactory, as shown in Figure 1. There are several mixed reasons for the production of such an inferior 3D textured model [4, 5]. 1) The depth noise produced by RGB-D cameras leads to imperfect 3D geometry. 2) The estimation errors of camera poses accumulate and finally lead to camera drifting. 3) Misalignment occur between the depth and color frames. 4) Due to the above errors, the texture mapping process, which naïvely projects multi-view images into another view-independent texture map, will inevitably produce blurring and ghosting artifacts.

In order to alleviate these problems, researchers have developed various algorithms to optimize the initial imperfect 3D model, and generate a higher-quality one by improving the texture quality [5, 6, 7], adjusting camera poses [1, 6], or refining mesh vertices [8, 9]. However, these methods, which focus on single-component optimization, subjectively break the inter-relationship between geometry, texture, and camera pose. Therefore, they have limited capabilities to compensate for reconstruction errors derived from the aforementioned mixed reasons. Maier et al. [10] first proposed an Intrinsic3D method to jointly optimize mesh, texture, camera pose, and scene lighting based on shape from shading (SFS). However, due to the SFS-based method’s inherent limitations, Intrinsic3D is extremely slow and tends to suffer from the texture-copy problem (i.e., illogical geometric deformation caused by texture, shown in Figure 10b). JointGT [4] effectively solves these problems by avoiding SFS in the optimization. Still, its complicated framework, which involves multiple optimization schemes and objective functions for geometry, texture, and camera pose, makes it less scalable and robust, especially when reconstruction errors are large.

Refer to caption — Figure 1: A 3D mesh model reconstructed from RGB-D frames by KinectFusion lacks geometric and texture details. Compared with the state-of-the-art optimization schemes [10, 4], our method can obtain visually realistic textures and high-quality geometric models.

In this paper, we propose a unified framework for joint optimization of texture, geometry, and camera pose. This work was inspired by the recent successes of differentiable renderer [11, 12, 13]. With it, the 3D model to be optimized can be rendered into multi-view images. Image-level objectives comparing the consistency between the rendered image and its corresponding RGB-D input can then be calculated, and back-propagated through the differentiable renderer to update the camera parameters, vertex positions, and texture colors of the 3D model, either separately or simultaneously. This unified framework has three benefits. First, with the unified framework, we jointly optimize the geometry, texture, and camera pose to fully exploit the inner relationships between different components and let them mutually improve each other. Second, our unified objectives are simple but effective. Unlike previous methods, which have to define specialized objectives defined on vertex color, vertex position, depth, and camera parameters, we use general image-level losses to supervise the optimization of geometry, texture, and camera pose. Third, thanks to the end-to-end framework with differentiable rendering, more advanced image-level objectives, such as perceptual loss [14] and adversarial loss [15] are introduced in our optimization, which greatly improves the rendering photorealism of 3D models optimized using our method.

Considering the optimization stability and convergence rate, in our joint optimization framework we do not simply update the geometry, texture, and camera pose altogether in each iteration. Instead, we introduce an adaptive optimization strategy to smartly interleave the update of geometry, texture, and camera pose, based on the convergence of objectives. We performed quantitative and qualitative evaluations on our proposed joint optimization framework on various datasets. The experimental results demonstrated the value of our method in recovering both fine-scale geometry and high-fidelity texture, as shown in Figure 1.

To summarize, our contributions are three-fold:

•

We propose the first joint optimization framework for RGB-D reconstruction based on differentiable rendering. It unifies optimization schemes and objectives for geometry, texture, and camera pose; allows joint optimizations of different components to mutual benefit; and supports the use of adversarial loss to increase the photorealism of the reconstructed model.
•

We introduce a joint optimization strategy to adaptively interleave the update of geometry, texture, and camera pose, which leads to faster and more stable convergence.
•

The experimental results show that our method performs considerably better than state-of-the-art methods, using either separate or joint optimization.

2 Related Work

2.1 Geometry and Texture Optimization

In this section, we provide a brief overview of texture and geometry optimization for 3D reconstruction. The first research line focuses on geometry refinement. Zollhöfer et al. [16] optimized the geometry of reconstructed 3D model encoded in a truncated signed distance field (SDF). Choe et al. [17] exploited shading cues captured from infrared cameras to improve the quality of a 3D mesh. Romanoni et al. [18] refined the surface geometry by optimizing composite energy in a variational manner, meanwhile updating the semantic labels based on Markov Random Field (MRF). Jiang et al. [19] combined both facial priors and SFS to enhance the fine geometric details of a portrait model. Schmitt et al. [20] recovered geometry details via jointly correcting camera pose and estimated material properties.

Another research line is about texture optimization. Zhou et al. [6] employed local warping for texture images to rectify complex distortions of geometry. Bi et al. [5] used patch-based synthesis to correct the misalignment of multi-view reference images. Fu et al. [21] proposed a global-to-local non-rigid optimization method to achieve better texture mapping results. Li et al. developed a fast texture mapping scheme to reduce misalignment at texture boundaries. Lee et al. adopted a texture-fution method with an SDF voxel grid to optimize the texture of a real-time scanning model. Huang et al. [22] designed a misalignment-tolerant metric based on generalized adversarial networks (GANs) to produce photorealistic textures.

Different from the above-mentioned works, some recent research has attempted to optimize texture and geometry jointly. Wang et al. [23] used planar primitives to partition a model and jointly optimized plane parameters, camera poses, texture and geometry using photometric consistency and planar constraints. However, this method relies on plane priors, and is not suitable for complex nonplanar objects. Maier et al. [10] proposed an SFS-based method to simultaneously optimize texture, camera pose, geometry, and lighting, but SFS is an ill-posed problem with potentially ambiguous solutions. Such ambiguity will lead to inferior results with texture-copy artifacts. Fu et al. [4] suggested directly optimizing the 3D models and avoiding the SFS process, which successfully alleviates the problem of texture-coping. However, they used different optimization schemes and objectives for texture, geometry, and camera pose optimizations, which makes the system complicated and less robust, especially to large reconstruction errors. Unlike Fu et al. [4], we propose a unified framework to jointly optimize texture, geometry, and camera poses via differentiable rendering. Experiments showed that our unified framework is more robust to different levels of reconstruction errors and achieves better results than existing methods.

2.2 Differentiable Rendering

Triangle meshes, a well-established representation of a surface, have been used in almost every graphics pipeline due to their efficiency and flexibility in terms of vertex transformations and texturing. However, traditional graphics engines do not produce usable gradients for optimization purposes. To solve this problem, some early work [12, 24] approximated gradients on mesh vertices, and recent methods [25, 13] propose fully-differentiable formulations. The property of differentiable renderer bridges the connection between a traditional optimization pipeline and advanced deep learning techniques. We adopt a soft-rasterizer renderer [13] in this work, which allows us to optimize the parameters of a 3D model, including texture, geometry, and camera poses, by enforcing weakly supervised losses between the rendered images and the corresponding RGB-D inputs.

2.3 Generative Adversarial Networks

GANs [15] have achieved impressive progress in areas such as super-resolution [26], image restoration [27], image translation [28], and 3D topological representation [29]. The main principle of a GAN is to run a zero-sum game between a generator and a discriminator in an adversarial manner. Unlike previous photometric approaches such as L1 or L2, which will lead to over-smooth results, adversarial loss can ensure high-quality image generation with high-frequency details. However, it is difficult to include adversarial loss in a traditional optimization pipeline [4, 7] due to the unstable process caused by entangled objectives. Huang et al. [22] first propose to combine texture optimization and GANs to solve the misalignment problem through view transformation-based warping. In this work, we combine GANs with differentiable rendering to achieve effective joint texture and geometry optimization for 3D reconstruction.

3 Methods

Our method aims to produce both fine-scale geometric detail and high-fidelity texture for 3D model reconstructed from scanned RGB-D images. To this end, we propose a joint optimization framework to refine inaccurate camera poses, rough geometry surfaces, and unclear texture.

3.1 Problem Setting

The inputs of our method include a set of RGB-D images scanned by a consumer range camera, corresponding estimated camera poses and object silhouettes for each frame, and an initial imperfect 3D model reconstructed using an existing reconstructing method like BundleFusion [30], KinectFution [31], or COLMAP [32]. Let ${M}$ indicate the reconstructed 3D mesh, and ${T}$ represent the texture map to be optimized. $\boldsymbol{I}_{A}$ , $\boldsymbol{D}_{A}$ , and $\boldsymbol{S}_{A}$ respectively denote the color image, depth map, and silhouette map under view ${A}$ , and its corresponding camera pose, indicated as $\boldsymbol{P}_{A}$ , is composed of a rotation matrix and a translation vector, which can transform each vertex, $\boldsymbol{v}_{i}$ , of a mesh, ${M}$ , from the world coordinate system to the camera coordinate system by $\left[\widetilde{\boldsymbol{v}}_{i}\right]_{\text{world}}=\boldsymbol{P}_{A}\left[\widetilde{\boldsymbol{v}}_{i}\right]_{\text{camera}}$ , where $\left[\widetilde{\boldsymbol{v}}_{i}\right]_{X}$ indicates the homogeneous coordinate of the vertex $\boldsymbol{v}_{i}$ in the $X$ coordinate system. Now the problem is how to produce the optimized 3D reconstruction, based on known information.

3.2 Overview

Considering that 3D reconstruction errors come from mixed factors regarding both texture and geometry, we jointly optimize the vertex position, camera pose, and texture map with a unified framework based on differentiable rendering. As depicted in Sec. 3.3, the 3D mesh, ${M}$ , with texture, ${T}$ , is rendered with different camera poses, $\boldsymbol{P}$ . Then a series of image-level losses measuring the differences between the rendered frames and the RGB-D inputs are computed, to update ${M}$ , ${T}$ , and $\boldsymbol{P}$ . To achieve fast and stable convergence, we do not naïvely update ${M}$ , ${T}$ , and $\boldsymbol{P}$ together in every iteration. Instead, we designed a strategy to adaptively interleave the optimization of ${M}$ , ${T}$ , and $\boldsymbol{P}$ according to the convergence rate in Sec. 3.4. The overall pipeline is shown in Figure 2.

3.3 Unified Optimization Framework

To enable the effective and efficient joint optimization of the different components of a 3D model, a unified framework supporting the optimization of texture, geometry, and camera pose is essential. However, this task is not easy, since texture, geometry, and camera pose are in different domains. Specialized schemes and objectives have often been designed for the optimization of each component in previous work. Inspired by the fact that the rendering process is a function combining texture, geometry, and camera pose, we propose a novel unified optimization framework based on differentiable rendering. Next, we describe the details of this framework including the multi-view rendering process and common objectives.

3.3.1 Multi-view Rendering

During optimization, a 3D mesh ${M}^{t}$ with texture map ${T}^{t}$ and an associated camera pose $\boldsymbol{P}^{t}_{A}$ are fed to the differentiable renderer to generate a color image $\boldsymbol{I}_{A}^{DR,t}=DR\left(M^{t},T^{t}\mid\boldsymbol{P}_{A}^{t}\right)$ , depth image $\boldsymbol{D}_{A}^{DR,t}$ and silhouette image $\boldsymbol{S}_{A}^{DR,t}$ , where ${DR}$ indicates the differentiable rendering operation. Instead of directly approximating $\boldsymbol{I}_{A}^{DR,t}$ with its ground truth $\boldsymbol{I}_{A}$ in view $A$ , we generate more ground-truth references for comparison, by re-projecting RGB-D inputs under other auxiliary views to the target view $A$ . This approach can help to compensate for errors in a single view and is inspired by [22]. Specifically, we sequentially select a color image $\boldsymbol{I}_{A}$ with associated camera pose $\boldsymbol{P}_{A}$ from all the candidate RGB-D inputs as the target view, and randomly select another image $\boldsymbol{I}_{B}$ with camera pose $\boldsymbol{P}_{B}$ from adjacent views of $\boldsymbol{I}_{A}$ as the auxiliary view. For each pixel $\boldsymbol{q}_{B}$ in the auxiliary image $\boldsymbol{I}_{B}$ , we can compute its re-projected pixel $\boldsymbol{q}_{B\rightarrow A}$ via a simple spatial transformation:

\boldsymbol{q}_{B\rightarrow A}=\boldsymbol{K}\boldsymbol{P}_{A}\boldsymbol{P}_{B}^{-1}\left(d_{B}\boldsymbol{K}^{-1}\boldsymbol{q}_{B}\right),

(1)

where $d_{B}\boldsymbol{K}^{-1}\boldsymbol{q}_{B}$ represents the associated vertex position of view $B$ in camera space, which is produced by using the known depth value $d_{B}$ and intrinsic matrix $\boldsymbol{K}$ of the camera. Then, the two projection transformations—camera to world and world to camera—are employed to find the corresponding vertex position of the camera coordinate of expected view $A$ . According to the transformation of a single pixel in Eq. 1, we can re-project the RGB image under auxiliary view ${B}$ into the target view $A$ by: $\boldsymbol{I}_{B\rightarrow A}=\operatorname{RP}\left(\boldsymbol{I}_{B}\rightarrow\boldsymbol{I}_{A}\right)$ , where $\operatorname{RP}$ represents the re-projection process.

To avoid the extreme case where the re-projected image has few visible pixels, the auxiliary view $B$ will be selected within a maximum $15^{\circ}$ deviation of the target view $A$ . Practically, the neighboring views of $A$ , denoted as $J_{A}$ , are determined using the following metric:

J_{A}=\left\{\boldsymbol{I}_{B}\in\Psi_{\text{color}}:\angle\left(\boldsymbol{R}_{A},\boldsymbol{R}_{B}\right)\leq 15^{\circ}\right\}

(2)

where $\Psi_{\text{color}}$ is the color image set, and $\angle\left(\boldsymbol{R}_{A},\boldsymbol{R}_{B}\right)$ denotes the angle between the rotation matrix $\boldsymbol{R}_{A}$ and $\boldsymbol{R}_{B}$ in view $A$ and $B$ .

3.3.2 Common Objectives

In every iteration $t$ , the rendered color image $\boldsymbol{I}_{A}^{DR,t}$ , depth image $\boldsymbol{D}_{A}^{DR,t}$ and silhouette $\boldsymbol{S}_{A}^{DR,t}$ in view $A$ are compared with the re-projected image $\boldsymbol{I}_{B\rightarrow A}$ , scanned depth map $\boldsymbol{D}_{A}$ and object mask $\boldsymbol{S}_{A}$ , to achieve a common objective of reconstructing RGB-D inputs. The most fundamental loss is image-level L1, which is employed to guide the optimization process based on the re-projecting transformation, to enhance the color consistency between the rendered image and the ground truth, defined as follows:

L_{RGB}=\left\|\boldsymbol{I}_{B\rightarrow A}-\boldsymbol{I}_{A}^{DR,t}\right\|_{1}.

(3)

The depth loss $L_{depth}$ is calculated by measuring the L1 distance between the scanned and rendered depth images:

L_{depth}=\left\|\boldsymbol{D}_{A}-\boldsymbol{D}_{A}^{DR,t}\right\|_{1},

(4)

which is effective to enhance the geometric consistency.

We employ an Intersection-over-Union (IoU) loss to measure the silhouette consistency between mesh model and object:

L_{IoU}=1-\frac{\left\|\boldsymbol{S}_{A}\otimes\boldsymbol{S}_{A}^{DR,t}\right\|_{1}}{\left\|\boldsymbol{S}_{A}\oplus\boldsymbol{S}_{A}^{DR,t}-\boldsymbol{S}_{A}\otimes\boldsymbol{S}_{A}^{DR,t}\right\|_{1}},

(5)

where $\otimes$ and $\oplus$ represent the element-wise product and the sum operator, respectively.

Finally, we take the weighted sum of the RGB loss, the depth loss, and the IoU loss as the common objective to optimize texture color $T^{t}$ , mesh vertices $M^{t}$ , and camera parameters $\boldsymbol{P}^{t}$ :

L_{common}=\lambda_{C}L_{RGB}+\lambda_{D}L_{depth}+\lambda_{S}L_{IoU},

(6)

where $\lambda_{C}$ , $\lambda_{D}$ and $\lambda_{S}$ are constant coefficients balancing between terms. We empirically set $\lambda_{C}=0.1$ , $\lambda_{D}=1$ and $\lambda_{S}=1$ in all the experiments. Note that in each epoch, this common objective (Eq. 6) under all views is averaged.

3.4 Joint Optimization

With our unified optimization framework, one naïve joint optimization strategy is to update the texture color $T^{t}$ , mesh vertices $M^{t}$ , and camera parameters $\boldsymbol{P}^{t}$ together in every single iteration $t$ . However, we found this naïve strategy results in slow and unstable convergence. To solve this problem, we developed a joint optimization strategy to adaptively interleave updates of different components based on the common objective’s convergence rate (Eq. 6). With this strategy, the updating iterations for each component will be dynamically adjusted in different stages, leading to more stable and efficient optimization. In this way, the optimization of different components will still be beneficial to each other across iterations.

In addition to convergence acceleration, the interleaving strategy brings another benefit: it allows the addition of specialized objectives for a component in its own iteration. Specifically, we add Laplacian loss in the geometry iterations to constrain the mesh smoothness, while adding adversarial loss in the texture iterations to increase photorealism. Next, we introduce specific losses used in geometry and texture iterations, and then our adaptive interleaving strategy.

3.4.1 Geometry Iteration with Laplacian Loss

The initial geometry of an input 3D mesh usually suffers from noise introduced during the reconstruction procedure. In geometry iteration, vertex positions are adjusted to restore high-fidelity geometry by maximizing the consistency between the rendered results and the scanned RGB-D images. To further guarantee the local smoothness of the refined mesh, we add a Laplacian loss to the common objective (Eq. 6) to refine the geometry. Let $\boldsymbol{W}$ denote the uniform Laplacian matrix [33] of size $m\times m$ , where $m$ is the number of vertices. The Laplacian loss can be defined as an L2 norm of the Laplacian coordinates:

L_{Lap}=\left\|\boldsymbol{W}\boldsymbol{V}_{d}\right\|_{2},

(7)

where $\boldsymbol{V}_{d}=\left[\boldsymbol{v}_{1d},\boldsymbol{v}_{2d},\cdots,\boldsymbol{v}_{md}\right]^{T}$ represents the vertex matrix of mesh $M^{t}$ , and $d\in{x,y,z}$ indicates the spatial coordinate axis.

3.4.2 Texture Iteration with Adversarial Loss

Using only the common objective (Eq. 6) defined by the L1 norms, the optimized texture will appear blurry and lose fine detail. To solve this problem, we introduce adversarial learning in the texture iteration in addition to $L_{common}$ . By jointly training an image discriminator network $\mathcal{D}$ to distinguish ’real’ and ’fake’ images, we aim to produce a texture that is indistinguishable from re-projections of captured images from other views. Accordingly, the adversarial loss is defined as:

\displaystyle L_{adv}=\log\mathcal{D}\left(\boldsymbol{I}_{A},\boldsymbol{I}_{B\rightarrow A}\right)+\log\left(1-\mathcal{D}\left(\boldsymbol{I}_{A},\boldsymbol{I}_{A}^{DR,t}\right)\right)

During optimization, the discriminator $\mathcal{D}$ learns to recognize artifacts like seams, noise, or blurriness, that do not appear in real images and is tolerant to small misalignment between a rendered image $\boldsymbol{I}_{A}^{DR,t}$ and the real image $\boldsymbol{I}_{B\rightarrow A}$ . Therefore, the texture optimized to fool the discriminator appears more realistic than that produced using only the common objective.

Algorithm 1 Pseudocode for Adaptive Iterative Strategy

threshold

\delta=10^{-3}

, patience

\Omega=50

;

common loss

L=\left[L^{(1)},\cdots,L^{(t)},\cdots\right]

best value

\beta=L^{(1)}

, maximum steps

T_{max}

(e.g.,

10^{3}

)

Let counter

n=0

\text{strategy}=keep

for

t=0

T_{max}

L^{(t)}<\beta*(1-\delta)

then

n=0

\beta=L^{(t)}

;

else

n+=1

end if

while

n\geq\Omega

\text{strategy}=next

end while

end for

strategy

3.4.3 Adaptive Interleaving Strategy

In this strategy, we employ external iterations to achieve joint optimization and internal iterations to search for a temporary optimum based on current data for each stage. Specifically, in each external iteration, we first minimize $L_{common}$ to correct the camera poses $\boldsymbol{P}$ until a local optimum is reached, while fixing geometry $M$ and texture $T$ . Then, the 3D mesh $M$ is refined by minimizing the sum of the common loss and additional Laplacian loss, with both $\boldsymbol{P}$ and $T$ fixed. Finally, we optimize the texture map $T$ based on the updated and fixed $\boldsymbol{P}$ and $M$ , using adversarial learning. The external iteration cycle among camera pose, geometry, and texture is repeated three times. Interleaving the optimization of each component makes the convergence more stable.

The internal iteration numbers in each external iteration are not set manually. Instead, an adaptive strategy is adopted, to achieve a good trade-off between convergence speed and quality. The proposed adaptive strategy tested after each internal iteration is illustrated in Algorithm 3.4.2. It checks the convergence rate of the common objective $L_{common}$ , and returns a flag $next$ to terminate internal iteration, or otherwise continue.

4 Experiments

We first compared our method with state-of-the-art methods on a synthetic dataset by adding noise to high-quality laser scanning models. Then, we evaluated our method on public RGB-D datasets to demonstrate the effectiveness of our method on real 3D reconstructions. We used ablation studies to validate the performance of each major component of our pipeline. An extension study using our method for the optimization of RGB data is described in Sec. 4.4. Finally, we performed a user study to qualitatively compare the experimental results of the different methods on synthetic and real data. All of the experiments were conducted on a PC with Intel Core i7-9700 3.00GHz and GeForce RTX2080Ti 12GB, using author-released codes for compared methods [10, 21, 22, 4].

4.1 Evaluation on Synthetic Data

To quantitatively evaluate the performance of our method, we first tested it on a synthetic dataset for which both geometry and texture ground truth are known. Three representative baselines were compared with our method: G2LTex [21], in which only texture is optimized via a texture mapping strategy; Adversarial Texture Optimization (ATO) [22], in which only texture is optimized with adversarial loss; and JointTG [4], in which camera pose, geometry, and texture are jointly optimized. Here, we ignored Intrinsic3d [10], since its model refinement is based on a signed distance field (SDF) volume instead of a given noisy textured mesh, although this method was used in experiments with real data.

4.1.1 Synthetic Dataset

We collected nine high-quality 3D models (D1-D9) reconstructed by high-precision laser scanning to serve as the ground truth. We uniformly sampled 40 views on a unit sphere, and rendered the textured model into these views, to simulate the RGB-D images scanned by a consumer range camera. For each ground truth model, random noise was added to its geometry, texture, and camera poses, to synthesize an imperfect input model.

Specifically, we added uniformly distributed noise ranging from $\left[-e_{t},e_{t}\right]$ to the translation vector, and noise of Euler angle ranging from $\left[-e_{r},e_{r}\right]$ to the rotation matrix, to simulate inaccurate camera poses. For geometric errors, we added random disturbances to the mesh vertices, and applied three steps of Laplacian smoothing to these random values, in which the disturbances satisfy a uniform distribution ranging from $\left[-e_{g},e_{g}\right]$ . In our experiment, we set ${e}_{t}=0.01\times 1.5^{n}$ , $e_{r}=5$ , and $e_{g}=0.03\times 1.5^{n}$ , where $n$ is a measure of the degree of noise. We set $n=1.5$ in the general comparative experiment and changed $n$ from $0.25$ to $2.5$ at intervals of $0.25$ in the robustness study of the optimization methods. For texture errors, we intentionally misaligned and blurred the original texture, and randomly added irregular masks to simulate misalignment, blurriness, and seams on the textures. Some synthesized data are shown in the first column of Figure 3.

TABLE I: Quantitative comparison on the state-of-the-art optimization methods and ours for real datasets.

Datasets	Dolls (9 models)			Intrinsic3d (5 models)			Scene3d (12 models)			Chairs (35 models)
	PSNR $\uparrow$	SSIM $\uparrow$	Perceptual $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	Perceptual $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	Perceptual $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	Perceptual $\downarrow$
G2LTex	22.758	0.920	0.070	21.684	0.649	0.276	17.637	0.583	0.287	23.316	0.791	0.167
ATO	23.734	0.941	0.058	25.463	0.770	0.269	21.235	0.680	0.250	25.011	0.819	0.148
Intrinsic3d	23.171	0.929	0.079	24.552	0.727	0.253	17.390	0.576	0.361	22.523	0.781	0.209
JointTG	23.354	0.930	0.079	22.262	0.697	0.327	19.756	0.640	0.325	24.113	0.820	0.214
Ours	25.185	0.949	0.042	26.716	0.793	0.251	21.462	0.696	0.247	26.001	0.831	0.135

4.1.2 Evaluation Metric

In order to measure the quality of the optimized texture and geometry compared to the ground truth of the synthetic dataset, we employed several evaluation metrics: peak-signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Perceptual loss [34]. Considering that rendered results integrate the camera poses, geometry, and texture of a 3D model, we calculated these metrics between the rendered and source images on the same viewpoint to comprehensively evaluate the overall quality of an optimized model. To illustrate the performance of the optimization algorithms with respect to geometry details, we adopted the average Hausdorff distance to measure the difference between the optimized and ground truth meshes. Since G2LTex [21] and ATO [22] will not optimize mesh, the Hausdorff distance metric was only used to evaluate JointTG [4] and our method.

4.1.3 Experimental Results

We report the quantitative evaluation results in Figure 4. Our method consistently outperformed all other baselines by a large margin on all metrics. This is also clearly reflected in Figure 3, in which we show visual comparisons of the results optimized by state-of-the-art methods and ours. G2LTex [21] is a texture mapping method which does not consider camera-pose and geometry errors, divides the model surface into several regions, and maps the texture of one specific image to each region. This makes it locally sharper than other methods that blend the color information of all views, but easily causes texture misalignment artifacts (black seams) when camera pose noise exists. Although ATO [22] also optimizes texture only, it is more robust to small amounts of camera-pose and geometry noise than G2LTex, because the adversarial loss, which we also adopted in our method, can tolerant small misalignment. Due to the joint optimization of camera pose, geometry, and texture, JointTG [4] performed better than either ATO or G2LTex. However, constrained by its complicated framework and different schemes used to optimize each component, it is less robust to camera and geometry noise than our method, especially when the errors are large. This point can be clearly observed in Figure 5. When the noise level increased in geometry or camera poses, our performance on perceptual loss was quite stable, while the performance of other methods, including JointTG [4], dropped quickly. To summarize, compared to previous methods, our performance gain comes largely from the combined effects of three aspects: a unified framework, adversarial loss, and an adaptive interleaving strategy.

4.2 Evaluation on Real Data

To demonstrate the performance of the proposed method on scanned objects and scenes in real scenarios, we compared our method to state-of-the-art baselines including two texture optimization methods (G2LTex [21] and ATO [22]) and two joint optimization methods (Intrisic3D [10] and JointTG [4]) on real datasets. The datasets contain 61 models from our works, and that of Zhou et al. [6], Maier et al. [10], and Huang et al. [22]. Figure 6 and Figure 7 display the visual comparisons of the rendered images and the geometry details of the optimized 3D models. The corresponding quantitative evaluations using image quality metrics for different datasets are listed in the Table I. Here, the Hausdorff distance metric is not given, because no ground truth mesh was available for the real data. Our method performed consistently better than the other methods.

Qualitative comparisons also demonstrated the superior performance of our method in recovering both fine-scale geometry and high-fidelity texture. As shown in Figure 7, G2LTex [21] based on texture mapping is sensitive to camera-poses and geometry noises, and finds it hard to obtain the correct texture when noise exists. ATO [22] obtains better texture by fusing texture information from multiple views and using adversarial loss. Still, it fails to recover geometric details, since only the texture is optimized. Intrinsic3d [10] and JointTG [4] can refine the geometry details to some extent. However, Intrinsic3d, based on SFS, is sensitive to lighting and has texture-copy artifacts (shown in Figure 10b), especially when the views are sparse. The texture and mesh generated by JointTG are visually harmonious, but it fails to recover both geometric and texture details as accurately as our method. Therefore, its rendered results look blurrier. More importantly, no previous methods could recover imperfect geometry shapes from some rough initial models, because of their limited geometric deforming capabilities (see the first example of Figure 7). Our method can handle geometric deformation problems, due to the IoU loss supported by our framework, as shown in Figure 10a.

To further compare the performance of different methods on real data, we report the average running time on different datasets in Table II. Since our method is implemented based on differentiable rendering, the running time of our method is linearly related to the model complexity and the number of scanned frames. As shown in Table II, the compared methods were more sensitive to the model complexity and the number of frames than ours. Whereas our method took a similar time to other methods in the low-complexity models (Dolls and Chairs), our method was faster in the high-complexity models (Intrinsic3d and Scene3d).

TABLE II: Running time (minutes) of different methods on different real datasets.

Datasets

G2LTex

ATO

Intrinsic3d

JointTG

Ours

Dolls (9 models)

24.16

15.47

18.54

16.26

26.84

Intrinsic3d (5 models)

805.75

31.18

160.26

72.94

49.88

Scene3d (12 models)

368.14

38.64

69.97

63.31

53.66

Chairs (35 models)

59.89

10.06

39.62

15.11

16.76

Average (all models)

176.39

18.21

52.37

29.50

28.22

4.3 Ablation Studies

We investigated the effectiveness of each major component in our method using the synthetic dataset. Firstly, to validate the effect of joint optimization, we conducted an experiment by removing camera pose correction, geometry refinement, or adversarial loss of texture optimization from our joint optimization framework. Without the camera pose correction, the rendered image was misaligned with the RGB-D inputs. Therefore, the reconstruction objectives were wrongly calculated and caused a severe performance drop, as shown in Table IV and Figure 8. Removing geometry refinement was less significant than removing the camera-pose correction, since the influence of geometry errors is more local. Still, the performance was worse than that of the complete method. Compared to the case without adversarial loss, our complete method can produce more realistic texture details. This ablation study showed that the camera pose, geometry, and texture affect each other, and thus a joint optimization is necessary.

To illustrate the effectiveness and superiority of our adaptive interleaving strategy, we compared it to a hybrid strategy in which texture color, mesh vertices, and camera parameters are updated together in each single iteration. The hybrid strategy was implemented based on common loss, since the Laplacian and adversarial loss cannot be added into the hybrid strategy. To be fair, we compared it to an interleaving strategy (common) without the Laplacian and adversarial loss. To further investigate the performance of our adaptive algorithm, we conducted experiments by removing the adaptive strategy and replacing it with fixed internal iteration numbers: 40 steps and 200 steps. The qualitative comparison and convergence curves of common loss are shown in Figure 9 and Figure 11, respectively. The common loss was averaged under a batch view in each step for all strategies, hence there are slight vibration on the curves. The hybrid strategy led to an unstable convergence, and found it hard to reach a lower convergence position, while other interleaving strategies can overcome this problem by optimizing texture, geometry, and poses in an iterative manner. However, the internal iteration number of interleaving strategies is an important hyper-parameter, since the use of a small number of iterations may result in inadequate optimization (Figure 11 (red)), while a large one can be unnecessarily time-consuming (Figure 11 (blue)). Thus, it is important to dynamically select a suitable internal iteration number via the adaptive algorithm.

We report the quantitative comparison among different strategies in Table III. These results show that our adaptive interleaving strategy had a similar performance as the 200-step interleaving strategy, but was much more time-efficient. Compared to the hybrid strategy and the 40-step interleaving strategy, our strategy was more effective and produced better optimized results. To sum up, the proposed adaptive interleaving strategy could better balance the performance and efficiency of the optimization algorithm.

TABLE III: Quantitative comparison between our adaptive interleaving strategy and other strategies on synthetic datasets.

Methods

Hybrid Strategy

Interleaving Strategy

(common)

Interleaving Strategy

(40 steps)

Interleaving Strategy

(200 steps)

Adaptive Interleaving

Strategy

PSNR

\uparrow

28.683

33.187

29.926

33.570

33.330

SSIM

\uparrow

0.942

0.967

0.945

0.967

0.964

Perceptual

\downarrow

0.054

0.023

0.039

0.021

0.023

Time (minutes)

\downarrow

62.589

78.182

23.872

120.393

72.283

TABLE IV: Quantitative comparison between our complete method and the methods without camera pose correction, geometry refinement, and adversarial loss, respectively.

Methods

No pose

correction

No geometry

refinement

No adversarial

loss

Complete

method

PSNR

\uparrow

25.515

29.666

32.763

33.330

SSIM

\uparrow

0.910

0.944

0.964

Perceptual

\downarrow

0.094

0.047

0.023

Hausdorff

\downarrow

(

\times 10^{-2}

)

0.422

0.517

0.165

0.159

4.4 Extension: Optimization with RGB Data

Although our adaptive joint optimization method was originally implemented based on RGB-D data, it is readily applied to RGB data, by removing the depth loss from the framework. Although there is still a certain gap in the geometry details compared with our complete framework, the version without depth loss can produce reasonably clear texture and a plausible geometric model, as shown in Figure 10c. A quantitative comparison of these two pipelines performed on our Doll dataset, as shown in Table V, indicated that our pipeline produced acceptable performance with only RGB data. Our method can be extended to joint optimization with RGB or RGB-D data, while the depth information can contribute to a better geometric details.

TABLE V: Quantitative comparison of optimization results produced by our method w/ and w/o depth information on our Doll dataset.

Methods	PSNR $\uparrow$	SSIM $\uparrow$	Perceptual $\downarrow$
Ours (RGB)	24.360	0.944	0.045
Ours (RGB-D)	25.185	0.949	0.042

4.5 User Study

In order to further evaluate the quality of the optimized results on both synthetic and real data, we performed a study in which we asked users to vote for the most visually realistic rendered image, as shown in Figure 12. A total of 138 participants were asked to vote for the best rendered result on both synthetic and real datasets. During the survey process, we showed the front rendering of each model, and there was no time limit for participants to vote. For some real datasets with low noise levels, it was sometimes difficult for individuals to distinguish between the rendered images produced by the different methods. Nevertheless, images produced by the method proposed in this paper was still preferred over those produced by the other methods.

5 Conclusions

We proposed a novel adaptively joint optimization method for 3D reconstruction based on differentiable rendering. This method integrates the optimization of camera pose, geometry, and texture into a unified framework, and achieved better performance in recovering both fine-scale geometry and high-fidelity texture, compared with state-of-the-art methods. We also conducted ablation studies to demonstrate the effectiveness and superiority of each major component and the adaptive interleaving strategy. In the future, approaches to migrating our framework from iterative optimization to a feed-forward network, to improve performance would be a direction worth exploring.

References

[1] J. Huang, A. Dai, L. J. Guibas, and M. Nießner, “3dlite: towards commodity 3d scanning for content creation.” ACM Trans. Graph., vol. 36, no. 6, pp. 203–1, 2017.
[2] D. Andersen, P. Villano, and V. Popescu, “Ar hmd guidance for controlled hand-held 3d acquisition,” IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 11, pp. 3073–3082, 2019.
[3] Z.-N. Liu, Y.-P. Cao, Z.-F. Kuang, L. Kobbelt, and S.-M. Hu, “High-quality textured 3d shape reconstruction with cascaded fully convolutional networks,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 1, pp. 83–97, 2019.
[4] Y. Fu, Q. Yan, J. Liao, and C. Xiao, “Joint texture and geometry optimization for rgb-d reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5950–5959.
[5] S. Bi, N. K. Kalantari, and R. Ramamoorthi, “Patch-based optimization for image-based texture mapping.” ACM Trans. Graph., vol. 36, no. 4, pp. 106–1, 2017.
[6] Q.-Y. Zhou and V. Koltun, “Color map optimization for 3d reconstruction with consumer depth cameras,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–10, 2014.
[7] W. Li, H. Gong, and R. Yang, “Fast texture mapping adjustment via local/global optimization,” IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 6, pp. 2296–2303, 2018.
[8] C. Wu, M. Zollhöfer, M. Nießner, M. Stamminger, S. Izadi, and C. Theobalt, “Real-time shading-based refinement for consumer depth cameras,” ACM Transactions on Graphics (ToG), vol. 33, no. 6, pp. 1–10, 2014.
[9] M. Zollhöfer, A. Dai, M. Innmann, C. Wu, M. Stamminger, C. Theobalt, and M. Nießner, “Shading-based refinement on volumetric signed distance functions,” ACM Transactions on Graphics (TOG), vol. 34, no. 4, pp. 1–14, 2015.
[10] R. Maier, K. Kim, D. Cremers, J. Kautz, and M. Nießner, “Intrinsic3d: High-quality 3d reconstruction by joint appearance and geometry optimization with spatially-varying lighting,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3114–3122.
[11] T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen, “Differentiable monte carlo ray tracing through edge sampling,” ACM Transactions on Graphics (TOG), vol. 37, no. 6, pp. 1–11, 2018.
[12] H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3907–3916.
[13] S. Liu, T. Li, W. Chen, and H. Li, “Soft rasterizer: A differentiable renderer for image-based 3d reasoning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7708–7717.
[14] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision. Springer, 2016, pp. 694–711.
[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[16] M. Zollhöfer, A. Dai, M. Innman, C. Wu, M. Stamminger, C. Theobalt, and M. Nießner, “Shading-based refinement on volumetric signed distance functions.” ACM, 2015.
[17] G. Choe, J. Park, Y.-W. Tai, and I. S. Kweon, “Refining geometry from depth sensors using ir shading images,” International Journal of Computer Vision, vol. 122, no. 1, pp. 1–16, 2017.
[18] A. Romanoni, M. Ciccone, F. Visin, and M. Matteucci, “Multi-view stereo with single-view semantic mesh refinement,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 706–715.
[19] L. Jiang, J. Zhang, B. Deng, H. Li, and L. Liu, “3d face reconstruction with geometry details from a single image,” IEEE Transactions on Image Processing, vol. 27, no. 10, pp. 4756–4770, 2018.
[20] C. Schmitt, S. Donne, G. Riegler, V. Koltun, and A. Geiger, “On joint estimation of pose, geometry and svbrdf from a handheld scanner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3493–3503.
[21] Y. Fu, Q. Yan, L. Yang, J. Liao, and C. Xiao, “Texture mapping for 3d reconstruction with rgb-d sensor,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4645–4653.
[22] J. Huang, J. Thies, A. Dai, A. Kundu, C. Jiang, L. J. Guibas, M. Nießner, and T. Funkhouser, “Adversarial texture optimization from rgb-d scans,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1559–1568.
[23] C. Wang and X. Guo, “Plane-based optimization of geometry and texture for rgb-d reconstruction of indoor scenes,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 533–541.
[24] M. M. Loper and M. J. Black, “Opendr: An approximate differentiable renderer,” in European Conference on Computer Vision. Springer, 2014, pp. 154–169.
[25] W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler, “Learning to predict 3d objects with an interpolation-based differentiable renderer,” in Advances in Neural Information Processing Systems, 2019, pp. 9609–9619.
[26] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, September 2018.
[27] Z. Wan, B. Zhang, D. Chen, P. Zhang, D. Chen, J. Liao, and F. Wen, “Bringing old photos back to life,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[29] Y. Li and G. Baciu, “Sg-gan: Adversarial self-attention gcn for point cloud topological parts generation,” IEEE Transactions on Visualization and Computer Graphics, no. 01, pp. 1–1, 2021.
[30] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, p. 1, 2017.
[31] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE International Symposium on Mixed and Augmented Reality. IEEE, 2011, pp. 127–136.
[32] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
[33] A. Nealen, T. Igarashi, O. Sorkine, and M. Alexa, “Laplacian mesh optimization,” in Proceedings of the 4th international conference on Computer graphics and interactive techniques in Australasia and Southeast Asia, 2006, pp. 381–389.
[34] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.