View Transformation Robustness for Multi-View 3D Object Reconstruction With Reconstruction Error-Guided View Selection

Qi Zhang¹ Zhouhang Luo^2,1 Tao Yu¹ Hui Huang¹¹¹1Corresponding author.
¹College of Computer Science and Software Engineering
Shenzhen University Shenzhen China
²Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen) Shenzhen China
[email protected] {luozhouhang2022 yutao2023}@email.szu.edu.cn [email protected]

Abstract

View transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models, which indicates the methods’ stability under inputs with various view transformations. However, existing research seldom focuses on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models’ VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. To fully utilize the power of Stable Diffusion models without causing extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors’ spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models’ robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods. Codel will be available at: https://github.com/zqyq/VTR.

1 Introduction

Voxel-based multi-view 3D object reconstruction outputs 3D voxels of the object by fusing various viewpoints with deep-learning neural networks. It plays an important role in computer vision, robotics, augmented reality, and other domains. Naturally, multi-view 3D object reconstruction shall perform stably with inputs of different viewpoints, i.e., the view transformation robustness (VTR) is vital. However, few research works have studied the view transformation robustness for multi-view 3D object reconstruction.

Refer to caption — Figure 1: The method’s main idea is to utilize the proposed reconstruction error-guided view selection method for selecting and generating viewpoints covering the most errors.

Instead, the VTR issue has been dealt with in other areas, such as in multi-view classification [1] or point cloud recognition [2, 3]. However, these methods are not specially designed to deal with the view transformation robustness issue for multi-view 3D object reconstruction. So, proposing novel methods for boosting view transformation robustness is important for multi-view 3D object reconstruction.

Recent progress in large vision models (LVMs), especially stable diffusion models (SD) [4, 5, 6, 7, 8, 9, 10, 11], has provided the extra potential for multi-view 3D object reconstruction, such as training a large model for direct 3D object reconstruction [12, 13, 14], or novel view synthesis. While utilizing LVMs for 3D reconstruction causes heavy computation burdens at the inference stage and their view transformation robustness [15] is still limited. An effective way of utilizing stable diffusion models [15] for boosting the view transformation robustness of multi-view 3D reconstruction, is to use them as a data augmentation platform to aid the model training by randomly synthesizing novel views (see Figure 1). However, even though these randomly generated views are useful for increasing the model’s robustness to view transformations, the repeated or similar views are also being synthesized with no aid for the model’s robustness, reducing the approach’s effectiveness. Thus, instead of generating random views, and considering the error spatial distributions of 3D reconstruction predictions, we propose a novel reconstruction error-guided view selection method to choose the most effective views to improve the 3D reconstruction models’ view transformation robustness (VTR).

Overall, the proposed method consists of three components (see Figure 2): the original 3D object reconstruction model, the view selection module, and the stable diffusion model-based view synthesis module. The original 3D object reconstruction model could be any deep learning-based multi-view 3D object reconstruction model (e.g., Xie et al. [16], or Yang et al. 17). It is first trained on the training set with limited view angles, thus it cannot perform well on the test set with large view transformations. The view selection module is guided by the 3D reconstruction error, where the view angles which can cover more 3D object reconstruction errors are more likely to be selected. Then, the selected view parameters are fed into the view synthesis model to generate new views. The view synthesis module generates novel new views based on the stable diffusion model with the learned view parameters from the view selection module. Later, the newly synthesized views are added to the training set to fine-tune the 3D object reconstruction model further. These steps are recycled to gradually improve the 3D reconstruction model’s view transformation robustness.

Besides, since existing 3D object reconstruction datasets (e.g., ShapeNet [18]) are shot from view angels roughly around the objects with small viewpoint ranges (denoted as ‘Aligned’ data), to study the view transformation robustness, we generate a new dataset ShapeNet-VTR with more view angle distributions: the ‘Hemispherical’ and ‘Spherical’ set (see Figure 5). The 3D reconstruction model is trained on the training set of ‘Aligned’ data and tested on the ‘Hemispherical’ and ‘Spherical’ sets, to validate the method’s robustness to view transformations (see details in Sec. Experiments). The paper’s contributions are summarized as follows.

•

This is the first study on view transformation robustness (VTR) for multi-view 3D object reconstruction. Seldom research has focused on the issue of view transformations in this area. Besides, we propose a new dataset specially designed for studying the issue.
•

We propose a novel reconstruction error-guided view selection method for choosing the most effective views, which is much more effective compared to randomly generating views or data augmentation methods.
•

We use existing stable diffusion models for boosting the view transformation robustness of current 3D object reconstruction models, without training a new one or deploying them at the inference stage, without causing extra computation burdens. Thorough experiments demonstrate the advantages of the proposed method over 3D reconstruction SOTAs and other view transformation robustness comparison methods.

2 Related Work

2.1 Multi-view 3D object reconstruction

Deep learning-based multi-view 3D object reconstruction methods have achieved remarkable performance. Early methods focus on the fusion of multi-view features, such as [19, 20, 21, 22], where the multi-view features are reduced into a fixed size of feature maps. Later methods tried recurrent neural network fusion methods [23, 24, 25, 26], regarding input views as a sequence. However, RNN fusion methods are not invariant to input order permutations and cannot handle a large number of views efficiently. Attention-based fusion methods [27, 28, 29, 16] are also proposed to fuse multi-views according to the attention maps estimated from the attention subnet. Recent methods utilize transformer networks for more complicated fusion between the views, such as [30, 31, 32, 33, 34, 17, 35]. LRGT [17] proposed long-range grouping attention for grouping tokens from all views with separate attention operations. However, none of these methods have ever paid attention to the multi-view 3D object reconstruction methods’ robustness to the input view transformations.

2.2 3D view transformation/rotation robustness

The issue of view transformation and rotation robustness has been studied in various areas, such as point cloud classification and multi-view image classification. Instead of using rotation invariant descriptors [36, 37] as inputs or designing rotation equivariant networks [38, 39], ART-Point [3] improved rotation robustness by training the point classifier on inputs with adversarial rotations. Considering that the camera viewpoints are often fixed for all shapes in multi-view 3D shape classification, Hamdi et al. [40] proposed the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. Recent methods have started to focus on novel view synthesis for boosting the model viewpoint robustness. ViewFool [41] found adversarial viewpoints that mislead 3D recognition models with an entropic regularize. VIAT [42] proposed to improve the viewpoint robustness of multi-view image classification with the inner diverse adversarial viewpoints and the outer viewpoint invariant classifier training. Overall, synthesizing novel viewpoints and selecting the most optimal views are the keys for these methods to improve the models’ view transformation or rotation robustness. Note that none of these methods have focused on the view transformation robustness issue for multi-view 3D object reconstruction.

3 Method

The main idea of the method is to propose a novel reconstruction error-guided view selection method to select viewpoints that influence the reconstruction results most, then these viewpoint parameters are fed into an existing Stable Diffusion (SD) model to generate the selected view images for finetuning the original 3D reconstruction model for better view transformation robustness. The proposed method consists of the multi-view 3D reconstruction model, the view selection module, and the view synthesis module, whose details are as follows.

3.1 Multi-view 3D reconstruction model

A multi-view 3D object reconstruction model $\mathit{f}$ usually consists of the following parts [23]: First, an encoder to extract the feature representation for reconstruction from the image set ( $I_{1}$ , $\dots$ , $I_{n}$ ) of Object $\mathit{I}$ ; Second, multiple 2D image features or 3D volume features are fed into a merger to fuse features from different views; Third, a decoder to predict the corresponding voxel-base 3D shape $\mathit{V}$ from the features map. The overall prediction process is formulated as: $V=f(I_{1},\cdots,I_{n})$ . We adopt two typical kinds of multi-view 3D object reconstruction models, CNNs-based Pix2Vox++ [16] and transformer-based LRGT [17]. Denote the 3D voxel ground truth as $V^{gt}$ , and the reconstruction loss is $Loss(V,V^{gt})$ (details in Supp.). Following [16] and [17], we use binary cross entropy loss in Pix2Vox++ while Dice loss [43] in LRGT. The model is first trained on the training set of ‘Aligned’ data.

3.2 View selection module

The view selection module first projects the 3D reconstruction error $\mathit{E}=\left|V-V^{gt}\right|$ to a series of viewpoints, and then selects viewpoints containing the most reconstruction errors to be fed into the view synthesis module (see Figure. 3). Specifically, We denote $v=[\psi,\theta,\phi]\in\mathbb{R}^{3}$ as the viewpoint parameters, where $[\psi,\theta,\phi]$ is the camera rotation along the z-y-x axes using the Tait-Bryan angles, and roll $\theta$ is set as zero. Then, our target is to estimate the $v=[\psi,\phi]$ from the 3D reconstruction error $\mathit{E}$ , which indicates the areas with bad reconstruction results of an object. Naturally, if we could provide more information (views) about the areas with large reconstruction errors and try to rectify these errors, the performance of the reconstruction model could be boosted. Particularly, if a camera view can cover the reconstruction error regions more clearly, it is supposed to aid the robustness of the 3D reconstruction model.

To reduce computation, we first divide yaw $\psi$ and pitch $\phi$ into discrete values with $K$ degree interval to obtain a set of dense discrete 3D viewpoints. The yaw is divided into $\left\{\left[-180^{\circ}\!+\!(i\!-\!1)\!\times K^{\circ},-180^{\circ}+i\!\times K^{\circ}\right]\right\}_{i=1}^{n_{\psi}}$ and the pitch is divided into $\left\{\left[-90^{\circ}\!+\!(i\!-\!1)\!\times\!K^{\circ},-90^{\circ}\!+\!i\!\times\!K^{\circ}\right]\right\}_{i=1}^{n_{\phi}}$ , where $n_{\psi}=360^{\circ}/K^{\circ}$ and $n_{\phi}=180^{\circ}/K^{\circ}$ refer the number of interval for the yaw $\psi$ and the pitch $\phi$ , respectively. Then we traverse each interval and use the median value of each interval as the viewpoint position for that interval: $\psi=-180^{\circ}+(i-\frac{1}{2})\times K^{\circ}$ and $\phi=-90^{\circ}+(i-\frac{1}{2})\times K^{\circ}$ . As $K$ gets smaller, the traversed viewpoints $v=[\psi,\phi]$ are more continuous. By rotating the voxel grid $E$ by the corresponding degree $v$ , we set the viewpoint direction to the negative direction along the X-axis, denoted as $E_{rot}=Rot(E,v)$ , where $Rot$ is the rotation operation.

We then perform the orthographic projection on $E_{rot}$ to each discrete viewpoint to obtain the reconstruction error projection maps $P$ , conducted as follows:

P((y,z),E_{rot})=E_{rot}(x^{\ast},y,z),

(1)

where the $x^{\ast}=\min_{E_{rot}(x,y,z)\neq 0}X$ , $E_{rot}(x,y,z)$ represents the integer 3D coordinate and the $X$ represents the range of $x$ . Intuitively, we get the first occupancy value along each line of sight, ie., the surface of the error voxel.

After we get the error projection map $P$ under each viewpoint $v$ , we obtain the sum of the pixels of each projection map: $sum_{v}=\sum_{i=1}^{S}P_{i}$ , where $S$ is the number of a projection map pixels and $P_{i}$ is the $i$ -th pixel of the projection map $P$ . Then, we select the maximum $n$ views $\{v_{s}\}$ based on the $sum_{v}$ . Because larger $sum_{v}$ means the viewpoint covers more reconstruction error regions. Rather than directly using the selected viewpoints for reconstruction, which will decrease the diversity of inputting viewpoints, we model each selected viewpoint $v_{s}$ as a Gaussian distribution and sample views among the distribution: $v\sim\mathcal{N}(v_{s},\sigma)$ , where $\sigma$ refers the standard deviation, which is set to $K/6$ in our experiments. Finally, we can obtain the $n$ corresponding viewpoints $v=[\psi,\phi]$ from the 3D reconstruction error $E$ for each object for $n$ -view reconstruction.

Viewpoint pool. During experiments, we found that we construct a viewpoint set for each object under each category, the learned viewpoint distribution will degenerate with the increase of training iterations, that is, the diversity of selected viewpoints will decrease in the later training period, resulting in the phenomenon of overfitting in iterative training. A simple solution is to construct multiple viewpoint sets with multiple iterations for each object of each category. However, it will be very time-consuming. To solve this problem, according to [3], it is observed that the selected viewpoint distributions of objects within the same category are highly similar, which has a strong transferability. Based on this, we construct a viewpoint pool by saving the viewpoints selected on each object by category:

v=[\{v_{i,1}\}_{i=1}^{n_{1}},\dots,\{v_{i,k}\}_{i=1}^{n_{k}},\dots,\{v_{i,N}\}_{i=1}^{n_{M}}],

(2)

where $v_{i,k}$ refers to the viewpoint selected on object $i$ of category $k$ . We will save the viewpoints corresponding to all $n_{k}$ objects in the category $k$ and traverse all $M$ categories to construct the viewpoint pool. In the process of iterative optimization, it is only necessary to sample the viewpoint pool according to the category and convert it into the corresponding view. Due to the transferability of the viewpoint distribution of the same type of object, high reconstruction loss can also be induced.

3.3 View synthesis from Stable Diffusion model

With the camera viewpoint $v=[\psi,\phi]$ selected from the view selection module, we synthesize novel images of an object from the Stable Diffusion model Zero-1-to-3 [15], which is a viewpoint-conditioned diffusion approach and can generate an image under the novel viewpoint by providing an image of the object and the viewpoint transformation matrix. Given a dataset of paired images and their relative camera extrinsic ( $\hat{I}$ , $I$ , $\hat{v}$ , $v$ ), the model is trained by solving for the following objective:

\min_{\theta}\mathbb{E}_{z\sim\varepsilon(\hat{I}),t,\epsilon\sim\mathcal{N}(0,1)}\parallel\epsilon-\epsilon_{\theta}(z_{t},t,c(\hat{I},I,\hat{v},v))\parallel^{2}_{2},

(3)

where $\varepsilon$ denote the encoder, $\epsilon_{\theta}$ is a denoiser U-Net, $t$ is the diffusion time step, and $z$ is the latent representation of the input image encoded by the encoder. After the model is trained, we can generate the novel image from any viewpoint by performing iterative denoising from a Gaussian noise image conditioned on $c(\hat{I},\hat{v},v)$ .

Use Zero-1-to-3 for direct 3D reconstruction. Directly using Zero-1-to-3 for 3D reconstruction reports bad performance as in Figure 4. It suffers drastically from the view transformations and the 3D reconstruction inference process is very time-consuming (see running time in the Supplemental). Thus, we utilize Zero-1-to-3 for novel view synthesis and rely on the new views to boost the view transformation robustness of existing 3D reconstruction models, instead of using it for direct 3D reconstruction.

	Test set	Aligned		Hemispherical		Spherical
	Method	mIoU $\uparrow$	F_score $\uparrow$	mIoU $\uparrow$	F_score $\uparrow$	mIoU $\uparrow$	F_score $\uparrow$
None-VTR	Pix2Vox++ [16]	64.63%	42.02%	47.02%	26.49%	36.71%	18.44%
None-VTR	GARNet [29]	65.14%	46.50%	49.69%	30.03%	37.46%	20.41%
VTR	Nature	65.54%	46.69%	49.00%	30.20%	37.13%	21.03%
	Random	64.89%	46.67%	50.89%	31.77%	41.62%	24.45%
	VIAT [42]	65.66%	47.30%	51.50%	31.88%	40.36%	23.47%
	MVTN [44]	65.67%	47.03%	49.58%	30.20%	37.30%	21.36%
	Ours (Pix2Vox++)	65.52%	47.22%	53.53%	33.53%	46.12%	27.01%

Table 1: The view transformation robustness comparison of CNNs-based 3D reconstruction methods. None-VTR/VTR indicates methods without/with view transformation robustness techniques, respectively. All VTR use CNNs-based Pix2Vox++ as the 3D reconstruction model. Our method achieves the best results on ‘Hemispherical’ and ‘Spherical’ with view transformations.

	Test set	Aligned		Hemispherical		Spherical
	Method	mIoU $\uparrow$	F_score $\uparrow$	mIoU $\uparrow$	F_score $\uparrow$	mIoU $\uparrow$	F_score $\uparrow$
None-VTR	UMIFormer [35]	69.87%	50.56%	53.04%	33.79%	41.19%	23.51%
None-VTR	LRGT [17]	70.01%	51.19%	53.57%	34.71%	41.88%	24.46%
VTR	Nature (LRGT)	70.90%	52.92%	56.06%	35.54%	43.99%	25.16%
	Random (LRGT)	70.92%	52.68%	56.02%	35.28%	44.18%	25.32%
	VIAT [42] (LRGT)	70.86%	52.33%	56.07%	35.29%	44.70%	25.27%
	MVTN [44] (LRGT)	71.09%	52.97%	55.76%	34.92%	43.58%	24.07%
	Ours (LRGT)	70.95%	52.78%	57.11%	35.89%	44.97%	26.02%

Table 2: The view transformation robustness comparison of Transformer-based 3D reconstruction methods. All VTR methods use Transformer-based LRGT as the 3D reconstruction model. Our proposed method achieves the best results.

3.4 Training method

The training adopts an iterative optimization scheme containing the inner maximization and the outer minimization. Specifically, in the first iteration, we evaluate the pre-trained 3D reconstruction model to select the viewpoints for inner maximization, and then re-train the model on new view images generated by the view synthesis module from the previously selected viewpoints before obtaining a robust model for outer minimization. The process will be repeated until the model converges to the most robust state. Our goal is formulated as: $\min_{W}\sum_{i=1}^{N}\mathbb{E}_{v_{i}}\left[\max_{v_{i}}L(W,\mathcal{R}(v_{i}),y_{i})\right]$ , where $W$ denotes the parameters of the reconstruction model $f$ , $L$ is a reconstruction loss function, $\mathcal{R}(v_{i})$ is the rendered images of the $i$ -th object given the viewpoint $v_{i}$ and the $y_{i}$ is the GT of the corresponding object. We then detail how we improve the inefficiency of the iterative optimization scheme with the random update strategy.

Random update strategy. A random update strategy is adopted to reduce the training time. Instead of generating new views for all objects in each epoch, we randomly select 5% objects and generate new views by Zero-1-to-3 for them, while the remaining 95% objects are still trained with the views from the ‘Align’ training set. At the same time, to make up for the diversity of viewpoints and improve the effectiveness of training, the newly generated view images will be added to the original view images set after each fine-tuning epoch, so that each epoch of iterative training can fully and effectively obtain the prior information left by the previous epoch, and help the 3D reconstruction model learn more efficiently. Note that only the multi-view 3D reconstruction model is needed at inference.

4 Experiments

4.1 Experiment settings

Dataset generation. Since no existing dataset is suitable for evaluating view transformation robustness, we generate a new dataset ShapeNet-VTR, based on ShapeNet [18], which consists of larger viewpoint ranges than the original version via rendering new views from the 3D CAD models in ShapeNet. ShapeNet-VTR consists of 13 categories and 39239 objects in total. The objects in each category are divided according to 7 $:$ 1 $:$ 2 for training, validation, and testing, respectively. ShapeNet-VTR consists of 3 sets: ‘Aligned’, ‘Hemispherical’, and ‘Spherical’, sharing the same object dividing way but differing in the viewpoint range as shown in Figure 5. ‘Aligned’ set is rendered at a fixed $\phi$ of $60^{\circ}$ and every $15^{\circ}$ in $\psi$ ; In the ‘Hemispherical’ set, the range of views is set as $\phi$ $\in$ $[0^{\circ},90^{\circ}]$ , $\psi$ $\in$ $[-180^{\circ},180^{\circ}]$ , and thus images are randomly rendered in the upper hemisphere. In ‘Spherical’, the range of views is $\phi$ $\in$ $[-90^{\circ},90^{\circ}]$ , $\psi$ $\in$ $[-180^{\circ},180^{\circ}]$ , and thus images are randomly rendered in the spherical space. In each set, each object contains 24 views but from different camera angle ranges. We train the 3D reconstruction models on the training set of ‘Aligned’ at first and use the models’ performance on ‘Hemispherical’ and ‘Spherical’ sets to evaluate the robustness of 3D object reconstruction models to view transformations.

Implementation details. We utilize two 3D reconstruction models, CNN-based Pix2Vox++ [16] and Transformer-based LRGT [17], respectively. The input view number is 3 and the image resolution is $256\times 256\times 3$ and the size of the voxel output is $32\times 32\times 32$ . We use a threshold of 0.4 to obtain the occupancy voxel grid and set the interval of degree $\mathit{K}=30^{\circ}$ in the experiments.

Comparison methods. We compare with SOTA 3D object reconstruction methods Pix2Vox++ [16], GARNet [29], UMIFormer [35], and LRGT [17] (without view transformation robustness approaches, denoted as None-VTR) and four view transformation robustness comparison methods (denoted as VTR). Note that there are no existing VTR methods for multi-view 3D object reconstruction, and we propose reasonable baselines or adopt methods from other areas for comparison:

•

Nature: Data augmentation with the most common viewpoint renderings from training objects’ natural states (e.g. cars are usually viewed in the side view).
•

Random: Standard data augmentation with random viewpoints, and the view range is set as $\phi$ $\in$ $[-90^{\circ},90^{\circ}]$ , $\psi$ $\in$ $[-180^{\circ},180^{\circ}]$ .
•

VIAT [42]: By regarding viewpoint transformation as an attack, VIAT solves the inner maximization problem to parameterize the Gaussian Mixture distribution of adversarial viewpoints with trainable parameters. In our experiment setting, VIAT keeps the roll $\theta$ fixed and learns the parameters of the Gaussian Mixture distribution of yaw $\psi$ and pitch $\phi$ to get an adversarial viewpoint. Next, we select the other two adversarial viewpoints at azimuth intervals of $120^{\circ}$ and $240^{\circ}$ , starting from the first adversarial viewpoint.
•

MVTN [44]: A viewpoint selection algorithm that uses differentiable rendering to determine optimal viewpoints for 3D shape recognition through the gradient descent algorithm. We changed its downstream task to multi-view 3D reconstruction. We changed the 3D representation from point clouds to voxels and set the number of the selected viewpoints to 3 as the view amount of the input.

Besides, we also evaluate the method’s single-view reconstruction robustness, comparing with single-view reconstruction state-of-the-art methods [45], which does not incorporate prior knowledge of point clouds or depth maps to assist in reconstruction as ours.

Metrics. The evaluation metrics include the mean Intersection-over-Union (mIoU) and F_score (see details in Supp.), whose higher values indicate better performance.

4.2 Experiment results

4.2.1 Multi-view 3D reconstruction performance

Table 1 shows the view transformation robustness results of CNNs-based multi-view 3D object reconstruction methods. Our method utilizes Pix2Vox++ [16] as the 3D reconstruction model. Compared with SOTA 3D reconstruction methods without view transformation robustness techniques (None-VTR), including Pix2Vox++ [16] and GARNet [29], the proposed method outperforms them a lot on ‘Hemispherical’ and ‘Spherical’ with much larger view transformations than ‘Aligned’. The proposed method also outperforms all VTR comparisons (all use Pix2Vox++ as the 3D reconstruction model as ours) on both ‘Hemispherical’ and ‘Spherical’, indicating the proposed method is more robust to view transformation than all comparisons. ‘Nature’ and ‘Random’ try to improve the model robustness via a data augmentation fashion, but also introduce redundant views to the model training, limiting the efficiency and effectiveness. VIAT [42] and MVTN [44] are not specially designed for multi-view 3D reconstruction tasks: VIAT ignores the reconstruction error’s spatial relation to the target views and MVTN relies on purely model learning for target view estimation without an in-depth rationale for the view selection step, and thus their performance is reduced. In contrast, the proposed method considers the spatial connections between the 3D reconstruction error and the selected views, where the selected views could cover the reconstruction errors as much as possible to finetune the multi-view 3D reconstruction models on these views in the next circle.

$K$ (^∘)	Aligned	Hemispherical	Spherical
15	65.37%	51.44%	43.27%
30	65.52%	53.53%	46.12%
60	65.40%	52.72%	45.03%
90	65.38%	52.32%	44.26%

Table 3: The ablation study on the interval of degree

K

30^{\circ}

achieves the best mIoU results.

Viewpoint Pool	Aligned	Hemispherical	Spherical
with	65.52%	53.53%	46.12%
without	65.28%	52.07%	42.39%

Table 4: The ablation study on whether the viewpoint pool is used in the proposed method. With the viewpoint pool, the method achieves much better mIoU results.

Table 2 shows the VTR of Transformer-based multi-view 3D object reconstruction methods. VTR methods utilize LRGT [17] as the 3D reconstruction model. Similarly, the proposed method also outperforms all Transformer SOTAs [35, 17] (None-VTR) and view transformation robustness comparisons (VTR) on ‘Hemispherical’ and ‘Spherical’, which proves the advantage of the proposed method and its effectiveness of boosting the robustness of Transfomer-based 3D reconstruction methods. The performance gain of our method over comparisons is larger for CNNs-based 3D reconstruction models, due to the weak view fusion ability of CNNs models compared to Transformer 3D reconstruction models. Overall, from Table 1 and 2, we can conclude the proposed method could boost the robustness of both CNNs-based and Transformer-based 3D reconstruction models, indicating our method’s generalization ability.

Visualization results. We show the visualization examples of the proposed method and comparisons in Figure 6 (CNNs-based) and Figure 7 (Transformer-based). In each figure, for each object example, the left three columns show the input 3 views, and the middle seven columns indicate the voxel predictions of the methods, and the last column is the ground-truth voxel. The first and the second row of each object example represent results on ‘Hemispherical’ and ‘Spherical’, respectively. In Figure 6 and Figure 7, our method can recover the rough shape of the plane and the chair, while comparisons ignore a lot of details, like the wings of the plane in Figure 6 and the chair legs in Figure 7. We can also observe that the results of the same object of each method on ‘Spherical’ are worse than on ‘Hemispherical’, because ‘Spherical’ contains larger view transformations and is more difficult. Transformer-based methods can achieve better visualization results than CNN-based methods because Transformer methods have stronger inter-view fusion ability. No matter whether using which 3D reconstruction model or on which testing set, the proposed method always achieves the best results, which indicates the stronger robustness of the proposed method to view transformations.

Test set	Aligned		Hemispherical		Spherical
Method	mIoU	F_score	mIoU	F_score	mIoU	F_score
OCCNet	52.89%	33.46%	38.31 %	21.49%	26.62%	13.74%
Pix2Vox++	56.30%	38.93%	39.12%	23.90%	30.06%	17.39%
LRGT	62.62%	42.63%	45.17%	26.63%	34.70%	19.52%
Ours (Pix2Vox++)	59.47%	40.32%	43.81%	25.84%	36.54%	20.71%
Ours (LRGT)	63.21%	43.35%	47.14%	28.12%	36.13%	20.20%

Table 5: The experiments on single-view 3D object reconstruction methods on the test set with view transformations.

Method	mIoU	F_score
Pix2Vox++	16.24%	7.76%
Ours (Pix2Vox++)	16.45%	7.77%
LRGT	17.08%	7.79%
Ours (LRGT)	17.47%	8.59%

Table 6: The testing results on the Pix3D Chairs dataset.

4.2.2 Ablation study

Ablation study on the interval of degree $K$ . We conduct the ablation study on the interval of degree $K$ in the view selection module. Table 3 presents the reconstruction result of the model after iterative training with different $K$ , where the best results are achieved by $K=30^{\circ}$ . It shows a suitable interval of degree benefits the model in achieving better results. If the degree is too large, the viewpoint selection will tend to be random; while if it is too small, it will affect the viewpoint diversity and hinder the reconstruction robustness. Thus, we experiment with $K=30^{\circ}$ in our methods.

Ablation study on the viewpoint pool. We conduct the ablation study on the effectiveness of the viewpoint pool in Table 4. It shows that the method with the viewpoint pool can achieve better performance than without the viewpoint pool on all testing sets, especially on the ‘Spherical’ set with much larger view transformations. This indicates the effectiveness of the viewpoint pool in boosting the multi-view 3D reconstruction model’s robustness to view transformations.

Single-view 3D reconstruction performance. We also validate the proposed method’s robustness in single-view 3D reconstruction and compare it with single-view SOTA methods [45, 16, 17] in Table 5. It shows that our method achieves better results than all comparisons on all sets, no matter whether using Pix2Vox++ [16] or LRGT [17] as 3D reconstruction models, which proves the effectiveness of the proposed method for single-view 3D reconstruction.

Evaluation on the real-world Dataset Pix3D. We also test the proposed method on Pix3D [46] to verify its single-view reconstruction performance on real-world data with more complicated object viewpoint distributions. Following [16], we use the data from the Chair category in ShapeNet to generate a training set and render images with random backgrounds in the SUN dataset [xiao2010sun], and each object has 60 synthesized images. Table 6 shows that our method’s performance is better than SOTA methods, suggesting that our method helps the 3D reconstruction model generalize better to real-world datasets by improving the model’s view transformation robustness.

5 Conclusion

In this paper, we propose a novel reconstruction error-guided view selection method together with view synthesis via a Stable Diffusion model to improve the view transformation robustness of existing multi-view 3D object reconstruction methods. Instead of randomly synthesizing new views from Stable Diffusion models and adding them to model training as a data augmentation approach, we consider the spatial distributions of the 3D reconstruction errors and use them to guide the view selection process for choosing the most effective views covering the errors as most as possible. Our proposed method shows the best view transformation robustness compared to the latest multi-view 3D reconstruction SOTAs and various view transformation robustness comparison methods. We provide a new perspective on incorporating large vision models into existing relatively ‘small’ 3D object reconstruction models for robustness gains without increasing the model deploying cost.

6 Acknowledgements

This work was supported in parts by NSFC (62202312, U21B2023), Guangdong Basic and Applied Basic Research Foundation (2023B1515120026), Shenzhen Science and Technology Program (KQTD 20210811090044003, RCJC20200714114435012), and Scientific Development Funds from Shenzhen University.

References

\bibcommenthead
Wei et al. [2020] Wei, X., Yu, R., Sun, J.: View-gcn: View-based graph convolutional network for 3d shape analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1850–1859 (2020)
Qi et al. [2017] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Wang et al. [2022] Wang, R., Yang, Y., Tao, D.: Art-point: Improving rotation robustness of point cloud classifiers via adversarial rotation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14351–14360 (2022)
Rombach et al. [2022] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Hu et al. [2024] Hu, J., Hui, K.-H., Liu, Z., Li, R., Fu, C.-W.: Neural wavelet-domain diffusion for 3d shape generation, inversion, and manipulation. ACM Transactions on Graphics 43(2), 1–18 (2024)
Shi et al. [2023] Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)
Qi et al. [2024] Qi, Z., Yu, M., Dong, R., Ma, K.: Vpp: Efficient conditional 3d generation via voxel-point progressive representation. Advances in Neural Information Processing Systems 36 (2024)
Yoo et al. [2024] Yoo, P., Guo, J., Matsuo, Y., Gu, S.S.: Dreamsparse: Escaping from plato’s cave with 2d diffusion model given sparse views. Advances in Neural Information Processing Systems 36 (2024)
Zou et al. [2024] Zou, Z., Cheng, W., Cao, Y.-P., Huang, S.-S., Shan, Y., Zhang, S.-H.: Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7900–7908 (2024)
Kwak et al. [2024] Kwak, J.-g., Dong, E., Jin, Y., Ko, H., Mahajan, S., Yi, K.M.: Vivid-1-to-3: Novel view synthesis with video diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6775–6785 (2024)
Bauer et al. [2024] Bauer, D., Hönig, P., Weibel, J.-B., García-Rodríguez, J., Vincze, M., et al.: Challenges for monocular 6d object pose estimation in robotics. IEEE Transactions on Robotics (2024)
Liu et al. [2024] Liu, M., Xu, C., Jin, H., Chen, L., Varma T, M., Xu, Z., Su, H.: One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems 36 (2024)
Liu et al. [2023a] Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453 (2023)
Liu et al. [2023b] Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H.: One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv preprint arXiv:2311.07885 (2023)
Liu et al. [2023c] Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 9264–9275 (2023)
Xie et al. [2020] Xie, H., Yao, H., Zhang, S., Zhou, S., Sun, W.: Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images. International Journal of Computer Vision 128(12), 2919–2935 (2020)
Yang et al. [2023] Yang, L., Zhu, Z., Lin, X., Nong, J., Liang, Y.: Long-range grouping transformer for multi-view 3d reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18257–18267 (2023)
Chang et al. [2015] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Su et al. [2015] Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Paschalidou et al. [2018] Paschalidou, D., Ulusoy, O., Schmitt, C., Van Gool, L., Geiger, A.: Raynet: Learning volumetric 3d reconstruction with ray potentials. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3897–3906 (2018)
Huang et al. [2018] Huang, P.-H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.-B.: Deepmvs: Learning multi-view stereopsis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830 (2018)
Yao et al. [2018] Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 767–783 (2018)
Choy et al. [2016] Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pp. 628–644 (2016). Springer
Kar et al. [2017] Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. Advances in neural information processing systems 30 (2017)
Dosovitskiy et al. [2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Yao et al. [2019] Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent mvsnet for high-resolution multi-view stereo depth inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5525–5534 (2019)
Yang et al. [2020] Yang, B., Wang, S., Markham, A., Trigoni, N.: Robust attentional aggregation of deep feature sets for multi-view 3d reconstruction. International Journal of Computer Vision 128(1), 53–73 (2020)
Xie et al. [2019] Xie, H., Yao, H., Sun, X., Zhou, S., Zhang, S.: Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2690–2698 (2019)
Zhu et al. [2023] Zhu, Z., Yang, L., Lin, X., Yang, L., Liang, Y.: Garnet: Global-aware multi-view 3d reconstruction network and the cost-performance tradeoff. Pattern Recognition 142, 109674 (2023)
Shi et al. [2021] Shi, Z., Meng, Z., Xing, Y., Ma, Y., Wattenhofer, R.: 3d-retr: End-to-end single and multi-view 3d reconstruction with transformers. arXiv preprint arXiv:2110.08861 (2021)
Tiong et al. [2022] Tiong, L.C.O., Sigmund, D., Teoh, A.B.J.: 3d-c2ft: Coarse-to-fine transformer for multi-view 3d reconstruction. In: Proceedings of the Asian Conference on Computer Vision, pp. 1438–1454 (2022)
Wang et al. [2021] Wang, D., Cui, X., Chen, X., Zou, Z., Shi, T., Salcudean, S., Wang, Z.J., Ward, R.: Multi-view 3d reconstruction with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5722–5731 (2021)
Yagubbayli et al. [2021] Yagubbayli, F., Wang, Y., Tonioni, A., Tombari, F.: Legoformer: Transformers for block-by-block multi-view 3d reconstruction. arXiv preprint arXiv:2106.12102 (2021)
Arshad and Beksi [2023] Arshad, M.S., Beksi, W.J.: List: Learning implicitly from spatial transformers for single-view 3d reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9321–9330 (2023)
Zhu et al. [2023] Zhu, Z., Yang, L., Li, N., Jiang, C., Liang, Y.: Umiformer: Mining the correlations between similar tokens for multi-view 3d reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18226–18235 (2023)
Spezialetti et al. [2019] Spezialetti, R., Salti, S., Stefano, L.D.: Learning an effective equivariant 3d descriptor without supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6401–6410 (2019)
Zhu et al. [2023] Zhu, B., Yang, C., Dai, J., Fan, J., Qin, Y., Ye, Y.: R 2 fd 2: Fast and robust matching of multimodal remote sensing images via repeatable feature detector and rotation-invariant feature descriptor. IEEE Transactions on Geoscience and Remote Sensing (2023)
Shen et al. [2020] Shen, W., Zhang, B., Huang, S., Wei, Z., Zhang, Q.: 3d-rotation-equivariant quaternion neural networks. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pp. 531–547 (2020). Springer
Cohen et al. [2021] Cohen, T., et al.: Equivariant convolutional networks. PhD thesis, Taco Cohen (2021)
Hamdi et al. [2022] Hamdi, A., AlZahrani, F., Giancola, S., Ghanem, B.: Mvtn: Learning multi-view transformations for 3d understanding. ArXiv abs/2212.13462 (2022)
Dong et al. [2022] Dong, Y., Ruan, S., Su, H., Kang, C., Wei, X., Zhu, J.: Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints. 36th Conference on Neural Information Processing Systems (NeurIPS) abs/2210.03895 (2022)
Ruan et al. [2023] Ruan, S., Dong, Y., Su, H., Peng, J., Chen, N., Wei, X.: Towards viewpoint-invariant visual recognition via adversarial training. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 4686–4696 (2023)
Milletari et al. [2016] Milletari, F., Navab, N., Ahmadi, S.-A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571 (2016). Ieee
Hamdi et al. [2022] Hamdi, A., AlZahrani, F., Giancola, S., Ghanem, B.: Mvtn: Learning multi-view transformations for 3d understanding. arXiv preprint arXiv:2212.13462 (2022)
Mescheder et al. [2019] Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)
Sun et al. [2018] Sun, X., Wu, J., Zhang, X., Zhang, Z., Zhang, C., Xue, T., Tenenbaum, J.B., Freeman, W.T.: Pix3d: Dataset and methods for single-image 3d shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2974–2983 (2018)