Progressive Text-to-3D Generation for Automatic 3D Prototyping

Han Yi¹, Zhedong Zheng¹, Xiangyu Xu² and Tat-seng Chua¹ ¹Han Yi, Zhedong Zheng and Tat-seng Chua are with School of Computing, National University of Singapore, Singapore 117417 [email protected], {zdzheng,dcscts}@nus.edu.sg²Xiangyu Xu is with School of Mathematics and Statistics, Xi’an Jiaotong University, China 710049 [email protected]

Abstract

Text-to-3D generation is to craft a 3D object according to a natural language description. This can significantly reduce the workload for manually designing 3D models and provide a more natural way of interaction for users. However, this problem remains challenging in recovering the fine-grained details effectively and optimizing a large-size 3D output efficiently. Inspired by the success of progressive learning, we propose a Multi-Scale Triplane Network (MTN) and a new progressive learning strategy. As the name implies, the Multi-Scale Triplane Network consists of four triplanes transitioning from low to high resolution. The low-resolution triplane could serve as an initial shape for the high-resolution ones, easing the optimization difficulty. To further enable the fine-grained details, we also introduce the progressive learning strategy, which explicitly demands the network to shift its focus of attention from simple coarse-grained patterns to difficult fine-grained patterns. Our experiment verifies that the proposed method performs favorably against existing methods. For even the most challenging descriptions, where most existing methods struggle to produce a viable shape, our proposed method consistently delivers. We aspire for our work to pave the way for automatic 3D prototyping via natural language descriptions.

I Introduction

3D prototyping, also known as 3D printing or additive manufacturing, is a technology that transforms digital models into tangible objects by adding material layer by layer. This technology has revolutionized sectors such as robotics [1, 2, 3], and manufacturing [4, 5]. However, a significant challenge is that digital design remains time-consuming and labor-intensive. As a solution, researchers are exploring simpler and more intuitive methods to guide 3D generation, like using text prompts. The aim of the text-to-3D generation task is to automatically create a 3D object draft from a natural description, thus cutting down the design efforts from the ground up.

In recent years, text-to-3D generation has reported rapid development due to the breakthrough of 2D text-to-image diffusion models [6, 7, 8]. For instance, the pioneer work DreamFusion [9] leverages the 2D Stable Diffusion and proposes Score Distillation Sampling (SDS) algorithm to generate a variety of 3D objects using only text prompts. However, there remain two problems. 1) The optimization difficulty of 3D high-resolution objects. It is hard to directly map one sentence to one high-dimension 3D object, especially in the form of Neural Radiance Fields (NeRF) [10]. This leads to either generation collapse or extended training duration for model convergence. 2) Lack of fine-grained details. We notice that some works report blurred results [9, 11, 12]. This is due to the use of a fixed training strategy, i.e., focusing on global fidelity all the time while ignoring the local parts.

Refer to caption — Figure 1: Pipeline for fast 3D prototyping. The proposed algorithm facilitates effortless and interactive creation of high-quality 3D objects from natural language descriptions, which can then be utilized for 3D printing.

In an attempt to overcome the above-mentioned challenges, we propose a progressive text-to-3D generation model that gradually refines the details of 3D objects. For the first problem, we introduce a novel network structure, namely, Multi-Scale Triplane Network (MTN) consisting of four triplanes ranging from low to high resolution. In the initial phases of training, we sample low-resolution features from the corresponding low-resolution triplane to capture the basic global geometric shape. As training advances, we fix the former low-resolution triplanes and gradually shift our focus to triplanes with a higher resolution. Such a progressive structure facilitates the model to capture different-level features in a step-by-step manner and thus enhances the geometric and textural nuances of the 3D model, such as color and texture.

For the second problem, we adopt a progressive learning strategy focusing on two key factors, i.e., time step $t$ and camera radius. In particular, unlike existing 2D diffusion models that utilize random sampling, we adopt a large $t$ during the initial stages to guide the global structure. As the training progresses, we transition to a smaller $t$ to refine visual details. Meanwhile, we gradually adjust the radius of the camera to approach the object more closely. This enables the camera to initially focus on capturing the global structure and later shift its attention to the local details.

To summarize, our contributions are as follows:

•

Most existing works on text-to-3D struggle to craft high-resolution outputs due to the optimization difficulty. In contrast, we introduce a Multi-Scale Triplane Network (MTN) to gradually create the 3D model in a bottom-up style, effectively alleviating the optimization issue.
•

We also propose a progressive learning strategy that simultaneously reduces the camera radius and time step $t$ in diffusion to refine details of the 3D model in a coarse-to-fine manner.
•

Albeit simple, extensive experiments show that the proposed method could achieve high-resolution outputs that align closely with natural language descriptions. We expect this work to pave the way for automatic 3D prototyping via easier human-machine interaction.

II Related Work

II-A Text-Guided Diffusion Models

The evolution of text-to-image generative paradigms has been notably characterized by the advent of diffusion models [6, 7, 8]. These models have gained prominence for their robust stability and unprecedented scalability. Significant exemplars in this category include GLIDE [13], DALL·E 2 [14], Imagen [15], and Stable Diffusion [16]. These models leverage textual prompts to synthesize images of photorealistic quality. Their efficacy is substantially bolstered by the availability of comprehensive datasets comprising billions of image-text pairs, thereby enhancing the semantic understanding of these generative systems. However, it is nontrivial to extend these diffusion models to 3D generation, which needs to accurately synthesize 3D attributes from textual prompts across a diversity of viewpoints [17].

II-B 3D Generative Modeling

The realm of 3D generative modeling has seen extensive exploration across diverse representation types, including voxel grids [18, 19], point clouds [20, 21, 22], meshes [23, 24, 25, 26, 27, 28], implicit fields [29, 30, 31, 32], and octrees [33]. While many traditional approaches hinge on 3D assets as training data, the challenge of acquiring such data at scale has spurred alternative strategies. Addressing the inherent challenge of obtaining 3D assets for training, some recent endeavors have turned to 2D supervision. Leveraging ubiquitous 2D images, models like pi-GAN [34], EG3D [35], MagicMirror [36] and GIRAFFE [37] have supervised 2D renderings of 3D models through adversarial loss against 2D image datasets. While these approaches hold potential, a recurring challenge is that they are often restricted to specific domains like human faces [38], limiting their versatility and hindering expansive creative freedom in 3D design. In our study, we pivot towards text-to-3D generation, with the objective of crafting a visually favorable 3D object guided by diverse text prompts.

II-C Text-to-3D Generation

The success of the text-to-image generation models has driven substantial progress in the emerging field of text-to-3D object generation. Notably, the integration of CLIP into models like Dream Fields [39] and CLIPmesh [40] has been a significant advancement. These approaches harness CLIP to optimize 3D representations, ensuring 2D renderings resonate with textual prompts. A defining advantage of such techniques is their ability to bypass the need for costly 3D training data, though a trade-off in terms of the realism of the resultant 3D models has been observed. More recent advancements, such as DreamFusion [9], SJC [11], Magic3D [41], and Latent-NeRF [12], have showcased the merits of employing robust text-to-image diffusion models as a robust 2D prior, elevating the quality and realism of text-to-3D generation. This innovation, capitalizing on the potential of diffusion models, has led to outcomes with higher fidelity and diversity, as well as reduced generation time. We follow the spirit of this line of work and present new techniques to effectively improve the quality of the 3D outputs.

III Method

In this section, we first explain our Multi-Scale Triplane Network (MTN) with discussion. Then we elaborate a progressive learning strategy, followed by implementation details. The brief scheme is shown in Fig. 2.

III-A Multi-Scale Triplane

An overview of our Multi-Scale Triplane Network (MTN) is shown in Fig. 2. In particular, MTN is composed of four triplanes [35] ranging from low to high resolutions. Each triplane leverages three axis-aligned 2D feature planes $\mathbf{F}^{m}_{xy},\mathbf{F}^{m}_{xz},\mathbf{F}^{m}_{yz}\in\mathbb{R}^{N_{m}\times N_{m}\times C},m=1,2,3$ . $N_{m}$ denotes spatial resolution, while $C$ is the dimension of channels and $m$ represents the training stage. Note that a large $N_{m}$ results in a substantial GPU memory cost. Therefore, for the last triplane, we essentially employ a trivector instead to optimize memory usage and support higher resolution. This trivector configuration leverages the axis-aligned vectors $\mathbf{F}^{4}_{x},\mathbf{F}^{4}_{y},\mathbf{F}^{4}_{z}\in\mathbb{R}^{N_{4}\times 1\times C}$ with a resolution of $N_{4}\times 1$ and $C$ .

Given any 3D coordinate point $p\in\mathbb{R}^{3}$ , we project this coordinate onto each of these orthogonal feature planes and sample feature vectors via interpolation. We then sum these three vectors $f^{m}(p)=\mathbf{F}^{m}_{xy}(p)+\mathbf{F}^{m}_{xz}(p)+\mathbf{F}^{m}_{yz}(p)$ for $m=1,2,3$ as position features for the first three triplanes, while $f^{4}(p)=\mathbf{F}^{4}_{x}(p)+\mathbf{F}^{4}_{y}(p)+\mathbf{F}^{4}_{z}(p)$ for the last trivector. To aggregate multi-scale features, we further fuse the different level position features together as $h^{m}(p)=\sum_{k=1}^{m}(f^{k}(p))$ . After obtaining the multi-scale representation, we follow [42] to transform the summed position features into the Fourier domain. Subsequently, the Fourier features are fed forward into a lightweight triplane decoder to estimate color and density [10]. We deploy a Multi-Layer Perceptron (MLP) as the triplane decoder. Finally, to calculate the loss, we apply neural volume rendering techniques [10] to project the 3D representation onto an RGB image $I$ , which is the input of the Diffusion model.

Discussion. Why is multi-scale important? As shown in Fig. 2, we utilize triplanes with different resolutions to obtain features at different scales. We intend to simulate the human recognition system to transition from basic elements to more intricate details when observing 3D objects. For instance, when a human sees a new object, he will first capture the overarching structure of the model and then refine the details via the foveal vision. Similarly, during the early stages of training, we extract low-resolution features from the corresponding low-resolution triplane. As one point on a low-resolution triplane is obtained through interpolation from a coarse grid and encompasses a broader field of view, these coarse features provide inherent global geometric insights.

As the training progresses, we gradually shift our focus from the initial low-resolution triplanes to the higher-resolution counterparts. The high-resolution triplane could acquire intricate features to refine details in the 3D model, such as subtle shade and texture nuances. Simultaneously, it is easier to refine high-scale features and facilitate the optimization process of high-scale features if low-scale features have been already well optimized. This process is also in spirit similar to the classical curriculum learning [43] where learning begins with simpler tasks and gradually advances to harder ones. In the experimental evaluations, we could observe that the proposed method achieves visual enhancements in both the shape and texture of the model, even for complex descriptions.

Optimization objective. Given the projected image $I$ , we apply Score Distillation Sampling (SDS) [9] to distill 2D image priors from the pretrained 2D diffusion model $\boldsymbol{\epsilon}_{\phi}$ . The loss on 2D projection is then back-propagated to update differentiable 3D representations. In particular, the proposed 3D model can be typically depicted as a parametric function $I=g_{\theta}(P)$ , where $I$ represents the images produced at distinct camera poses, and $P$ is the set of multiple positions $p$ . Here, $g$ denotes the volumetric rendering mechanism, and $\theta$ embodies a coordinate-based MLP and triplanes that portray a 3D scene. To estimate the projection quality, we adopt the pretrained diffusion model, which is well aligned with text prompts $y$ . The one-time denoising forward can be formulated as $\epsilon_{\phi}(I_{t};y,t)$ to predict the noise $\varepsilon$ given the noisy image $I_{t}$ , time step $t$ , and text embedding $y$ . Therefore, the gradient of the SDS loss can be formulated as:

\nabla_{\theta}\mathcal{L}_{SDS}(\phi,g_{\theta}(P))=\mathbb{E}_{t,\epsilon}\left[\left(\epsilon_{\phi}\left(I_{t};y,t\right)-\epsilon\right)\frac{\partial I_{t}}{\partial\theta}\right],

where $\boldsymbol{\epsilon}$ is a noise term following a standard normal distribution and $I_{t}$ denotes the noisy image. Following the setting in the diffusion model [6, 7, 8], the noisy image can be formulated as a linear process $I_{t}=\sqrt{\bar{\alpha}_{t}}I+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon}$ , where $\bar{\alpha}_{t}$ is a predefined time-dependent constant. Besides, it is worth noting that the diffusion model parameter $\phi$ is frozen. The purpose of this denoising function is to offer the text-aware guidance to update $\theta$ . If the projection $I$ is well-aligned with the text $y$ , the noise on $I_{t}$ is easy to predict. Otherwise, we will punish the 3D model.

III-B Progressive Learning Strategy

Another essential element underlying the proposed method is the employment of a progressive learning strategy, focusing on two critical parameters: i.e., the time step $t$ and camera radius. Subsequent sections provide detailed illustrations for each of these components.

Progressive time step sampling. We first introduce a progressive time step $(t)$ sampling approach. It is motivated by the observation that the default uniform $t$ -sampling in SDS training often results in inefficiencies and inaccuracies due to the broad-range random sampling. Our approach, therefore, emphasizes a gradual reduction of the time step, directing the model to transition from coarse to detailed learning (See Fig. 2 (b)). In the early phases of training, we adopt larger time steps to add a substantial amount of noise into the image. During the noise recovery process, the network is driven to focus on the low-frequency global structure signal. As training advances and the global structure stabilizes, we decrease to smaller time steps. In this stage, the network is demanded to recover the high-frequency fine-grained pattern according to the context. It facilitates the model in refining local details, such as textures and shades.

We define the rate of change of variable $t$ as follows:

\displaystyle\frac{\mathrm{d}t}{\mathrm{d}i}=\beta v(t),

(1)

where $v(t)$ controls how $t$ changes with respect to the training iteration $i$ and is manually designed. $\beta$ is a positive constant. We define $v(t)$ piece-wise:

\displaystyle v(t)

\displaystyle=\begin{cases}-\exp(\frac{t-n_{2}}{m_{2}})&\text{ if }t>n_{2}\\ -1.0&\text{ if }n_{1}\leq t\leq n_{2}\\ -\exp(\frac{t-n_{1}}{m_{1}})&\text{ if }t<n_{1},\end{cases}.

(2)

Here, $v(t)<0$ implies $\frac{\mathrm{d}t}{\mathrm{d}i}<0$ , indicating that $t$ decreases as training progresses. Our design ensures that $t$ decreases rapidly at the beginning ( $t>n_{2}$ ), linearly in the middle ( $n_{1}\leq t\leq n_{2}$ ), and more mildly towards the end ( $t<n_{1}$ ). After the time step $t$ decreases to $t_{\text{min}}$ , we revert to random sampling:

t\sim\mathcal{U}(t_{\text{min}},t_{\text{max}}).

It reintroduces randomness to maintain the vibrancy of the coloration of the 3D model.

We notice that a concurrent pre-print work, Dreamtime [44], also employs a similar non-increasing $t$ -sampling strategy. However, such a strategy sometimes tends to overfit the local details, and inadvertently change the global illumination. Therefore, it is crucial to avoid the consistent use of extremely small time steps at the end of training. Unlike Dreamtime [44], our method decreases $t$ with the training step at a much steeper pace and employs a mixture of both deterministic and random sampling.

Progressive radius. Simultaneously, our approach also incorporates a dynamic camera radius considering the camera movements in the real world. Typically, eyes will move closer for detailed object observation. Motivated by this behavior, we dynamically adjust the camera radius during the multi-scale learning. During the low-scale triplane stage, which focuses on broader geometric structures, we utilize a large camera radius to cover the entire object. As we move to the high-scale triplane stage, which refines local model details, the camera radius is reduced to closely focus on finer details of the 3D scene. This progressive radius strategy is intuitive and directly impacts resolution, aiding in feature learning across varying scales. In the ablation study, we also verify the effectiveness of this strategy (See Section IV-D).

III-C Implementation Details

Neural field rendering structure. As mentioned, the proposed MTN consists of three triplanes and one trivector varying in resolution. The resolutions of the triplanes are $N_{1},N_{2},N_{3}=64,128,256$ , and the number of channels is $C=32$ . For the trivector, we set $N_{4}=512$ . During the Neural Field optimization, camera positions are randomly sampled in spherical coordinates. The azimuth angles, the polar angles and fovy range are randomly sampled between $\left[-180^{\circ},180^{\circ}\right]$ , $\left[45^{\circ},105^{\circ}\right]$ , and $\left[10^{\circ},30^{\circ}\right]$ , respectively. For spherical radius, the initial $R\in[3.0,3.5]$ and gradually decreases to $R\in[1.8,2.1]$ .

Diffusion model. We deploy DeepFloyd IF [46] as the guidance model to provide 2D image priors. For time step $(t)$ sampling in SDS, the Stable-DreamFusion uses random sampling $t\sim\mathcal{U}(20,980)$ . In our proposed approach, the time step $t$ is set to decrease from 980 to 20. Through a grid search, we empirically set an optimal prior weight configuration as $\{m_{1}=50,m_{2}=150,n_{1}=500,n_{2}=800\}$ to control the rate of decrease. Following existing works [9, 12, 41, 47], we also adopt the viewpoint-aware prompts by appending prompts such as “front view”, “side view”, and “back view”.

Optimization. The number of total iterations is 6000 and the batch size is 1. We employ the Adan optimizer [48] with learning rate of $1\times 10^{-3}$ , weight decay of $2\times 10^{-5}$ . We follow existing works [35] and apply two regularization terms, i.e., TV regularization and L2 regularization, to prevent floating clouds.

IV Experiment

In this section, we assess the capability of our method to produce high-fidelity 3D objects according to natural language prompts. We primarily consider two key evaluation aspects: (1) alignment with the text, particularly focusing on key words in the sentence; and (2) consistent geometric shape, especially in localized parts like ears and tails. Due to the space limitation, we mainly compare our approach against two widely-used text-to-3D frameworks. Since DreamFusion [9] is not publicly available, we utilize the open-source variant, Stable-DreamFusion [45]. Besides, we also compare the proposed method with the pioneer work, i.e., Latent-NeRF [12].

TABLE I: Quantitative comparisons with ground-truth images, Latent-NeRF [12], Stable-Dreamfusion [45] evaluated on 153 standard prompts in Dreamfields [39]. The best precision in every column is in bold.

Method	R-Precision (%) $\uparrow$
	CLIP B/32		CLIP B/16		CLIP L/14
	RGB	DEPTH	RGB	DEPTH	RGB	DEPTH
GT images	77.1	-	79.1	-	-	-
Latent-NeRF	48.4	37.1	52.9	40.6	59.5	40.9
Stable-Dreamfusion	56.4	45.9	60.3	45.8	58.3	42.9
Ours	62.6	53.1	62.6	51.9	64.8	47.6

TABLE II: Ablation study of different components. The best precision in every column is in bold.

Method	MTN	Progressive	Progressive	R-Precision (%) $\uparrow$
		Progressive	Progressive	CLIP B/32		CLIP B/16		CLIP L/14		Mean
		time step	radius	RGB	DEPTH	RGB	DEPTH	RGB	DEPTH	RGB	DEPTH
Single triplane				46.8	38.4	51.8	41.1	53.9	41.4	50.8	40.3
MTN	$\checkmark$			57.8	46.7	58.2	46.2	62.2	42.8	59.4	45.2
MTN-t	$\checkmark$	$\checkmark$		60.2	52.7	61.2	51.0	63.5	43.5	61.6	49.1
MTN-r	$\checkmark$		$\checkmark$	57.9	48.5	60.4	48.8	62.4	42.7	60.2	46.7
MTN-full	$\checkmark$	$\checkmark$	$\checkmark$	62.6	53.1	62.6	51.9	64.8	47.6	63.3	50.9

IV-A Qualitative Evaluation

As shown in Fig. 3, we could observe that our method outperforms prior competitive approaches in terms of text alignment, texture details, and geometric precision. For instance, in the first row, Latent-NeRF [12] struggles to generate a 3D model. Stable-DreamFusion [45] does generate a tiger avatar but misses the key word “doctor”. In contrast, the proposed method successfully crafts a tiger doctor with a book in his hands. In the second row, our method displays a better geometric shape and correct shading on the bust, while Stable-DreamFusion places a tail on the head and Latent-NeRF produces a head shape with three ears. In the third row, our method successfully captures the keyword “baby bunny” and shows a better geometric shape with one head and two ears. In contrast, Latent-NeRF [12] and Stable-DreamFusion [45] are both plagued with the multi-face and multi-ear issue. In summary, our method can generate reliable 3D representations that are aligned with the text prompts and exhibit natural geometric shapes, which are also well-aligned with human intuition.

IV-B Quantitative Evaluation

Since our task is a generation problem, we do not have 3D ground-truth meshes for direct comparison of differences. Therefore, we follow the existing work, i.e., DreamFusion [9] to evaluate the alignment between 2D projected images and the text prompt. In particular, we adopt the CLIP R-Precision [49] to evaluate the retrieval performance for both RGB images and depth maps. The RGB images serve as an indicator of texture quality, while the depth maps represent the geometric shape. A higher score indicates better performance. This evaluation is conducted using three pre-trained CLIP models with different model sizes i.e., CLIP B/32, CLIP B/16, and CLIP L/14. For a fair comparison, we also adopt 153 standard prompts from Dreamfields [39]. Our results are shown in Table I. We could see that our method consistently achieves the highest R-Precision scores across all three metrics, indicating a significant advantage.

IV-C User Study

For a more comprehensive evaluation, we conduct a user study with 96 participants. We evaluate our model against two prevailing approaches like Latent-NeRF [12] and Stable-DreamFusion [45] in three key aspects: 3D geometry, prompt consistency, and overall quality. We randomly select 96 prompts from the standard set of 153 prompts and generate 3D models using Stable-DreamFusion [45], Latent-NeRF [12], and our approach. Participants are then asked to rank the models based on the aforementioned criteria. As shown in Fig. 4, our visual results outperform other methods across multiple metrics, attracting preferences from $72.92\%$ of participants for overall quality, $62.5\%$ for 3D geometry, and $55.2\%$ for prompt consistency. This highlights the efficacy of our approach in delivering superior results across various evaluation criteria.

IV-D Ablation Study

Effectiveness of multi-scale triplanes. We first investigate the impact of the multi-scale triplane architecture to substantiate its advantages. As shown in Table II, we could observe that the multi-scale architecture facilitates both texture and geometric shape learning. Specifically, the RGB R-Precision is improved with a large margin $+8.6\%$ on average, while the mean depth R-Precision increases $+4.9\%$ . We also provide a visualization result in Fig. 5 (b). The basic single-scale triplane structure results in a 3D output that misses intricate details both texturally and geometrically, evident in incomplete hands, tails, and the presence of floating points. In contrast, the multi-scale network gradually leverages the multi-scale information, yielding a more smooth geometric shape with clear edges.

Effectiveness of progressive learning. Here we further evaluate the impact of progressive time step sampling and progressive radius. (1) As shown in the third row of Table II, the MTN with only progressive time step strategy could further improve the text alignment by $+2.2\%$ texture and $+3.9\%$ geometry quality on average. This is because the small time step towards the end of learning shifts the focus to high-frequency details, significantly improving the overall visual quality. As shown in the third column in Fig. 5, we could notice more fine-grained texture patterns are generated on the surface. (2) Similarly to how humans often take a closer look to examine object details, our model, when applying the progressive radius approach, performs even better, showing a $+1.7\%$ improvement on the local texture details. As the camera gets closer, the 2D projection and the optimization objects both emphasizes local quality, resulting in a refined 3D model. As a result, the culmination of these strategies leads to a final output that is both detailed and visually appealing (See Fig. 5 (d)).

V Conclusion

In this work, inspired by the bottom-up spirit, we introduce the Multi-Scale Triplane Network (MTN) and a progressive learning strategy, both of which effectively ease the optimization difficulty during high-fidelity generation. The Multi-Scale Triplane Network operates at the structure level to aggregate the multi-scale representation, while the progressive learning strategy functions at the recognition level to gradually refine high-frequency details. Extensive experiments verify the effectiveness of every component. We envision our approach offers a preliminary attempt for automatic 3D prototyping, bridging the gap between natural language descriptions and intricate 3D design. In the future, we will continue to explore the potential to complete occluded 3D objects [50] via language prior and discriminative language guidance [51].

References

[1] G. Hunt, F. Mitzalis, T. Alhinai, P. A. Hooper, and M. Kovac, “3d printing with flying robots,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 4493–4499.
[2] R. MacCurdy, R. Katzschmann, Y. Kim, and D. Rus, “Printable hydraulics: A method for fabricating robots by 3d co-printing solids and liquids,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 3878–3885.
[3] J. D. Carrico, K. J. Kim, and K. K. Leang, “3d-printed ionic polymer-metal composite soft crawling robot,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 4313–4320.
[4] D. Drotman, S. Jadhav, M. Karimi, P. de Zonia, and M. T. Tolley, “3d printed soft actuators for a legged robot capable of navigating unstructured terrain,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 5532–5538.
[5] T. Fujii, J. Dang, and H. Tanaka, “Hummingbird-bat hybrid wing by 3-d printing,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 3404–3410.
[6] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
[7] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
[8] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=St1giarCHLP
[9] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988, 2022.
[10] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[11] H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich, “Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 619–12 629.
[12] G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 663–12 673.
[13] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
[14] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
[15] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
[16] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[17] X. Zhang, Z. Zheng, D. Gao, B. Zhang, Y. Yang, and T.-S. Chua, “Multi-view consistent generative adversarial networks for compositional 3d-aware image synthesis,” International Journal of Computer Vision, pp. 1–24, 2023.
[18] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2088–2096.
[19] J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas, “Grass: Generative recursive autoencoders for shape structures,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1–14, 2017.
[20] A. Luo, T. Li, W.-H. Zhang, and T. S. Lee, “Surfgen: Adversarial 3d shape synthesis with explicit surface discriminators,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 238–16 248.
[21] L. Zhou, Y. Du, and J. Wu, “3d shape generation and completion through point-voxel diffusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5826–5835.
[22] A. Vahdat, F. Williams, Z. Gojcic, O. Litany, S. Fidler, K. Kreis et al., “Lion: Latent point diffusion models for 3d shape generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 021–10 039, 2022.
[23] L. Gao, J. Yang, T. Wu, Y.-J. Yuan, H. Fu, Y.-K. Lai, and H. Zhang, “Sdm-net: Deep generative network for structured deformable mesh,” ACM Transactions on Graphics (TOG), vol. 38, no. 6, pp. 1–15, 2019.
[24] L. Gao, T. Wu, Y.-J. Yuan, M.-X. Lin, Y.-K. Lai, and H. Zhang, “Tm-net: Deep generative networks for textured meshes,” ACM Transactions on Graphics (TOG), vol. 40, no. 6, pp. 1–15, 2021.
[25] C. Nash, Y. Ganin, S. A. Eslami, and P. Battaglia, “Polygen: An autoregressive generative model of 3d meshes,” in International conference on machine learning. PMLR, 2020, pp. 7220–7229.
[26] P. Henderson, V. Tsiminaki, and C. H. Lampert, “Leveraging 2d data to learn textured 3d mesh generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7498–7507.
[27] K. Gupta, Neural mesh flow: 3d manifold mesh generation via diffeomorphic flows. University of California, San Diego, 2020.
[28] A. Rosinol, T. Sattler, M. Pollefeys, and L. Carlone, “Incremental visual-inertial 3d mesh generation with structural regularities,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8220–8226.
[29] Z. Cheng, M. Chai, J. Ren, H.-Y. Lee, K. Olszewski, Z. Huang, S. Maji, and S. Tulyakov, “Cross-modal 3d shape generation and manipulation,” in European Conference on Computer Vision. Springer, 2022, pp. 303–321.
[30] R. Wu, Y. Zhuang, K. Xu, H. Zhang, and B. Chen, “Pq-net: A generative part seq2seq network for 3d shapes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 829–838.
[31] R. Wu and C. Zheng, “Learning to generate 3d shapes from a single example,” arXiv preprint arXiv:2208.02946, 2022.
[32] X. Zheng, Y. Liu, P. Wang, and X. Tong, “Sdf-stylegan: Implicit sdf-based stylegan for 3d shape generation,” in Computer Graphics Forum, vol. 41, no. 5. Wiley Online Library, 2022, pp. 52–63.
[33] M. Ibing, G. Kobsik, and L. Kobbelt, “Octree transformer: Autoregressive 3d shape generation on hierarchically structured sequences,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2697–2706.
[34] E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein, “pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5799–5809.
[35] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis et al., “Efficient geometry-aware 3d generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 123–16 133.
[36] Z. Zheng, J. Zhu, W. Ji, Y. Yang, and T.-S. Chua, “3d magic mirror: Clothing reconstruction from a single image via a causal perspective,” arXiv preprint arXiv:2204.13096, 2022.
[37] M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as compositional generative neural feature fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 453–11 464.
[38] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
[39] A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 867–876.
[40] N. Mohammad Khalid, T. Xie, E. Belilovsky, and T. Popa, “Clip-mesh: Generating textured meshes from text using pretrained image-text models,” in SIGGRAPH Asia 2022 conference papers, 2022, pp. 1–8.
[41] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 300–309.
[42] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Advances in Neural Information Processing Systems, vol. 33, pp. 7537–7547, 2020.
[43] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
[44] Y. Huang, J. Wang, Y. Shi, X. Qi, Z.-J. Zha, and L. Zhang, “Dreamtime: An improved optimization strategy for text-to-3d content creation,” arXiv preprint arXiv:2306.12422, 2023.
[45] J. Tang, “Stable-dreamfusion: Text-to-3d with stable-diffusion,” 2022, https://github.com/ashawkey/stable-dreamfusion.
[46] M. Konstantinov. (2023) Deepfloyd-if. [Online]. Available: https://github.com/deep-floyd/IF
[47] M. Armandpour, H. Zheng, A. Sadeghian, A. Sadeghian, and M. Zhou, “Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,” arXiv preprint arXiv:2304.04968, 2023.
[48] X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan, “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,” arXiv preprint arXiv:2208.06677, 2022.
[49] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[50] S. S. Mohammadi, N. F. Duarte, D. Dimou, Y. Wang, M. Taiana, P. Morerio, A. Dehban, P. Moreno, A. Bernardino, A. Del Bue et al., “3dsgrasp: 3d shape-completion for robotic grasp,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 3815–3822.
[51] F. Matsuzawa, Y. Qiu, K. Iwata, H. Kataoka, and Y. Satoh, “Question generation for uncertainty elimination in referring expressions in 3d environments,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 6146–6152.