Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation
Abstract
By lifting the pre-trained 2D diffusion models into Neural Radiance Fields (NeRFs), text-to-3D generation methods have made great progress. Many state-of-the-art approaches usually apply score distillation sampling (SDS) to optimize the NeRF representations, which supervises the NeRF optimization with pre-trained text-conditioned 2D diffusion models such as Imagen. However, the supervision signal provided by such pre-trained diffusion models only depends on text prompts and does not constrain the multi-view consistency. To inject the cross-view consistency into diffusion priors, some recent works finetune the 2D diffusion model with multi-view data, but still lack fine-grained view coherence. To tackle this challenge, we incorporate multi-view image conditions into the supervision signal of NeRF optimization, which explicitly enforces fine-grained view consistency. With such stronger supervision, our proposed text-to-3D method effectively mitigates the generation of floaters (due to excessive densities) and completely empty spaces (due to insufficient densities). Our quantitative evaluations on the T3Bench dataset demonstrate that our method achieves state-of-the-art performance over existing text-to-3D methods. We will make the code publicly available.
keywords:
Text-to-3D Generation , AIGC , Diffusion , Neural Radiance Fields1 Introduction
The advent of text-to-3D technology marks a revolutionary stride in digital asset generation, profoundly impacting various fields, such as gaming and other 3D applications involving a creative design process. In the gaming industry, this technology heralds a new era of creativity and efficiency, allowing designers to rapidly transform textual descriptions into intricate 3D models, thereby reducing development time and fostering more dynamic and immersive gaming experiences. Recently, researchers are increasingly concentrating on improving the geometric priors of large-scale diffusion models for generative text-to-3D tasks.

Prominent methods in text-to-3D generation demonstrate significant strengths and potential in two steps: they first implement the generative model as a differentiable mapping from parameters to images (Poole et al. [2022], Shi et al. [2023b], Wang et al. [2023b], Lin et al. [2023a]); and then a neural inverse renderer, NeRF (Mildenhall et al. [2020]), is trained to render images that belong to the distribution of the pre-trained diffusion model. In other words, this class of methods distill the score function of a pre-trained diffusion model, namely score distillation (Poole et al. [2022]). However, the effective training of a neural inverse renderer necessitates consistent multi-view images, a requirement current diffusion models struggle to meet due to their inherent stochasticity (Poole et al. [2022], Lin et al. [2023a], Tsalicoglou et al. [2023]). Consequently, the resulting objects often suffer from the Janus problem (Poole et al. [2022]), which refers to as regenerating multi-copy (and thus wrong) content described by the text prompt like multi-faces or drifted content.
Additionally, since the generative process is inherently a 3D reconstruction from 2D observations, the problem is highly ill-posed. Without enough consistency between views to constrain the reconstruction, NeRF tends to predict excessive or insufficient densities. Excessive densities correspond to an object being visually occluded, and insufficient densities result in failure to generate transparent objects. Overall, achieving both appropriate densities and high color consistency is extremely challenging for the stability of the 3D generation task.
Recently, ProlificDreamer (Wang et al. [2023b]) takes into account the stochastic process during optimization, yielding high-quality 3D assets. However, the optimization process is prohibitively expensive, making it impractical for real-time applications. Other state-of-the-art works like MVDream (Shi et al. [2023b]) finetune the diffusion model to yield multi-view images by generating a set of orthogonal views with respect to the text prompt. Nevertheless, MVDream does not always produce consistent multi-view images, as its fine-tuning process involves generating views with minimal overlap with the text. Consequently, the optimization process does not provide enough constraints, and the resulting NeRF tends to predict excessive or insufficient densities on certain text prompts. To address this challenge, in this paper, we consider whether it can give sufficient constraints to control the diffusion models to generate high-quality objects.
Drawing on the above considerations, we propose to add an additional image-conditioned diffusion model to constrain the 3D reconstruction during score distillation, as shown in Figure 1. The additional guidance enforces fine-grained view consistency in the rendered images. We start by sampling a set of orthogonal views, which we refer to as reference views. Then, we randomly sample a reference view and use it as a condition for the image-guided diffusion model to produce a novel view. Furthermore, we apply the score distillation as evaluated by two different diffusion models. On the one hand, a text-conditioned multi-view diffusion model gives an updated direction following multi-view consistency with respect to text (which indicates accuracy), constraining the coarse consistency of the 3D reconstruction; On the other hand, the image-conditioned novel view diffusion model gives an updated direction following consistency between views, which can enforce the fine-grained consistency between views instead of the text (which indicates quality). Finally, the two complementary signals collaborate to ensure the accuracy and quality of the 3D target generation. Our experimental results on T3Bench show that our method outperforms existing methods and yields state-of-the-art results.
We organize this paper as follows. First, we present related work with score distillation in Sec. 2. Then, we describe our method in Sec. 3, which leverages an image-conditioned diffusion model to explicitly enforce multi-view consistencies. Experimental results are shown in Sec. 4. Finally, we conclude our work in Sec. 5.
2 Related Work
2.1 Text-to-3D
With the rise of large pre-trained 2D models (Radford et al. [2021], Saharia et al. [2022], Rombach et al. [2021], DeepFloyd [2023]), text-to-3D generation has experienced a remarkable leap in progress. Existing text-to-3D methods can be grouped into two main approaches. The first group of works (Jun and Nichol [2023], Nichol et al. [2022], Gupta et al. [2023]) adopts only captioned 3D data for the text-to-3d generation task. However, due to the fact that captioned 3D data is limited in scale and diversity, these methods are very dependent on access to large-scale 3D data and can only generate objects from a limited set of categories.
A second well-used paradigm is to tackle 2D supervision to supervise a 3D representation model such as NeRF (Mildenhall et al. [2020]). Early works use CLIP (Radford et al. [2021]) to optimize NeRF by projecting the text and images of the 3D generation into a shared latent space and aligning them with the CLIP loss (Jain et al. [2022], Mohammad Khalid et al. [2022], Xu et al. [2023b], Lee and Chang [2022], Michel et al. [2022], Wei et al. [2023]). Because CLIP is not a generative model, these methods tend to produce generations that lack geometric fidelity. Improving upon previous works, Poole et al. [2022], Wang et al. [2023a] are the pioneer works that leverage 2D pre-trained diffusion models for text-to-3D generation. They propose score distillation sampling (SDS) that supervises 3D representation models, such as NeRF (Mildenhall et al. [2020]) or DMTet (Shen et al. [2021]), by guiding their corresponding rendered images towards areas of high probability density that are contingent on the text. Expanding upon the aforementioned research based on SDS, existing works explore different supervision methods (Wang et al. [2023b], Yu et al. [2023b]) and various 3D representations such as Gaussian Splatting (Vilesov et al. [2023], Tang et al. [2023a], Yi et al. [2023], Chen et al. [2023d]).
Building upon these seminal works, a large collection of 2D-lifting methods (Huang et al. [2023], Wang et al. [2023b], Yu et al. [2023b], Seo et al. [2023a], Tsalicoglou et al. [2023]) focus on improving fidelity and ameliorating existing issues, such as over-saturation (generations tend to appear cartoon-like). In addition, prior works also tackle the poor view-consistency issue and the Janus problem (Shi et al. [2023b], Zhao et al. [2023], Zhu and Zhuang [2023], Yu et al. [2023a], Ma et al. [2023], Li et al. [2023b], Seo et al. [2023b], Hong et al. [2023]). Moreover, efficiency-centric methods emphasize reducing the generation time through parallel sampling (Zhou et al. [2023]), distilling a multi-view 2D diffusion model (Chen et al. [2023b]), and splitting the task into sparse view generation and reconstruction (Li et al. [2023a]). Other methods (Lin et al. [2023a], Chen et al. [2023a]) focus on generating high-resolution 3D contents using a coarse-to-fine approach and exploring efficient 3D representations, as well as producing generations with fine details (Qiu et al. [2023]). Several other works improve generation diversity (Huang et al. [2023], Wang et al. [2023b], Seo et al. [2023a], Lu et al. [2023]).
2.2 Image-to-3D
NeRF (Mildenhall et al. [2020]), with its ability to take a collection of images from different viewpoints and reconstruct the underlying 3D scene, has shown tremendous potential in progressing the image-to-3D research direction. However, NeRF is greatly constrained by the large demand for dense viewpoint data with significant viewpoint overlaps. Subsequent works focus on relaxing this constraint under the single-image-to-3D paradigm, such as performing generation through progressively deforming the mesh (Wang et al. [2018]), leveraging NeRF with learned scene prior (Yu et al. [2021]), adding semantic/geometric regularization (Xu et al. [2022]), or leveraging differentiable surface modeling and differentiable rendering (Gao et al. [2022]). However, these methods are limited by the lack of large-scale 3D data as well as suffer from numerous issues such as inconsistency and long generation times.
Later works leverage 2D models as priors in the 3D generation process (Deng et al. [2023]) and take a coarse-to-fine approach (Tang et al. [2023b], Seo et al. [2023a], Qian et al. [2023]. Subsequent methods likewise adopt the 2D diffusion prior to the novel view synthesis task (Watson et al. [2022], Chan et al. [2023], Guangcong et al. [2023], Liu et al. [2023c], Lin et al. [2023b]). Expanding upon established foundations, the current image-to-3D task features several orthogonal research focuses, include enabling 360-degree reconstruction of objects (Melas-Kyriazi et al. [2023], Xu et al. [2023a]) and scenes (Sargent et al. [2023]), improving generation efficiency (Liu et al. [2023b, a]), and ameliorating 3D inconsistencies (Ye et al. [2023], Lin et al. [2023c], Weng et al. [2023], Long et al. [2023], Liu et al. [2023d]).
2.3 Multi-view Diffusion
One major approach to ameliorate the multi-view consistency issue in 2D-lifting methods is through leveraging multi-view diffusion models, which requires modeling the joint distribution of an object’s information from multiple views. The multi-view information can take the form of latent intermediary features (Tang et al. [2023c]), epipolar lines (Tseng et al. [2023]), or whole images (Shi et al. [2023b]). Information-sharing is achieved with a variety of techniques, such as placing different viewpoint images into a tiling layout (Shi et al. [2023a], Tsalicoglou et al. [2023], Li et al. [2023a]), denoising several image views with different noises and then sharing information across images with an attention mechanism during each denoising step (Shi et al. [2023b], Liu et al. [2023d], Tang et al. [2023c], Xu et al. [2023c]), or adopting a coarse-to-fine approach by incorporating an orthogonal-view diffusion prior (Zhao et al. [2023]).
3 Our Text-Image Conditioned Diffusion (TICD) Method
In Sec. 3.1, we provide some preliminaries that are necessary to our method. In Sec. 3.2, we present a novel score distillation that leverages text-conditioned and image-conditioned score functions. We propose to constrain the optimization process to generate multi-view images that are consistent both with the text and across different views. The detailed implementation is presented in Sec. 3.3.
3.1 Preliminaries
NeRF (Mildenhall et al. [2020]) is a neural inverse rendering approach based on volume rendering. In formulae, for each point in space and viewing direction unit vector in , NeRF is a differentiable volumetric renderer parameterized by a neural network that returns the density and RGB color . In particular, NeRF renders each pixel via volume ray casting. Given a ray with ray origin the camera center, it approximates the following integral via numerical quadrature:
(1) |
where is the transmittance function that predicts the probability that the ray travels from and terminates at . NeRF is trained in an end-to-end manner on a set of posed images to minimize the reconstruction loss between the rendered color and the ground-truth color of the posed images. For simplicity, denote a rendered view at some desired camera pose , which is obtained by applying Eq. 1 on every pixel.
DDPM (Ho et al. [2020]) is described as two Markov processes yielding latent variables from a data distribution . Specifically, it defines a forward process and a reverse process , both being Gaussians. While is hand-crafted, we learn using a deep neural network (DNN) with parameters , which we denote . In formulae, given the diffusion process generates latent variables with decreasing signal-to-noise ratio such that:
(2) |
where , , and is a hyper-parameter that controls the noise level at . Since converges towards an isotropic Gaussian distribution as , we define the prior of the reverse process as an isotropic Gaussian distribution, . Given a noising schedule , the reverse process runs from to and at each step we sample from where:
(3) |
In particular, the DNN predicts the signal such that it minimizes the following loss defined as
(4) |
where is a positive weighting function that depends on and is obtained using Eq. 2. Additionally, one can use classifier-free guidance to decrease the diversity of the samples while increasing the quality of each individual sample w.r.t to some condition (e.g., text). Notably, the neural network’s output is modified as follows
(5) |
where is a scalar coefficient that balances the aforementioned trade-off.

3.2 Image and Text View Consistency
Neural inverse rendering techniques such as NeRF optimize a 3D scene representation based on image observations. Thus, one can use a text-to-image generative model for optimizing a 3D scene representation. Prior work refers to this as score distillation sampling (Poole et al. [2022]), for the sampling process of this generative model is done by distilling the learned score function of a 2D diffusion model into NeRF. Leveraging text-to-image generative models has the following advantages. First, compared to 3D data, there is abundant 2D data. Second, it allows leveraging other abilities in 2D generative models, such as editing or controllability (Li et al. [2023c], Haque et al. [2023], Chen et al. [2023c]). Finally, it can also be applied in a similar manner to video generation (Singer et al. [2023]). In practice, we generate multi-view images using a text-to-image diffusion model and minimize the score distillation loss on these images by optimizing NeRF.
Score Distillation (Poole et al. [2022]) encourages NeRF to render an image such that it belongs to the distribution of plausible images as evaluated by a diffusion model . In practice, it consists of optimizing to minimize the following loss function:
(6) |
where is sampled from the forward process and is some condition (e.g., text). In practice, the gradient of the diffusion model is detached, yielding the following gradient for NeRF’s parameters :
(7) |
Eq. 1 shows that the optimization process in NeRF is applied in a pixel-wise manner. Thus, pixel-wise view consistency is the signal for learning the density and color. Additionally, the gradient signal in Eq. 7 must maintain multi-view consistency to match the condition . In other words, accurate text-to-3D asset generation requires view consistency w.r.t and between views at the pixel level.
Our score distillation uses two different scores for supervision: a text-conditioned multi-view diffusion model and an image-conditioned novel-view diffusion model. These two learning signals are complementary; one maintains multi-view consistency with text, while the other maintains consistency between views. In this paper, we denote a view as and a set of views (i.e., multi-view) as where we index each view as .
Text-conditioned generation. Denote with the index of views generated from the text condition. Following prior work (Shi et al. [2023b]), we start by sampling a set of camera poses and render these views , which we call reference views. In particular, the views are chosen such that they are orthogonal to each other. For each view, we sample a timestep and compute the forward process of the diffusion process . Given the text and the set of noised views rendered from NeRF, the text-conditioned diffusion model computes score function w.r.t to , yielding an update direction towards higher density regions. Additionally, we modify the score function to include classifier-free guidance as in Eq. 5 and compute score distillation to obtain gradient update as in Eq. 7.
Image-conditioned generation. Prior work utilized image-conditioned diffusion to generate 3D content from a single 2D reference image, while our approach differs in how the model is leveraged. Instead of providing the reference as an additional input to the model, we innovatively employ it as extra supervision to guide different views and ensure fine-grained multi-view consistency. Denote with the index of novel views generated from images. We render extra views at camera extrinsic . Denote the relative camera extrinsic from camera position to . In formulae, the image-conditioned diffusion model takes rendered image along with the relative camera extrinsic as conditioning. In a similar manner, we sample from the uniform distribution. The model was trained to compute the score function for novel views , denoted as . Following Liu et al. [2023c], we compute the score distillation at , which was previously rendered. The resulting score distillation encourages NeRF to render consistent views , such that the pre-trained diffusion models can predict accurate novel views .
Text and image score distillation. We modify Eq. 6, such that the gradient signal becomes:
(8) |
where and are scale factors of the text and image diffusion model, respectively. The score distillation process adds a score toward view consistency and another score toward text multi-view consistency. This optimization process provides NeRF with enough constraints to predict accurate densities and colors as shown in Figure 4.
3.3 Implementation Details
Models and Representations. We utilize MVDream’s (Shi et al. [2023b]) pre-trained model as our multi-view diffusion model, and Zero123-xl, provided by Zero-1-to-3 (Liu et al. [2023c]), as our novel view image-conditioned diffusion model. For 3D representation, we implement ThreeStudio’s (Guo et al. [2023a]) implicit volume approach, consisting of a multi-resolution hash grid and an MLP network for predicting voxel density and RGB values.
View Selection. For each camera view to render, we first randomly select cameras with a field-of-view (fov) between [15, 60] and an elevation between [0, 30] for the multi-view diffusion model. The camera distance is set as the object size (0.5) multiplied by the NDC focal length and a random scaling factor ranging from [0.8, 1.0]. We then randomly select views from the above set as reference views for the novel view diffusion model. For each reference view, we choose an additional random camera with the same fov and an elevation between [-30, 80] before applying the novel view image-conditioned diffusion model. The batch size starts at 8 and 12 for the multi-view and novel view models, respectively, and then decreases to 4 and 4 after 5,000 iterations.
Optimization. The 3D model is optimized for 10,000 steps using an AdamW (Kingma and Ba [2014]) optimizer. The learning rate for the hash-grid and MLP components is set to 0.01 and 0.001, respectively. Score distillation sampling is applied, with the maximum and minimum time steps decreasing from 0.98 to 0.5 and 0.02 over the first 8,000 steps, respectively. Both the loss scale factors and are set to 1.0. The rendering resolution begins at 6464 and is increased to 256256 after 5,000 steps. Guidance scales of 50.0 and 3.0 are used for the multi-view and novel-view models, respectively.
4 Experiment
4.1 Dataset and Settings
We evaluate our method on the T3Bench (He et al. [2023]), a comprehensive text-to-3D benchmark containing diverse text prompts across three categories: Single Object, Single Object with Surroundings, and Multiple Objects, with 100 distinct prompts in each category. We also leverage the two automatic metrics proposed in T3Bench (He et al. [2023]) for evaluating the quality and text alignment of generated 3D scenes. The quality metric captures multi-view scene images and utilizes text-image scoring models and a regional convolution mechanism to measure overall quality and detect view inconsistency issues. The alignment metric employs multi-view scene captioning with BLIP (Li et al. [2022]) and aggregates the captions with a large language model, which measures how well the caption covers information in the original text prompt.
Apart from the analysis of 3D content, we also conduct a text-2D consistency analysis by applying the CLIP (Radford et al. [2021]) cosine similarity between the original text prompt and the image captured at a fixed position and orientation. In practice, we first normalize the 3D mesh into a cube with a range, then capture an image at the front of the 3D mesh, with a distance of and a focal of .
For extracting the mesh model, we adopt the Marching Cubes (Lorensen and Cline [1998]) algorithm for our method and the original mesh extracting algorithm in T3Bench (He et al. [2023]) for other methods. We also follow the settings of T3Bench to apply mesh geometry simplification to a maximum of 40,000 faces before texture extraction.
Single Obj. | Single Obj. w/ Surr. | Multiple Obj. | |
---|---|---|---|
DreamFusion | 24.4 | 24.6 | 16.1 |
Magic3D | 37.0 | 35.4 | 25.7 |
LatentNeRF | 33.1 | 30.6 | 20.6 |
Fantasia3D | 26.4 | 27.0 | 18.5 |
SJC | 24.7 | 19.8 | 11.7 |
ProlificDreamer | 49.4 | 44.8 | 35.8 |
MVDream | 47.8 | 42.4 | 33.8 |
TICD | 50.0 | 45.6 | 36.0 |
Single Obj. | Single Obj. w/ Surr. | Multiple Obj. | |
---|---|---|---|
DreamFusion | 0.255 | 0.253 | 0.240 |
Magic3D | 0.264 | 0.268 | 0.253 |
LatentNeRF | 0.271 | 0.267 | 0.247 |
Fantasia3D | 0.250 | 0.248 | 0.240 |
SJC | 0.245 | 0.232 | 0.233 |
ProlificDreamer | 0.283 | 0.261 | 0.251 |
MVDream | 0.281 | 0.277 | 0.265 |
TICD | 0.285 | 0.279 | 0.268 |
4.2 Comparisons with State-of-the-Art
We compared our method with the seven advanced text-to-3D methods, including Dreamfusion (Poole et al. [2022]), Magic3D (Lin et al. [2023a]), LatentNeRF (Metzer et al. [2023]), Fantasia3D (Chen et al. [2023a]), SJC (Wang et al. [2023a]), ProlificDreamer (Wang et al. [2023b]), and MVDream (Shi et al. [2023b]). For all the methods, we adopt the original implementation of ThreeStudio (Guo et al. [2023b]) to generate 3D contents, and the original codebase of T3Bench to extract 3D meshes and evaluate metrics.
Single Obj. | Single Obj. w/ Surr. | Multiple Obj. | |
---|---|---|---|
without image condition | 48.5 | 43.2 | 34.0 |
with image condition | 50.0 | 45.6 | 36.0 |

Our method achieves the best performance among all the text-to-3D methods on the average T3Bench score across all the prompt sets, as shown in Table 1. On the one hand, compared with the performance of a single object, our precision of reconstruction is significantly higher than other methods, proving our superiority. On the other hand, for the more difficult multiple objects task, our method shows a strong result, which is attributed to the co-supervision of text and image features.
As illustrated in Figure 3, the 3D-generated content from most existing methods lacks texture details and clear geometry, making it difficult to faithfully reproduce all information from textual prompts. DreamFusion produces renderings lacking in fine texture details and suffers from a high failure rate. Magic3D and LatentNeRF generate improved texture details but still demonstrate poor geometric quality. SJC exhibits a tendency to output less compact geometry, an attribute unfavorable to the final shape of the 3D content. By contrast, Fantasia3D typically outputs compact and well-defined geometry alongside richer textures. However, its performance declines when processing complex textual prompts, often yielding completely erroneous geometry. While ProlificDreamer utilizes LoRA (Hu et al. [2021]) to finetune the diffusion model during optimization and variational score distillation, which benefits the generation of rich details, the geometry quality is often poor with excessively incorrect shapes. MVDream performs much better geometrically, but the overall generation quality and alignment to the original text prompt need further improvement. Our proposed method demonstrates superior performance across all aspects; it can generate detailed 3D content with accurate, high-quality geometry that effectively reflects the textual description, even on challenging prompt sets.
To further validate the effectiveness, we also construct the CLIP cosine similarities comparison test between the front view image and the original text prompt, as shown in Table 2. The results also support that our method retains better consistency with the original text prompt compared to other methods.
4.3 Ablation Study
To demonstrate the effectiveness of the image-conditioned diffusion module, we first conducted quantitative experiments, adjusting only the inclusion of the module while keeping all other experimental conditions constant. As shown in Table 3, the addition of the image-conditioned module effectively improved response quality and alignment with the original text prompt.

To show how the image-conditioned module effectively utilizes consistency modeling within a robust, plug-in architecture, we conducted a case study. The first row of Figure 4 demonstrates how viewpoint inconsistencies in text-conditioned diffusion models can lead to instability when rendering special materials like glass during optimization, often resulting in a loss of density. However, by continuing to constrain cross-view consistency during training, the addition of the image-conditioned module ameliorates such density collapse. Furthermore, when faced with certain prompts (e.g. the second row of Figure 4), the textual inconsistency of diffusion models gradually accumulates erroneous excess density. As these erroneous densities tend not to match the 3D prior in the image-conditioned module, they can be suppressed by our extra module during optimization. Additionally, the third row of Figure 4 illustrates the color drift and blending issues that emerge with 2D diffusion guidance. Our module, which facilitates cross-view consistency, effectively addresses these challenges as well.
5 Conclusion
In this paper, we propose to optimize a neural radiance field by distilling the score of a text-conditioned multi-view diffusion model and an image-based novel view diffusion model. Our method is based on the observation of view inconsistency in existing text-based multi-view generative models. We propose to explicitly enforce a constraint between views by leveraging an image-based diffusion model for supervision. Thus, unlike prior work, our method can produce accurate densities. Currently, our work uses two diffusion models that introduce additional parameters. Future work might explore the design of a diffusion model that can generate multiple views given a text as well as generate a novel view given an image.
References
- Chan et al. [2023] Chan, E.R., Nagano, K., Chan, M.A., Bergman, A.W., Park, J.J., Levy, A., Aittala, M., Mello, S.D., Karras, T., Wetzstein, G., 2023. GeNVS: Generative novel view synthesis with 3D-aware diffusion models, in: arXiv.
- Chen et al. [2023a] Chen, R., Chen, Y., Jiao, N., Jia, K., 2023a. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv:2303.13873.
- Chen et al. [2023b] Chen, Y., Li, Z., Liu, P., 2023b. Et3d: Efficient text-to-3d generation via multi-view distillation. arXiv:2311.15561.
- Chen et al. [2023c] Chen, Y., Pan, Y., Li, Y., Yao, T., Mei, T., 2023c. Control3d: Towards controllable text-to-3d generation, in: Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA. p. 1148–1156. URL: https://doi.org/10.1145/3581783.3612489, doi:10.1145/3581783.3612489.
- Chen et al. [2023d] Chen, Z., Wang, F., Liu, H., 2023d. Text-to-3d using gaussian splatting. arXiv:2309.16585.
- DeepFloyd [2023] DeepFloyd, 2023. If. URL: https://github.com/deep-floyd/IF. gitHub repository.
- Deng et al. [2023] Deng, C., Jiang, C.“., Qi, C.R., Yan, X., Zhou, Y., Guibas, L., Anguelov, D., 2023. Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20637--20647.
- Gao et al. [2022] Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S., 2022. Get3d: A generative model of high quality 3d textured shapes learned from images, in: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 31841--31854. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/cebbd24f1e50bcb63d015611fe0fe767-Paper-Conference.pdf.
- Guangcong et al. [2023] Guangcong, Chen, Z., Loy, C.C., Liu, Z., 2023. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. IEEE/CVF International Conference on Computer Vision (ICCV) .
- Guo et al. [2023a] Guo, Y.C., Liu, Y.T., Shao, R., Laforte, C., Voleti, V., Luo, G., Chen, C.H., Zou, Z.X., Wang, C., Cao, Y.P., Zhang, S.H., 2023a. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio.
- Guo et al. [2023b] Guo, Y.C., Liu, Y.T., Shao, R., Laforte, C., Voleti, V., Luo, G., Chen, C.H., Zou, Z.X., Wang, C., Cao, Y.P., Zhang, S.H., 2023b. threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio.
- Gupta et al. [2023] Gupta, A., Xiong, W., Nie, Y., Jones, I., Oğuz, B., 2023. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv:2303.05371.
- Haque et al. [2023] Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A., 2023. Instruct-nerf2nerf: Editing 3d scenes with instructions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision.
- He et al. [2023] He, Y., Bai, Y., Lin, M., Zhao, W., Hu, Y., Sheng, J., Yi, R., Li, J., Liu, Y.J., 2023. T3bench: Benchmarking current progress in text-to-3d generation. arXiv preprint arXiv:2310.02977 .
- Ho et al. [2020] Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 6840--6851. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf.
- Hong et al. [2023] Hong, S., Ahn, D., Kim, S., 2023. Debiasing scores and prompts of 2d diffusion for view-consistent text-to-3d generation. arXiv:2303.15413.
- Hu et al. [2021] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 .
- Huang et al. [2023] Huang, Y., Wang, J., Shi, Y., Qi, X., Zha, Z.J., Zhang, L., 2023. Dreamtime: An improved optimization strategy for text-to-3d content creation. arXiv:2306.12422.
- Jain et al. [2022] Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B., 2022. Zero-shot text-guided object generation with dream fields, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 867--876.
- Jun and Nichol [2023] Jun, H., Nichol, A., 2023. Shap-e: Generating conditional 3d implicit functions. arXiv:2305.02463.
- Kingma and Ba [2014] Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Lee and Chang [2022] Lee, H.H., Chang, A.X., 2022. Understanding pure clip guidance for voxel grid nerf models. arXiv:2209.15172.
- Li et al. [2022] Li, J., Li, D., Xiong, C., Hoi, S., 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, in: International Conference on Machine Learning, PMLR. pp. 12888--12900.
- Li et al. [2023a] Li, J., Tan, H., Zhang, K., Xu, Z., Luan, F., Xu, Y., Hong, Y., Sunkavalli, K., Shakhnarovich, G., Bi, S., 2023a. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214 .
- Li et al. [2023b] Li, W., Chen, R., Chen, X., Tan, P., 2023b. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv:2310.02596.
- Li et al. [2023c] Li, Z., Chen, Y., Zhao, L., Liu, P., 2023c. Mvcontrol: Adding conditional control to multi-view diffusion for controllable text-to-3d generation. arXiv:2311.14494.
- Lin et al. [2023a] Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y., 2023a. Magic3d: High-resolution text-to-3d content creation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 300--309.
- Lin et al. [2023b] Lin, K.E., Lin, Y.C., Lai, W.S., Lin, T.Y., Shih, Y.C., Ramamoorthi, R., 2023b. Vision transformer for nerf-based view synthesis from a single input image, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 806--815.
- Lin et al. [2023c] Lin, Y., Han, H., Gong, C., Xu, Z., Zhang, Y., Li, X., 2023c. Consistent123: One image to highly consistent 3d asset using case-aware diffusion priors. arXiv:2309.17261.
- Liu et al. [2023a] Liu, M., Shi, R., Chen, L., Zhang, Z., Xu, C., Wei, X., Chen, H., Zeng, C., Gu, J., Su, H., 2023a. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. arXiv:2311.07885.
- Liu et al. [2023b] Liu, M., Xu, C., Jin, H., Chen, L., T, M.V., Xu, Z., Su, H., 2023b. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv:2306.16928.
- Liu et al. [2023c] Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C., 2023c. Zero-1-to-3: Zero-shot one image to 3d object, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9298--9309.
- Liu et al. [2023d] Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W., 2023d. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv:2309.03453.
- Long et al. [2023] Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., Wang, W., 2023. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv:2310.15008.
- Lorensen and Cline [1998] Lorensen, W.E., Cline, H.E., 1998. Marching cubes: A high resolution 3d surface construction algorithm, in: Seminal graphics: pioneering efforts that shaped the field, pp. 347--353.
- Lu et al. [2023] Lu, Y., Zhang, J., Li, S., Fang, T., McKinnon, D., Tsin, Y., Quan, L., Cao, X., Yao, Y., 2023. Direct2.5: Diverse text-to-3d generation via multi-view 2.5d diffusion. arXiv:2311.15980.
- Ma et al. [2023] Ma, B., Deng, H., Zhou, J., Liu, Y.S., Huang, T., Wang, X., 2023. Geodream: Disentangling 2d and geometric priors for high-fidelity and consistent 3d generation. arXiv:2311.17971.
- Melas-Kyriazi et al. [2023] Melas-Kyriazi, L., Laina, I., Rupprecht, C., Vedaldi, A., 2023. Realfusion: 360deg reconstruction of any object from a single image, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8446--8455.
- Metzer et al. [2023] Metzer, G., Richardson, E., Patashnik, O., Giryes, R., Cohen-Or, D., 2023. Latent-nerf for shape-guided generation of 3d shapes and textures, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12663--12673.
- Michel et al. [2022] Michel, O., Bar-On, R., Liu, R., Benaim, S., Hanocka, R., 2022. Text2mesh: Text-driven neural stylization for meshes, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13492--13502.
- Mildenhall et al. [2020] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R., 2020. Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV.
- Mohammad Khalid et al. [2022] Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T., 2022. Clip-mesh: Generating textured meshes from text using pretrained image-text models, in: SIGGRAPH Asia 2022 Conference Papers, Association for Computing Machinery, New York, NY, USA. URL: https://doi.org/10.1145/3550469.3555392, doi:10.1145/3550469.3555392.
- Nichol et al. [2022] Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M., 2022. Point-e: A system for generating 3d point clouds from complex prompts. arXiv:2212.08751.
- Poole et al. [2022] Poole, B., Jain, A., Barron, J.T., Mildenhall, B., 2022. Dreamfusion: Text-to-3d using 2d diffusion. arXiv:2209.14988.
- Qian et al. [2023] Qian, G., Mai, J., Hamdi, A., Ren, J., Siarohin, A., Li, B., Lee, H.Y., Skorokhodov, I., Wonka, P., Tulyakov, S., Ghanem, B., 2023. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv:2306.17843.
- Qiu et al. [2023] Qiu, L., Chen, G., Gu, X., Zuo, Q., Xu, M., Wu, Y., Yuan, W., Dong, Z., Bo, L., Han, X., 2023. Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. arXiv:2311.16918.
- Radford et al. [2021] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision, in: Meila, M., Zhang, T. (Eds.), Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 8748--8763. URL: https://proceedings.mlr.press/v139/radford21a.html.
- Rombach et al. [2021] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2021. High-resolution image synthesis with latent diffusion models. arXiv:2112.10752.
- Saharia et al. [2022] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M., 2022. Photorealistic text-to-image diffusion models with deep language understanding, in: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 36479--36494. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf.
- Sargent et al. [2023] Sargent, K., Li, Z., Shah, T., Herrmann, C., Yu, H.X., Zhang, Y., Chan, E.R., Lagun, D., Fei-Fei, L., Sun, D., Wu, J., 2023. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. arXiv:2310.17994.
- Seo et al. [2023a] Seo, H., Kim, H., Kim, G., Chun, S.Y., 2023a. Ditto-nerf: Diffusion-based iterative text to omni-directional 3d model. arXiv:2304.02827.
- Seo et al. [2023b] Seo, J., Jang, W., Kwak, M.S., Ko, J., Kim, H., Kim, J., Kim, J.H., Lee, J., Kim, S., 2023b. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv:2303.07937.
- Shen et al. [2021] Shen, T., Gao, J., Yin, K., Liu, M.Y., Fidler, S., 2021. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis, in: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 6087--6101. URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/30a237d18c50f563cba4531f1db44acf-Paper.pdf.
- Shi et al. [2023a] Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H., 2023a. Zero123++: a single image to consistent multi-view diffusion base model. arXiv:2310.15110.
- Shi et al. [2023b] Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X., 2023b. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512.
- Singer et al. [2023] Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., Taigman, Y., 2023. Text-to-4d dynamic scene generation. arXiv:2301.11280.
- Tang et al. [2023a] Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G., 2023a. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv:2309.16653.
- Tang et al. [2023b] Tang, J., Wang, T., Zhang, B., Zhang, T., Yi, R., Ma, L., Chen, D., 2023b. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv:2303.14184.
- Tang et al. [2023c] Tang, S., Zhang, F., Chen, J., Wang, P., Furukawa, Y., 2023c. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv:2307.01097.
- Tsalicoglou et al. [2023] Tsalicoglou, C., Manhardt, F., Tonioni, A., Niemeyer, M., Tombari, F., 2023. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv:2304.12439.
- Tseng et al. [2023] Tseng, H.Y., Li, Q., Kim, C., Alsisan, S., Huang, J.B., Kopf, J., 2023. Consistent view synthesis with pose-guided diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16773--16783.
- Vilesov et al. [2023] Vilesov, A., Chari, P., Kadambi, A., 2023. Cg3d: Compositional generation for text-to-3d via gaussian splatting. arXiv:2311.17907.
- Wang et al. [2023a] Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G., 2023a. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12619--12629.
- Wang et al. [2018] Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G., 2018. Pixel2mesh: Generating 3d mesh models from single rgb images, in: Proceedings of the European Conference on Computer Vision (ECCV).
- Wang et al. [2023b] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J., 2023b. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation, in: Advances in Neural Information Processing Systems (NeurIPS).
- Watson et al. [2022] Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M., 2022. Novel view synthesis with diffusion models. arXiv:2210.04628.
- Wei et al. [2023] Wei, J., Wang, H., Feng, J., Lin, G., Yap, K.H., 2023. Taps3d: Text-guided 3d textured shape generation from pseudo supervision, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16805--16815.
- Weng et al. [2023] Weng, H., Yang, T., Wang, J., Li, Y., Zhang, T., Chen, C.L.P., Zhang, L., 2023. Consistent123: Improve consistency for one image to 3d object synthesis. arXiv:2310.08092.
- Xu et al. [2022] Xu, D., Jiang, Y., Wang, P., Fan, Z., Shi, H., Wang, Z., 2022. Sinnerf: Training neural radiance fields on complex scenes from a single image, in: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (Eds.), Computer Vision -- ECCV 2022, Springer Nature Switzerland, Cham. pp. 736--753.
- Xu et al. [2023a] Xu, D., Jiang, Y., Wang, P., Fan, Z., Wang, Y., Wang, Z., 2023a. Neurallift-360: Lifting an in-the-wild 2d photo to a 3d object with 360deg views, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4479--4489.
- Xu et al. [2023b] Xu, J., Wang, X., Cheng, W., Cao, Y.P., Shan, Y., Qie, X., Gao, S., 2023b. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 20908--20918.
- Xu et al. [2023c] Xu, Y., Tan, H., Luan, F., Bi, S., Wang, P., Li, J., Shi, Z., Sunkavalli, K., Wetzstein, G., Xu, Z., Zhang, K., 2023c. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. arXiv:2311.09217.
- Ye et al. [2023] Ye, J., Wang, P., Li, K., Shi, Y., Wang, H., 2023. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. arXiv:2310.03020.
- Yi et al. [2023] Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X., 2023. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arxiv:2310.08529 .
- Yu et al. [2021] Yu, A., Ye, V., Tancik, M., Kanazawa, A., 2021. pixelnerf: Neural radiance fields from one or few images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4578--4587.
- Yu et al. [2023a] Yu, K., Liu, J., Feng, M., Cui, M., Xie, X., 2023a. Boosting3d: High-fidelity image-to-3d by boosting 2d diffusion prior to 3d prior with progressive learning. arXiv:2311.13617.
- Yu et al. [2023b] Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X., 2023b. Text-to-3d with classifier score distillation. arXiv:2310.19415.
- Zhao et al. [2023] Zhao, M., Zhao, C., Liang, X., Li, L., Zhao, Z., Hu, Z., Fan, C., Yu, X., 2023. Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior. arXiv:2308.13223.
- Zhou et al. [2023] Zhou, L., Shih, A., Meng, C., Ermon, S., 2023. Dreampropeller: Supercharge text-to-3d generation with parallel sampling. arXiv:2311.17082.
- Zhu and Zhuang [2023] Zhu, J., Zhuang, P., 2023. Hifa: High-fidelity text-to-3d generation with advanced diffusion guidance. arXiv:2305.18766.