This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: 1Xidian University, 2Hunan University, 3Chang’an University
4The University of Western Australia

Beyond Skeletons: Integrative Latent Mapping for Coherent 4D Sequence Generation

Qitong Yang 11    Mingtao Feng Corresponding Authors11    Zijie Wu 22    Shijie Sun 33   
Weisheng Dong
11
   Yaonan Wang 22    and Ajmal Mian 44
Abstract

Directly learning to model 4D content, including shape, color and motion, is challenging. Existing methods depend on skeleton-based motion control and offer limited continuity in detail. To address this, we propose a novel framework that generates coherent 4D sequences with animation of 3D shapes under given conditions with dynamic evolution of shape and color over time through integrative latent mapping. We first employ an integrative latent unified representation to encode shape and color information of each detailed 3D geometry frame. The proposed skeleton-free latent 4D sequence joint representation allows us to leverage diffusion models in a low-dimensional space to control the generation of 4D sequences. Finally, temporally coherent 4D sequences are generated conforming well to the input images and text prompts. Extensive experiments on the ShapeNet, 3DBiCar and DeformingThings4D datasets for several tasks demonstrate that our method effectively learns to generate quality 3D shapes with color and 4D mesh animations, improving over the current state-of-the-art. Source code will be released.

Keywords:
4D Generation Integrative Latent Mapping Diffusion

1 Introduction

Generative modeling of dynamic 3D scenes can potentially revolutionize how we create animations, games, movies, simulations, and entire virtual worlds. Encouraging progress has been made in synthesizing diverse 3D objects via generative models. However, current methods typically synthesize static 3D scenes, and the quality and the diversity of the generated shapes still require improvement. While image diffusion models [33, 35] have been successfully extended to video generation [41, 4], there is little research on similarly extending 3D synthesis [19, 16, 38] to 4D generation, by addition a temporal dimension.

To address this gap, recent works [12, 32, 5, 3, 36] combine the controllability of dynamic 3D objects with the expressivity of emerging diffusion models. 4D representations are proposed to construct an underlying representation of the dynamic 3D objects, allowing the diffusion model to learn the distribution of 3D synthesis with an additional temporal dimension. In general, 4D representations can be categorized into prior-based and free-form methods depending on the 3D representation of the output shape. Prior-based methods are mostly derived from skeleton based shape parameters and a series of pose parameters to model dynamic sequences [17, 48]. Although they produce plausible results, their motion representations are not compact or only support a small time span. Free-form methods leveraging dynamic Neural Radiance Fields (NeRF) combine the benefits of video and 3D generative models [18, 36, 6, 46] require heavy computations to generate a 4D NeRF. Moreover, their generated motions are not well-controllable. The 4D Gaussian splatting based free-form method [32] requires camera pose information during training and suffers from low quality of unseen views. To this end, we pose a question, how can we represent a dynamic scene without relying on priors in an efficient low-dimensional way to generate coherent 4D sequences with animated 3D shapes, considering the three-way tradeoff between the 3D shape quality, appearance and coherence of the sequence.

In this paper, we propose several techniques to ensure coherent 4D sequence generation under the given conditions with dynamic evolution of shape and color over time through integrative latent mapping. We first design an integrative latent unified representation to encode the shape and color information of each frame that can achieve better generation quality. Our method does not require an off-the-shelf 2D diffusion model like texture rendering based colored 3D object generation [9]. Moreover, instead of using an individual vector to encode color, we unify the shape and color into the same latent representation. Considering that 3D shapes share local structural similarities, we propose the structural similarity loss to refine the consistency of geometry and appearance of the 3D shapes while simultaneously decoding the unified values of shape and color. Next, since directly applying diffusion models to the 3D matrices of Signed Distance Function (SDF) values and vertex color matrices of each frame is computationally expensive, we introduce a skeleton-free latent 4D sequence joint representation, extending the integrative latent mapping along the temporal dimension. The unified shape and color latent codes across all frames are jointly concatenated to a low dimensional matrix capturing the 4D distribution of the sequence. The 4D representation allows the diffusion model to efficiently operate on the temporal dimension, resulting in the generation of intricate 4D sequences with high fidelity and temporal coherence. Thirdly, equipped with the efficient 4D representation, we propose an image-text conditioned latent diffusion model for 4D sequence generation, which captures the intra-frame information along with frame-to-frame and frame-to-global relationships under attention mechanism. The 4D representation allows the diffusion model to enforce temporal consistency on the content during generation. Our contributions can be summarized as follows:

  • We propose a novel image-text conditioned4D sequence generation framework, that balances the three-way tradeoff between the 3D shape quality, color and sequence coherence.

  • We present an integrative latent mapping approach that simultaneously models the 3D shape and color information within a unified latent space.

  • We introduce a low-dimension 4D representation method, independent of skeleton priors, that facilitates efficient correlation learning along the temporal dimension.

  • We design frame coherent self-attention and condition injection cross attention to enforce temporal consistency in the generated content during the denoising process.

  • We assess various tasks (e.g. 3D object generation, unconditioned/conditioned 4D sequences generation) across multiple datasets, to show that our approach achieves promising generation performances compared to state-of-the-art.

2 Related Work

3D Generative Models have been extensively explored based on point clouds [2, 25, 43], voxel grids [37, 44, 24], meshes [45], and SDFs [8, 27, 10, 9]. Beyond the generation of 3D shapes, efforts have been made in generating colored objects [9, 26]. Most methods are based on texture rendering, NeRF, score distillation, or Gaussian splatting. Recent works [12, 32, 5, 3, 36] combine the controllability of colored dynamic 3D objects with the expressivity of emerging diffusion models. However, their downsides are long optimization time and poor performance beyond the training views. Arguably, these methods are ineffective for applications involving dynamic signals, e.g. colored 3D moving objects, since the temporal information is not captured.

4D Representations have primarily taken two directions. One approach treats 4D scenes as functions of spatial dimensions (x, y, z) extended by the temporal dimension (t) or latent codes [42, 15, 21]. The alternative approach involves integrating deformation fields with static, canonical 3D models [30, 29, 11, 39, 23]. A key challenge in 4D representation is maintaining computational efficiency along with temporal consistency. Various strategies have been proposed to address this, including explicit and hybrid representations that yield notable improvements in speed and reconstruction quality. Techniques such as planar decomposition for 4D space-time grids  [6, 14, 34], hash representations [40], and other innovative structures [13, 1] have shown promise. However, current research lacks methods for unifying the representation of shape and color in low-dimensional spaces.

4D Generative Models. A prominent research direction employs text-to-video diffusion models to refine 4D representations, exemplified by the optimization of structures like Hexplane [6] or K-plane  [14]. This process typically involves crafting camera trajectories and applying Score Distillation Sampling on the rendered video sequences. A recent method [3] focuses on enhancing the photorealism of these representations. However, the motion elements are tightly coupled with the 3D content in this method, resulting in a lack of diversity and control over the generated motions. An alternate approach uses priors, mostly derived from the skeleton shape parameters and a series of pose parameters to model dynamic sequences [17, 48]. Their motion representations, while plausible, lack compactness and are limited to short time spans.

3 Coherent 4D Sequence Generation

We propose an image-text conditioned coherent 4D sequence generation method that effectively handles 4D sequence representation (see Fig. 1). We represent the shape of each frame in the 4D sequence using SDF and assign colors to the reconstructed mesh vertices. However, directly applying diffusion models to the 3D matrices of SDF values and vertex colors is computationally intensive. Therefore, we employ an integrative latent unified representation to encode the shape and color information of each frame (Sec. 3.1). The proposed skeleton-free representation allows us to leverage diffusion models in a low-dimensional space to control the sequence generation (Sec. 3.2). Our method simplifies the process of generating 4D sequences, reducing complexity while maintaining high temporal coherence (Sec. 3.3).

Refer to caption
Figure 1: Overview of our method. The unified shape and color latent vectors of full sequences are concatenated jointly in the 4D representation \mathcal{M}. The input image mask and text prompt are then encoded by CLIP [31] to condition the diffusion process of \mathcal{M}. The 4D sequences are finally reconstructed from the generated \mathcal{M}^{\prime}.

3.1 Integrative Latent Mapping for Unified Modeling

To learn the high-quality 4D sequence distribution with color via diffusion, we propose an integrative latent mapping approach that models both 3D shape and color information within a unified latent space. The 3D geometry is first represented by a continuous SDF with a shape latent vector zid1z_{i}\in\mathbb{R}^{d_{1}}. To establish an initial shape representation, we use an auto-decoder DθD_{\theta} to decode a coarse SDFci{SDF}_{c}^{i}.

SDFci=Dθ(zi,xi),SDFi=Uϕ(SDFci),{SDF}_{c}^{i}=D_{\theta}(z_{i},x_{i}),\qquad{SDF}^{i}=U_{\phi}({SDF}_{c}^{i}), (1)

where latent vector zidz_{i}\in\mathbb{R}^{d} encapsulates the implicit shape information for the ii-th object, xi3x_{i}\in\mathbb{R}^{3} denotes the spatial query coordinate. DθD_{\theta} is used to reconstruct the initial shape, yielding a coarse SDFci{SDF}_{c}^{i}. Building upon this coarse representation SDFci{SDF}_{c}^{i}, the 3D U-Net UϕU_{\phi} further refines the geometry details, resulting in the final SDFi{SDF}^{i}, providing an accurate shape SreiS_{re}^{i} representation.

For geometrically consistent color learning, we unify the shape and color representation rather than using an additional isolated latent vector to encode color information (see Fig.2). We employ a remeshing method [20] to subdivide the reconstructed shape SreiS_{re}^{i} surface, and increase the number of vertices to enhance the color details. This process is followed by aligning the vertices and using a KD-tree to calculate the color values PciP_{c}^{i} from ground truth object SgtiS_{gt}^{i}. We sample random coordinates pip_{i} and random color (Rip,Gip,Bip)(R^{p}_{i},G^{p}_{i},B^{p}_{i}) from PciP_{c}^{i}.

(Rip,Gip,Bip)gt=Φθ((Ri,Gi,Bi),vi,pi),(R^{p}_{i},G^{p}_{i},B^{p}_{i})_{gt}=\Phi_{\theta}((R_{i},G_{i},B_{i}),v_{i},p_{i}), (2)

where (Rip,Gip,Bip)gt(R^{p}_{i},G^{p}_{i},B^{p}_{i})_{gt} is the color of random coordinate pip_{i} from vertices viv_{i} of the ii-th object and Φθ\Phi_{\theta} is the interpolation function. Concurrently, we construct a continuous color field in the neighborhood of shape surface through interpolation to ensure that the decoder can fully learn the color distribution of the object and is generalizable to a broader range of geometries.

Refer to caption
Figure 2: Pipeline for geometrically consistent color learning. We use a remeshing method to augment the vertex numbers, then leverage a KD-tree structure to efficiently retrieve the true color values of the vertices for training. During each training epoch, a subset of vertices is randomly selected. CθC_{\theta} is the color decoder, cic_{i} is the latent vector, SreiS_{re}^{i} is the reconstructed mesh, SgtiS_{gt}^{i} is the groundtruth mesh, and PciP_{c}^{i} is the colored vertices computed from KD-tree.

We employ a decoder CθC_{\theta} that is capable of decoding the color of shape, which is inherently mapped to the latent shape representation ziz_{i}.

(Rip,Gip,Bip)=Cθ(ci,pi),(R_{i}^{p},G_{i}^{p},B_{i}^{p})=C_{\theta}(c_{i},p_{i}), (3)

where (Rip,Gip,Bip)(R_{i}^{p},G_{i}^{p},B_{i}^{p}) is the color of random coordinates pip_{i} of the ii-th object, cid2c_{i}\in\mathbb{R}^{d_{2}} is the randomly initialized color latent vector, optimized during training.

At training time, we simultaneously optimize the shape decoder DθD_{\theta}, 3D U-Net UΦU_{\Phi}, and color decoder CθC_{\theta}.

=λ1s+λ2c,\mathcal{L}=\lambda_{1}\mathcal{L}_{s}+\lambda_{2}\mathcal{L}_{c}, (4)

where s\mathcal{L}_{s} is the shape representation loss, c\mathcal{L}_{c} is the color mapping loss, and λ1\lambda_{1} and λ2\lambda_{2} are balancing weights. s\mathcal{L}_{s} maximizes the joint log posterior over all the training shapes to optimize the shape decoder and is defined as

s=L1(Dθ(zi,xi),UΦ(SDFci),SDFgti)+1σ2zi22),\mathcal{L}_{s}=L_{1}(D_{\theta}(z_{i},x_{i}),U_{\Phi}(SDF^{i}_{c}),SDF^{i}_{gt})+\frac{1}{\sigma^{2}}\|z_{i}\|_{2}^{2}), (5)

where xix_{i} is the query coordinates, SDFgtiSDF^{i}_{gt} is the ground truth SDF and L1L_{1} is the L1-norm. The color decoder CθC_{\theta} is optimized by a combination of regularized L1-norm loss and structural similarity loss since 3D shapes share structural similarities at local scales. We find that the color decoder optimized with structural similarity loss achieves significantly better results compared to using only L1-norm loss.

c=λ3(L1(Cθ(ci,pi),(Rip,Gip,Bip)gt)+1σ2ci22))\displaystyle\mathcal{L}_{c}=\lambda_{3}(L_{1}(C_{\theta}(c_{i},p_{i}),(R^{p}_{i},G^{p}_{i},B^{p}_{i})_{gt})+\frac{1}{\sigma^{2}}\|c_{i}\|_{2}^{2}))
+λ4LSSIM(Cθ(ci,pi),(Rip,Gip,Bip)gt),\displaystyle+\lambda_{4}L_{SSIM}(C_{\theta}(c_{i},p_{i}),(R^{p}_{i},G^{p}_{i},B^{p}_{i})_{gt}), (6)

where λ3\lambda_{3} and λ4\lambda_{4} are balancing weights for the two terms.

At inference time, after training and fixing the model parameters, a shape latent code zi^\hat{z_{i}} and color latent code ci^\hat{c_{i}} for target object SiS_{i} can be estimated via Maximum-a-Posterior (MAP) estimation as:

zi^=argminzi=1N(Dθ(zi,xi),SDFgti)+1σ2zi22,ci^=argminci=1N(Cθ(ci,pi),(Rip,Gip,Bip)gt)+1σ2ci22.\begin{split}\hat{z_{i}}=\mathop{\arg\min}_{z}\displaystyle\sum^{N}_{i=1}\mathcal{L}(D_{\theta}(z_{i},x_{i}),SDF^{i}_{gt})+\frac{1}{\sigma^{2}}\|z_{i}\|_{2}^{2},\\ \hat{c_{i}}=\mathop{\arg\min}_{c}\displaystyle\sum^{N}_{i=1}\mathcal{L}(C_{\theta}(c_{i},p_{i}),(R^{p}_{i},G^{p}_{i},B^{p}_{i})_{gt})+\frac{1}{\sigma^{2}}\|c_{i}\|_{2}^{2}.\end{split} (7)

This demonstrates that despite employing an auto-decoder architecture, our model is capable of accepting unknown inputs during inference and providing an implicit representation for the object of new inputs. Accordingly, the integrative latent mapping for unified 3D shape and color modeling is capable of simultaneously decoding the SDFiSDF^{i} values and color (Ri,Gi,Bi)(R_{i},G_{i},B_{i}) of shape SiS_{i}.

Refer to caption
Figure 3: Pipeline of 4D sequences reconstruction from the 4D representatioin \mathcal{M}^{\prime}. The generated \mathcal{M}^{\prime} is split along the temporal dimension, and then decoded to the shape and color components with the trained models UΦ,DθU_{\Phi},D_{\theta} and CθC_{\theta} for recovering the full sequence.

3.2 Skeleton-free Latent 4D Sequence Joint Representation

Our 4D sequence representation marks a significant innovation by eliminating reliance on traditional skeletal structures. We propose a skeleton-free latent joint representation, and then extend the implicit expression in the temporal dimension. Equipped with the trained models DθD_{\theta}, UϕU_{\phi}, and CθC_{\theta}, we decode the shape and color of an animated 3D object SiS_{i} at arbitrary frame from its shape latent vector zid1z_{i}\in\mathbb{R}^{d_{1}} and color latent vector cid2c_{i}\in\mathbb{R}^{d_{2}}. This allows us to represent a 4D sequence as a matrix N×(d1+d2)\mathcal{M}\in\mathbb{R}^{N\times(d_{1}+d_{2})}, which jointly concatenates the latent vectors {zid1}\{z_{i}\in\mathbb{R}^{d_{1}}\} and {cid2}\{c_{i}\in\mathbb{R}^{d_{2}}\} across all frames, effectively capturing the 4D distribution of the sequence.

=[z1z2zNc1c2cN]\mathcal{M}=\begin{bmatrix}z_{1}&z_{2}&\cdots&z_{N}\\ c_{1}&c_{2}&\cdots&c_{N}\,\end{bmatrix} (8)

where NN denotes the number of frames, with each column of \mathcal{M} corresponding to a specific frame nn, and znz_{n} and cnc_{n} representing the shape and color latent vector at a specific frame, respectively. The full 4D sequence is now represented by a single 2D matrix \mathcal{M} in a low-dimensional space. \mathcal{M} provides a compact and efficient representation for the entire 4D sequence. We can then train a diffusion model (see Sec 3.3) with the previously created matrix \mathcal{M} as data points, enabling manipulation and generation of animated objects S1SNS_{1}\cdots S_{N}.

After the diffusion denoising process, the generated matrix \mathcal{M^{\prime}} is used to reconstruct the full 4D sequence (see Fig. 3). We split the matrix \mathcal{M^{\prime}} along the temporal dimension to get the shape and color latent vectors znz_{n} and cnc_{n} for each frame. We then use the trained models UΦ,DθU_{\Phi},D_{\theta} and CθC_{\theta} to decode each frame SnS_{n} of the 4D sequence. We first decode the shape {vn,fn}\{v_{n},f_{n}\} by:

{vn,fn}=MCubes(UΦ(Dθ(zn,xn))),\{v_{n},f_{n}\}=\text{MCubes}(U_{\Phi}(D_{\theta}(z_{n},x_{n}))), (9)

where vnv_{n} and fnf_{n} are the vertices and triangle faces of shape SnS_{n} at time step nn, MCubes denoted as marching cube algorithm to reconstruct vertices and triangle faces from SDF values, xnx_{n} is the query coordinate same to xix_{i} in Sec.3.1.

We significantly reduce the dimensionality and computational complexity and export high quality 4D sequence by using Cθ,cnC_{\theta},c_{n} and vnv_{n} to decode the color (Rn,Gn,Bn)(R_{n},G_{n},B_{n}) and get the object frame SnS_{n}:

Sn={vn,fn,(Rn,Gn,Bn)},(Rn,Gn,Bn)=Cθ(cn,vn),\begin{split}S_{n}=\{v_{n},f_{n},(R_{n},G_{n},B_{n})\},(R_{n},G_{n},B_{n})=C_{\theta}(c_{n},v_{n}),\end{split} (10)

where Eq. 9 and 10 represent the process of decoding one frame SnS_{n} of 4D sequence S1SNS_{1}\cdots S_{N} using \mathcal{M}. We denote the above operations for brevity as \mathcal{F}. We repeat this process to reconstruct the full 4D sequence as

{S1SN}=(xn,Dθ,UΦ,Cθ,).\{S_{1}\cdots S_{N}\}=\mathcal{F}(x_{n},D_{\theta},U_{\Phi},C_{\theta},\mathcal{M}). (11)

The 4D representation \mathcal{M} allows the diffusion model to efficiently operate across the temporal dimension while imposing temporal consistency (details in Sec. 3.3), resulting in the generation of intricate 4D sequences S1SNS_{1}\cdots S_{N} with high fidelity and temporal coherence.

3.3 Learning Conditional Diffusion for 4D Sequence

We can now apply diffusion to the 4D sequence representation \mathcal{M} to generate sequences, rather than to the individual 3D animated frames. During the forward diffusion process, data is transformed into Gaussian noise following a Markov chain, while the reverse process performs iterative denoising to recover data from a single Gaussian distribution. For the forward process of the latent diffusion, the original 4D representation 0q(0)\mathcal{M}_{0}\sim q(\mathcal{M}_{0}) is gradually corrupted in pre-defined TT-steps noised distribution following the Markov’s chain assumption until the Gaussian distribution is reached. Based on the Markov property, the joint distribution 1:T\mathcal{M}_{1:T} is derived from the original 4D representation 0\mathcal{M}_{0}:

q(0:T)=q(0)n=1Tq(tt1),q(tt1)=𝒩(1βtt1,βt𝐈),\begin{gathered}q\left(\mathcal{M}_{0:T}\right)=q\left(\mathcal{M}_{0}\right)\prod_{n=1}^{T}q\left(\mathcal{M}_{t}\mid\mathcal{M}_{t-1}\right),\\ q\left(\mathcal{M}_{t}\mid\mathcal{M}_{t-1}\right)=\mathcal{N}\left(\sqrt{1-\beta_{t}}\mathcal{M}_{t-1},~{}\beta_{t}\mathbf{I}\right),\end{gathered} (12)

where 𝒩(μ,σ2)\mathcal{N}\left(\mu,\sigma^{2}\right) denotes a Gaussian distribution and βt\beta_{t} is the known variance defined during the diffusion process. The reverse process starts from a sample in Gaussian distribution, aiming to achieve the target 4D representation 0\mathcal{M}_{0}.:

pθ(0:T)=p(T)n=1Tpθ(t1t),pθ(t1t)=𝒩(ϵθ(t,t),σt2𝐈),\begin{gathered}p_{\theta}\left(\mathcal{M}_{0:T}\right)=p\left(\mathcal{M}_{T}\right)\prod_{n=1}^{T}p_{\theta}\left(\mathcal{M}_{t-1}\mid\mathcal{M}_{t}\right),\\ p_{\theta}\left(\mathcal{M}_{t-1}\mid\mathcal{M}_{t}\right)=\mathcal{N}\left(\epsilon_{\theta}\left(\mathcal{M}_{t},t\right),\sigma_{t}^{2}\mathbf{I}\right),\end{gathered} (13)

where ϵθ\epsilon_{\theta} is the noise prediction network of the reverse chain.

Refer to caption
Figure 4: Illustration of attention mechanisms. The frame coherent self-attention is first used to enhance the consistency among t\mathcal{M}_{t} and the condition features FtF_{t} are then injected by the cross attention to guide the 4D sequence generation.

Our diffusion model operates in the latent space to generate a 4D latent of the animated object sequence conditioned by the image-text input (see Fig. 1). We apply a unified description template pnp_{n} embedded with frame index nn to condition coherent sequences. Additionally, to maintain visual consistency, we replicate the image features, using the same image condition II for generation across each frame:

FnT=τβ(pntn)FnI=τθ(I),F_{n}^{T}=\tau_{\beta}(p_{n}\circ t_{n})\qquad F_{n}^{I}=\tau_{\theta}(I), (14)

where τβ\tau_{\beta} and τθ\tau_{\theta} are the text encoder and image encoder of CLIP [31], tnt_{n} is the text condition of nn-th frame, \circ means concatenation, FnT,FnIF_{n}^{T},F_{n}^{I} are the text and image features for the nn-th frames.

We employ attention mechanisms Att.(Q,K,V)=Softmax(QKTd)V\emph{Att.}(Q,K,V)=\text{Softmax}(\frac{QK^{T}}{\sqrt{d}})V to guide the generative process, striking a balance between conditional guidance and the temporal coherence of the latent representation t\mathcal{M}_{t} (see Fig.4). A frame coherent self-attention module is first used to enhance the consistency among the latent vectors ztz_{t} and ctc_{t} within the 4D sequence representation t\mathcal{M}_{t}:

Q=WQ(i)ϕi(t),K=WK(i)ϕi(t),V=WK(i)ϕi(t)Q=W_{Q}^{(i)}\phi_{i}(\mathcal{M}_{t}),~{}K=W_{K}^{(i)}\phi_{i}(\mathcal{M}_{t}),~{}V=W_{K}^{(i)}\phi_{i}(\mathcal{M}_{t}) (15)

where ϕi(t)\phi_{i}(\mathcal{M}_{t}) denotes a intermediate status of t\mathcal{M}_{t} in the noise prediction network ϵθ\epsilon_{\theta}, and WQ(i)d×dϵiW_{Q}^{(i)}\in\mathbb{R}^{d\times d_{\epsilon}^{i}}, WK(i)d×dϵiW_{K}^{(i)}\in\mathbb{R}^{d\times d_{\epsilon}^{i}}, & WV(i)d×dϵiW_{V}^{(i)}\in\mathbb{R}^{d\times d_{\epsilon}^{i}} are learnable projection matrices. The encoded features FnTF_{n}^{T} and FnIF_{n}^{I} are then used to control the denosing process of t\mathcal{M}_{t} with condition injection cross-attention module:

Q=WQ(i)ϕi(t),K=WK(i)τi(FnTFnI),V=WK(i)τi(FnTFnI),Q=W_{Q}^{(i)}\phi_{i}(\mathcal{M}_{t}),~{}K=W_{K}^{(i)}\tau_{i}(F_{n}^{T}\circ F_{n}^{I}),~{}V=W_{K}^{(i)}\tau_{i}(F_{n}^{T}\circ F_{n}^{I}), (16)

where τi\tau_{i} is the task-specific encoder for the ii-th modality.

We adopt classifier-free guidance for image-text conditional generation of 4D representation t\mathcal{M}_{t}. The simplified objective proposed by  [33] reads as follows:

LDM=missingE,f,ϵ,t[ϵϵθ(t,t,F{Dτi(fi)})2],\displaystyle\mathcal{L}_{LDM}=\mathop{\mathbb{missing}}{E}_{\mathcal{M},f,\epsilon,t}\left[{\epsilon-\epsilon_{\theta}(\mathcal{M}_{t},t,F\{D\circ\tau_{i}(f_{i})\})}^{2}\right], (17)

where DD is a dropout operation enabling classifier-free guidance, and FF refers to a simple concatenation.

4 Experiments

We conduct extensive experiments on three tasks: single 3D object generation, unconditioned 4D generation, and conditioned 4D sequence generation.

Data Preparation: We use the car, chair and airplane categories of ShapeNet [7] dataset for 3D geometry evaluation, and T-pose objects of 3DBiCar [26] for colored 3D object generation evaluation. We use 16-frame animal animation sequences from the DeformingThings4D (DT4D) [22] dataset, including 1772 sequences for unconditional 4D generation. We use 20-frame colored cartoon animation sequences from 3DBiCar dataset, including 1500 sequences for unconditional and conditional 4D generation. We use the manually annotated image descriptions and corresponding image masks from 3DBiCar. Details are in supplymentary.

Evaluation Metrics: Following [25, 47], we use Minimum Matching Distance (MMD), Coverage (COV), and 1-Nearest-Neighbor Accuracy (1-NNA) to evaluate geometry generation. Lower MMD is better, higher COV is better and 50% 1-NNA is the optimal. To evaluate 4D geometry sequence, we calculate 3D geometry metrics for object of each frame, then we average all metrics across the full 4D sequence to get the final results. To evaluate inter-frame consistency of a 4D sequence, we calculate the Chamfer Distance and COV for each frame, and then calculate the variance of Chamfer Distance and COV for one full sequence, denoted as CD-VAR and COV-VAR. For colored shape, we evaluate the color performance by randomly sampling 16,384 points from surface and repeat 10 times to compute the Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) metrics, and we also use CLIP score [31] (denoted as CLIP-S).

Implementation Details: For the shape auto-decoder DθD_{\theta}, we employ an 8-layer MLP to decode the SDF values, using a resolution of 64×64×6464\times 64\times 64 to construct a 6-layer 3D U-Net UΦU_{\Phi}. For the color auto-decoder CθC_{\theta}, we use an 8-layer MLP. We use 256256 as the dimension of the latent representation for unified shape latent vector zz and color latent vector cc, thus the dimension of the 4D representation \mathcal{M} is 256×256256\times 256. For denoising network, we use an 8-layer UNet augmented with attention mechanism. The loss weights λ1,λ2,λ3\lambda_{1},\lambda_{2},\lambda_{3} and λ4\lambda_{4} are set to 11. Training is performed on 4 NVIDIA Tesla V100 GPUs for all experiments.

4.1 Ablation Studies

We conduct extensive ablation studies first.

Refer to caption
Figure 5: Quantitative comparison of 3D shape generation ablation study. We sample some 3D objects from models trained on 3DBiCar and ShapeNet.

Refinement Module UΦU_{\Phi}. The refinement U-Net UΦU_{\Phi} improves the quality of coarse SDF with a 3D U-Net. Tab.1 shows that it improves results on all metrics of 3D generation, especially for MMD and COV. UΦU_{\Phi} learns more details for SDFiSDF_{i} compared to a single decoder DθD_{\theta}, resulting in highly detailed objects SiS_{i} (see Fig.5). Tab.3 shows ablation study of UΦU_{\Phi} in 4D sequence generation, where UΦU_{\Phi} improves all metrics because UΦU_{\Phi} improves geometric performance of all frames in the 4D sequence and thereby promotes consistency between frames indirectly.

Table 1: Ablation study of refinement module UΦU_{\Phi} for single 3D shape.
Category UΦU_{\Phi} MMD COV(%) 1-NNA(%)
×\times 0.198 55.64 61.72
Airplane \checkmark 0.032 65.22 61.29
×\times 0.221 59.86 60.58
Car \checkmark 0.201 66.45 54.45
×\times 0.201 70.12 27.98
Chair \checkmark 0.04 95.56 33.91
×\times 0.098 72.13 39.32
3DBiCar \checkmark 0.067 74.63 41.92
Table 2: Quantitative comparison of color ablation study.
L1 R SSIM F PSNR SSIM
\checkmark \checkmark ×\times ×\times 29.15 0.79
\checkmark ×\times \checkmark ×\times 26.37 0.72
×\times \checkmark \checkmark ×\times 28.76 0.75
\checkmark \checkmark \checkmark \checkmark 9.64 0.64
\checkmark \checkmark \checkmark ×\times 34.51 0.85
Table 3: Ablation study of 4D Sequence Generation.
Category UΦU_{\Phi} MMD CD-VAR COV(%) COV-VAR 1-NNA(%)
×\times 0.012 0.007 69 2.791 36
3DBiCar \checkmark 0.009 0.004 73 0.834 41
×\times 0.037 0.003 69 3.179 36
DT4D \checkmark 0.030 0.001 76 2.91 43

Integrative Latent Mapping. We analyze the L1 loss, remeshing operation (R) and structural similarity loss (SSIM). We also define a shape-agnostic learning strategy (denoted as F), where the colors of points are directly learned from a fixed uniform distribution for each object. We ablate these color related components in Table 2. The combination of L1, R, SSIM achieves the best result for color learning because L1 ensures the rationality of basic color, SSIM considers the local structural similarities of 3D shape to refine the basic color, and the remeshing operation increases the number of vertices to enhance the color details. The F reduces the accuracy of the colored shape as it can not correlate the color to the shape consistently while our proposed unified representation can learn the geometrically consistent color. As shown in Fig. 6, we also conduct qualitative comparisons on the different strategies combinations. The results show that our proposed unified representation (L1+R+SSIM) generates the best visually coherent and higher fidelity colors, preserving local similarity of shape-consistent colors. We can see that the F does destroy the color consistency, once again proving the superiority of our proposed unified representation.

Refer to caption
Figure 6: Quantitative comparison of color ablation study. Unified representation contributes most to the learning of color, while SSIM loss and remeshing operation contribute notably to the learning of color details.

Inter-frame Consistency. Table 4 shows that our attention mechanism significantly improves temporal consistency, depicted by improvement in all metrics. Our attention mechanism provides valuable clues to improve the consistency of znz_{n} and cnc_{n} in \mathcal{M}, leading to better shape and color consistency over time.

Table 4: Ablation study of 4D sequence generation.
Category Att. MMD CD-VAR COV(%) COV-VAR 1-NNA(%)
×\times 0.011 0.018 69 2.988 36
3DBiCar \checkmark 0.009 0.007 73 1.834 41
×\times 0.034 0.015 73 3.77 36
DT4D \checkmark 0.030 0.001 76 2.91 43

4.2 3D Object Generation

We conduct a comparative analysis of the quality of 3D shape reconstructions among DeepSDF  [28], SDFusion [9], and our method. SDFusion generates a 3D shape by using a 3D latent diffusion model with a VQVAE decoder, then the texture rendering based strategy is used to color the shape with an off-the-shelf 2D diffusion model. DeepSDF uses a MLP decoder to decode SDF values without color of query points under given latent vector. We compare the accuracy of our model for only geometry on ShapeNet (the car, chair, and airplane) and 3DBiCar datasets (the object’s T-pose) with them. Tab.5 shows that our method outperforms SDFusion and DeepSDF across all metrics on geometry generation, and outperforms the texture rendering based color generation method SDFusion on color generation, which confirms that our proposed latent mapping for unified modeling can learn more shape details and retain the shape-consistent color.

Table 5: Quantitative comparison of Single Shape. Our method surpasses other models across the majority of evaluation metrics.
Category Method MMD COV(%) 1-NNA(%) CLIP-S
DeepSDF [28] 0.264 63.11 71.11 -
SDFusion [9] 0.088 29.8 83.71 0.823
Airplane Ours 0.032 65.22 61.29 0.897
DeepSDF [28] 0.240 64.53 68.85 -
SDFusion [9] 0.599 20.59 54.85 0.842
Car Ours 0.201 66.45 54.45 0.866
DeepSDF [28] 0.368 90.42 18.93 -
SDFusion [9] 0.184 50.83 84.85 0.869
Chair Ours 0.04 95.56 33.91 0.885
DeepSDF [28] 0.112 71.98 38.46 -
SDFusion [9] 0.076 73.19 40.03 0.812
3DBiCar Ours 0.067 74.63 41.92 0.911
Table 6: Quantitative comparison of 4D Sequence Generation.
Category Method MMD CD-VAR COV(%) COV-VAR 1-NNA(%)
Voxel 0.293 0.035 30 1.857 91
HyperDiffusion [12] 0.044 0.011 52 2.138 69.3
3DBiCar Ours 0.009 0.007 73 1.834 41
Voxel 0.219 0.044 35 1.673 85
HyperDiffusion [12] 0.155 0.013 45 1.735 62
DT4D Ours 0.030 0.001 76 2.91 43
Refer to caption
Figure 7: Unconditional 4D Sequence Generation Comparison. Our model achieves high quality 4D sequence generation.
Refer to caption
Figure 8: Diverse motions 4D Sequence Generation Comparison. Our model is capable of generating varied motions for the same object with high-quality results.
Table 7: Comparison of 4D generation method. Comparative analysis of memory usage, speed, and quality among different 4D generation models.
Method Parameters Speed (one sequence) Color CLIP-S
DG4D [32] 2254.4M 6.5 min \checkmark 0.833
Hyper. [12] 190.3M 0.54s ×\times -
Ours 249.6M 0.625s \checkmark 0.911
Refer to caption
Figure 9: Comparison of different view of conditional 4D sequence generation. Our model remains high temporal consistency cross different views.
Refer to caption
Figure 10: Conditional 4D Sequence Generation. Our model is capable of generating high-quality, colored 4D sequences under given conditions.

4.3 4D Objects Sequence Generation

Unconditional 4D Objects Sequence Shape Generation. We conduct unconditional 4D sequence generation evaluations on the 3DBiCar and DeformingThings4D datasets as shown in Tab. 6. We adopt the voxel baseline from [12], which uses 3D latent diffusion model to directly generate 4D sequences. HyperDiffusion [12] synthesizes 4D sequences by generating a sequence of high-dimensional MLP weights. The sparsity in high-dimensional weight spaces can weaken the model’s ability to produce results with high temporal consistency. As shown in Tab. 6, the representation of most shapes by voxels is generally poor, hence the CD-VAR and COV-VAR do not fluctuate significantly. Our model benefits from refinement module, attention mechanisms and a joint 4D representation, which enables it to achieve superior across all metrics, demonstrating our model’s superior performance.

Fig. 7 illustrates the detailed performance of our model and HyperDiffusion [12] in generating 4D sequences, highlighting our model’s ability to maintain temporal consistency while generating sequences with significant motion amplitudes and preserving fine details.

Fig. 8 demonstrates the capability of our 4D representation \mathcal{M} to capture different actions of the same object. It showcases our model’s ability to generate diverse poses for the same object, indicating that our model not only produces 4D sequences rich in detail but also supports a variety of poses.

Conditonal Colored 4D Objects Sequence Generation. Our model accepts conditions from images and text, using latent diffusion models to generate corresponding 4D sequence representation \mathcal{M}. We compare our method to the rendering based model DreamGaussian4D (denoted as DG4D) [32] and HyperDiffusion [12] in Tab. 7. Although DreamGaussian4D supports colored 4D sequence generation through an image, it takes a lot of time and GPU memory to fit a 4D sequence. HyperDiffusion supports fast generation of 4D sequences but is unable to generate color and also suffers from low geometry quality. Our method demonstrates a balance between speed and memory efficiency.

We then compare our method with [32] as shown in Fig. 9. To maintain consistency with DreamGaussian4D, we train our latent diffusion model only with image conditions, though our approach supports image-text conditions. We set the number of supervision views for DreamGaussian4D to 1 due to the GPU memory limitation. Through the exported 4D sequence S1SNS_{1}\cdots S_{N} from DreamGaussian4D, We render images from multiple viewpoints. If viewpoints are outside the range of supervised views used for training, the rendering quality is poor, e.g. incomplete shape and blurred color. This is because single-view optimization will reduce the performance of the texture and geometric generation. Consequently, our approach achieves superior results under the same GPU resource, striking a balance between memory consumption and high-quality sequence generation.

Finally, we showcase our colored 4D sequence generation results in Fig. 10. The results in Fig. 10 indicate that our method can balance the three-way tradeoff between the 3D shape quality, color and sequence coherence since we can effectively generate 4D representation \mathcal{M} and decode it to 4D sequence S1SNS_{1}\cdots S_{N} under given the image-text condition.

5 Conclusion and Limitations

Conclusion. We presented a novel image-text conditioned 4D sequence generation method that balances the tradeoff between the 3D shape quality, color, and sequence coherence. We introduced an integrative latent mapping method that simultaneously models the 3D shapes and color information within a unified latent space. This mapping enables a skeleton-free low-dimension 4D representation by extending the mapping along the temporal dimension to support efficient correlation learning along the temporal dimension. We proposed frame coherent self-attention and condition injection cross attention to improve temporal consistency in the conditional generation of colored 4D sequence. Experiments on multiple datasets demonstrate that our method can generate diverse and high-quality 4D sequences. We believe our method is a first attempt that opens up new possibilities for the representation and generation of complex 4D sequences.

Limitations. Unlike images and videos, where large amounts of captioned data are readily available, animated 3D shapes (with/out text descriptions) are scarce. Although our method produces compelling 4D sequences, training a large-scale model on extensive 4D sequences will further improve generalization to complex motions. Investigating diffusion models pre-trained on more common data modalities and transferring them to 4D sequences is an exciting future direction.

References

  • [1] Abou-Chakra, J., Dayoub, F., Sünderhauf, N.: Particlenerf: A particle-based encoding for online neural radiance fields. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5975–5984 (2024)
  • [2] Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representations and generative models for 3d point clouds. In: International conference on machine learning. pp. 40–49 (2018)
  • [3] Bahmani, S., Skorokhodov, I., Rong, V., Wetzstein, G., Guibas, L., Wonka, P., Tulyakov, S., Park, J.J., Tagliasacchi, A., Lindell, D.B.: 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
  • [4] Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)
  • [5] Cai, S., Ceylan, D., Gadelha, M., Huang, C.H.P., Wang, T.Y., Wetzstein, G.: Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2024)
  • [6] Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 130–141 (2023)
  • [7] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
  • [8] Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5939–5948 (2019)
  • [9] Cheng, Y.C., Lee, H.Y., Tulyakov, S., Schwing, A.G., Gui, L.Y.: Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4456–4465 (2023)
  • [10] Cheng, Z., Chai, M., Ren, J., Lee, H.Y., Olszewski, K., Huang, Z., Maji, S., Tulyakov, S.: Cross-modal 3d shape generation and manipulation. In: European Conference on Computer Vision. pp. 303–321. Springer (2022)
  • [11] Du, Y., Zhang, Y., Yu, H.X., Tenenbaum, J.B., Wu, J.: Neural radiance flow for 4d view synthesis and video processing. in 2021 ieee. In: CVF International Conference on Computer Vision (ICCV). pp. 14304–14314 (2021)
  • [12] Erkoç, Z., Ma, F., Shan, Q., Nießner, M., Dai, A.: Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14300–14310 (2023)
  • [13] Fang, J., Yi, T., Wang, X., Xie, L., Zhang, X., Liu, W., Nießner, M., Tian, Q.: Fast dynamic radiance fields with time-aware neural voxels. In: SIGGRAPH Asia 2022 Conference Papers. pp. 1–9 (2022)
  • [14] Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: Explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12479–12488 (2023)
  • [15] Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5712–5721 (2021)
  • [16] Hong, Y., Zhang, K., Gu, J., Bi, S., Zhou, Y., Liu, D., Liu, F., Sunkavalli, K., Bui, T., Tan, H.: Lrm: Large reconstruction model for single image to 3d (2024)
  • [17] Jiang, B., Zhang, Y., Wei, X., Xue, X., Fu, Y.: H4d: human 4d modeling by learning neural compositional representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19355–19365 (2022)
  • [18] Jiang, Y., Zhang, L., Gao, J., Hu, W., Yao, Y.: Consistent4d: Consistent 360 {\{\\backslashdeg}\} dynamic object generation from monocular video. In: International Conference on Learning Representations (2023)
  • [19] Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
  • [20] Khan, D., Plopski, A., Fujimoto, Y., Kanbara, M., Jabeen, G., Zhang, Y.J., Zhang, X., Kato, H.: Surface remeshing: A systematic literature review of methods and research directions. IEEE Transactions on Visualization and Computer Graphics 28(3), 1680–1713 (2020)
  • [21] Li, T., Slavcheva, M., Zollhoefer, M., Green, S., Lassner, C., Kim, C., Schmidt, T., Lovegrove, S., Goesele, M., Newcombe, R., et al.: Neural 3d video synthesis from multi-view video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5521–5531 (2022)
  • [22] Li, Y., Takehara, H., Taketomi, T., Zheng, B., Nießner, M.: 4dcomplete: Non-rigid motion estimation beyond the observable surface. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12706–12716 (2021)
  • [23] Li, Z., Wang, Q., Cole, F., Tucker, R., Snavely, N.: Dynibar: Neural dynamic image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4273–4284 (2023)
  • [24] Lin, C.H., Lee, H.Y., Menapace, W., Chai, M., Siarohin, A., Yang, M.H., Tulyakov, S.: Infinicity: Infinite-scale city synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision (2023)
  • [25] Luo, S., Hu, W.: Diffusion probabilistic models for 3d point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2837–2845 (2021)
  • [26] Luo, Z., Cai, S., Dong, J., Ming, R., Qiu, L., Zhan, X., Han, X.: Rabit: Parametric modeling of 3d biped cartoon characters with a topological-consistent dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12825–12835 (2023)
  • [27] Mittal, P., Cheng, Y.C., Singh, M., Tulsiani, S.: Autosdf: Shape priors for 3d completion, reconstruction and generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 306–315 (2022)
  • [28] Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 165–174 (2019)
  • [29] Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: Deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5865–5874 (2021)
  • [30] Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10318–10327 (2021)
  • [31] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763 (2021)
  • [32] Ren, J., Pan, L., Tang, J., Zhang, C., Cao, A., Zeng, G., Liu, Z.: Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142 (2023)
  • [33] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [34] Shao, R., Zheng, Z., Tu, H., Liu, B., Zhang, H., Liu, Y.: Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16632–16642 (2023)
  • [35] Sheynin, S., Ashual, O., Polyak, A., Singer, U., Gafni, O., Nachmani, E., Taigman, Y.: Knn-diffusion: Image generation via large-scale retrieval. In: International Conference on Learning Representations (2023)
  • [36] Singer, U., Sheynin, S., Polyak, A., Ashual, O., Makarov, I., Kokkinos, F., Goyal, N., Vedaldi, A., Parikh, D., Johnson, J., et al.: Text-to-4d dynamic scene generation. In: Proceedings of International Conference on Machine Learning (2023)
  • [37] Smith, E.J., Meger, D.: Improved adversarial systems for 3d object generation and reconstruction. In: Conference on Robot Learning. pp. 87–96. PMLR (2017)
  • [38] Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In: International Conference on Learning Representations (2024)
  • [39] Tretschk, E., Tewari, A., Golyanik, V., Zollhöfer, M., Lassner, C., Theobalt, C.: Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12959–12970 (2021)
  • [40] Turki, H., Zhang, J.Y., Ferroni, F., Ramanan, D.: Suds: Scalable urban dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12375–12385 (2023)
  • [41] Wang, Y., Chen, X., Ma, X., Zhou, S., Huang, Z., Wang, Y., Yang, C., He, Y., Yu, J., Yang, P., et al.: Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103 (2023)
  • [42] Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024)
  • [43] Wu, Z., Wang, Y., Feng, M., Xie, H., Mian, A.: Sketch and text guided diffusion model for colored point cloud generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8929–8939 (2023)
  • [44] Xie, J., Zheng, Z., Gao, R., Wang, W., Zhu, S.C., Wu, Y.N.: Learning descriptor networks for 3d shape synthesis and analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8629–8638 (2018)
  • [45] Zhang, S.H., Guo, Y.C., Gu, Q.W.: Sketch2model: View-aware 3d modeling from single free-hand sketches. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6012–6021 (2021)
  • [46] Zhao, Y., Yan, Z., Xie, E., Hong, L., Li, Z., Lee, G.H.: Animate124: Animating one image to 4d dynamic scene. arXiv preprint arXiv:2311.14603 (2023)
  • [47] Zhou, L., Du, Y., Wu, J.: 3d shape generation and completion through point-voxel diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5826–5835 (2021)
  • [48] Zou, K., Faisan, S., Yu, B., Valette, S., Seo, H.: 4d facial expression diffusion model. arXiv preprint arXiv:2303.16611 (2023)