3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

Xuanmeng Sha [email protected] Osaka UniversityOsakaJapan , Liyun Zhang Institute for Datability Science, Osaka UniversityOsakaJapan [email protected] , Tomohiro Mashita Osaka Electro-Communication UniversityNeyagawaJapan and Yuki Uranishi Osaka UniversityOsakaJapan

Abstract.

Audio-driven 3D facial animation has made immersive progress both in research and application developments. The newest approaches focus on Transformer-based methods and diffusion-based methods, however, there is still gap in the vividness and emotional expression between the generated animation and real human face. To tackle this limitation, we propose 3DFacePolicy, a diffusion policy model for 3D facial animation prediction. This method generates variable and realistic human facial movements by predicting the 3D vertex trajectory on the 3D facial template with diffusion policy instead of facial generation for every frame. It takes audio and vertex states as observations to predict the vertex trajectory and imitate real human facial expressions, which keeps the continuous and natural flow of human emotions. The experiments show that our approach is effective in variable and dynamic facial motion synthesizing.

Diffusion Policy, Face Mesh, Expression Control

^†^†ccs: Computing methodologies Computer vision^†^†ccs: Computing methodologies Vision for robotics^†^†ccs: Computing methodologies Neural networks

Refer to caption — Figure 1. We present 3DFacePolicy, a facial motion prediction architecture with diffusion policy. Comparing with other two mainstream approaches (top 2 rows), our method generate predicted animation sequence step by step in motion space with diffusion policy mechanism.

1. Introduction

Speech-driven 3D facial animation is crucial topic both in research and application fields. It focus on creating realistic and precise 3D facial movements on 3D vertex or blendshape templates with speech input, which is widely deployed in live streaming, film production and video games. Its goal is to create vivid and natural facial expressions that are similar to real human facial movements (Yang et al., 2024).

Recently, traditional generative methods can produce promising facial animations and keep lip-sync accuracy with audio guidance. Karras et al. (Karras et al., 2017) proposes an end-to-end CNN based method mapping input waveforms to the 3D vertex coordinates. VOCA (Cudeiro et al., 2019) proposed a CNN-based method with pretrained DeepSpeech (Hannun, 2014) model including identity control. Recent progress has grown with the Transformer-based autoregression approach. The FaceFormer (Fan et al., 2022) applies the extensive contextual capabilities of Transformer for auto-regressive facial motion generation. CodeTalker (Xing et al., 2023) incorporates self-supervised Wav2Vec 2.0 together with the idea of having a latent codebook using VQ-VAE inspired by (Ng et al., 2022). Though these deterministic regression methods achieve relatively promising results, the dynamic and variable human expressions are overlooked.

In contrast to deterministic methods, diffusion models (Ho et al., 2020) can gradually remove noise from a signal rather than learning a mapping from a noise distribution to the data distribution, which supports various forms of data, especially high-dimensional generations (Yang et al., 2023). It can also support conditional input to guide the denoising process to meet the conditions. To this end, the diffusion model suits the 3D facial animation generation in a better modality. However, few work leverages the diffusion model in their methods. (Stan et al., 2023) is the first work that integrates the diffusion model into 3D facial animation, which proves that diffusion mechanism is effective for handling diverse facial animation sequences. DiffSpeaker (Ma et al., 2024) combines Transformer architecture with diffusion-based animation sequence generation and proposes unique biased conditional self/cross-attention mechanism. However, these methods lack the ability of processing intensive contexts.

To address the limitations and disadvantages of the aforementioned methods, our approach aims to generate flexible, variable, and dynamic facial motions from audio input while maintaining accurate facial expressions. To achieve this, we propose 3DFacePolicy, a facial motion prediction model based on a diffusion policy framework. A conceptual comparison of our method with two other mainstream approaches is shown in Figure 1. The diffusion policy was first introduced by (Chi et al., 2023), which is a visuomotor imitation learning algorithm designed to teach robots a broad range of motor skills. This method conditions on high-dimensional visual inputs, such as images or depth maps, and generates denoised robot actions for visuomotor tasks using Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020).

Leveraging this approach, we decompose 3D facial animation into sequences of vertices and audio and actions, where the action sequence represents frame-by-frame deviations in the vertices. By sampling a noisy action sequence from Gaussian noise, and conditioning it on the audio sequence and facial vertices sequence, the diffusion policy model outputs a denoised action sequence. For handling complex, high-dimensional data distributions such as actions, we employ a sequence sampler following (Ze et al., 2024), which segments the long-term data sequence into smaller fractions. This technique generates denoised actions over a limited horizon, effectively reducing vibrational artifacts. Consequently, our architecture ensures not only smooth 3D facial animations but also more flexible and variable facial motions.

We evaluate our proposed method quantitatively. With VOCASET as benchmarks, 3DFacePolicy holds notable advantages over other state-of-the-art methods. The main contributions of our work are as follows:

•

3DFacePolicy is the pioneering work to innovatively integrate the robot imitation learning diffusion policy with 3D facial animation synthesis task.
•

Our model introduces a new approach for the disentanglement of the animation sequence, which refines vertices motion sequences from single vertices sequence in a whole animation frame.
•

Leveraging sequence sampler for smooth action generation in a limited local space during policy training.

2. Related Work

2.1. Speech-Driven 3D Facial Animation

Speech-driven 3D facial animation is to create realistic and natural facial movements with the driving speech. The most crucial component of this problem is the synchronization of the tone, rhythm and dynamics that mirrors the real-human expressions with the speech. Over the years, researches mainly focus on traditional generative methods and diffusion-based methods.

Traditional generative methods: These kind of methods design the deterministic mappings for audios and facial motions with deep neural networks. Taylor et al. (Taylor et al., 2017) proposes a sliding window approach to transform audio information into animation parameters with deep neural network. Karras et al. (Karras et al., 2017) used an end-to-end method to animate face directly from audio with emotion component based on CNNs. Edwards et al. (Edwards et al., 2016) proposed two anatomical actions to synchronize the lips and jaws with speech signal. Zhou et al. (Zhou et al., 2018) focus on the phonemes processing with the combination model of JALI and LSTM-based neural network. However, these methods suffer the limitations on lip-only animations and cannot handle identity variations. Following methods train on high-resolution audio-mesh paired vertex-based dataset and raises the result to a higher level. Cudeiro et al. (Cudeiro et al., 2019) designs method that both captures audio features and speaking styles to generate facial animation from mesh template. However, it still lacks the movements of upper face. Richard et al. proposes MeshTalk (Richard et al., 2021), which learns a categorical latent space that disentangles the facial movements into audio-correlated and audio-uncorrelated and solve the problem of upper face motions. However, it neglects the long-term audio information. With the release of powerful Transformer model by (Vaswani, 2017), FaceFormer (Fan et al., 2022) refers the Transformer structure to address the limitation on short-term audio data with a self-supervised pretrained speech model Wav2Vec 2.0 (Baevski et al., 2020a). CodeTalker (Xing et al., 2023) build a codebook for complex data distribution of motion space with the inspiration of VQ-VAE (Van Den Oord et al., 2017), which leads to a notable performance improvement. Though these deterministic regression methods can generate 3D facial animation naturally and precisely, the dynamic facial movements and variable expressions are still the weak point.

Diffusion-based methods: With the explosive progress on Denoising Diffusion Probabilistic Models (Ho et al., 2020), it is widely applied for high-dimensional data generations like medias. It gradually generates a target data distribution from a randomly sampled Gaussian distribution in a Markov chain-like denoising process. It also gives more variety with the conditional guidance during generative process, which can generate impressing data distribution that meet the need (Croitoru et al., 2023). For speech-driven 3D facial animation, FaceDiffuser (Stan et al., 2023) is the first to integrate the diffusion model to the 3D facial animation, it predicts the facial sequence in a diffusion model structure with HuBERT (Hsu et al., 2021) and GRU as pretrained audio encoder and facial decoder, which brings comparable performance in state-of-the-art deterministic methods. Furthermore, DiffPoseTalk (Sun et al., 2024) and 3DiFACE (Thambiraja et al., 2023) focus on the head poses and personalized styles of speakers with the conditional diffusion models. DiffSpeaker (Ma et al., 2024) leverages a biased conditional attention module to further control the denoising process to address the audio-4D data limitations. Nevertheless, these diffusion-based methods suffer the unstable generations like vague facial motions.

2.2. Diffusion Policy Models

The visuomotor policies are presented in the Robotics, which guides the robot to perform tasks with visual observations like images or depth information etc.. The methods can be classified as reinforcement learning, reward learning, motion planning and imitation learning. Diffusion Policy (Chi et al., 2023) is the simple but powerful work that leverages a conditional denoising diffusion model in a visual imitation learning method for robot visuomotor policy generation. It takes 2D images sequence as conditions and generate robot actions from a randomly sampled noise sequence with diffusion model inside a small closed-loop action space with horizon control, which achieves notable results in multiple robot control tasks. From this work, 3D Diffuser Actor (Ke et al., 2024) lifts the 2D single-view visual condition into multi-view observations and disentangle the robot actions into position, rotation and end-effector pose. 3D Diffusion Policy (Ze et al., 2024) is the closest to our work. It consists of two critical parts, the perception and decision. In the perception, it directly takes 3D point cloud and robot state as conditions with highly-effective 3D encoder. In the decision part, the conditional denoising diffusion model generates the robot actions systematically from a random gaussian noise in the denoising process. In our work, we consider the facial motions as the actions. With facial mesh vertices sequence and audios sequence as conditions, we can gradually generate the denoised motion data distribution from a gaussian noise motion sequence with conditional denoising diffusion model. To this end, it can prevent from vague facial animations and present dynamic and variable facial movements.

3. Methods

3.1. Problem Formulation

The traditional speech-driven 3D facial animation approach with diffusion model aim to generate denoised vertices $x_{0}^{1:N}=(x_{0}^{1},x_{0}^{2},...,x_{0}^{N})\in\mathbb{R}^{N\times V\times 3}$ conditioned on speech audio $s^{1:N}=(s^{1},s^{2},...,s^{N})\in\mathbb{R}^{N\times D}$ , where $N$ is the number of visual frames that sampled from a whole speech sequence, each frame $n\in{\{1:N\}}$ contains the vertex information $x_{0}^{n}$ for the speech dataset, which is represented as $V\times 3$ , where $V$ is the number of mesh vertices for the template face mesh, $3$ is the coordinates of the vertices. However, these approach are not relatively bound by the facial movement in animation sequence. In our method, we design facial movements diffusion policy model (3DFacePolicy) following (Ze et al., 2024) to predict the trajectory of vertices movements for every frame, represented as the action $a_{0}^{1:N}=(a_{0}^{1},a_{0}^{2},...,a_{0}^{N})\in\mathbb{R}^{N\times V\times 3}$ , from noised action $a_{t}^{1:N}$ with the condition of audio and vertices states, where $t\in{\{1,...,T\}}$ is the diffusion step. The vertices $x^{1:N}$ are disentangled into actions $a^{1:N}$ with the same shape $N\times V\times 3$ and are considered as one of the conditions. Therefore, our goal of the proposed architecture 3DFacePolicy is to predict the almost identical movements for vertices throughout given animation sequence from noised actions that sampled from Gaussian noise with the conditional input of audio and vertices state. The problem could be formulated as:

(1)

\widehat{a}_{0}=3DFacePolicy(a_{t},s,x,t)

where $\widehat{a}_{0}$ is the predicted action of each vertex in the topology in animation sequence, $a_{t}$ is action after $t$ diffusion steps, $x$ and $s$ are the vertex and audio sequence respectively. With the predicted action sequence $\widehat{a}_{0}$ and the vertices in first frame $x_{0}^{1}$ , the vertices in $n$ frame is presented as:

(2)

\widehat{x}_{0}^{n}={x}_{0}^{1}+\sum_{n=1}^{n}\widehat{a}_{0}^{n},n\in{\{1:N\}}

where ${\widehat{x}_{0}^{n}},n\in{\{1:N\}}$ is the output animation sequence with audio input following frame-by-frame predicted actions.

3.2. Diffusion Policy based Action Prediction Model

We need to learn a action predicting policy with the perception of vision and audio via diffusion model. To this end, we introduce 3DFacePolicy, which contains three critical parts: (a) Preprocessing. We disentangle the 3D animation sequence into vertices, audio and template, the sequence sampler will resample the long-term vertices and audio data into several limited sequences preserved in a buffer. (b) Perception. The vertices sequence $x$ and audio sequence $s$ are represented into features with pretrained encoders. The visual feature $o_{x}$ and audio feature $o_{s}$ are considered as observation features $o=\{o_{x},o_{s}\}$ . (c) Desicion. It refers the 3D Diffusion Policy (Ze et al., 2024) as backbone, which predicts the action sequence conditioning on the observation features. The overview of proposed model is illustrated in Figure 2.

3.2.1. Preprocessing

It has been proved that the 3D facial animation synthesis task is limited due to the highly intensive context in the data distributions (Stan et al., 2023). To address this problem, we separate the long-term sequence into a series of fixed duration in a given length, then train the motion predicting policy in a relatively local context, which provides a specific context for motion prediction. This process is achieved by sequence sampler following (Chi et al., 2023). In data preprocessing module, we resample the long sequence of vertices and audio into short-term sequences list with sequence sampler and saved in a buffer, then we sample the small vertices sequence and audio sequence to encode into observations for policy training in Perception module.

Sequence Sampler: To sample a long sequence into several fixed duration, we define Horizen $H$ as the time length of a sampled fraction for action prediction, then $N_{obs}$ as time length for observation conditions, $N_{act}$ as time length for action making steps. With these Preliminaries, for a given time step $t$ , the policy takes $\{O_{t-N_{obs}}:O_{t}\}$ steps as observation conditions to predict $\{a_{t}:a_{t+H}\}$ action steps. However, We choose shorter time length $\{a_{t}:a_{t+N_{act}}\}$ as action making steps. This keeps smoothness and consistency for action prediction. During training, we set $H=4,N_{obs}=2,N_{act}=2$ .

3.2.2. Perception

The fixed durations of vertices and audio are sampled from buffer, then the pretrained visual encoder and audio encoder represent the durations into features respectively. Each duration is considered as temporal observation fraction, which contains the audio feature and visual feature that are considered as condition for action prediction with diffusion policy.

Pretrained Encoder: For visual encoder, we design a simple lightweight encoder referring (Stan et al., 2023). It contains linear layer, convolutional layer and a max pooling layer, which downsample the 3D feature into 512 dimension. For Audio encoder, we use the pretrained large speech model HuBERT (Hsu et al., 2021) to generate audio features, which has been proved to be superior to Wav2Vec 2.0 (Baevski et al., 2020b) in (Haque and Yumak, 2023). We use the released pretrained hubert-large-ls960-ft version, which consists of a temporal convolutional audio feature extractor and a multi-layer transformer encoder. The representations encoded from visual encoder and audio encoder are concatenated into one representation of dimension 1024. Afterward, the decision backbone generates actions conditioning on this representation.

3.2.3. Decision

The conditional denoising diffusion model (Ho et al., 2020) is chosen as the backbone of our decision making module following (Ze et al., 2024). For $K$ iterations, A noised action sequence $a_{K}$ is sampled from Gaussian noise, with the condition of visual features $x$ and audio features $s$ , it is gradually denoised into noise-free action sequence $a_{0}$ with reverse process (Ho et al., 2020). The equation is formulated as follows:

(3)

a_{k-1}=\alpha_{k}(a_{k}-\gamma_{k}\epsilon_{\theta}(a_{k},k,x,s))+\sigma_{k}\mathcal{N}(0,\mathrm{I}),

where $\epsilon_{\theta}$ is the denoising network, $\alpha_{k}$ , $\gamma_{k}$ and $\sigma_{k}$ are the functions of $k$ iteration. $\mathcal{N}(0,\mathrm{I})$ is Gaussian noise. After $K$ iterations, we can get the actions for current sequence.

Training Objective: In the diffusion process (Ho et al., 2020), the noise $\epsilon^{\theta}$ is added on a randomly sampled action sequence $a_{0}$ at $k$ iteration to train the denoising network $\epsilon_{\theta}$ . The objective is to predict the noise added on the sequence:

(4)

\mathcal{L}=\mathrm{MSE}(\epsilon^{k},\epsilon_{\theta}(\overline{\alpha_{k}}a_{0}+\overline{\beta_{k}}\epsilon^{k},k,x,s))

where $\overline{\alpha_{k}}$ and $\overline{\beta_{k}}$ are noise schedules during diffusion steps.

Implementation details: We use DDIM (Song et al., 2020) as denoising scheduler and sample prediction following (Ze et al., 2024). For the sequence sampler, we set horizen $H=4$ , observation condition length $N_{obs}=2$ and action making length $N_{act}=2$ .

4. Experiments

We trained and tested our model on the representative 3D facial animation datasets, VOCASET (Cudeiro et al., 2019), then compared our method with other benchmarks on quantitative evaluations, which shows great performance and competitive results through state-of-the-art methods.

4.1. Datasets

VOCASET dataset: It contains 480 3D facial animation sequences both with facial motions and audios from 12 subjects. Each sequence is recorded at 60 frames per second and about 3 to 4 seconds long. The template mesh for facial motions is a FLAME (Li et al., 2017) topology, which consists of 5023 vertices. For fair comparison, we use the same training set (VOCA-Train), validation set (VOCA-Val) and test set (VOCA-Test) as in (Xing et al., 2023) and (Fan et al., 2022).

4.2. Experimental Settings

Training: We train 50 epochs on 1 batch size. The audio encoder and visual encoder generate 512 dimensions of hidden states each. 1024 concatenated dimensions for observation features. Our model was trained on a single V100 GPU with 32GB RAM, which took 5 hours to complete. We employed the AdamW optimization algorithm, setting the learning rate to 0.0001.

Metrics: Here we choose facial dynamics deviation (FDD) and Mean Vertex Error (MVE) as our evaluation metric following (Xing et al., 2023). FDD calculates the deviation of generated facial movements and groundthuth, which is suitable for evaluating our proposed motion policy generation methods. MVE is for measuring the deviation of all the face vertices of a sequence with respect to the ground truth.

Baselines: We compare 3DFacePolicy with three state-of-the-art methods, Faceformer (Fan et al., 2022), CodeTalker (Xing et al., 2023) and FaceDiffuser (Stan et al., 2023). We use the same official implementations of VOCASET dataset as these two methods for a fair comparison.

4.3. Quantitative Evaluation

Table 1 shows the quantitative results of our method and other baselines. From the evaluation result, our proposed method 3DFacePolicy performs better than other approaches on FDD metric in the order of magnitude, which benefits from our motion policy module. Its direct action prediction makes the generated face motion mirror the realistic facial movement in speeching animation. This result shows that 3DFacePolicy holds the ability for generating highly dynamic facial motions syncing to the ground truth motions. However, the other state-of-the-art methods outperforms our method in MVE metric. It is caused by the motion prediction instead of animation visual frame prediction. Our method tends to generate dynamic facial motions, which may produce exaggerating facial expressions in vertex space.

Table 1. Quantitative comparison of 3DFacePolicy with other baselines. Here 3DFacePolicy outperforms other state-of-the-art methods in FDD metric but has disadvantages in MVE metric, which illustrates that 3DFacePolicy generates dynamic facial motions that are realistic to groundtruth but may overfits in vertex space.

Method	FDD ( $1\times 10^{-5}$ mm)	MVE ( $1\times 10^{-3}$ mm)
FaceFormer (Fan et al., 2022)	25.352	1.117
CodeTalker (Xing et al., 2023)	22.830	0.912
FaceDiffuser (Stan et al., 2023)	17.335	0.925
3DFacePolicy(Ours)	6.276	2.062

5. Conclusion

In this paper, we present 3DFacePolicy, which disentangles the facial motions from 3D facial animation sequence and innovatively leverages diffusion policy to predict the facial motions conditioning on the vertices sequence and audio sequence with the sequence sampler. The core contribution of our method is employing the diffusion policy mechanism from Robotics to 3D facial motion prediction. From the experiment, quantitative results demonstrate that our method outperforms the current baselines in dynamic facial motion synthesis and presents realistic and vivid facial expressions comparing to other methods. We argue that our work provides novel approach of motion prediction for 3D facial animation task that benefits from diffusion policy mechanism.

Limitation and future work: 3DFacePolicy shows great performance on dynamic facial motion generation. However, it is less consistent when synthesizing whole animation sequence in vertex space. It also lacks more evaluation experiment on other benchmarks. In the future work, we will add a global condition in vertex space to keep the consistent animation generation and add more benchmark to evaluate our method in multiple aspects.

References

(1)
Baevski et al. (2020a) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020a. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
Baevski et al. (2020b) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020b. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Neural Information Processing Systems,Neural Information Processing Systems (Jun 2020).
Chi et al. (2023) Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. 2023. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137 (2023).
Croitoru et al. (2023) Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 9 (2023), 10850–10869.
Cudeiro et al. (2019) Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. 2019. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10101–10111.
Edwards et al. (2016) Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on graphics (TOG) 35, 4 (2016), 1–11.
Fan et al. (2022) Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770–18780.
Hannun (2014) A Hannun. 2014. Deep Speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
Haque and Yumak (2023) Kazi Injamamul Haque and Zerrin Yumak. 2023. Facexhubert: Text-less speech-driven e (x) pressive 3d facial animation synthesis using self-supervised speech representation learning. In Proceedings of the 25th International Conference on Multimodal Interaction. 282–291.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
Hsu et al. (2021) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing 29 (2021), 3451–3460.
Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (Aug 2017), 1–12. https://doi.org/10.1145/3072959.3073658
Ke et al. (2024) Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 2024. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885 (2024).
Li et al. (2017) Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36, 6 (2017), 194–1.
Ma et al. (2024) Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Chen Qian, Zhaoxiang Zhang, and Zhen Lei. 2024. DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer. arXiv preprint arXiv:2402.05712 (2024).
Ng et al. (2022) Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395–20405.
Richard et al. (2021) Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1173–1182.
Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020).
Stan et al. (2023) Stefan Stan, Kazi Injamamul Haque, and Zerrin Yumak. 2023. Facediffuser: Speech-driven 3d facial animation synthesis using diffusion. In Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games. 1–11.
Sun et al. (2024) Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, and Yong-jin Liu. 2024. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (TOG) 43, 4 (2024), 1–9.
Taylor et al. (2017) Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions On Graphics (TOG) 36, 4 (2017), 1–11.
Thambiraja et al. (2023) Balamurugan Thambiraja, Sadegh Aliakbarian, Darren Cosker, and Justus Thies. 2023. 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing. arXiv preprint arXiv:2312.00870 (2023).
Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017).
Vaswani (2017) A Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017).
Xing et al. (2023) Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. 2023. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12780–12790.
Yang et al. (2024) Karren D Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, and Oncel Tuzel. 2024. Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27294–27303.
Yang et al. (2023) Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. 2023. Diffusion models: A comprehensive survey of methods and applications. Comput. Surveys 56, 4 (2023), 1–39.
Ze et al. (2024) Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 2024. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation.
Zhou et al. (2018) Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–10.