This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Weizhi Zhong, Junfan Lin, Peixin Chen, Liang Lin, and Guanbin Li This work was supported in part by the National Natural Science Foundation of China (NO. 62322608). (Corresponding author is Guanbin Li).W. Zhong, J. Lin, P. Chen, L. Lin and G. Li are with the School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China. E-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract

Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling end-to-end optimization. Specifically, we first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks via differentiable cross-attention, which enables end-to-end optimization for improved lip synchronization. Besides, TalkFormer employs implicit feature warping to align the reference image features with the target motion for preserving more appearance details. Extensive experiments demonstrate that our approach can synthesize high-fidelity and lip-synced talking face videos, preserving more subject appearance details from the reference image.

Index Terms:
Talking face generation, Landmark-based, End-to-End optimization, Diffusion.
publicationid: pubid: 0000–0000/00$00.00 © 2024 IEEE

I Introduction

Audio-driven talking face video generation, a challenging task of cross-modal synthesis, aims to create talking videos with lip movements that are accurately synchronized with the input audio. This task has attracted increasing interest from the research community due to its broad applications such as visual dubbing [1], virtual avatars [2], and digital humans [3]. To generate high-fidelity talking face videos, a prevalent manner is to gather data of the target individual to learn a person-specific model [4, 5, 6, 7]. While effective, these person-specific methods are hindered by the costly data collection and extensive training process. In contrast, person-generic methods can generalize to unseen subjects without further training. However, these methods often grapple with challenges in maintaining the appearance details of subjects as well as lip-audio synchronization. In this study, we endeavor to develop a person-generic talking face generation framework to generate high-fidelity facial details for general subjects while ensuring accurate lip-audio synchronization.

Refer to caption
Figure 1: An illustration of three strategies to learn the audio-visual relationship for talking face generation. (a) Previous methods optimize a network to learn the direct audio-visual mapping, resulting in flawed results. (b) Previous methods optimize a network to learn the mapping from audio to landmark motion, then separately optimize another mapping from the motion representation to realistic face, suffering from the inaccuracies of pre-estimated intermediate representation. (c) Our method leverages facial landmarks as intermediate representation while enabling end-to-end optimization to reduce the errors accumulation resulting from pre-estimated landmark inaccuracies.

To achieve faithful talking face generation, two critical issues need to be considered: the high fidelity of the subject appearance details and the synchronization of lip movement with the audio input. To generate lip-synced talking videos, prior methods [8, 9, 10] directly model the mapping between audio signal and visual content. However, such audio-visual mapping is often uncertain and ambiguous, as one phonetic unit potentially matches various visual forms due to the diversities in illumination, emotion, and appearance. This often leads to flawed results and the loss of details in subject appearance. To ease the ambiguity for improved fidelity of subject appearance, another line of works [11, 1, 12, 13, 14, 15] instead first establish a correlation between the audio signals and intermediate structural representation such as facial landmarks or 3D Morphable Model (3DMM) [16], which primarily reflect the motion information and facilitate the less ambiguous mapping from audio to motion. Then, another stage converts these motion representations into realistic facial images. However, a notable drawback of these approaches is the isolated training of different stages, potentially resulting in inaccuracies in lip-audio synchronization stemming from error in the pre-estimated structural representations. Additionally, prior methods[8, 9, 15, 11] frequently rely on either misaligned reference images or images warped by imprecise optical flow predictions as generation conditions, neglecting the influence of such misalignment on the preservation of appearance details. Moreover, most previous approaches employ generative adversarial networks (GANs) [17] for talking face generation, which often suffers from training instability and the issue of mode collapse.

To tackle the challenges above, we propose an innovative landmark-based diffusion model to learn the audio-visual relation. Figure 1 compares previous methods and ours. Our method utilizes facial landmarks as intermediate representation to ease the ambiguity of audio-visual mapping while enabling the end-to-end optimization of distinct stages, facilitating the generation of high-fidelity and lip-synced talking face video. Specifically, our two-stage approach initially converts the input audio signal into a set of landmark representations using a landmark completion module [11]. To enable the end-to-end optimization of distinct stages and align the synthesized motion with the motion represented by landmarks, we devise a novel conditioning module called TalkFormer to integrate landmark representations into the diffusion model using differentiable cross-attention. This end-to-end approach significantly reduces error accumulation resulting from pre-estimated landmark inaccuracies, thereby improving lip-audio synchronization. Besides, to enhance the fidelity of facial appearance, our approach converts a reference facial image into multi-scale features capturing intricate details. Then, our proposed TalkFormer module spatially aligns these features with the target motion using an implicit warping technique. Without using imprecise optical flow, the implicit warping automatically establishes semantic correlations for improved alignment between the reference features and the synthesized content. These aligned reference features are then integrated into the denoising process, enhancing the preservation of subject appearance details from reference image. Our extensive experiments validate our framework’s effectiveness in producing realistic talking face videos with high-fidelity subject appearance and lip movements accurately synchronized to the input audio. We summarize our contributions as follows:

  • We propose a novel method to learn the audio-visual relationship for generating high-fidelity and lip-synced talking face videos, utilizing facial landmarks as intermediate representation to ease the ambiguity of audio-visual mapping while enabling end-to-end optimization to minimize error accumulation.

  • We introduce a novel conditioning module, TalkFormer, to align the synthesized motion with the motion represented by landmarks in a differentiable manner, enabling the joint optimization of distinct stages. Additionally, TalkFormer aligns the reference image features with the target motion based on semantic correlations, enhancing the preservation of subject appearance details.

  • We conduct comprehensive experiments to demonstrate our method’s effectiveness in producing high-fidelity and lip-synced talking face videos, which can generalize to any unseen subject without additional fine-tuning.

II Related Work

II-A Audio-Driven Talking Face Generation

Audio-driven talking face video generation techniques can be mainly divided into two types: person-specific or person-generic. Many person-specific methods can generate vivid videos [4, 5, 6, 7, 18, 19], but they require videos of the target subject for additional training. On the contrary, person-generic methods [8, 9, 11, 20, 1, 10, 12, 14, 13] enable inference on unseen subjects without any retraining or fine-tuning. However, there is still a gap in achieving high-fidelity and lip-synced talking face generation for person-generic methods.

Wav2Lip [8], PD-FGC [10], and PC-AVS [20] attempt to generate lip-synced videos by directly conditioning the generator on audio representation. Nevertheless, these approaches exhibit notable flaws and loss of appearance details in the subjects due to the inherent uncertainty and ambiguity in audio-visual mapping. IP-LAP [11] and other methods [1, 12] propose a two-stage framework that utilizes facial landmarks as intermediate representation. However, the projection of their intermediate landmark representation into the sketch image is indifferentiable. Subsequently, an image-to-image translation network is employed to synthesize realistic faces from the sketch images. Therefore, these methods train distinct stages independently and suffer from the inaccuracies of pre-estimated landmarks. Besides, prior methods [14, 15, 13, 21] including IP-LAP [11] utilize estimated optical flow to align the reference image with target facial expression and pose, such that more appearance details from reference image can be preserved during the generation process. However, accurate optical flow estimation is challenging especially when there is significant variation in head pose, leading to distorted results. To tackle the drawbacks of GANs [17] in unstable training and mode collapse, DiffTalk [9] crafts a diffusion model to learn the direct audio-to-visual mapping for generalized talking face generation. It directly models the audio-to-lip translation with the landmarks of upper-half face concatenated as auxiliary condition. However, the landmarks of upper-half face are insufficient to alleviate the uncertainty of direct audio-to-visual mapping. Besides, its usage of the misaligned reference image hinders the preservation of subject appearance details from reference image. Recently, GAIA [22] proposes to disentangle motion and appearance using VAE[23] and utilizes diffusion models to predict motion from the speech. However, it can not achieve end-to-end learning of the framework to reduce error accumulation. Besides, it leverages misaligned reference appearance features as generation conditions, which hinders the preservation of facial details from reference images.

Contrasting with prior approaches, our framework employs facial landmarks as intermediate representation while enabling end-to-end optimization. Our novel conditioning module TalkFormer integrates landmarks representation into diffusion models in a differentiable way and aligns the reference appearance features based on semantic correlations, facilitating the generation of high-fidelity and lip-synced talking face video.

II-B Diffusion Models

Diffusion models [24, 25] have recently emerged as a promising type of generative models, exhibiting superior generation power and enhanced training stability when compared to GANs [17]. Denoising diffusion probabilistic models (DDPM) [25] is a class of latent variable models that uses a diffusion process to add noise to the data gradually and learns a denoiser network to reverse the diffusion process. Recently, Denoising Diffusion Implicit Models (DDIM) [26] are proposed to accelerate the sampling process of diffusion models through a class of non-Markovian diffusion processes. Latent diffusion models (LDM) [27] apply diffusion model in the latent space of powerful pre-trained autoencoders to save computational resources while retaining the generation quality. Diffusion models have recently demonstrated remarkable success in various synthesis tasks [28, 29, 30, 31, 32], including text-to-image generation [33, 34, 27], image editing [35, 36], person image synthesis [37, 38], face video editing[28], and face restoration[30, 39]. Stable Diffusion [27] conditions the latent diffusion model on the CLIP [40] text embedding, achieving compelling text-to-image generation. ControlNet [33] proposes a neural network architecture that incorporates spatial conditions (e.g., sketch image) into pre-trained diffusion models.

For talking face video generation with diffusion models, previous methods[9, 41, 42, 43] make an early attempt to learn the direct audio-visual mapping by conditioning the diffusion model on audio feature, but often produce flawed results due to the ambiguity of mapping. To alleviate this problem, a straightforward method is to involve facial landmarks as intermediate representation and integrate the predicted landmarks from audio into the diffusion model using ControlNet [33] architecture. However, a notable drawback of this manner is the isolated training of audio-to-landmark and landmark-to-video stages, resulting in sub-optimal lip synchronization. This is because the projection from landmarks coordinates into sketch image is indifferentiable. In contrast, in this study, we propose an innovative talking face video generation framework based on efficient latent diffusion models, which leverages facial landmarks as intermediate representation and enables end-to-end optimization of distinct stages. Our method first predict the lip and jaw landmarks coordinates from audio signals, and a novel conditioning module TalkFormer integrates the landmarks into diffusion model in a differentiable manner.

III Methodology

Refer to caption
Figure 2: An overview of the proposed framework. The diffusion and reverse denoising operations are executed in the encoded latent space of an autoencoder 𝒟(())\mathcal{D}(\mathcal{E}(\cdot)). (1). Initially, the audio signal drives the completion of lip and jaw landmarks, guided by reference full-face and upper half-face pose landmarks. The completed lip and jaw landmarks are then combined with the input pose landmarks to form the target full-face landmarks. (2). The conditioning module, TalkFormer, aligns the synthesized motion with the motion represented by target landmarks via differentiable cross-attention layers. To capture the intricate appearance details, a reference face image is encoded into multi-scale reference features. TalkFormer then aligns these features with the target motion via an implicit warping mechanism implemented by cross-attention layers. The skip-connections of U-Net are omitted for clarity.

The framework overview of our method is presented in Figure 2. For talking face video generation, our method inputs audio and a template video, masking the lower-half face. The framework then inpaints these areas with realistic content synchronized with the audio. Information about the subject appearance and facial contours is derived from a single reference image and reference full-face landmarks from the template video, respectively. More specifically, the audio signal drives the completion of lip and jaw landmarks, guided by reference full-face landmarks and pose landmarks detected from the upper-half face of the template video. The completed landmarks, as well as the reference image, are then fed into the latent diffusion model via TalkFormer, influencing the synthesized motion and appearance. During training, the network diffuses the lower half of the ground-truth face in latent space, focusing on noise reduction. Upcoming sections Section III-A, Section III-B, and Section III-C will introduce latent diffusion models and provide a comprehensive explanation of our approach.

III-A Preliminaries of Latent Diffusion Models

Latent diffusion models [27] carry out diffusion and the denoising process in the encoded latent space of an autoencoder 𝒟(())\mathcal{D}(\mathcal{E}(\cdot)), with ()\mathcal{E}(\cdot) being the encoder and 𝒟()\mathcal{D}(\cdot) being the decoder. A U-Net-based [44] denoising network ϵθ(zt,t)\epsilon_{\theta}\left(z_{t},t\right) is trained to predict the noise added to the image latent z0z_{0}, where z0=(x)z_{0}=\mathcal{E}(x), xx is the input image, ztz_{t} represents the noisy version of z0z_{0} at time step condition t{1,2,,T}t\in\{1,2,...,T\} and θ\theta refers to the learnable parameters. The optimization objective during training is as follows:

ldm=𝔼z,ϵ𝒩(0,1),t[ϵϵθ(zt,t)22]\mathcal{L}_{ldm}=\mathbb{E}_{z,\epsilon\sim\mathcal{N}(0,1),t}\left[\left\|\epsilon-\epsilon_{\theta}\left(z_{t},t\right)\right\|_{2}^{2}\right] (1)

where ϵ\epsilon is the ground-truth noise added to the image latent z0z_{0} and tt is uniformly sampled from {1,,T}\{1,...,T\}. During the inference phase, these models progressively denoise a normally distributed variable zT𝒩(0,1)z_{T}\sim\mathcal{N}(0,1) until it reaches a clean latent z^0\hat{z}_{0}. This clean latent variable can then be decoded by 𝒟\mathcal{D} to synthesize realistic images.

In our framework, both diffusion and denoising processes are exclusively performed in the lower half of the encoded latent, with the remaining upper half also being incorporated into denoising U-Net to provide more context. During inference, the masked input face is encoded as z0mz^{m}_{0}, of which the lower half is diffused to obtain the initial zTz_{T}. During training, the ground-truth face is first encoded to z0z_{0} and subsequently diffused to ztz_{t} by the noise.

III-B Audio-driven Landmark Completion

Instead of directly conditioning the diffusion model on the audio signal, our framework first establishes the less ambiguous mapping from audio to landmarks motion of lip and jaw. Following [11], we devise a transformer-based landmark completion module to predict the lip and jaw landmarks from the input audio. Specifically, we first encode the pose landmarks from the upper half of the face into a pose embedding using a 1D convolutional module. For facial contour information, we extract reference full-face landmarks from NN video frames within the input video, and encode them into NN reference embeddings via another 1D convolutional module. The mel spectrogram of the input audio is encoded into an audio embedding by a 2D convolutional module. These pose, reference, and audio embeddings are subsequently fed into a transformer encoder to predict the lip and jaw landmark coordinates, denoted as C^lipτ2×nl\hat{C}^{\tau}_{lip}\in\mathbb{R}^{2\times n_{l}} and C^jawτ2×nj\hat{C}^{\tau}_{jaw}\in\mathbb{R}^{2\times n_{j}}, respectively, where τ\tau indicates landmarks for the τ\tau-th frame, nln_{l} and njn_{j} are the number of landmarks used to represent the lip and jaw, respectively. To ensure temporally stable landmark prediction, we adopt the batched sequential training strategy following the common practice of previous methods [45, 46, 47]. Specifically, the completion module predicts landmarks of LL successive frames for each video during training. The training objective for the landmark completion module is defined as follows:

1=i=0L1(C^lipτ+iClipτ+i1+C^jawτ+iCjawτ+i1)\mathcal{L}_{1}=\sum_{i=0}^{L-1}\left(\|\hat{C}^{\tau+i}_{lip}-C^{\tau+i}_{lip}\|_{1}+\|\hat{C}^{\tau+i}_{jaw}-C^{\tau+i}_{jaw}\|_{1}\right) (2)

where Clipτ+iC^{\tau+i}_{lip} and Cjawτ+iC^{\tau+i}_{jaw} are the ground-truth landmarks coordinates of lip and jaw, respectively. The predicted lip and jaw landmarks are then combined with the input pose landmarks to form the comprehensive target full-face landmarks.

However, the objective in Equation 2 is insufficient to synchronize the lip and jaw motion with input audio due to potential inaccuracies inherent in the pre-estimated ground-truth landmarks. Therefore, we expect the predicted landmarks to be integrated into the image generation stage (Section III-C) in a differentiable manner, enabling end-to-end optimization to improve lip synchronization.

III-C Inpainting Lower Half via Latent Diffusion Model

As GANs [17] suffer from training instability and mode collapse, we resort to powerful latent diffusion models [27] to inpaint the lower half of the face, conditioning on the completed landmarks and reference image. For the end-to-end optimization of the whole framework and improved alignment between the reference image and the synthesized content, we introduce a novel conditioning module called TalkFormer, as demonstrated in the green section of Figure 2. TalkFormer aligns the synthesized motion with the motion represented by facial landmarks via differentiable cross-attention [48], and aligns the reference image features via an implicit warping manner implemented by another cross-attention layer. In our denoiser U-Net, TalkFormer modules exist at all scales, except the first scale which only contains residual convolution blocks. In the following subsections, we will detail the core components of TalkFormer and the reference appearance encoder for encoding reference facial image.

III-C1 TalkFormer: Align Talking Motion Differentiably

Previous researches [11, 1, 12] project the intermediate landmark representation on the image plane, forming the sketch image as generation condition in an indifferentiable manner. In contrast, our TalkFormer first uses a 1D-convolution embedding module to encode the target full-face landmarks from the landmark completion module into nn landmark embeddings {ei,i=1,2,,n}\{e_{i},i=1,2,...,n\}, where nn is the number of landmarks to represent the full face. Then, these landmark embeddings are integrated into cross-attention layers as keys and values, denoted as K1K_{1} and V1V_{1}, respectively. Simultaneously, the queries Q1Q_{1} are extracted from the hidden features after ResNet[49] blocks through an MLP layer. The output of cross-attention is computed as YY according to the following equation:

Y=Softmax(Q1K1d1)V1Y=\operatorname{Softmax}\left(\frac{Q_{1}K_{1}^{\top}}{\sqrt{d_{1}}}\right)V_{1} (3)

where d1d_{1} is the dimension of queries and keys. Subsequently, the results YY go through a zero-initialized convolution layer and are added to the hidden features of U-Net in a residual manner. In this way, the final generated face is ensured to have talking motion aligned with the motion represented by landmarks, and the diffusion model can be jointly optimized with the landmark completion module for improved lip synchronization.

III-C2 Reference Appearance Encoder

To enable generalized talking face generation, a single reference face image is typically utilized as condition, ensuring that the synthesized appearance remains consistent with the subject appearance. As illustrated in the pink section of Figure 2, the reference face image is initially encoded to the latent space as zrz_{r}. To retain more fine-grained details from the reference face image, we devise an appearance encoder similar to the U-Net encoder consisting of residual convolution blocks. This appearance encoder converts the latent zrz_{r} into multi-scale reference features symbolized as Fa={Faii=1,2,..,I}F_{a}=\left\{F_{a}^{i}\mid i=1,2,..,I\right\}, where II represents the number of scales in U-Net. The dimensions of these features are identical to those of the hidden features in the encoder of U-Net denoiser.

III-C3 TalkFormer: Align Reference Appearance Features

To make the denoiser model aware of more appearance details from the reference image, we align the multi-scale reference appearance features in an implicit warping manner through another cross-attention layer. Specifically, we denote the hidden features after talking motion alignment as FhiD×H×WF_{h}^{i}\in\mathbb{R}^{D\times H\times W}, where scale i{2,3,,I}i\in\{2,3,\dots,I\}. To spatially align the reference appearance features FaiF^{i}_{a} with the hidden features FhiF_{h}^{i}, the FhiF^{i}_{h} is first transformed to the queries Q2HW×d2Q_{2}\in\mathbb{R}^{HW\times d_{2}} through an MLP layer while the FaiF^{i}_{a} are projected to the keys K2HW×d2K_{2}\in\mathbb{R}^{HW\times d_{2}} and values V2HW×DV_{2}\in\mathbb{R}^{HW\times D}, where d2d_{2} is the dimension of the keys. Then, the correlation matrix between FaiF^{i}_{a} and FhiF^{i}_{h} is computed as follows:

S=Softmax(Q2K2d2)S=\operatorname{Softmax}\left(\frac{Q_{2}K_{2}^{\top}}{\sqrt{d_{2}}}\right) (4)

where SHW×HWS\in\mathbb{R}^{HW\times HW}, and each element sjks_{jk} of it indicates the semantic correspondence between the hidden feature in location jj and the reference feature in location kk, with j,k{1,2,,HW}j,k\in\{1,2,...,HW\}. Based on this correlation matrix, we can obtain the aligned reference features by referring to the relevant features in the reference appearance features. Specifically, the reference appearance features FaiF^{i}_{a} are warped implicitly via a weighted sum of the values in V2V_{2} as follows:

Fai¯=Reshape(SV2)\bar{F^{i}_{a}}=\operatorname{Reshape}(SV_{2}) (5)

where Fai¯D×H×W\bar{F^{i}_{a}}\in\mathbb{R}^{D\times H\times W}, and its semantic contents are spatially aligned with those of the hidden features FhiF^{i}_{h}. Eventually, the aligned reference appearance features Fai¯\bar{F^{i}_{a}} are passed through a zero-initialized convolution layer and added to the hidden features FhiF_{h}^{i} using a residual way. Consequently, the denoising process can better preserve the subject appearance details from reference images, facilitating high-fidelity talking face video generation.

III-D Optimization

Benefit from TalkFormer, the joint optimization of the landmark completion module and the latent diffusion model can be achieved by employing the following objective function:

total=ldm+λ1\mathcal{L}_{total}=\mathcal{L}_{ldm}+\lambda\mathcal{L}_{1} (6)

where ldm\mathcal{L}_{ldm} is the denoising objective defined in Equation 1 and λ\lambda represents the weight assigned to the 1\mathcal{L}_{1} loss term (Equation 2). In this way, the ldm\mathcal{L}_{ldm} loss will guide the landmark completion module to predict more accurate landmarks for better denoising, thus enhancing lip-audio synchronization. Similar to the landmark completion module for improved temporal continuity, the latent diffusion model adopts the batched sequential training strategy [45, 46, 47] where LL successive images are synthesized for each video during training.

IV Experiments

TABLE I: Quantitative comparisons between the proposed and previous state-of-the-art methods in person-generic talking face video generation. Here \uparrow denotes higher is better, and \downarrow indicates lower is better.
Method VoxCeleb HDTF
PSNR↑ SSIM↑ LPIPS↓ FID↓ CSIM↑ SyncScore↑ PSNR↑ SSIM↑ LPIPS↓ FID↓ CSIM↑ SyncScore↑
Ground Truth - - - - - 8.59 - - - - - 8.87
Wav2Lip [8] 24.17 0.81 0.173 78.20 0.52 9.33 22.51 0.78 0.232 87.99 0.60 9.43
PC-AVS [20] 21.84 0.76 0.132 53.72 0.37 8.31 19.18 0.68 0.189 56.33 0.33 8.73
IP-LAP [11] 24.17 0.83 0.114 44.02 0.55 3.39 22.60 0.83 0.118 34.92 0.67 3.62
PD-FGC [10] 20.15 0.70 0.165 57.54 0.33 6.33 17.76 0.65 0.207 62.54 0.31 6.47
DiffTalk [9] 22.43 0.74 0.119 44.00 0.51 1.42 22.03 0.74 0.121 30.16 0.56 2.30
Ours 25.44 0.85 0.090 37.19 0.57 4.42 23.48 0.83 0.096 27.94 0.69 5.03

IV-A Experimental Setups

IV-A1 Dataset

We conduct experiments on two public audio-visual datasets, VoxCeleb [50] and HDTF [14]. VoxCeleb is a collection of over 100,000 utterances from 1,251 celebrities, all extracted from videos uploaded to YouTube. HDTF is a high-resolution audio-visual dataset consisting of approximately 362 distinct videos, spanning over 15.8 hours, in 720P or 1080P resolutions. Compared to HDTF, the large-scale VoxCeleb dataset is a more standard benchmark commonly used in prior work. To ensure a fair comparison, all comparison methods, including ours, are trained on the VoxCeleb dataset and evaluated using the test sets of both VoxCeleb and HDTF.

IV-A2 Evaluation Metric

We quantitatively evaluate all methods regarding visual quality and lip synchronization. Pixel-level visual quality is assessed through the Peak Signal-to-Noise Ratio (PSNR) and Structured Similarity (SSIM) [51], while feature-level visual quality is evaluated using Learned Perceptual Image Patch Similarity (LPIPS) [52] and Fréchet Inception Distance (FID) [53]. Compared to pixel-level measurements, the feature-level measurements are more in line with human perception [52, 54]. Additionally, we employ the cosine similarity (CSIM) of identity vectors extracted by the ArcFace face recognition network [55] to assess the preservation of subject identity. SyncScore [56] is commonly used by prior work to evaluate the lip-audio synchronization quality, despite some limitations.

IV-A3 Comparison Methods

We compare our approach against several state-of-the-art person-generic audio-driven talking face video generation methods. DiffTalk (CVPR’23[9] constructs a Diffusion-based framework for generalized talking face synthesis by conditioning the latent diffusion model on audio signal. PD-FGC (CVPR’23[10] employs a progressive disentangled representation learning strategy to achieve fine-grained controllable talking face synthesis (e.g., eye, pose control). IP-LAP (CVPR’23[11] is a two-stage landmark-based method that trains different stages separately and utilizes predicted optical flow to align the reference image with the target pose and expression. PC-AVS (CVPR’21[20] proposes a GAN-based framework to generate pose-controllable talking face videos by modularizing audio-visual representations. Wav2Lip (MM’20[8] utilizes a lip sync discriminator to guide the generator in generating lip-synced talking face videos.

IV-A4 Comparison Setups

Wav2Lip [8], IP-LAP [11], DiffTalk [9], and our method all generate talking face videos by inpainting the lower half of the face. Therefore, during the quantitative comparison, the lower half of the face in the input video is masked. Then, these methods reconstruct the masked area guided by the input audio and reference image. The original input video serves as the ground truth for metric calculation. We train DiffTalk [9] using the official code until convergence, but it generates temporally unstable results. It relies on additional frame interpolation to smooth the results, affecting comparison fairness. Therefore, the frame interpolation post-processing was not employed for fair comparison. PC-AVS [20] utilizes a pose source video, an audio input, and a reference image to generate a talking face video. In our implementation, we substitute its pose source video with the ground-truth video. PD-FGC [10] requires a pose source video, an expression source video, an eye blink source video, an audio input, and a reference image to generate a talking face video. Our version replaces its pose source, expression source, and eye blink source videos with the ground-truth video.

IV-A5 Implementation Details

In our framework, the input face images are resized to 256×\times256, and the latent space of autoencoder 𝒟(())\mathcal{D}(\mathcal{E}(\cdot)) has a spatial dimension of 64×\times64. Facial landmarks are extracted from video frames using the mediapipe tool [57]. We represent the lip with nl=41n_{l}=41 landmarks, the jaw with nj=16n_{j}=16 landmarks, and the entire face with a total of n=131n=131 landmarks. We set NN to 5 and the reference full-face landmarks are detected from the randomly selected frames of input videos. The hyper-parameter λ\lambda is set to 10 and LL set to 5. To generate talking face videos, we employ the DDIM[26] diffusion sampler with 200 steps. We set the number of diffusion steps TT to 1000. The number of scales II is 4, but we illustrate the case of I=3I=3 in Figure 2 for clarity. The reference face image can be any face image from the input video that reflects as many appearance details of the subject as possible. All comparison methods use the same reference face image to ensure a fair evaluation. Our implementation of the proposed method closely follows the code implementation of latent-diffusion [58], while incorporating TalkFormer, Appearance Encoder, and Landmark Completion module[11] as additional components. The Appearance Encoder is designed based on the encoder structure of U-Net denoiser in latent-diffusion [58], excluding self-attention layers.

We train our framework on 2 NVIDIA A100(40GB) GPUs for 500 epochs with Adam optimizer[59]. The batch size is 64, and the learning rate is 4e-5. The pre-trained autoencoder is frozen during training. The Landmark Completion module is jointly trained from scratch with the latent diffusion model. Our method focuses on inpainting the lower half of the face based on the input audio. Hence, to generate talking face videos, we first crop the face area from the input template video as network input. After obtaining the generated face, we employ a post-processing technique following the previous method[11] to seamlessly blend it with the background, producing final talking face videos. For fair comparison, this post-processing technique was not employed during the quantitative and qualitative comparisons. The input template video has no length requirement and can be looped to match the length of the input audio. We will release our code upon acceptance.

IV-B Quantitative Evaluation

We conduct a comprehensive quantitative comparison with state-of-the-art methods regarding visual quality and lip synchronization. The visual quality metrics are calculated solely based on the lower half of the generated face, since the generated upper-half face in Wav2Lip [8], IP-LAP [11], DiffTalk [9], and ours almost inherit from the input video (i.e., ground truth). The comparison results are reported in Table I.

IV-B1 Visual Quality

Our method outperforms other methods in all visual quality metrics (PSNR, SSIM, LPIPS, FID, CSIM) on both VoxCeleb [50] and HDTF [14] datasets. Specifically, on the perceptual distance metric LPIPS and FID, our method significantly improves over other methods. This verifies that our method can produce high-fidelity talking face videos that align with human perception, preserving more appearance details. Besides, the highest CSIM score achieved by ours also indicates our method can preserve more identity information of the target subject. Although Wav2Lip exhibits a slight lag in terms of PSNR and SSIM metrics compared to our method, its FID and LPIPS values are approximately twice as high as ours, suggesting the presence of artifacts that is not aligned with human perception in their results. The performance of IP-LAP is closely comparable to ours in terms of the PSNR, SSIM, and CSIM metrics, but there remains a certain gap between IP-LAP and ours when assessed on the LPIPS, FID, and SyncScore metrics. While DiffTalk exhibits comparable performance in the FID metric, it still lags behind our approach when assessed on the LPIPS and CSIM metrics.

IV-B2 Lip Synchronization

Due to different speaking styles among individuals, accurate quantitative assessment of lip-audio synchronization remains a persistently challenging task. A common practice is to calculate the SyncScore based on the audio and visual features of SyncNet [56]. Wav2Lip, PC-AVS, and PD-FGC directly model the audio-visual mapping and obtain better SyncScore than ours. However, our approach notably excels in visual quality metrics, particularly in preserving the finer details of subject appearance, an aspect where others have room for improvement. Wav2Lip utilizes SyncNet [56] as a discriminator during training. Hence, it achieves a very high SyncScore, even higher than that of the ground truth. Besides, PD-FGC and PC-AVS adopt audio-visual contrastive learning similar to SyncNet [56], which contributes to a higher SyncScore but compromises visual quality. IP-LAP leverages facial landmarks as intermediate representations, but its lip synchronization is inferior to ours due to the isolated training of different stages. DiffTalk directly models the audio-visual mapping and generates temporally unstable videos with poor lip synchronization. We suspect its issue might stem from the mapping ambiguity magnified by the multi-step iteration of diffusion model.

Refer to caption
Figure 3: Several representative visual comparisons. The subject on the left is from the VoxCeleb [50] dataset, while the subject on the right is from the HDTF [14] dataset. Our method achieves high fidelity of the subject appearance details with accurate lip shape. For more qualitative results, please refer to the supplementary video.

IV-C Qualitative Evaluation

IV-C1 Visual Comparison

As shown in Figure 3, we present some representative comparison results on the HDTF [14] and VoxCeleb [50] datasets. It can be observed that our results are visually closer to the ground-truth images than other methods’, with more appearance details (e.g., beard, lip, teeth) preserved. It implies that our TalkFormer module could effectively align the reference image features based on semantic correlation, providing valuable features for the diffusion model. Besides, the lip shapes of ours are also closer to the ground truth. DiffTalk, PD-FGC, PC-AVS, and Wav2Lip directly learn the audio-visual mapping and utilize a misaligned reference image as generation condition. Therefore, their results lose some appearance details of the subject and appear blurry. IP-LAP generates results that are blurrier than ours and exhibit some artifacts, possibly due to the inaccurate optical flow estimation for aligning reference images. For more qualitative comparisons, please refer to the supplementary video as detailed in the following subsection.

IV-C2 Supplementary Video

We have provided a short video as supplementary material. Please download and watch it. If there are any issues downloading the video from the review system, the same file can also be downloaded through this backup link. The time schedule of the video is as follows:

  • 00:00 ~ 00:08: Video title.

  • 00:08 ~ 00:46: Demonstration of Ours. We demonstrate a generated result of our method where the driving audio is sourced from the text-to-speech technique, and the subject is from the HDTF[14] dataset. We also visualize the intermediate landmarks after the landmark completion module.

  • 00:46 ~ 01:42: Method Comparison. We present the results of all methods as well as ground-truth videos for comparison.

  • 01:42 ~ 02:19: Ablation Study. We present the qualitative results of the ablation study, which will be further analyzed in the Section IV-D.

  • 02:19 ~ 03:14: More results of our method. We provide the testing results of our method for more cases.

IV-C3 User Study

For comprehensive evaluation, we conduct a user study where 16 volunteers are invited to assess the generated videos of all comparison methods. We randomly sample 10 videos for testing, 5 from HDTF [14] dataset and 5 from VoxCeleb [50] dataset. Volunteers are asked to give their rates (0-5) for each generated video regarding image quality, lip-audio synchronization, and fidelity of appearance details. The videos are presented to participants in a random order and the evaluation criteria is explained in detail to the participants. The mean opinion scores (MOS) of each method are presented in  Table II. Our method receives better evaluations from participants across three dimensions than other approaches.

TABLE II: User study results measured by Mean Opinion Scores. Scores range from 0 to 5, where higher scores denote superior performance.
Method Appearance Fidelity Lip Synchronization Image Quality
Wav2Lip [8] 2.85 3.67 2.82
PC-AVS [20] 2.97 3.17 2.93
IP-LAP [11] 2.88 2.88 2.79
PD-FGC [10] 2.76 3.19 2.76
DiffTalk [9] 2.52 1.24 2.01
Ours 4.55 4.32 4.54

IV-D Ablation Study

In this section, we conduct an ablation study on the HDTF [14] dataset to verify the effectiveness of the proposed end-to-end framework and TalkFormer conditioning module. The numerical results are reported in Table III, while the qualitative results are presented in Figure 4, as well as in the supplementary video.

TABLE III: Ablation study results of removing individual components of the proposed approach. The terms “Ours w/o M-Align”, “Ours w/o R-Align” and “Ours w/o End2End” denote variants of our model, indicating the absence of talking motion alignment, reference appearance feature alignment, and end-to-end training, respectively.
Variants PSNR↑ SSIM↑ LPIPS↓ FID↓ CSIM↑ SyncScore↑
Ours w/o M-Align 20.82 0.75 0.159 50.28 0.687 1.12
Ours w/o R-Align 24.28 0.82 0.114 43.36 0.790 1.54
Ours w/o End2End 23.85 0.82 0.096 30.96 0.786 4.93
Ours 24.49 0.83 0.097 29.15 0.795 5.46

IV-D1 Effect of End-to-End Training

Our two-stage landmark-based method achieves the joint optimization of landmark completion module and latent diffusion model to improve lip synchronization. To validate the effectiveness of end-to-end optimization, we devise a variant where these two modules are trained separately. Specifically, the landmark completion module is optimized using only 1\mathcal{L}_{1} loss (Equation 2). In the denoiser U-Net of latent diffusion model, TalkFormer accepts the pre-estimated ground-truth landmarks as condition. The latent diffusion model is then optimized using only ldm\mathcal{L}_{ldm} loss (Equation 1).

The numerical results of this variant are reported in the “Ours w/o End2End” row of Table III. In terms of visual quality metrics, this variant exhibits similar performance to our full model. However, the SyncScore of this variant drops 9.71% compared to the full model’s, verifying the effectiveness of end-to-end training in improving lip-audio synchronization. Besides, as seen in the “Ours w/o End2End” row of Figure 4, without the end-to-end optimization, the synthesized lip shapes are less accurate, while the visual quality remains similar to the full model’s.

IV-D2 Effect of Talking Motion Alignment in TalkFormer

Our TalkFormer module aligns the synthesized motion with the motion represented by landmarks via cross-attention layer. To implement a variant without talking motion alignment, we remove the first cross-attention layer in TalkFormer that integrates landmarks embeddings as condition. Following the practice of DiffTalk [9], we redesign the landmark embedding module composed of multiple MLP layers to encode the target full-face landmarks into a single landmark embedding. This landmark embedding is added to all spatial locations of the hidden features in U-Net.

The numerical results of this variant are reported in the “Ours w/o M-Align” row of Table III. It can be seen that the SyncScore drops significantly compared to the full model’s. This is because the synthesized motion of the lip and jaw could not be accurately controlled through a simple addition operation. Besides, the adding operation may introduce artifacts into the generated results, deteriorating the visual quality metrics. As shown in the “Ours w/o M-Align” row of Figure 4, in the absence of TalkFormer’s talking motion alignment, the mouth shape remains predominantly closed. Besides, influenced by the simple addition operation, the generated mouths appear blurry and exhibit some artifacts. In the first and fourth images of the “Ours w/o M-Align” row, the generated mouth shapes are somewhat skewed, which implies loss of the subject identity information.

Refer to caption
Figure 4: Ablation study on the effectiveness of end-to-end optimization and TalkFormer module. “Ours w/o End2End” represents the variant without end-to-end training. “Ours w/o M-Align” represents the variant without talking motion alignment in TalkFormer. “Ours w/o R-Align” represents the variant without reference appearance features alignment in TalkFormer.

IV-D3 Effect of Reference Features Alignment in TalkFormer

To develop a variant without reference appearance features alignment, we remove the second cross-attention layer in TalkFormer that incorporates the reference appearance features, and replace it with a self-attention layer akin to DDPM [25]. Besides, the appearance encoder is removed. Following the common practice of previous researches [9, 8], the reference face image is first encoded into the latent space as zrz_{r}. The zrz_{r} is then concatenated with the noisy latent ztz_{t} along the channel dimension and fed into the U-Net denoiser network.

The numerical results of this variant are reported in the “Ours w/o R-Align” row of Table III. It can be seen that all the visual quality metrics deteriorates compared to the full model’s. Although the pixel-level metrics (PSNR, SSIM) do not change significantly, the feature-level metrics (LPIPS, FID), which are more in line with human perception, increase by a large margin. The potential reason is that the diffusion model can not extract meaningful features from the misaligned reference features, resulting in the loss of subject appearance details. Besides, the CSIM metric based on identity vectors might not be sensitive to subject appearance details, therefore the CSIM decreases slightly without reference features alignment in TalkFormer. Moreover, the lip shape of the misaligned reference image might have a negative impact on the synthesized lip shape. Therefore, the SyncScore decreases without reference features alignment. Furthermore, as can be seen in the “Ours w/o R-Align” row of Figure 4, in the absence of reference appearance features alignment, the generated results lose some subject appearance details, resulting in some unrealistic contents. Besides, the lip shapes are less accurate influenced by the misaligned reference face image.

V Conclusion and Discussion

In this paper, we propose a novel landmark-based framework to learn the audio-visual relationship for person-generic talking face video generation. Our framework utilizes facial landmarks as intermediate representations to alleviate the ambiguity of audio-visual mapping, while enabling end-to-end optimization to minimize error accumulation resulting from the inaccuracies of pre-estimated facial landmarks. This accomplishment can be attributed to our innovative conditioning module TalkFormer, which aligns the synthesized talking motion with the motion represented by landmarks using a differentiable cross-attention layer. Besides, TalkFormer implicitly aligns the reference appearance features with the target motion based on semantic correlations, facilitating the preservation of more appearance details for the target subject. Extensive experiments verify the effectiveness of our method in producing high-fidelity and lip-synced talking face videos.

Ethical Discussion. Our method can generate a talking face video for any subject, requiring only a template video of the subject and a segment of audio, without the need for person-specific training. It might be misused for some illegal profits. To combat the malicious behaviors, we will watermark the generated results and are willing to contribute our synthetic videos to the deepfake detection community for enhancing their algorithms.

References

  • [1] T. Xie, L. Liao, C. Bi, B. Tang, X. Yin, J. Yang, M. Wang, J. Yao, Y. Zhang, and Z. Ma, “Towards realistic visual dubbing with heterogeneous sources,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1739–1747.
  • [2] Z. Zhou, Z. Wang, S. Yao, Y. Yan, C. Yang, G. Zhai, J. Yan, and X. Yang, “Dialoguenerf: Towards realistic avatar face-to-face conversation video generation,” arXiv preprint arXiv:2203.07931, 2022.
  • [3] J. Thies, M. Elgharib, A. Tewari, C. Theobalt, and M. Nießner, “Neural voice puppetry: Audio-driven facial reenactment,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16.   Springer, 2020, pp. 716–731.
  • [4] Y. Guo, K. Chen, S. Liang, Y.-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5784–5794.
  • [5] R. Huang, P. Lai, Y. Qin, and G. Li, “Parametric implicit face representation for audio-driven facial reenactment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 759–12 768.
  • [6] C. Du, Q. Chen, T. He, X. Tan, X. Chen, K. Yu, S. Zhao, and J. Bian, “Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 4281–4289.
  • [7] Z. Ye, Z. Jiang, Y. Ren, J. Liu, J. He, and Z. Zhao, “Geneface: Generalized and high-fidelity audio-driven 3d talking face synthesis,” in The Eleventh International Conference on Learning Representations, 2022.
  • [8] K. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 484–492.
  • [9] S. Shen, W. Zhao, Z. Meng, W. Li, Z. Zhu, J. Zhou, and J. Lu, “Difftalk: Crafting diffusion models for generalized audio-driven portraits animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1982–1991.
  • [10] D. Wang, Y. Deng, Z. Yin, H.-Y. Shum, and B. Wang, “Progressive disentangled representation learning for fine-grained controllable talking head synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 979–17 989.
  • [11] W. Zhong, C. Fang, Y. Cai, P. Wei, G. Zhao, L. Lin, and G. Li, “Identity-preserving talking face generation with landmark and appearance priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9729–9738.
  • [12] Y. Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makelttalk: speaker-aware talking-head animation,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–15, 2020.
  • [13] S. Gururani, A. Mallya, T.-C. Wang, R. Valle, and M.-Y. Liu, “Space: Speech-driven portrait animation with controllable expression,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 20 914–20 923.
  • [14] Z. Zhang, L. Li, Y. Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3661–3670.
  • [15] S. Sinha, S. Biswas, R. Yadav, and B. Bhowmick, “Emotion-controllable generalized talking face generation.”
  • [16] V. Blanz and T. Vetter, “A morphable model for the synthesis of 3d faces,” in Proceedings of the 26th annual conference on Computer graphics and interactive techniques, 1999, pp. 187–194.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  • [18] S. Shen, W. Li, Z. Zhu, Y. Duan, J. Zhou, and J. Lu, “Learning dynamic facial radiance fields for few-shot talking head synthesis,” in European Conference on Computer Vision.   Springer, 2022, pp. 666–682.
  • [19] C. Zhang, Y. Zhao, Y. Huang, M. Zeng, S. Ni, M. Budagavi, and X. Guo, “Facial: Synthesizing dynamic talking face with implicit attribute learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3867–3876.
  • [20] H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose-controllable talking face generation by implicitly modularized audio-visual representation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4176–4186.
  • [21] W. Zhang, X. Cun, X. Wang, Y. Zhang, X. Shen, Y. Guo, Y. Shan, and F. Wang, “Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8652–8661.
  • [22] T. He, J. Guo, R. Yu, Y. Wang, jialiang zhu, K. An, L. Li, X. Tan, C. Wang, H. Hu, H. Wu, sheng zhao, and J. Bian, “GAIA: Data-driven zero-shot talking avatar generation,” in The Twelfth International Conference on Learning Representations, 2024.
  • [23] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [24] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning.   PMLR, 2015, pp. 2256–2265.
  • [25] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  • [26] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2020.
  • [27] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [28] G. Kim, H. Shim, H. Kim, Y. Choi, J. Kim, and E. Yang, “Diffusion video autoencoders: Toward temporally consistent face video editing via disentangled video encoding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6091–6100.
  • [29] X. Xing, C. Wang, H. Zhou, J. Zhang, Q. Yu, and D. Xu, “Diffsketcher: Text guided vector sketch synthesis through latent diffusion models,” arXiv preprint arXiv:2306.14685, 2023.
  • [30] Y. Zhao, T. Hou, Y.-C. Su, X. Jia, Y. Li, and M. Grundmann, “Towards authentic face restoration with iterative diffusion models and beyond,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7312–7322.
  • [31] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7623–7633.
  • [32] J. Seo, G. Lee, S. Cho, J. Lee, and S. Kim, “Midms: Matching interleaved diffusion models for exemplar-based image translation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2191–2199.
  • [33] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.
  • [34] C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
  • [35] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-or, “Prompt-to-prompt image editing with cross-attention control,” in The Eleventh International Conference on Learning Representations, 2022.
  • [36] B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen, “Paint by example: Exemplar-based image editing with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 381–18 391.
  • [37] A. K. Bhunia, S. Khan, H. Cholakkal, R. M. Anwer, J. Laaksonen, M. Shah, and F. S. Khan, “Person image synthesis via denoising diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5968–5976.
  • [38] F. Shen, H. Ye, J. Zhang, C. Wang, X. Han, and W. Yang, “Advancing pose-guided image synthesis with progressive conditional diffusion models,” arXiv preprint arXiv:2310.06313, 2023.
  • [39] Z. Yue and C. C. Loy, “Difface: Blind face restoration with diffused error contraction,” arXiv preprint arXiv:2212.06512, 2022.
  • [40] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [41] M. Stypułkowski, K. Vougioukas, S. He, M. Zięba, S. Petridis, and M. Pantic, “Diffused heads: Diffusion models beat gans on talking-face generation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5091–5100.
  • [42] S. Mukhopadhyay, S. Suri, R. T. Gadde, and A. Shrivastava, “Diff2lip: Audio conditioned diffusion models for lip-synchronization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5292–5302.
  • [43] D. Bigioi, S. Basak, M. Stypułkowski, M. Zieba, H. Jordan, R. McDonnell, and P. Corcoran, “Speech driven video editing via an audio-conditioned diffusion model,” Image and Vision Computing, vol. 142, p. 104911, 2024.
  • [44] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.   Springer, 2015, pp. 234–241.
  • [45] S. Wang, L. Li, Y. Ding, and X. Yu, “One-shot talking face generation from single-speaker audio-visual correlation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2531–2539.
  • [46] Y. Ma, S. Wang, Z. Hu, C. Fan, T. Lv, Y. Ding, Z. Deng, and X. Yu, “Styletalk: One-shot talking head generation with controllable speaking styles,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1896–1904.
  • [47] S. Wang, Y. Ma, Y. Ding, Z. Hu, C. Fan, T. Lv, Z. Deng, and X. Yu, “Styletalk++: A unified framework for controlling the speaking styles of talking heads,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [50] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in INTERSPEECH, 2017.
  • [51] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [52] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
  • [53] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [54] R. Zhen, W. Song, Q. He, J. Cao, L. Shi, and J. Luo, “Human-computer interaction system: A survey of talking-head generation,” Electronics, vol. 12, no. 1, p. 218, 2023.
  • [55] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.
  • [56] J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision.   Springer, 2016, pp. 251–263.
  • [57] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee et al., “Mediapipe: A framework for building perception pipelines,” arXiv preprint arXiv:1906.08172, 2019.
  • [58] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “latent-diffusion,” https://github.com/CompVis/latent-diffusion, 2022.
  • [59] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.