Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis

Haozhe Wu¹, Jia Jia¹*, Haoyu Wang¹, Yishun Dou², Chao Duan², Qingshan Deng² ¹Department of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaBeijing National Research Center for Information Science and Technology (BNRist)The Institute for Artificial Intelligence, Tsinghua University²HiSilicon Company, Shenzhen, China [email protected], [email protected], [email protected], {douyishun,duanchao15,dengqingshan}@hisilicon.com

(2021)

Abstract.

People talk with diversified styles. For one piece of speech, different talking styles exhibit significant differences in the facial and head pose movements. For example, the ”excited” style usually talks with the mouth wide open, while the ”solemn” style is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences between different styles, it is necessary to incorporate the talking style into audio-driven talking face synthesis framework. In this paper, we propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. Specifically, we systematically investigate talking styles with our collected Ted-HD dataset and construct style codes as several statistics of 3D morphable model (3DMM) parameters. Afterwards, we devise a latent-style-fusion (LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes. We emphasize the following novel characteristics of our framework: (1) It doesn’t require any annotation of the style, the talking style is learned in an unsupervised manner from talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary videos, and the style codes can also be interpolated to generate new styles. Extensive experiments demonstrate that the proposed framework has the ability to synthesize more natural and expressive talking styles compared with baseline methods.

photorealistic talking face, talking style, style imitation

^†^†journalyear: 2021^†^†copyright: acmcopyright^†^†conference: Proceedings of the 29th ACM International Conference on Multimedia; October 20–24, 2021; Virtual Event, China^†^†booktitle: Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China^†^†price: 15.00^†^†doi: 10.1145/3474085.3475280^†^†isbn: 978-1-4503-8651-7/21/10^†^†submissionid: 609^†^†ccs: Computing methodologies Animation

1. Introduction

Refer to caption — Figure 1. Different talking styles have significant differences in facial and head pose movements when pronouncing the word ”significant”.

The talking face synthesis is an eagerly anticipated technique for several applications, such as filmmaking, teleconference, virtual/mix reality, and human-computer interaction. One of the key essences behind the talking face synthesis is the stylization of facial and head pose movements. Different from the talking emotion which reflects in the short-term facial motions, the talking style is a crucial factor which affects long-term facial and head pose movements. People usually talk with diversified talking styles such as ”excited”, ”solemn”, ”communicational”, ”storytelling”, et al. Given one piece of speech, different talking styles exhibit significant differences in the facial and head pose movements. For example, as shown in Figure 1, people with the ”excited” style usually talk aloud, and thus the facial movement of the mouth wide open occurs frequently. Meanwhile, the ”solemn” talking style usually appears in formal occasions, and thus the exaggerated motion seldom occurs. Considering such huge differences between different styles, in order to synthesize diversified and realistic talking faces with respect to one piece of speech, it is necessary to incorporate talking style into the audio-driven talking face synthesis framework.

Previous efforts (Cudeiro et al., 2019; Yi et al., 2020) have shown the rationality of synthesizing stylized talking faces. Yi et al. (Yi et al., 2020) proposed Memory Augmented GAN model to synthesize stylized talking face with the wild training data. Cudeiro et al. (Cudeiro et al., 2019) proposed Voice Operated Character Animation (VOCA) model to learn the identity-wise talking style. The VOCA model captures the talking style of each identity by injecting the one-hot identity vector into the audio-motion predicting network. Overall, these methods have the following two disadvantages which constrain the expressiveness of the synthesized talking styles: (1) These methods assume that each identity has only one talking style. However, in the real scenario, the talking style is only relatively stable inside one video clip, one person can talk with significantly different styles among different video clips. (2) These methods require substantial labour to collect adequate synchronized audio-visual data for each identity, which is unavailable for the wild scenarios.

To address the aforementioned problems, we propose to synthesize stylized talking faces by imitating styles from arbitrary clips of videos. With such motivation, the talking videos with stable and diversified talking styles are in need. Therefore, we collect the Ted-HD dataset which has 834 clips of videos with 60 identities. Each video clip has an average length of 23.5 seconds. The talking styles of the Ted-HD dataset are diversified, and the talking style inside each video clip is stable.

Based on the constructed dataset, we devise a two stage talking face synthesis framework as shown in Figure 2. The first stage imitates talking styles from arbitrary videos and synthesizes 3D talking faces according to the driven speech. Afterwards, in the second stage, we render the 3D face model photo-realistically from one static portrait of the speaker. Overall, the key idea of our framework is to construct style codes from video clips in the wild, and then synthesize talking faces by imitating the talking styles from the constructed style codes. Specifically, for the style codes construction, we conduct exhaustive observations on the Ted-HD dataset, which on one hand verify that even one identity has multiple talking styles, on the other hand find that the talking style is closely related to the variance of facial and head pose movements inside each video. With our observations, we define the style codes as several interpretable statistics of 3D morphable model (3DMM) (Blanz and Vetter, 1999) parameters. Having obtained the style codes of each talking video, we devise a latent-style-fusion (LSF) model to synthesize stylized 3D talking faces by imitating talking styles from the style codes. Detailedly, the LSF model first dropouts (Srivastava et al., 2014) information from the audio stream to prevent the audio from dominating the synthesis process. Further, the LSF model fuses the style codes with the latent audio representation frame-by-frame to synthesize 3D talking face with corresponding talking style. The overall implementation of the LSF model is simple but effective. Our model not only circumvents the annotation for talking styles and avoids collecting substantial training data for each identity, but also enables new talking style generation.

With our proposed framework, talking faces with diversified styles can thus be synthesized. We perform experiments on the Ted-HD dataset for evaluation. Compared with baseline methods, our framework synthesizes more expressive and diversified talking styles. We conduct extensive user studies to investigate the facial motion naturalness and audio-visual synchronization. With the mean opinion score (MOS) of 20 participants, our framework outperforms baseline methods by 0.67 on average in terms of facial motion naturalness and 0.11 on average in terms of audio-visual synchronization.

To summarize, our contributions are summarized as three-fold:

•

We propose to synthesize stylized talking faces by imitating talking styles of arbitrary videos. The incorporation of style imitation leads to more diversified talking styles.
•

We formalize the style codes of each talking video and devise the latent-style-fusion (LSF) model to synthesize stylized 3D talking faces from style codes and driven audio. Our framework do not require any annotation of talking styles for stylized talking face synthesis.
•

We collect Ted-HD dataset, which contains 834 clips of wild talking videos with stable and diversified talking styles. Based on the Ted-HD dataset, we conduct extensive style observations and synthesize expressive talking styles. Code and dataset are publicly available ¹¹1https://github.com/wuhaozhe/style_avatar.

2. Related Work

Talking face synthesis has received significant attention in previous literatures. Related work in this area can be grouped into two categories: the unimodal talking face synthesis (Thies et al., 2020; Zhou et al., 2019; Chen et al., 2018; Wiles et al., 2018; Prajwal et al., 2020; Jamaludin et al., 2019; Zhou et al., 2021) and the multimodal talking face synthesis (Wang et al., 2020a; Emre Eskimez et al., 2020; Cudeiro et al., 2019; Yi et al., 2020; Zeng et al., 2020; Ji et al., 2021). For one piece of driven speech, the unimodal talking face synthesis generates unique motion, while the multimodal talking face synthesis generates diversified facial and head pose movements.

Most of the prior works focused on the unimodal talking face synthesis. Karras et al. (Karras et al., 2017) proposed to synthesize 3D talking face with driven audio and emotion state. Suwajanakorn et al. (Suwajanakorn et al., 2017) synthesized high quality videos of talking Obama through hours of Obama’s weekly address footage. Since that Suwajanakorn’s method requires hours of data for each identity, several methods (Thies et al., 2020; Zhou et al., 2020; Zhou et al., 2019; Prajwal et al., 2020; Chen et al., 2019, 2018; Yu et al., 2020; Chen et al., 2020) proposed to simultaneously reduce the required duration of training data and guarantee the photorealism of the synthesized video. The ATVG framework proposed by Chen et al. (Chen et al., 2019) and the DAVS framework proposed by Zhou et al. (Zhou et al., 2019) synthesize talking face with only one image. Despite that these unimodal talking face synthesis methods can synthesize photo-realistic videos, the lack of style results in the stiffness of the synthesized results.

To synthesize diversified facial and head pose movements, some recent literatures addressed the multimodal talking face synthesis. Wang et al. (Wang et al., 2020a) and Eskimez et al. (Emre Eskimez et al., 2020) achieved multimodal synthesis through incorporating emotion condition vector, which enables the generation of diversified facial expressions. However, the synthesized results of these methods still lack in personality because of the overlook of talking style. To address this issue, some methods (Yi et al., 2020; Cudeiro et al., 2019) proposed to incorporate talking style into the synthesis framework. Yi et al. (Yi et al., 2020) proposed Memory-Augmented GAN model to synthesize stylized talking face with wild training data. However, only the head pose is synthesized with multiple styles, while the facial movements still lack personality. Further, Cudeiro et al. (Cudeiro et al., 2019) proposed the Voice Operated Character Animation (VOCA) model to learn the identity-wise talking style. The VOCA model injects one-hot identity vector into the audio-motion predicting network, leading to discriminative styles of facial and head pose movements. However, the VOCA model on one hand requires substantial data for each identity, on the other hand forces one identity to have only one talking style, limiting its capacity on synthesizing diversified styles. To resolve this problem, in this work, we propose to imitate talking styles from arbitrary wild talking videos.

3. Problem Formulation

In this paper, we propose a two-stage talking face synthesis framework which synthesizes stylized talking videos with the following three inputs: one static portrait of the speaker, the driven audio and the style reference video. We formalize the first stage of the framework as the 3D talking face synthesis stage and the second stage as the photorealistic render stage. Between two stages, we apply the 3DMM face model (Blanz and Vetter, 1999) as a crucial bridge. Therefore, before formally defining the two stages, we firstly give a brief introduction of the face model we use.

We leverage the 3DMM face model to represent each 3D face. With 3DMM, the face shape $\mathbf{S}$ is represented as an affine model of facial expression and facial identity:

(1)

\mathbf{S}=\mathbf{S(\alpha,\beta)}=\mathbf{\bar{S}}+\mathbf{B}_{id}\mathbf{\alpha}+\mathbf{B}_{exp}\mathbf{\beta},

where $\mathbf{\bar{S}}\in\mathbf{R}^{N\times 3}$ is the average face shape; N is the number of the vertexes in the face model; $\mathbf{B}_{id}$ and $\mathbf{B}_{exp}$ are the PCA bases of identity and expression; $\alpha$ and $\beta$ are the identity parameters and the expression parameters. Following deng et al. (Deng et al., 2019), we adopt the 2009 Basel Face Model (Paysan et al., 2009) for $\mathbf{\bar{S}}$ , $\mathbf{B}_{id}$ , and use expression bases $\mathbf{B}_{exp}$ of Guo et al. (Guo et al., 2018) built from Facewarehouse (Cao et al., 2013), resulting in $\alpha\in\mathbb{R}^{80}$ , $\beta\in\mathbb{R}^{64}$ . Afterwards, the 3D face shape is projected on the 2D plane according to the head pose and translation $p\in\mathbb{R}^{7}$ , where 4 elements represent the pose quaternion and 3 elements represent the translation. Overall, the parameter set $(\alpha,\beta,p)$ controls the appearance of each face. In our framework, the facial movements are the time series of parameter $\beta$ , which we denote as $\beta(t)$ , while the head pose movements are the time series of parameter $p$ , which we denote as $p(t)$ . With $\beta(t)$ and $p(t)$ , we formalize the two stages of our framework as follows.

3D Talking Face Synthesis Stage. In this stage, given driven audio $\mathbf{X}_{a}$ , the facial and head pose movements of style reference video $\beta_{sty}(t),p_{sty}(t)$ , we aim to generate corresponding facial and head pose movements $\beta_{pred}(t)$ , $p_{pred}(t)$ .

Photorealistic Render Stage. In this stage, given the predicted movements $\beta_{pred}(t)$ , $p_{pred}(t)$ and the input portrait $\mathbf{X}_{p}$ , we aim to generate photorealistic videos $\mathbf{Y}$ .

4. Talking Style Observation

In this section, we systematically investigate how different talking styles reflect in the facial and head pose movements $\beta(t),p(t)$ . Afterwards, we formally define interpretable style codes for each video based on our observation.

In order to observe talking styles of each video, we should firstly collect a suitable dataset for observation. The video for the style observation requires the following characteristics: (1) It should be of high resolution, (2) it should contain natural and expressive facial and head pose movements, (3) each clip of video cannot be too short, or the talking style can hardly be observed, (4) the talking style should both be stable inside the clip and be diversified across different clips, (5) the camera pose and location should be static inside each clip, otherwise the head pose parameters will be influenced by the camera movements. Considering the characteristics above, current publicly available wild datasets like VoxCeleb2 (Chung et al., 2018) and LRS3 (Afouras et al., 2018) are much too noisy and thus do not meet the requirements. Meanwhile, several in-lab datasets like MEAD (Wang et al., 2020b) and GRID (Cooke et al., 2006) do not have natural facial expressions and head poses, which dissatisfy the requirements either.

To address this issue, we manually collect a wild dataset Ted-HD which is suitable for style observation and further style synthesis. The Ted-HD dataset selects several speech videos from Ted website. Each video in the dataset has one person giving a speech, which focuses on the facial part of each person and is of high resolution. We cut each video into several clips according to the scene change. Totally, Ted-HD dataset contains 834 clips of videos with 60 identities. The average length of each clip is 23.5 seconds, and the total duration of the dataset is 6 hours. The talking style of these videos are diversified across different clips. Even for the same identity, there could be different talking styles.

Having obtained the dataset, for each video, we reconstruct the facial and head pose movements $\beta(t),p(t)$ . We conduct exhaustive data observations on the correlation between talking style and $\beta(t),p(t)$ , with the aim of answering the following questions:

•

Q1: Whether one identity has multiple talking styles.
•

Q2: How the talking style is reflected in time series $\beta(t),p(t)$ .

To answer the Q1, we verify the talking style diversity of each identity through user study with A/B test. Specifically: we first randomly construct 100 triplets, each triplet contains two talking videos $v_{1},v_{2}$ from the same identity and one talking video $v_{3}$ from the other identity. Next, we reconstruct $\beta(t),p(t)$ of $v_{1},v_{2},v_{3}$ , retarget $\beta(t),p(t)$ to the same identity and render the retargeted faces to videos. The retargeted videos are signified as $v_{1}^{{}^{\prime}},v_{2}^{{}^{\prime}},v_{3}^{{}^{\prime}}$ . Afterwards, we show $v_{1}^{{}^{\prime}},v_{2}^{{}^{\prime}},v_{3}^{{}^{\prime}}$ and their transcripts to users, asking the following question: which pair of $(v_{1}^{{}^{\prime}},v_{2}^{{}^{\prime}})$ and $(v_{1}^{{}^{\prime}},v_{3}^{{}^{\prime}})$ has more similar talking styles. Statistics demonstrate that among 100 triplets, $(v_{1}^{{}^{\prime}},v_{3}^{{}^{\prime}})$ is more similar in 30 triplets, while $(v_{1}^{{}^{\prime}},v_{2}^{{}^{\prime}})$ is more similar in 70 triplets. With the statistics that $30\%$ of videos inside each identity have dissimilar talking styles, we conclude that one identity has multiple talking styles.

Since that the talking style is not consistent for each identity, formulating style codes is necessary. Therefore, in Q2, we conduct experiments to find how talking style is reflected in time series $\beta(t),p(t)$ . Specifically, we first randomly select 300 talking videos and annotate the talking style of these videos to three categories: tedious, solemn, and excited. Afterwards, for the motion series $\beta(t),p(t)$ of each video, we calculate its derivative series with respect to time $t$ : $(\frac{\partial\beta(t)}{\partial t},\frac{\partial p(t)}{\partial t})$ . Next, we calculate the mean value $\mu(\cdot)$ and the standard deviation $\sigma(\cdot)$ of $(\beta(t),p(t),\frac{\partial\beta(t)}{\partial t},\frac{\partial p(t)}{\partial t})$ along time, yielding 8 feature vectors (4 for mean, 4 for standard deviation). To observe the relationship between these feature vectors and talking styles, we leverage t-SNE algorithm (Van der Maaten and Hinton, 2008) to visualize each feature vector and plot points from different style categories with different colors, as shown in Figure 3. Figure 3 demonstrates that the talking style is closely related with $\sigma(\beta(t))$ , $\sigma(\frac{\partial\beta(t)}{\partial t})$ , $\sigma(\frac{\partial p(t)}{\partial t})$ , especially $\sigma(\frac{\partial p(t)}{\partial t})$ . Meanwhile the talking style is less related with $\mu(\cdot)$ , which denotes that the talking style is mostly reflected in the fluctuation of movements, rather than the idle state of movements.

Based on such observation, we define the style codes as the standard deviation of facial and head pose movements. Formally, given arbitrary video with the reconstructed parameter series $\beta(t),p(t)$ , the style codes $\mathbf{sty}$ are defined as:

(2)

\mathbf{sty}=\sigma(\beta(t))\oplus\sigma(\frac{\partial\beta(t)}{\partial t})\oplus\sigma(\frac{\partial p(t)}{\partial t}),

where $\oplus$ signifies the vector concatenation.

To summarize, we make the following two conclusions: (1) one identity have multiple talking styles, (2) the talking style is closely related to the variance of facial and head pose movements inside each video, following which we define the style codes in Equation 2. The style codes are utilized to synthesize diversified talking styles, details will be illustrated in the section 5.

5. Methodology

Following the style codes defined in section 4, we propose a two-stage talking face synthesis framework to imitate arbitrary talking styles, as shown in Figure 2. Our framework synthesizes stylized talking videos with the following three inputs: one static portrait of the speaker, the driven audio and the style reference video. In the first stage of our framework, we devise a latent-style-fusion (LSF) model to synthesize stylized 3D talking faces through imitating arbitrary talking styles. Based on the synthesized 3D talking faces, in the second stage, we leverage the deferred neural render (Thies et al., 2019) and few-shot neural texture generation model to generate video frames photo-realistically. In the next two subsections, we will introduce the two stages respectively.

5.1. Stylized 3D Talking Face Synthesis

In the first stage of our framework, we propose the latent-style-fusion (LSF) model for talking face synthesis. Overall, the input of the LSF model is the driven audio and the reference talking video for style imitation. The LSF model learns motion related information from audio, and then combines latent audio representation with style information to synthesize 3D talking meshes with target talking style. Details will be illustrated as follows.

For the driven input audio $\mathbf{X}_{a}$ of $T$ seconds, we firstly leverage the DeepSpeech (Hannun et al., 2014) model to extract speech features. Deepspeech is a deep neural model for automatic speech recognition (ASR). The extracted features from DeepSpeech not only contains rich speech information but also is robust to background noise and generalizes well to different identities. Feeding the input audio $\mathbf{X}_{a}$ to the DeepSpeech model, yields latent representations $\mathbf{X}_{d}\in\mathbb{R}^{50T\times D_{a}}$ , where $D_{a}$ is the dimension of the DeepSpeech features, and $50T$ denotes that the DeepSpeech features have 50 frames per second. Afterwards, for the reference talking video, we calculate its style codes $\mathbf{sty}\in\mathbb{R}^{D_{s}}$ as illustrated in Section 4 for style imitation, where $D_{s}$ is the dimension of the style codes.

Having obtained the $\mathbf{X}_{d}$ and $\mathbf{sty}$ , we now elaborate the 3D talking face synthesis process. We devise a latent-style-fusion (LSF) model, which takes $\mathbf{X}_{d}\in\mathbb{R}^{50T\times D_{a}}$ and $\mathbf{sty}\in\mathbb{R}^{D_{s}}$ as input, outputs facial movements $\beta_{pred}(t)\in\mathbb{R}^{25T\times 64}$ and head pose movements $p_{pred}(t)\in\mathbb{R}^{25T\times 7}$ with 25 frames per second. Based on $\beta_{pred}(t)$ and $p_{pred}(t)$ , we reconstruct the 3D talking meshes with the 3DMM face model (Blanz and Vetter, 1999).

The LSF model leverages a latent fusion mechanism to both synthesize stylized faces and guarantee the synchronization between audio and motion. Specifically, as shown in Figure 2, the LSF model firstly takes audio $\mathbf{X}_{d}$ as input and encodes $\mathbf{X}_{d}$ with the bottom part of ResNet-50 (He et al., 2016), yielding latent audio representation $\mathbf{X}_{l}$ . Afterwards, the LSF model fuses the latent audio representation $\mathbf{X}_{l}$ and style codes $\mathbf{sty}$ to acquire mixed representation for synthesis. During the fusion process, the LSF model first dropouts the latent audio representation $\mathbf{X}_{l}$ to obtain $\mathbf{X}_{l}^{{}^{\prime}}$ , while the information from $\mathbf{sty}$ remains unchanged. Next, the LSF model concats each frame of $\mathbf{X}_{l}^{{}^{\prime}}$ with style codes $\mathbf{sty}$ , leading to the mixed representation. Further, the LSF model leverages the top part of ResNet-50 to predict facial movements $\beta_{pred}(t)$ and head pose movements $p_{pred}(t)$ from the mixed representation. It’s noteworthy to stress that the fusion between latent audio representation and style codes enables the synthesis of more expressive talking styles. Meanwhile, the dropout of audio information prevents from discarding style information for synthesis. The overall implementation of the LSF model is simple but effective.

For the training stage of the LSF model, we adopt the parameter series $\beta(t),p(t)$ reconstructed from the 3D face reconstruction algorithm (Deng et al., 2019) as ground truth. For each training video, we calculate its style codes $\mathbf{sty}$ and randomly clip the the input audio $\mathbf{X}_{d}$ and the ground truth $\beta(t),p(t)$ to fixed length. Afterwards, we feed $\mathbf{X}_{d}$ and corresponding $\mathbf{sty}$ to the LSF model, yielding the predicted expression parameter series $\beta_{pred}(t)$ and $p_{pred}(t)$ . Based on the predicted $\beta_{pred}(t)$ and $p_{pred}(t)$ , we apply the $\mathrm{L_{1}}$ loss as follows:

(3)

\mathcal{L}_{\mathrm{L_{1}}}=||\beta(t)-\beta_{pred}(t)||_{1}+||p(t)-p_{pred}(t)||_{1}.

It is worth emphasizing that the training of the LSF model doesn’t require any additional annotation on the speaking identity. Only by training on wild videos with stable talking styles, we obtain expressive style space.

During the inference stage, feeding style codes of arbitrary talking videos to LSF model not only yields desired talking styles but also maintains the synchronization between the driven audio and talking faces. Meanwhile, we can interpolate among different talking styles to acquire new talking styles. For the audio representation $\mathbf{X}_{d}$ with arbitrary duration, since that the trained LSF model only digests the audio with fixed length, we apply the slide window strategy to synthesize the corresponding facial movements $\beta_{pred}(t)$ and head pose movements $p_{pred}(t)$ .

So far, we have obtained the untextured talking 3D faces. In the next subsection, we’ll introduce how we render these 3D faces photo-realistically.

5.2. Photorealistic Render

Conventional deferred neural render (Thies et al., 2019) requires substantial training data for each identity. In order to both synthesize photorealistic results and guarantee the few-shot capacity, we devise a few-shot neural texture generation model and combine the generated neural texture with the deferred neural render, which enables synthesizing photorealistic videos with only one source portrait. As shown in Figure 2, the deferred neural render incorporates the generated neural texture, conducts UV texture sampling on the neural texture, and translates the sampled image to photorealistic domain.

Algorithm 1 UV Texture Sampling

\mathbf{X}_{uv}\in\mathbb{R}^{2\times H\times W}

\mathbf{Y}_{t}\in\mathbb{R}^{D_{t}\times H_{t}\times W_{t}}

\mathbf{X}_{s}\in\mathbb{R}^{D_{t}\times H\times W}

4:for

i\leftarrow 1

H

5: for

j\leftarrow 1

W

u=\mathbf{X}_{uv}[0,i,j]

v=\mathbf{X}_{uv}[1,i,j]

\mathbf{X}_{s}[:,i,j]

= BilinearSample

(\mathbf{Y}_{t},u,v)

9:return

\mathbf{X}_{s}

Detailedly, for the input 3D talking faces, we firstly leverage the UVAtlas tool ²²2https://github.com/microsoft/UVAtlas to obtain the UV coordinate of each vertex in the 3D model. Afterwards, we rasterize the 3D face model to 2D image $\mathbf{X}_{uv}\in\mathbb{R}^{2\times H\times W}$ , of which each pixel represents the UV coordinate. Subsequently, for the input portrait $\mathbf{X}_{p}\in\mathbb{R}^{3\times H\times W}$ and the 3D face model, we extract the RGB texture $\mathbf{X}_{t}\in\mathbb{R}^{3\times H_{t}\times W_{t}}$ , where $H_{t},W_{t}$ signifies the height and width of the texture. Based on $\mathbf{X}_{t}$ , we leverage the pix2pix (Isola et al., 2017) model to generate neural texture $\mathbf{Y}_{t}\in\mathbb{R}^{D_{t}\times H_{t}\times W_{t}}$ , where $D_{t}$ signifies the dimension of the neural texture. With the neural texture, we conduct UV texture sampling on $\mathbf{X}_{uv}$ to obtain the sampled image $\mathbf{X}_{s}\in\mathbb{R}^{D_{t}\times H\times W}$ , the details of the sampling algorithm are illustrated in the Algorithm 1. Finally, we translate the sampled image $\mathbf{X}_{s}$ to photorealistic image through the pix2pixHD (Wang et al., 2018) model.

During the training stage, the few-shot texture generation model and the deferred neural render model are trained simultaneously. Given the rasterized input $\mathbf{X}_{uv}$ , we denote the rendered image as $\mathbf{Y}^{{}^{\prime}}$ and the ground truth image as $\mathbf{Y}$ . We combine the perceptual loss (Johnson et al., 2016) and $\mathrm{L_{1}}$ loss together as $\mathcal{L}$ to optimize the neural texture and pix2pixHD model. Formally:

(4)

\mathcal{L}=||\mathbf{Y}-\mathbf{Y}^{{}^{\prime}}||_{1}+||\phi(\mathbf{Y})-\phi(\mathbf{Y}^{{}^{\prime}})||_{1},

where $\phi(\cdot)$ is the first few layers of VGGNet (Simonyan and Zisserman, 2015) pretrained on the ImageNet (Deng et al., 2009). Due to the limitation of the 3DMM face model, in our render, we only synthesize the facial part of the image, without considering the hair and background rendering.

6. Experiments

In this section, we conduct extensive experiments to demonstrate the effectiveness of our framework. We evaluate our framework on the collected Ted-HD dataset. Our method has acquired better synthesis results both qualitatively and quantitatively.

6.1. Dataset

As illustrated in Section 4, the currently available datasets are either collected in lab which have constrained talking styles or collected in the wild of which the talking styles are unstable and noisy. Therefore for the training and testing of the LSF model, we leverage the Ted-HD dataset described in Section 4. Totally, the Ted-HD dataset has 834 clips of videos, we select 799 clips for training, and hold out the remained 35 for testing. The training set and test set have no overlap on identities. Additionally, for the training of the deferred neural render and the few-shot neural texture generation model, we leverage the Lip Reading in the Wild (LRW) (Chung and Zisserman, 2016) dataset.

6.2. Implementation Details

During the training of the LSF model, the input DeepSpeech audio features have 50 frames per second (FPS), while each frame has the dimension $D_{a}$ of 29. The input style codes $\mathbf{sty}$ have the dimension $D_{s}$ of 135 (64 for $\sigma(\beta(t))$ , 64 for $\sigma(\frac{\partial\beta(t)}{\partial t})$ , 7 for $\sigma(\frac{\partial p(t)}{\partial t})$ ). The predicted facial movements $\beta_{pred}(t)$ and the head pose movements $p_{pred}(t)$ have 25 frames per second. For the convenience of training, we randomly clip the input to 80 frames, and clip the output to 32 frames. For the implementation of the LSF model, we apply the ResNet-50 as the backbone. We leverage the first 16 layers of ResNet-50 to encode the DeepSpeech features, combine the encoded features with style codes, and leverage the last 34 layers of ResNet-50 to predict the motion series. When optimizing, we adopt the Adam optimizer (Kingma and Ba, 2015) to train the LSF model with the initial learning rate of $5\times 10^{-4}$ . We train for 50000 iterations with a mini-batch size of 128 samples.

During the training of the deferred neural render and the few-shot neural texture generation model, the input UV image $\mathbf{X}_{uv}$ has a size of $2\times 224\times 224$ , while the neural texture has a size of $16\times 64\times 64$ . The dimension $D_{t}$ of the texture is set to 16, which enables each pixel to contain richer texture information. Meanwhile, the texture size is set to be smaller than the UV image size, which avoids oversampling in the sample process. The output image $\mathbf{X}_{s}$ is conventional RGB image with size of $3\times 224\times 224$ . When optimizing, we adopt the Adam optimizer to simultaneously train the neural render and the texture generation model. The learning rate is set to $2\times 10^{-4}$ . We train for 1000000 iterations with mini-batch size of 6 samples.

6.3. Comparison with VOCA on Style Synthesis

To the best of our knowledge, the VOCA model (Cudeiro et al., 2019) is the only available method that captures diversified talking styles of facial movements. Therefore, in this section, we systematically compare our method with the VOCA model (Cudeiro et al., 2019) to demonstrate the effectiveness of the LSF model. Different from our method, the VOCA model learns identity-level talking styles. Specifically, the VOCA model injects a one-hot identity code to the time convolution network, and directly predicts face model vertexes from the DeepSpeech features and the identity code. By adjusting the one-hot identity code, the VOCA model outputs different talking styles.

Table 1. The mean opinion scores (MOS) of different metrics, higher signifies better. TSE denotes talking style expressiveness, FMN denotes facial movement naturalness, and AVS denotes audio-visual synchronization.

	TSE	FMN	AVS
VOCA (Cudeiro et al., 2019)	3.28	3.13	3.56
LSF	3.41	3.21	3.64

We compare the style space learned from the VOCA model and the style space learned from the LSF model with extensive user studies. Specifically, we randomly select 10 clips of driven audio, each with a duration from 10 to 20 seconds. Afterwards, we randomly sample 5 talking styles from the VOCA style space and 5 talking styles from the style space of our method. With the sampled talking styles and the driven audio, we synthesize corresponding talking faces and retarget the synthesized faces to the same identity. Afterwards, we subsume videos with the same driven audio and synthesis model to the same group. For each group of video, we invite 20 participants to rate (1) the expressiveness of the talking style, (2) the naturalness of facial movements, (3) the audio-visual synchronization between driven audio and talking face. We ask participants to rate the mean opinion score (MOS) (Recommendation, 2006) in the range 1-5 (higher MOS score denotes better results). When showing videos to participants, since that the VOCA model only gives untextured 3D faces, we also just give the 3D talking faces synthesized from the LSF model, without utilizing the deferred neural render to generate photorealistic results for fair comparison.

Table 1 demonstrates the results of user studies. Through the experimental results, we observe that our LSF model has acquired higher talking style expressiveness, which verifies that the style space learned by taking style imitation is more expressive than the identity-level style space. Meanwhile, our LSF model outperforms the VOCA model by 0.08 in terms of the facial movement naturalness and audio-visual synchronization. Such results confirm the effectiveness of style imitation in LSF model further. Additionally, compared with the VOCA model which requires substantial training data for each identity, our method is trained on wild dataset which does not require any annotation on identity or talking style.

6.4. Study on the Style Space

In this section, we extensively investigate the style space learned in our method. We conduct two qualitative experiments, which verify that not only the synthesized talking styles are diversified, but new talking styles can also be generated from the interpolation of different talking styles.

To verify the diversification of the talking styles, we visualize the distance between lower and upper lip as a function of time. Specifically, we randomly select one piece of driven audio and 10 different talking styles, and then synthesize 10 facial movements corresponding to each talking style and driven audio. Afterwards, we calculate the lip distance as shown in Figure 5. Through Figure 5 we observe that the lip distance significantly varies among different talking styles, which verifies that the LSF model is able to synthesize diversified talking styles. Meanwhile, different talking styles demonstrate similar trends of fluctuation, the peak and the valley of the distance curve highly overlap, which confirms that the LSF model also guarantees the synchronization between audio and synthesized motions.

To confirm that the style space in the LSF model is expressive and interpolative, we visualize the interpolation results of different talking styles, as shown in Figure 4. Detailedly, we select two representative talking styles: excited and solemn, and conduct linear interpolation between the two styles to generate new talking styles. From each row of Figure 4, we observe that the facial and head pose movements translate smoothly from excited talking style to solemn style. For the excited talking style, the lip motion is exaggerated and the head frequently shakes. Meanwhile, for the solemn talking style, the lip motion and head pose are stable. We provide more synthesis results in the supplementary materials.

6.5. Comparison with One-Shot Synthesis

In this section, we conduct experiments to demonstrate that our method synthesizes more natural and expressive talking faces compared with several baseline methods. Specifically, we compare our method with the following baseline methods: (1) the ATVG framework (Chen et al., 2019), (2) the MakeItTalk framework (Zhou et al., 2020), (3) the Wav2Lip framework (Prajwal et al., 2020). For Wav2Lip framework which requires few seconds of videos as input, we repeat input portrait as videos for fair comparison. Figure 6 shows some synthesis results We observe that our method has more expressive facial movements and head pose movements while also guarantees the synthesized results to be photorealistic.

Additionally, we conduct both user studies and quantitative evaluations on the Ted-HD dataset to verify the effectiveness of our method. Specifically, for the user studies, we firstly synthesize videos with the randomly selected 20 clips of driven audios and 5 different identities. Afterwards, for each video, we invite 20 participants to rate (1) the naturalness of facial movements, (2) the audio-visual synchronization between driven audio and talking face. The mean opinion score (MOS) is rated in the range 1-5. Besides, we also evaluate the synthesized video quality with the signal-noise-ratio (SNR) metric. We do not leverage the PSNR metric since that the ground truth talking videos with arbitrary talking styles are not available.

Table 2. Comparison with baseline methods on the Ted-HD dataset, where FMN denotes facial movement naturalness, and AVS denotes audio-visual synchronization. The Pre-Fusion method removes latent style fusion in the LSF model, details are illustrated in Section 6.6

	FMN	AVS	SNR (dB)
ATVG (Chen et al., 2019)	2.71	3.51	2.98
MakeItTalk (Zhou et al., 2020)	3.08	3.13	3.01
Wav2Lip (Prajwal et al., 2020)	2.97	4.28	2.78
Pre-Fusion	2.19	2.25	5.70
Ours	3.59	3.75	5.76

Table 2 shows the comparison results. From Table 2 we observe that our method has achieved the most expressive facial motion and best video quality. We also notice that the Wav2Lip method achieves unsatisfying motion naturalness since that it cannot resolve the one-shot synthesis scenario. Meanwhile, we observe that the AVS of our method is slightly lower than Wav2Lip, that is because the talking style synthesis in our LSF model mildly sacrifices the performance of audio-visual synchronization, but our method still has better AVS performance compared with ATVG and MakeItTalk.

6.6. Effectiveness of Latent Style Fusion

To verify the rationality of the latent style fusion mechanism in LSF model, we conduct the following ablation experiments. For comparison, we remove the latent style fusion mechanism, and directly concatenate DeepSpeech representation with style codes as the input of ResNet-50. Afterwards, we compare the synthesized results with LSF model through similar user studies as Section 6.5 does. The experimental results are shown in the last two rows of Table 2. From the results we observe that the motion naturalness and audio-visual synchronization would degrade significantly once we remove the latent style fusion mechanism, which verifies the effectiveness of the latent style fusion. Meanwhile, the video quality remains constant since that the motion synthesis does not influence the photorealistic render stage.

7. Conclusion

In this paper, we propose the concept of style imitation for audio-driven talking face synthesis. To imitate arbitrary talking styles, we firstly formulate the style codes of each talking video as several interpretable statistics of 3DMM parameters. Afterwards, we devise a latent-style-fusion (LSF) model to synthesize stylized talking faces according to the style codes and driven audio. The incorporation of style imitation not only circumvents the annotation for talking style during the training phase, but also endows the capacity of arbitrary style synthesis and new talking style generation. Additionally, to synthesize expressive talking styles, we collect Ted-HD dataset with 834 clips of talking videos, which contains stable and diversified talking styles. We conduct extensive experiments on the constructed dataset and obtain expressive synthesis results with our Ted-HD dataset and LSF model. The constructed Ted-HD dataset will be made publicly available in the future. We hope that the proposal of talking style imitation and the construction of Ted-HD dataset pave a new way for audio-driven talking face synthesis.

8. Acknowledgments

This work is supported by the National Key R&D Program of China under Grant No. 2020AAA0108600, the state key program of the National Natural Science Foundation of China (NSFC) (No.61831022), Beijing Academy of Artificial Intelligence No. BAAI2019QN0302.

References

(1)
Afouras et al. (2018) T. Afouras, J. S. Chung, and A. Zisserman. 2018. LRS3-TED: a large-scale dataset for visual speech recognition. In arXiv preprint arXiv:1809.00496.
Blanz and Vetter (1999) Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 187–194.
Cao et al. (2013) Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. 2013. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20, 3 (2013), 413–425.
Chen et al. (2020) Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-head Generation with Rhythmic Head Motion. In European Conference on Computer Vision. Springer, 35–51.
Chen et al. (2018) Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision (ECCV). 520–535.
Chen et al. (2019) Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7832–7841.
Chung et al. (2018) J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.
Chung and Zisserman (2016) J. S. Chung and A. Zisserman. 2016. Lip Reading in the Wild. In Asian Conference on Computer Vision.
Cooke et al. (2006) Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421–2424.
Cudeiro et al. (2019) Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. 2019. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10101–10111.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
Deng et al. (2019) Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.
Emre Eskimez et al. (2020) Sefik Emre Eskimez, You Zhang, and Zhiyao Duan. 2020. Speech Driven Talking Face Generation from a Single Image and an Emotion Condition. arXiv e-prints (2020), arXiv–2008.
Guo et al. (2018) Yudong Guo, Jianfei Cai, Boyi Jiang, Jianmin Zheng, et al. 2018. Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE transactions on pattern analysis and machine intelligence 41, 6 (2018), 1294–1307.
Hannun et al. (2014) Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).
Jamaludin et al. (2019) Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You said that?: Synthesising talking faces from audio. International Journal of Computer Vision 127, 11 (2019), 1767–1779.
Ji et al. (2021) Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14080–14089.
Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694–711.
Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
Paysan et al. (2009) Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. 2009. A 3D face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance. Ieee, 296–301.
Prajwal et al. (2020) KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.
Recommendation (2006) ITUT Recommendation. 2006. Vocabulary for performance and quality of service.
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.1556
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13.
Thies et al. (2020) Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural voice puppetry: Audio-driven facial reenactment. In European Conference on Computer Vision. Springer, 716–731.
Thies et al. (2019) Justus Thies, Michael Zollhöfer, and Matthias Nießner. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–12.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Wang et al. (2020a) Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020a. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision. Springer, 700–717.
Wang et al. (2020b) Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020b. MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation. In ECCV.
Wang et al. (2018) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8798–8807.
Wiles et al. (2018) Olivia Wiles, A Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV). 670–686.
Yi et al. (2020) Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-driven talking face video generation with learning-based personalized head pose. arXiv e-prints (2020), arXiv–2002.
Yu et al. (2020) Lingyun Yu, Jun Yu, Mengyan Li, and Qiang Ling. 2020. Multimodal inputs driven talking face generation with spatial-temporal dependency. IEEE Transactions on Circuits and Systems for Video Technology (2020).
Zeng et al. (2020) Dan Zeng, Han Liu, Hui Lin, and Shiming Ge. 2020. Talking Face Generation with Expression-Tailored Generative Adversarial Network. In Proceedings of the 28th ACM International Conference on Multimedia. 1716–1724.
Zhou et al. (2019) Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9299–9306.
Zhou et al. (2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4176–4186.
Zhou et al. (2020) Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: speaker-aware talking-head animation. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–15.