\UseRawInputEncoding

One Shot Audio to Animated Video Generation

Neeraj Kumar
Hike Private Limited
[email protected]
&Srishti Goel
Hike Private Limited
[email protected]
Ankur Narang
Hike Private Limited
[email protected]
&Brejesh Lall
IIT Delhi
[email protected]
&Mujtaba Hasan
Hike Private Limited
[email protected] &Pranshu Agarwal
Hike Private Limited
[email protected] &Dipankar Sarkar
Hike Private Limited
[email protected]

Abstract

We consider the challenging problem of audio to animated video generation. We propose a novel method OneShotAu2AV to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input. The proposed method consists of two stages. In the first stage, OneShotAu2AV generates the talking-head video in the human domain given an audio and a person^′s image. In the second stage, the talking-head video from the human domain is converted to the animated domain. The model architecture of the first stage consists of spatially adaptive normalization based multi-level generator and multiple multi-level discriminators along with multiple adversarial and non-adversarial losses. The second stage leverages attention based normalization driven GAN architecture along with temporal predictor based recycle loss and blink loss coupled with lip-sync loss, for unsupervised generation of animated video. In our approach, the input audio clip is not restricted to any specific language, which gives the method multilingual applicability. OneShotAu2AV can generate animated videos that have: $(a)$ lip movements that are in sync with the audio, $(b)$ natural facial expressions such as blinks and eyebrow movements, $(c)$ head movements. Experimental evaluation demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and RecycleGan on multiple quantitative metrics including KID(Kernel Inception Distance), Word error rate, blinks/sec.

1 Introduction

Audio to Video generation has numerous applications across industry verticals including film making, multi-media, marketing, education and others. In the film industry, it can help through automatic generation from the voice acting and also occluded parts of the face. Additionally, it can help in limited bandwidth visual communication by using audio to auto-generate the entire visual content or by filling in dropped frames. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps. Most of the work in this field has been centered towards the mapping of audio features (MFCCs, phonemes) to visual features (Facial landmarks, visemes etc.) (Aneja and Li, 2019; Tian et al., 2019; Cappelletta and Harte, 2012; Lee and Yook, 2002). Further computer graphics techniques select frames of a specific person from the database to generate expressive faces. Few techniques which attempt to generate the video using raw audio focus for the reconstruction of the mouth area only (Chung et al., 2017). Due to a complete focus on lip-syncing, the aim of capturing human expression is ignored. Further, such methods lack smooth transition between frames which does not make the final video look natural. Regardless, of which approach we use, the methods described above are either subject dependent (Suwajanakorn et al., 2017; Kim and Ganapath, 2016) or generate unnatural videos (Wiles et al., 2018) due to lack of smooth transition and/or require high compute time to generate video for a new unseen speaker image for ensuring high-quality output (Zakharov et al., 2019).Recent unsupervised approaches such as RecycleGAN (Bansal et al., 2018) and U-GAT-IT (Kim et al., 2019) either generate low quality videos or has low expressives due to lack of eye blinks, eyebrow movement and head movement.We propose a novel approach based on two novel stages that convert an audio and single image of a person to an animated video. The first stage generates a speaker-independent and language-independent high-quality natural-looking talking head video from a single unseen image and an audio clip. It captures the word embeddings from the audio clip using a pretrained deepspeech2 model(Amodei et al., 2015) trained on Librispeech corpus(Panayotov et al., 2015). These embeddings and the image are then fed to the multi-level generator network which is based on the Spatially-Adaptive Normalization architecture (Park et al., 2019). Multiple multi-level discriminators (Wang et al., 2018) are used to ensure synchronized and realistic human video generation. A multi-level temporal discriminator is modeled which ensures temporal smoothening along with spatial consistency. Finally, to ensure lip synchronization we use SyncNet architecture (Assael et al., 2017) based discriminator applied to the lower half of the image. To make the generator input-time independent, a sliding window approach is used. Since, the generator needs to finally learn to generate multiple facial component movements along with high video quality, multiple loss functions both adversarial and non-adversarial are used in a curriculum learning fashion. For fast low-cost adaptation to an unseen image, a few output update epochs suffice to provide one-shot learning capability to our approach. The second stage couples an attention-based normalization driven GAN architecture with temporal predictor based recycle loss and blink loss and lip-sync loss to generate high quality animated video from human video obtained from the first stage.

Specifically, we make the following contributions:

$(a)$ We present a novel approach, $\bf{OneShotAu2AV}$ , that uses two stages, that are trained independently, to convert audio and single image input to animated video.
$(b)$ The first stage takes audio and single image of a person as input, and, leverages curriculum learning to simultaneously learn movements of expressive facial components and generate a high-quality talking-head video of the given person. The stage feeds the features generated from the audio input directly into a generative adversarial network and it adapts to any given unseen selfie by applying one-shot learning with only a few output update epochs.
$(c)$ The second stage leverages attention based normalization driven GAN architecture along with temporal predictor based recycle loss and blink loss coupled with lip-sync loss, for unsupervised generation of animated video that demonstrates eye blinks, eyebrow movements and lip-synchronization with audio.
$(d)$ Experimental evaluation demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and RecycleGan on multiple quantitative metrics including KID(Kernel Inception Distance), Word error rate, blinks/sec.

2 Related Work

A lot of work has been done in synthesizing realistic videos in the human domain or animated domain from an audio clip and an image as an input. Speech is a combination of content and expression and there is a perceptual variability of speech that exists in the form of various languages, dialects, and accents.

Audio to Human Domain Video Generation

The earliest methods for generating videos relied on Hidden Markov Models which captured the dynamics of audio and video sequences. Simons and Cox(Simons and Cox, 1990) used the Viterbi algorithm to calculate the most likely sequence of mouth shape given the particular utterances. Such methods are not capable of generating high-quality videos and lacked emotions.

CNN based models have been used to generate realistic videos given audio and single image as an input. Audio2Face (Tian et al., 2019) model uses the CNN method to generate an image from audio signals. (Chung et al., 2017)(Speech2Vid) has used an encoder-decoder based approach for generating realistic videos. Other approaches such as Synthesizing Obama: learning lip sync (Suwajanakorn et al., 2017) from the audio are trained for a single image. LumiereNet (Kim and Ganapath, 2016) uses LSTM, DensePose (Guler et al., 2018) and Pix2Pix (Isola et al., 2017) for generating videos. These however, have limitations either in terms of expressions such as lip-sync, eye-blink, emotions or they are specifically trained for a single image and are not generalizable. We propose a spatially adaptive generator along with multiple discriminators which generates high-quality, lip-synchronized video with expressions such as eye-blink, etc.

(Zakharov et al., 2019) uses meta-learning to create videos of unseen images. Few shot Video to Video Synthesis (Wang et al., 2019) is able to generate videos on unseen images given a video as an input by using a network weight generation module for extracting the pattern. Such a method is computationally expensive compared to our approach which is a one-shot approach for video generation. Realistic Speech-Driven Facial Animation with GANs (RSDGAN) (Vougioukas et al., 2004) uses a GAN based approach to produce quality videos. They have used identity encoder, context encoder and frame decoder to generate images and used various discriminators to take care of different aspects of video generation. They have used frame discriminator to distinguish real and fake images, sequence discriminator to distinguish real and fake videos, and synchronization discriminator for better lip synchronization in videos. We introduce spatially adaptive normalization along with a one-shot approach and implemented curriculum learning to produce better results. This is explained in Sections 3,4.

Video to Animated Video generation

Initially, phonemes and visemes based methods were used to create the stylized characters. (Aneja and Li, 2019) has used an LSTM based approach to generate live lip synchronization on a 2D animated character. Some of these methods target rigged 3D characters or meshes with predefined mouth blend shapes that correspond to speech sounds (Stef et al., 2018; Karras et al., 2017; Taylor et al., 2012; Edwards et al., 2016; Mattheyses and Verhelst, 2014; Suwajanakorn et al., 2017), while others generate 2D motion trajectories that can be used to deform facial images to produce continuous mouth motions (Brand, 1999; Cao et al., 2005). These methods are primarily focused on mouth motions only and do not show emotions such as eye-blink, eye-brow movements, etc.

Several works have been done on expression and facial action units classification and mapping it to the animated version of a person such as (Leone et al., 2012; Santos Pérez et al., 2011; Gilbert et al., 2018; Poggi et al., 2005). They cover a finite space in terms of expression, movements and are not personalized to specific people. The proposed model is able to capture these various aspects such as facial expressions, lip-syncing, eye-blinks etc., due to attention-based generator and discriminator.

(Aneja et al., 2019) uses facial action coding system (Dailey et al., 2002), ensures lip syncing using phoneme classifier (Huggins Daines et al., 2006), expression control using (Kring and Sloan, 2007) and bone control units to create an avatar. They use an unreal engine (Games, 2007) for generating the avatar. They use a classification-based approach to create an avatar which covers the finite space in terms of facial details in a video. On the other hand, the proposed method uses an unsupervised generative approach to create animated videos with various facial details.

The recent introduction of GAN (Ian J. Goodfellow, 2014) has shifted the focus of the machine learning community towards generative models. Several works have been done in the image to image translation as well as video to video translation. Techniques such as Pix2Pix (Isola et al., 2017), Pix2PixHD (Wang et al., 2017), SPADE (Park et al., 2019) work in image to image translation, but require a paired form of training. CycleGan (Creswell et al., 2017) which uses cycle consistency loss, deals with unpaired form of training but lack in preserving temporal information while generating the animated videos.

For Video to Video style transfer, RecycleGan (Bansal et al., 2018) uses unpaired but ordered streams of data for both domains. This method uses recycle loss apart from adversarial and cycle loss to handle temporal information. Due to the Unet (Ronneberger et al., 2015) generator and lack of attention-based architecture, it is not able to generate high-quality animated video. The proposed method uses adaptive layer and instance normalization (AdaLin) and attention-based networks along with temporal discriminator which give better superior quality animated videos over RecycleGan.

U-GAT-IT (Kim et al., 2019) uses AdaLin and attention maps for translating an image from one domain to another. However, this architecture is not able to capture the temporal information and lacks lip synchronization and expressions such as eye-blink, eyebrow movement as well as head movement. We have leveraged AdaLin and attention map based architecture along with temporal predictor using recycle loss, blink loss and lip-sync loss for the better expression capture in animated videos(refer Section 5) in unsupervised fashion. OneShotAu2AV is able to synthesize a personalized animated video from an audio clip and a single image of the person.

3 Architectural Design

OneShotAu2AV consists of 2 stages: Stage 1 to generate realistic human domain videos given an audio and single unseen image as an input and Stage 2 to generate animated videos from realisitic human domain generated videos.

3.1 Stage 1

It consists of a single generator and 3 discriminators as shown in Figure 1(a).

3.1.1 Generator

Spatially Adaptive Generator:

The initial layers of generator, G uses deepspeech2 (Amodei et al., 2015) layers followed by Spatially-Adaptive normalization similar to SPADE architecture (Park et al., 2019). Instead of using semantic map as the input, we use the real image as an input to the SPADE generator. This helps in minimizing the loss of information due to normalization.

Audio features using deepspeech2 model:

The MFCC coefficients of audio signals are fed to the deepspeech2 model to extract the content related information of audio. The initial few layers are being used which goes to the generator .This helps in achieving better lip synchronization in video and lower word error rate

3.1.2 Discriminator

We have used 3 discriminators namely a multi-scale frame discriminator, a multi-scale temporal discriminator and a synchronization discriminator.

Multi-scale Frame Discriminator

Multi-scale discriminator (Wang et al., 2018), D is used in the proposed model to distinguish the coarser and finer details between real and fake images. Adversarial training with the discriminator helps in generating realistic frames. To have high resolution generated frames, we need to have an architecture with better receptive field. A deeper network can cause overfitting, to avoid that, multi-scale discriminators are used.

Multi-scale Temporal Discriminator

Every frame in a video is dependent on its previous frames. To capture the temporal property along with a spatial one, we have used a multi-scale temporal discriminator (Kim and Ganapath, 2016). This discriminator is modeled to ensure a smooth transition between consecutive frames and achieve a natural-looking video sequence. The multi-scale temporal discriminator is described as

L(T,G,D)=\sum_{i=t-L}^{t}[\log(D(x_{i}))]+[\log(D(1-G(z_{i})))]

(1)

where t is the time instance of an audio and L is the length of the time interval for which the adversarial loss is computed.

Synchronization Discriminator

To have coherent lip synchronization, the proposed model uses SyncNet architecture proposed in Lip Sync in the wild (Chung and Zisserman, 2016). The input to the discriminator is an audio signal of 200ms time interval(5 audio signals of 40ms each) and 5 frames of the video. The lower half of the frame of resized to (224,224,3) is fed as an input.

3.1.3 Losses

We have used adversarial loss, L_GAN, temporal adversarial loss, L(T,G,D), feature loss, L_FL, reconstruction loss, L_RL, perceptual loss, L_PL, contrastive loss, L_CL and blink loss, L_BL to generate high quality output. Objective function is given below and detailed explanation is given in supplementary material.

min_{G}((max_{D1,D2,D3}\sum_{k=1,2,3}L_{GAN}(G,D_{k})+max_{D1,D2,D3}\sum_{k=1,2,3}L(T,G,D))+\\ \lambda_{FM}\sum_{k=1,2,3}L_{FM}+\lambda_{PL}L_{PL}+\lambda_{CL}L_{CL}+\lambda_{BL}L_{BL})

where $\lambda_{FM},\lambda_{PL},\lambda_{CL},\lambda_{BL}$ are the hyperparameters that control the importance of various loss functions in the above objective function

Curriculum Learning

We have trained the model in multiple phases so that it can produce better results. In the first phase we have used a multi-scale frame discriminator and applied the adversarial loss, feature matching loss and perceptual loss to learn the higher-level features of the image. When these losses stabilize, we move to the second phase in which we have added a multi-scale temporal discriminator and synchronization discriminator and used reconstruction loss, Contrastive loss and temporal adversarial loss to get a better quality image near mouth region and coherent lip synchronized high-quality videos. After the stabilization of the above losses, we have added blink loss in the third phase to generate a more realistic image capturing emotions such as eye movement and eye blinks.

Few shot learning

To achieve a more sharp and a better image quality for an unseen subject, we have used one shot approach using perceptual loss during inference time. Our approach is computationally less expensive as compared to (Zakharov et al., 2019; Wang et al., 2019) which we have described in Section 2 and because of the spatially adaptive nature of generator architecture, we are able to achieve high-quality video. We run the model for 5 epochs during inference time to get high-quality video frames.

3.2 Stage 2

The stage 2 as shown in Figure 1(b) consists of a generator, a temporal predictor and a discriminator to generate high-quality animated videos.

3.2.1 Generator

The Generator model $G_{s->t}$ consists of encoder $E_{s}$ , attention-based normalisation driven decoder $D_{s}$ .The attention-based adaptive instance and layer normalization (AdaLin) is inspired by Class Activation Map(CAM) (Zhou et al., 2015) which is trained to learn weights of feature maps of the source domain using the global average pooling and global max pooling. This helps the generator to focus on the source image regions that are more discriminative from the target domain, such as eyes and mouth. AdaLin adjusts the ratio of IN and LN in the decoder according to source and target domain distributions to have the features of the source domain as well as the style of the target domain.

3.2.2 Discriminator

The discriminator, $D_{t}$ consists of Encoder, $E_{D_{t}}$ and auxiliary classsifier, $n_{D_{t}}$ . The discriminator concentrates its attention to determine whether the target image is real or fake by visualizing local and global attention maps so that it helps the generator to capture the global structure (e.g., face area and near of eyes) as well as the local regions.

3.2.3 Temporal Predictor

We use unpaired data but have ordered streams of frames, $(x_{1}\cdots x_{n})$ and $(y_{1}\cdots y_{n})$ for source and target domains. To learn better mapping from source to target domain, we focus to learn the temporal information. We introduce a temporal predictor( $P_{x}$ ) whose architecture is same as of UNet (Ronneberger et al., 2015) which predicts the future frames given past frames as an input. This is trained with L2 loss.

3.2.4 Losses

The losses such as adversarial loss, $L_{GAN}(G,D)$ , identity loss, CAM loss, $L_{cam}$ are used for the domain transfer and recycle loss, $L_{recycle}$ , lip sync loss, $L_{lip}$ and blink loss, $L_{BL}$ to extract the spatial and temporal information from a video that helps in generating high-quality expressive animated videos.Detailed explanation is given in supplementary material. Objective function is given below:

min_{G}((max_{D}L_{GAN}(G,D)+\lambda_{cam}L_{cam}^{D_{t}})+\lambda_{recycle}L_{recycle}(G_{x},G_{y},P_{y})\\ +\lambda_{identity}L_{identity}+\lambda_{cam}L_{cam}^{s->t}+\lambda_{lip}L_{lip}+\lambda_{BL}L_{BL})

where $\lambda_{cam}$ , $\lambda_{recycle}$ , $\lambda_{identity}$ , $\lambda_{lip}$ , $\lambda_{BL}$ are the hyperparameters used to control the importance of various loss functions in the above objective function

4 Experiments and Results

4.1 Datasets & Training

Our model is implemented in Pytorch and takes approximately $4$ days to run on $4$ Nvidia V100 GPUs for training. Around $5000$ and $1200$ videos of the GRID dataset are used for training and testing purposes. We have taken $3000$ and $600$ videos of the LOMBARD GRID and CREMA-D datasets for training and testing purposes respectively. The frames are extracted at $25$ fps. We have taken 16khz as sampling frequency for audio signals and used $13$ MFCC coefficients for 0.2 sec of overlapping audio for experimentation.

We have used the GRID dataset (Cooke, 2006), LOMBARD GRID (Najwa Alghamdi and Brown, 2018) and CREMA-D (Cao et al., 2014) for the experimentation and evaluation of different metrics for stage 1.

We have used the GRID dataset (Cooke, 2006) and hikemoji animated videos for experimentation and evaluation of different metrics. We have used around 100 videos for training and 30 videos for testing purposes in both the domains respectively for Stage 2.

4.2 Metrics

To quantify the quality of the final generated video, we use the following metrics. PSNR(Peak Signal to Noise Ratio), SSIM(Structural Similarity Index), CPBD(Cumulative Probability Blur Detection), ACD(Average Content Distance) and KID(Kernel Inception Distance). KID (Bińkowski et al., 2018) computes the squared Maximum Mean Discrepancy between the feature representations of real and generated images. PSNR, SSIM, and CPBD measure the quality of the generated image in terms of the presence of noise, perceptual degradation, and blurriness respectively. ACD (Tulyakov et al., 2018) is used for the identification of the speaker from the generated frames by using OpenPose (Cao et al., 2018). Along with image quality metrics, we also calculate WER(Word Error Rate) using pretrained LipNet architecture (Assael et al., 2017) and Blinks/sec using (Soukupova and Cech, 2016) to evaluate our performance of speech recognition and eye-blink reconstruction respectively.

4.3 Qualitative Results

OneShotAu2AV produces natural-looking high-quality animated videos of an unseen input image and audio signals. The videos are able to do lip synchronization on the sentences provided to them as well add natural expressions such as head movements, eye-blink, eyebrow movements. Videos were generated targeting different languages ensuring the proposed method is language independent and can generate videos for any linguistic community.

Figure 2 and Figure 5 display different examples of generated lip-synchronized video for male and female test cases for human and animated domains. As observed the opening and closing of the mouth is in sync with the audio signal. Figure 5 also displays a slight head movement of the animated person(between frames 3 and 4). Figure 3 and Figure 6 display eye-blink and facial expressions such as frowns in the both domain videos. Figure 4 displays different examples of generated lip-synchronized video for people uttering words of Hindi and Bengali language words, such as ’Modi’ and ’aache’ respectively. Figure 7 displays the same for animated domain.

[Uncaptioned image] — Table 1: Comparision of OneShotAu2AV with U-GAT-IT and RecycleGAN

Method	Results speaking ’now’
Input frames
OneShotAu2AV
U-GAT-IT
RecycleGAN

4.4 Quantitative Results

Stage 1

The proposed model performs better on image reconstruction metrics including PSNR and SSIM for both GRID and CREMA-D datasets as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) (Vougioukas et al., 2004) and Speech2Vid (Chung et al., 2017) as shown in Table 2. The table also displays the performance of OneShotAu2AV trained on the LOMBARD GRID dataset (Najwa Alghamdi and Brown, 2018). The improved performance of the proposed method is achieved with the use of spatially adaptive normalization in the generator architecture along with training of the proposed model in curriculum learning fashion with appropriate adversarial and non-adversarial losses.

Method	SSIM	PSNR	CPBD	WER	ACD-C	ACD-E
OneShotAu2AV(GRID)	0.881	28.571	0.262	27.5	0.005	0.09
RSDGAN(GRID)	0.818	27.100	0.268	23.1	-	1.47x10^-4
Speech2Vid(GRID)	0.720	22.662	0.255	58.2	0.007	1.48x10^-4
OneShotAu2AV(CREMA-D)	0.773	24.057	0.184	NA	0.006	0.96
RSDGAN(CREMA-D)	0.700	23.565	0.216	NA	-	1.40x10^-4
Speech2Vid(CREMA-D)	0.700	22.190	0.217	NA	0.008	1.73x10^-4
OneShotAu2AV(lombard)	0.922	28.978	0.453	26.1	0.002	0.064
Speech2Vid(lombard)	0.782	26.784	0.406	53.1	0.004	0.069

Table 2: Comparision of OneShotAu2AV with RSDGAN and Speech2Vid for GIRD, GRID lombard and CREMA-D datasets for SSIM, PSNR, CPBD, WER and ACD by calculating cosine distance(ACD-C)(should be 0.02 and below) and euclidean distance(ACD-E)(should be 0.2 and below).

Stage 2:

The proposed Model shows better results in terms of image translation metric i.e KID is 2x and 8x better than U-GAT-IT and RecycleGan respectively as displayed in Table 3. This is achieved by adding temporal predictor based recycle loss and lip-sync loss in the networks which helps in the reconstruction of high quality animated output. OneShotAu2AV performs better on lip synchronization metric(WER) which is 2x and 8x better than U-GAT-IT and RecycleGan respectively.

Method	KID $\times$ 100 $\pm$ std. $\times$ 100	WER	Blink/sec
OneShotAu2AV	5.02 $\pm$ 0.03	31.97	0.546
U-GAT-IT	10.37 $\pm$ 0.17	68.85	0.046
RecycleGan	42.54 $\pm$ 0.68	240.51	0.0

Table 3: Comparision of OneShotAu2AV with U-GAT-IT and RecycleGAN for KID, WER and Blink/sec.

4.5 Ablation Study

We conducted detailed ablation studies.In Stage 1, the addition of Contrastive Loss and multi-scale temporal adversarial loss leads to the improvement of the SSIM(0.867 to 0.873), PSNR(27.996 to 28.327) and CPBD(0.213 to 0.259) scores when measured on GRID dataset. Adding blink loss leads to further improvement in SSIM(0.873 to 0.881), PSNR(28.327 to 28.571), CPBD(0.259 to 0.262) scores. Similar improvements are observed on the LOMBARD GRID dataset as well.

In Stage 2, the addition of predictive model based recycle loss and lip-sync loss helped in the improvement of KID (10.37 to 5.02) and WER (68.85 to 31.97). For further details, kindly refer to the supplementary material.

Psychophysical assessment:

For video attachments and Psychophysical assessment(results of Turing test and user ratings), kindly refer to the supplementary material.

5 Conclusion and Future Work

In this paper, we have presented a novel approach, OneShotAu2AV, to convert an audio and single image of a person to an animated video. Using two stages in our multi-level generators and discriminators based architecture and appropriate adversarial and non-adversarial losses, we are able to achieve synced lip movements, blinks, and eye-brow movements in the output. Experimental evaluation demonstrates superior performance of OneShotAu2AV as compared to U-GAT-IT and RecycleGan on multiple quantitative metrics including KID(Kernel Inception Distance), Word error rate, blinks/sec. In future, we will look at techniques to further enhance the expressiveness of the generated animated videos.

References

Amodei et al. (2015) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. 12 2015.
Aneja and Li (2019) Deepali Aneja and Wilmot Li. Real-time lip sync for live 2d animation. 2019.
Aneja et al. (2019) Deepali Aneja, Daniel McDuff, and Shital Shah. A high-fidelity open embodied avatar with lip syncing and expression capabilities. pages 69–73, 10 2019. ISBN 978-1-4503-6860-5. doi: 10.1145/3340555.3353744.
Assael et al. (2017) Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: End-to-end sentence-level lipreading. GPU Technology Conference, 2017. URL https://github.com/Fengdalu/LipNet-PyTorch.
Bansal et al. (2018) Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. Recycle-gan: Unsupervised video retargeting. In ECCV, 2018.
Bińkowski et al. (2018) Mikołaj Bińkowski, DJ Sutherland, M Arbel, and A Gretton. Demystifying mmd gans. 01 2018.
Brand (1999) Matthew Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co, page 21–28, 1999.
Cao et al. (2014) Houwei Cao, David Cooper, Michael Keutmann, Ruben Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5:377–390, 10 2014. doi: 10.1109/TAFFC.2014.2336244.
Cao et al. (2005) Yong Cao, Wen Tien, Petros Faloutsos, and Frederic Pighin. Expressive speech-driven facial animation. ACM Trans. Graph., 24:1283–1302, 10 2005. doi: 10.1145/1095878.1095881.
Cao et al. (2018) Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018.
Cappelletta and Harte (2012) Luca Cappelletta and Naomi Harte. Phoneme-to-viseme mapping for visual speech recognition. Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012), 2, 05 2012.
Chung and Zisserman (2016) J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
Chung et al. (2017) Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? In British Machine Vision Conference, 2017.
Cooke (2006) Barker J. Cunningham S. Shao X Cooke, M. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
Creswell et al. (2017) Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35, 10 2017. doi: 10.1109/MSP.2017.2765202.
Dailey et al. (2002) Matthew Dailey, Michael Lyons, Miyuki Kamachi, H Ishi, Jiro Gyoba, and Garrison Cottrell. Cultural differences in facial expression classification. Proc. Cognitive Neuroscience Society, 9th Annual Meeting, San Francisco CA, page 153, 06 2002.
Edwards et al. (2016) Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 35:1–11, 07 2016. doi: 10.1145/2897824.2925984.
Games (2007) Epic Games. Unreal engine , online: https://www. unrealengine. com. 2007.
Gilbert et al. (2018) Michaël Gilbert, Samuel Demarchi, and Isabel Urdapilleta. Facshuman a software to create experimental material by modeling 3d facial expression. pages 333–334, 11 2018. doi: 10.1145/3267851.3267865.
Guler et al. (2018) Riza Guler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. pages 7297–7306, 06 2018. doi: 10.1109/CVPR.2018.00762.
Huggins Daines et al. (2006) David Huggins Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and Alexander Rudnicky. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. volume 1, pages I – I, 06 2006. doi: 10.1109/ICASSP.2006.1659988.
Ian J. Goodfellow (2014) Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron Courville Yoshua Bengio Ian J. Goodfellow, Jean Pouget-Abadie. Generative adversarial nets, 2014.
Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei Efros. Image-to-image translation with conditional adversarial networks. pages 5967–5976, 07 2017. doi: 10.1109/CVPR.2017.632.
Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 36:1–12, 07 2017. doi: 10.1145/3072959.3073658.
Kim and Ganapath (2016) Byung-Hak Kim and Varun Ganapath. Lumièrenet: Lecture video synthesis from audio. 2016.
Kim et al. (2019) Junho Kim, Minjae Kim, Hyeon-Woo Kang, and Kwanghee Lee. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation, 07 2019.
Kring and Sloan (2007) Ann Kring and Denise Sloan. The facial expression coding system (faces): Development, validation, and utility. Psychological assessment, 19:210–24, 06 2007. doi: 10.1037/1040-3590.19.2.210.
Lee and Yook (2002) Soonkyu Lee and Dongsuk Yook. Audio-to-visual conversion using hidden markov models. pages 563–570, 08 2002. doi: 10.1007/3-540-45683-X_60.
Leone et al. (2012) Giuseppe Riccardo Leone, Giulio Paci, and Piero Cosi. Lucia: An open source 3d expressive avatar for multimodal h.m.i. volume 78, pages 193–202, 01 2012. doi: 10.1007/978-3-642-30214-5_21.
Mattheyses and Verhelst (2014) Wesley Mattheyses and Werner Verhelst. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication, 66, 11 2014. doi: 10.1016/j.specom.2014.11.001.
Najwa Alghamdi and Brown (2018) Ricard Marxer Jon Barker Najwa Alghamdi, Steve Maddock and Guy J. Brown. A corpus of audio-visual lombard speech with frontal and profile view, the journal of the acoustical society of america 143, el523 (2018); https://doi.org/10.1121/1.5042758, 2018.
Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. pages 5206–5210, 04 2015. doi: 10.1109/ICASSP.2015.7178964.
Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Poggi et al. (2005) Isabella Poggi, Catherine Pelachaud, F. Rosis, Valeria Carofiglio, and Berardina Carolis. Greta. A Believable Embodied Conversational Agent, pages 3–25. 01 2005. doi: 10.1007/1-4020-3051-7_1.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. volume 9351, pages 234–241, 10 2015. ISBN 978-3-319-24573-7. doi: 10.1007/978-3-319-24574-4_28.
Santos Pérez et al. (2011) Marcos Santos Pérez, Eva González-Parada, and Jose Manuel Cano-Garcia. Avatar: An open source architecture for embodied conversational agents in smart environments. volume 6693, pages 109–115, 06 2011. doi: 10.1007/978-3-642-21303-8_15.
Simons and Cox (1990) A. Simons and Stephen Cox. Generation of mouthshapes for a synthetic talking head. Proceedings of the Institute of Acoustics, Autumn Meeting, 01 1990.
Soukupova and Cech (2016) Tereza Soukupova and Jan Cech. Real-time eye blink detection using facial landmarks, 2016.
Stef et al. (2018) Andreea Stef, Kaveen Perera, Hubert Shum, and Edmond Ho. Synthesizing expressive facial and speech animation by text-to-ipa translation with emotion control. pages 1–8, 12 2018. doi: 10.1109/SKIMA.2018.8631536.
Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven Seitz, and Ira Kemelmacher. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36:1–13, 07 2017. doi: 10.1145/3072959.3073640.
Taylor et al. (2012) Sarah Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. Dynamic units of visual speech. pages 275–284, 07 2012.
Tian et al. (2019) Guanzhong Tian, Yi Yuan, and Yong Liu. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. 05 2019.
Tulyakov et al. (2018) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, 2018.
Vougioukas et al. (2004) Konstantinos Vougioukas, Stavros Petridi, and Maja Pantic. End-to-end speech-driven facial animation with temporal gans. Journal of Foo, 14(1):234–778, 2004.
Wang et al. (2017) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. 11 2017.
Wang et al. (2018) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
Wang et al. (2019) Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
Wiles et al. (2018) O. Wiles, A.S. Koepke, and A. Zisserman. X2face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision, 2018.
Zakharov et al. (2019) Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models, 05 2019.
Zhou et al. (2015) Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. 12 2015.