This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Review of Human Emotion Synthesis Based on Generative Technology

Fei Ma*, Yukan Li*, Yifan Xie*, Ying He, Yi Zhang, Hongwei Ren, Zhou Liu, Wei Yao, Fuji Ren, , Fei Richard Yu, , Shiguang Ni (Corresponding Author: Shiguang Ni. *: Equal Contribution.) Fei Ma, Yifan Xie, Yi Zhang, and Zhou Liu are with Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China. E-mail: [email protected], [email protected], [email protected], [email protected]. Yukan Li, Wei Yao, and Shiguang Ni are with Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China. E-mail: [email protected], [email protected], [email protected]. Ying He is with College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China. E-mail: [email protected]. Hongwei Ren is with MICS Thrust, The Hong Kong University of Science and Technology (GZ), Guangzhou, China. E-mail: [email protected]. Fuji Ren is with School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China, and also with Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China. E-mail: [email protected]. Fei Richard Yu is with College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, and also with School of Information Technology, Carleton University, Canada. E-mail: [email protected]. Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.

Index Terms:
Emotion Synthesis, Generative Technology, Autoencoder, Generative Adversarial Network, Diffusion Model, Large Language Model, Sequence-to-Sequence Model

1 Introduction

Affective computing is an interdisciplinary research field that aims to endow computers with the ability to recognize, understand, express, and respond to human emotions [1, 2]. It integrates theories and methods from multiple disciplines such as computer science, psychology, and cognitive science, attempting to reveal the essence of human emotions and apply it to human-computer interaction and intelligent systems. The core goal of affective computing is to enable computers to perceive, understand, and express emotions like humans, thereby achieving more natural and friendly human-computer interaction [3, 4, 5, 6].

Emotion synthesis [7] is an important branch of affective computing, which aims to enable computers to generate emotional expressions similar to human emotions. This ability can be realized through various common modalities, such as facial images, speech, and text. To achieve emotion synthesis, researchers have proposed a series of traditional methods by analyzing the characteristics of human emotional expressions and establishing mathematical models. These models are then used to generate speech and facial expressions with specific emotions using computers [8, 9].

Artificial intelligence has made remarkable advances in synthesizing human emotions, marking a significant breakthrough in the field. In particular, generative technology has greatly improved the effect and application scope of emotion synthesis [10, 11, 12]. Compared with traditional methods, these new models can automatically learn the characteristics of emotional expressions from massive data without relying on manually designed rules and models [13, 14, 15, 16]. With their powerful generation capabilities, generative models can generate emotional samples that are highly similar to real data and more flexible, greatly expanding the research boundaries in the field of emotion synthesis. For example, some researchers use Autoencoders (AEs) [17] to generate speech with emotions. By modifying this structure, they can extract speaker embeddings, isolate timbre information, and control the flow of emotional attributes [18]. Other researchers use Generative Adversarial Networks (GANs) [19] to generate facial images with specific emotions. By controlling the input of the generative model, they can generate faces expressing different emotions such as happiness, sadness, and anger [20]. In recent years, Diffusion Models (DMs) [21] and Large Language Models (LLMs) [22] have also been widely used in emotion synthesis tasks. Some researchers use DMs to enhance image and audio processing by employing a reconstruction Module, which leverages noising and denoising in latent spaces [23]. Other researchers use LLMs to generate empathic conversational texts by training language models on empathic conversations and injecting emotional information into response generation [24]. However, to the best of our knowledge, there is a conspicuous absence of a systematic review that specifically focuses on generative technology for human emotion synthesis within this burgeoning field.

This study examines how generative AI models synthesize human emotions, addressing current gaps in research through a systematic analysis. The overall schematic diagram is illustrated in Fig. 1. Specifically, this paper will firstly introduce several generative models and their underlying mathematical principles to help readers better understand the technical background of this field, including AEs, GANs, DMs, LLMs, and Sequence-to-Sequence (Seq2Seq) models [25]. AEs learn compressed data representations through encoder-decoder reconstruction, enabling dimensionality reduction and feature extraction. GANs use adversarial training between a generator and discriminator to produce realistic data samples. DMs generate data by learning to reverse a noise diffusion process, often avoiding vanishing gradients. LLMs leverage transformer architectures and massive text datasets for human language processing and generation. Seq2Seq models use encoder-decoder architectures to map input sequences to output sequences, facilitating tasks like translation and synthesis. Then, this review summarizes the commonly used human emotion synthesis datasets from unimodal to cross-modal. Each dataset has annotated emotion labels according to its purpose, covering the simplest positive and negative emotion labels [26] to complex 32 categories of compound emotion labels [27].

Subsequently, we categorize human emotion synthesis into three subfields: facial emotion synthesis, speech emotion synthesis, and textual emotion synthesis. The taxonomy of this survey is shown in Fig. 2. Facial emotion synthesis involves modifying facial information in computer-generated faces to create more realistic and diverse emotions. This area of research intersects with face reenactment [28, 29], face manipulation [30], and talking head generation [31, 32]. Speech emotion synthesis involves altering the emotional attributes of speech segments or generating new speech that conveys specific emotions by manipulating acoustic features. This section will cover various studies on Voice Conversion [33, 34], Text-to-Speech (TTS) [35], and speech manipulation tasks [36]. Textual emotion synthesis refers to using computational techniques to infuse textual content with different emotions or sentiments, thereby enhancing its expressiveness. This section will focus on text emotion transfer and empathetic dialogue generation.

Finally, this review summarize the prevalent evaluation metrics employed in the task, present key findings from multiple perspectives, and offer insights into future research directions based on the aforementioned systematic analysis. The main findings include: (i) Generative models have made significant progress in emotion synthesis across multiple modalities, including facial images, speech, and text. (ii) In the past, GAN-based methods have demonstrated strong capability in facial emotion synthesis, excelling at capturing subtle nuances in expressions. However, DMs have now emerged as a more promising alternative, offering superior control, stronger adaptability across different modalities. (iii) Speech emotion synthesis has benefited from the adaptation of GANs and Seq2Seq models, with further improvements through AEs and DMs to enhance emotional depth and prosodic control. (iv) Textual emotion synthesis has increasingly leveraged LLMs and Seq2Seq architectures, using sentiment control and emotional valence modulation to produce emotionally resonant content, although challenges remain in balancing emotional expressiveness with conversational coherence. (v) Both subjective and objective metrics are essential for evaluating emotion synthesis models, with future research focusing on refining both to better capture emotional subtleties and align with human judgments. Looking ahead, there are still many directions worth exploring, which include: (i) Combining different generative models like GANs, Seq2Seqs, AEs, DMs, and LLMs can enhance emotion synthesis by leveraging the strengths of each model for more accurate and realistic outputs. (ii) Exploring new modalities like gestures, electroencephalogram (EEG) [37], and electrocardiogram (ECG) [38], as well as cross-modal models, expanding the potential for immersive and interactive emotional experiences. (iii) Real-time emotion generation on edge devices like smartphones and wearables can enable personalized, adaptive emotional interactions, with applications in healthcare, retail, and more. (iv) Emotion synthesis can transform digital entertainment and filmmaking by enabling more authentic emotional expressions in virtual characters, enhancing storytelling, and allowing real-time emotional adjustments in films based on audience feedback. This research provides a framework for understanding how generative models replicate human emotions, offering insights to guide future developments in the field. Overall, the main contributions of this paper include:

  • To the best of our knowledge, this review provides the first systematic overview of human emotion synthesis based on generative technology.

  • By analyzing more than 230 related papers, this review gives a taxonomy of generative technology-based human emotion synthesis in different modalities.

  • We summarize commonly used datasets and evaluation metrics for human emotion synthesis across different modalities.

  • Finally, we discuss the current research status of human emotion synthesis based on generation technology and present a future outlook.

The remainder of this paper is structured as follows: Sections 3 - 6 describe the gaps in existing review research, the mainstream emotion models, mathematical principles for generating models, and the commonly used datasets, Sections 7 - 9 introduce the latest human emotion synthesis works in the three modalities of face images, speech, and text, Section 10 and 11 summarize the common evaluation metrics in the field and discuss the current state of research and development trends, and Section 12 provides the conclusion. A list of abbreviations is given in Table I.

Refer to caption

Figure 1: Schematic Diagram of Generation Technology for Human Emotion Synthesis.
TABLE I: Main Acronyms.
Acronym Full Form
AAE Adversarial Autoencoder
ACC Accuracy
AE Autoencoder
AIGC Artificial Intelligence Generated Content
AU Action Unit
AUD AU-intensity Discriminator
BLEU Bilingual Evaluation Understudy
CAAE Conditional Adversarial Autoencoder
CGAN Conditional GAN
CWT Continuous Wavelet Transform
DM Diffusion Model
FID Fréchet Inception Distance
F0 RMSE F0 Root Mean Square Error
GAN Generative Adversarial Network
GPT Generative Pre-trained Transformer
MCD Mel Cepstral Distortion
MFCC Mel-Frequency Cepstral Coefficient
MOS Mean Opinion Score
PPL Perplexity
PSNR Peak Signal-to-Noise Ratio
RNN Recurrent Neural Network
Seq2Seq Sequence-to-Sequence
SMOS Similarity Mean Opinion Score
SSIM Structural Similarity Index
TTS Text-to-Speech
T5 Text-to-Text Transfer Transformer
VAE Variational Autoencoder
VC Voice Conversion

Refer to caption

Figure 2: Taxonomy of This Survey.

Refer to caption

Figure 3: A Comprehensive Review Methodology.

2 Review Methodology

To compile this extensive review on human emotion synthesis based on generative models, a systematic and rigorous approach was followed to ensure the comprehensiveness and relevance of the literature. The overall screening process is shown in Fig. 3. Initially, we conducted comprehensive searches across key academic databases, including IEEE Xplore, ScienceDirect, and Google Scholar. The search strategy combined general and modality-specific keywords related to generative models and emotion synthesis. Examples of search terms included “facial emotion synthesis” + “generative models,” “speech emotion synthesis” + “generative models,” “emotional face reenactment,” and “emotional voice conversion,” etc., aiming to cover a wide array of studies in these domains. To ensure the inclusion of the most recent advancements, the search focused on papers published between 2017 and 2024, providing a comprehensive overview of developments in this timeframe. To further broaden the scope, we included terms referencing specific generative model architectures, such as AE, GAN, DM, LLM, and Seq2Seq. The initial search yielded more than 270 papers.

Furthermore, we established strict inclusion criteria for the selection process: (1) Peer-reviewed papers published up to November 2024, including both journal and conference papers; (2) Research focusing on the application of generative models to human emotion synthesis in various modalities; (3) Studies in which extensive experiments were conducted and fully evaluated.

After the initial retrieval, a two-step filtering process was applied to ensure the focus remained on emotion synthesis. Studies whose primary aim was not related to emotional generation or did not involve the use of generative models were excluded. Furthermore, we eliminated papers with incremental contributions to maintain the diversity of the sources reviewed. The final set of selected works provides a detailed overview of the current and impactful advancements in the field.

By following this structured methodology, we ensured the thoroughness and relevance of the selected studies, offering a timely and well-rounded perspective on the role of generative technology in human emotion synthesis across multiple modalities. This process ultimately enabled a focused analysis of both seminal works and the latest advancements, contributing to a deepened understanding of generative models in affective computing.

3 Emotion Model

Emotion is commonly understood as a complex and ever-changing state of mind and body, which can be triggered by various interactions[39], perceptions, or thoughts[40]. It encompasses a wide range of experiences, cognitive evaluations, behavioral responses, physiological reactions [41], and communicative expressions. In the realm of human cognition, emotions play a crucial role in decision-making [42], shaping our perceptions, and guiding our interactions with others [43].

As illustrated in Fig. 4, the study of emotions has resulted in the development of different theoretical models, which can be primarily categorized into discrete emotions theory and multidimensional emotion theory [44]. In the most basic discrete emotion framework, emotions are simply categorized as positive or negative, also known as polarity [45, 46]. Within this framework, the term "emotion" is often replaced with "sentiment," which sometimes includes a neutral category as well. However, this sentiment categorization is considered too simplistic for certain contexts. Therefore, the more detailed discrete emotion theory categorizes basic emotions that are universally recognized across cultures into six or eight types [47, 48, 49]. On the other hand, the multidimensional emotions theory suggests that emotions can be viewed along a continuous spectrum, often defined by dimensions such as 2D (valence and arousal) [50] or 3D (valence, arousal, and dominance) [51]. These theoretical perspectives provide valuable insights into the complex nature of human emotions and serve as foundational principles for emotion synthesis.

By considering emotions in generative tasks, machines not only understand and process information, but also become attuned to the emotional dimensions of human experience. This enriches the human-machine interaction landscape and opens up new avenues for the development of empathetic and intuitive AI developments.

Refer to caption

Figure 4: Plutchik Wheel (left) and 2D Emotion Model (right).

(1) Plutchik Wheel is sourced from https://en.wikipedia.org/wiki/Robert_Plutchik#/media/File:Plutchik-wheel.svg.
(2) Diagram of 2D emotion model is modified from [44].

4 Difference

TABLE II: Differences between Our Work and Other Reviews.
Reference
Related to
generative models
Related to
emotion synthesis
Focus on
multiple modalities
Our work * * *
Kammoun et al. [52] ✓*
Liu et al. [53] ✓*
Triantafyllopoulos et al. [54] * *
Wali et al. [55] ✓* *
Zhang et al. [56] * *
De et al. [57] ✓*
Hajarolasvadi et al. [9] ✓* * *

*: The article only relates to GAN.

The current state of research in affective computing primarily focuses on areas such as human sentiment analysis [10], emotion detection, and emotion recognition [12]. These tasks have a long-standing tradition and are considered significant when viewed in a broader context. For instance, Saxena et al. [58] conducted a survey on emotion recognition methods, examining facial, physiological, speech, and text approaches. They highlighted key techniques like Stationary Wavelet Transform and Particle Swarm Optimization. In another study, Nandwani et al. [59] analyzed sentiment analysis and emotion detection methods, discussing the transition from lexicon-based to deep learning techniques. They addressed challenges and emphasized the need for advanced and versatile models to improve accuracy and adaptability across different domains and languages. Canal et al. [60] presented a systematic literature review on facial emotion recognition from images. They categorized the techniques into classical and neural network-based approaches, highlighting the slightly higher accuracy of classical methods compared to neural networks, despite the latter’s generalization capabilities.

Generative models currently constitute one of the mainstream directions in artificial intelligence research. However, most of the existing reviews only focus on the synthesis of a single modality, such as facial images [52, 53], speech [54, 55], and text [56, 57], and many of them ignore the emotional aspects of the synthesis process. The most relevant work to ours is [9]. This paper provides a thorough review of GANs in synthesizing human emotions, with a focus on facial expressions, speech, and cross-modal synthesis. It details various GAN architectures, their applications in emotion synthesis, challenges faced, and future directions. By evaluating numerous studies, it highlights how GANs enhance emotion recognition accuracy, offer data augmentation, and create realistic, diverse emotional samples across modalities. However, the key distinction between our survey and the aforementioned one lies in the fact that ours exclusively focuses on emotion synthesis. In addition, this review introduces a series of generation models other than GANs, such as the Seq2Seq model in the field of emotional speech synthesis, and LLMs in text emotion synthesis.

In summary, the differences are shown in Table II. Our survey is based on a clear definition of human emotion synthesis, focusing on the application of generative models in this emerging field, as well as the latest research developments in various sub-fields.

5 Generative Model

The generative model [61] refers to a model that can be described as generating data, belonging to a type of probability model. In machine learning, it can model data directly or establish conditional probability distributions between variables via Bayes’ theorem, allowing the generation of new data not present in the training set. In this section, we explain the mathematics of different generative models, including AE, GAN, DM, LLM, and Seq2Seq.

5.1 Auto-Encoder (AE)

AEs [17] are neural network models used in generative tasks to efficiently learn data representations. These models compress input data into simplified patterns, then reconstruct it by preserving key features while minimizing errors in the reproduction process. As a variant of AE, Variational Auto-Encoders (VAEs) introduced by Kingma and Welling in 2013 [62], offer a principled approach to learning latent data representations to account for data uncertainty and variability, making them well-suited for generative tasks. The training of VAEs is guided by the maximization of the Evidence Lower BOund (ELBO), which can be expressed as follows:

ELBO=𝔼qϕ(z|x)[logpθ(x|z)]DKL[qϕ(z|x)p(z)],\text{ELBO}=\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]-D_{KL}[q_{\phi}(z|x)\|p(z)], (1)

where the encoder, qϕ(z|x)q_{\phi}(z|x), maps the input data xx to a distribution over the latent space characterized by parameters ϕ\phi. Typically, this distribution is assumed to be Gaussian, with the encoder outputting the mean and variance of the distribution. The latent variable zz, sampled from this distribution, is then fed into the decoder, pθ(x|z)p_{\theta}(x|z), which attempts to reconstruct the input data, where θ\theta denotes the parameters of the decoder. In the Equation (1), the first term is the expected log-likelihood of the data given the latent variables, encouraging accurate reconstruction of the data. The second component measures how closely the encoded data patterns match an expected statistical distribution p(z)p(z), using the Kullback-Leibler divergence formula. By optimizing the ELBO, VAEs learn to balance the trade-off between fidelity in data reconstruction and adherence to a structured latent space.

5.2 Generative Adversarial Network (GAN)

GANs [19] are distinguished by their unique training methodology, which leverages the concept of adversarial learning, setting up a dynamic competition between two distinct neural networks: the generator and the discriminator. The generator network, GG, aims to map latent space vectors, drawn from a prior distribution pz(z)p_{z}(z), to data space, effectively generating new data samples that mimic the distribution of real data, pdata(x)p_{data}(x). In contrast, the discriminator network, DD, is trained to distinguish between samples drawn from the real data distribution and those produced by the generator. The competition between networks steadily improves their performance, ultimately enabling the generator to create lifelike outputs. The training of GANs is formulated as a min-max game, which can be formally represented by the following value function V(G,D)V(G,D):

minGmaxDV(D,G)=\displaystyle\min_{G}\max_{D}V(D,G)= 𝔼xpdata(x)[logD(x)]+\displaystyle\mathbb{E}_{x\sim p_{data}(x)}[\log D(x)]+ (2)
𝔼zpz(z)[log(1D(G(z)))],\displaystyle\mathbb{E}_{z\sim p_{z}(z)}[\log(1-D(G(z)))],

where the first term represents the expected log-probability that the discriminator correctly identifies real data samples as real. The second term represents the expected log-probability that the discriminator correctly identifies fake samples (generated by GG) as fake. Training GANs involves alternating between optimizing DD to maximize V(D,G)V(D,G) for fixed GG (improving DD’s accuracy in distinguishing real from fake samples) and optimizing GG to minimize V(D,G)V(D,G) for fixed DD (improving GG’s ability to generate realistic samples). This adversarial training process continues until a state of equilibrium is reached where GG generates samples indistinguishable from real data by DD. The adversarial training mechanism of GANs has proven to be highly effective for generating complex, high-dimensional data.

5.3 Diffusion Model (DM)

DMs [21] have emerged as a class of powerful generative models that synthesize data by gradually refining random noise into structured patterns. The fundamental principle behind DMs involves two phases: a forward (noising) phase and a reverse (denoising) phase. In the forward phase, step by step, the model gradually adds random noise to the data until the original information becomes completely obscured. This process is described by a Markov chain that gradually corrupts the original data distribution, x0x_{0}, into a tractable noise distribution, xTx_{T}, over TT steps. Mathematically, this can be expressed through a sequence of conditional probabilities:

q(xt|xt1)=𝒩(xt;1βtxt1,βtI),q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I), (3)

where βt\beta_{t} represents the variance of the noise added at each step, and II is the identity matrix. The sequence of βt\beta_{t} values is predefined to control the noise level at each step, ensuring a smooth transition from data to noise. The reverse phase aims to learn the reverse process, modeling the conditional distribution of xt1x_{t-1} given xtx_{t}, effectively denoising the data. The model, typically parameterized by a neural network, is trained to approximate the reverse conditional probabilities:

pθ(xt1|xt)=𝒩(xt1;μθ(xt,t),Σθ(xt,t)),p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{\theta}(x_{t},t)), (4)

where the parameters μθ\mu_{\theta} and Σθ\Sigma_{\theta} are functions learned during training, with θ\theta representing the model’s parameters. The model learns by minimizing the gap between real and generated data distributions, using either statistical bounds or direct probability optimization. Despite their computational intensity, DMs have emerged as a cornerstone in generative modeling due to their impressive performance in generating high-quality samples.

5.4 Large Language Model (LLM)

Large Language Models are AI systems trained on massive text datasets, using billions of parameters to learn language patterns. Their deep understanding of language enables human-like text comprehension and generation, transforming how computers process natural language. The advent of Transformer architecture, developed by Vaswani et al. in 2017 [22], marked a significant breakthrough in language modeling.

The training objective of LLMs can generally be described as learning a conditional probability distribution over sequences. Given an input sequence X=(x1,x2,,xT)X=(x_{1},x_{2},\dots,x_{T}), the model maximizes the likelihood of generating each token based on prior tokens, represented as:

p(X)=t=1Tp(xt|x<t)p(X)=\prod_{t=1}^{T}p(x_{t}|x_{<t}) (5)

where p(xt|x<t)p(x_{t}|x_{<t}) is the probability of token xtx_{t} given its context x<tx_{<t}. A key feature of this process is the attention mechanism, which allows the model to dynamically focus on different parts of the context at each step. In self-attention, the probability of generating each token is influenced by a weighted sum of the context, with attention weights αtj\alpha_{tj} computed as:

αtj=exp(qtkjdk)k=1Texp(qtkkdk)\alpha_{tj}=\frac{\exp(\frac{q_{t}\cdot k_{j}}{\sqrt{d_{k}}})}{\sum_{k=1}^{T}\exp(\frac{q_{t}\cdot k_{k}}{\sqrt{d_{k}}})} (6)

and the new token representation ztz_{t} is given by:

zt=j=1Tαtjvjz_{t}=\sum_{j=1}^{T}\alpha_{tj}v_{j} (7)

By analyzing relationships between tokens, the attention system helps the model understand context and produce coherent, relevant text.

So far, several LLMs have gained prominence, including Generative Pre-trained Transformer (GPT) [63] series, Bidirectional Encoder Representations from Transformers (BERT) [64], eXtreme Language Understanding Network (XLNet) [65], and Text-to-Text Transfer Transformer (T5) [66], etc. These models have been pre-trained on vast amounts of text data and can effectively fine-tune for specific tasks, such as language translation, sentiment analysis, and text generation [67, 68].

5.5 Sequence-to-Sequence (Seq2Seq) Model

Seq2Seq model [25] is a neural network architecture designed for tasks involving sequential input-output pairs. It follows an encoder-decoder structure, where each component is typically implemented with recurrent neural networks (RNNs) or Transformers in later models.

One defining feature of Seq2Seq models is their focus on sequential data, which differentiates them from models like GANs or VAEs that do not inherently account for sequential dependencies. Unlike general language models, Seq2Seq models specialize in transforming one sequence into another, effectively capturing both immediate and distant patterns in the data.

In the Seq2Seq model, the encoder processes the input sequence (x1,x2,,xT)(x_{1},x_{2},\dots,x_{T}) to produce a context vector cc, often represented by the encoder’s final hidden state:

c=hTc=h_{T} (8)

where hTh_{T} encapsulates the input sequence’s essential information. The decoder then uses this context vector to generate the output sequence (y1,y2,,yT)(y_{1},y_{2},\dots,y_{T^{\prime}}) one element at a time, conditioned on cc and previously generated outputs:

st=g(yt1,st1,c)s_{t}=g(y_{t-1},s_{t-1},c) (9)
p(yt|y<t,x)=softmax(Wst)p(y_{t}|y_{<t},x)=\text{softmax}(Ws_{t}) (10)

where sts_{t} is the hidden state at time tt, and WW is a weight matrix for calculating the output distribution.

Seq2Seq models are effective for handling variable-length input and output sequences, making them well-suited for applications like translation, summarization, and question answering, where coherent sequence transformation is required. The differences between Seq2Seq models and other generative models are shown in Table III.

TABLE III: Relationship Between Seq2Seq and Other Generative Models.
Model Type Belongs to Seq2Seq? Description
LLMs Partially belongs Some LLMs (e.g., T5, BART) are implementations of Seq2Seq, but LLMs cover a broader range.
AEs Does not belong Similar to Seq2Seq in architecture, but with different goals and tasks.
GANs Does not belong Completely different model type with unrelated goals and structures.
DMs Does not belong Entirely different generative models, unrelated to Seq2Seq.

6 Databases

The performance of human emotion synthesis tasks based on generative models is closely tied to the quality and richness of the utilized datasets. To be specific, the diversity and scope of the datasets play a crucial role in the model’s ability to generalize across various emotional states, cultural contexts, and individual differences. The structure and content of emotion databases directly shape how researchers design and build emotion synthesis models. The structure, annotation scheme, and inherent biases of the datasets influence the choice of model architecture, loss functions, and training strategies. Furthermore, the size and quality of the datasets influence the choice between end-to-end learning approaches and modular architectures, with large, high-quality datasets enabling end-to-end learning of emotion synthesis, while smaller or noisier datasets might necessitate the use of pre-trained components or transfer learning techniques. Based on these emotional datasets of different modalities, such as facial images, speech, and text, the designed models can imitate human emotional expressions with high precision from different aspects. Table IV summarizes the common datasets used in the field of human emotion synthesis, providing a comprehensive overview of the available resources for researchers and practitioners in this domain.

TABLE IV: Common Datasets for Human Emotion Synthesis with Generative Models.
Database Year Modalities Samples Subjects Category
Oulu-CASIA [69] 2011 visual 480 sequences 80 happiness, surprise, sadness, anger, fear, disgust
RaFD [70] 2010 visual 8040 images 49 sad, neutral, angry, contemptuous, disgusted, surprised, fearful, happy
CK+ [71] 2010 visual 593 images 123 happiness, sadness, surprise, fear, anger, disgust, neutral, contempt
CFEE [72] 2014 visual 229 images 230 happiness, surprise, sadness, anger, fear, disgust
AffectNet [73] 2017 visual 450,000 images / happiness, sadness, surprise, fear, anger, disgust, neutral, contempt
DISFA [74] 2013 visual 130,788 images 27 continuous annotation of graded changes in spontaneous facial expression of emotion
EmotioNet [75] 2016 visual 1,000,000 images / 23 basic or compound emotion categories(happy, sad, fearful, angrily surprised, sadly angry, etc.)
ESD [76] 2021 audio 350 utterances 20 happy, sad, neutral, angry, surprise
EmoV-DB [77] 2018 audio 7590 utterances 5 neutral, amused, angry, sleepy, disgust
Emo-DB [78] 2005 audio 800 sentences 10 neutral, anger, fear, joy, sadness, disgust, boredom
MEmoSD [79] 2019 audio / 4 angry, happy, neutral, sad
CaFE [80] 2018 audio 936 samples 12 neutral, sadness, happiness, anger, fear, disgust, surprise
KES [81] 2019 audio 21,000 speeches 1 neutral, happy, sad, angry, surprised, fearful, disgusted
ETOD [81] 2019 audio 6000 speeches 13 neutral, happy, sad, angry
YELP review [26] / text 6,990,280 reviews / positive, negative
AMAZON review [82] / text / / positive, negative
EmpatheticDialogue [27] 2019 text 24,850 conversations 810 32 emotion labels (suprised, excited, angry, proud, sad, annoyed, grateful, etc.)
MojiTalk [83] 2018 text 662,159 conversations / 64 emoji labels
MEAD [84] 2020 visual + audio 281,400 clips 60 angry, disgust, contempt, fear, happy, sad, surprise, neutral
CREMA-D [85] 2014 visual + audio 7442 utterances 91 happiness, surprise, sadness, anger, fear, disgust
EmoVoxCeleb [86] 2018 visual + audio 153,500 tracks 1251 neutral, happiness, surprise, sadness, anger, disgust, fear, contempt
SAVEE [87] 2008 visual + audio 480 utterances 4 neutral, anger, disgust, fear, happiness, sadness, surprise
RAVDESS [88] 2018 visual + audio 7356 videos 24 Happiness, Sadness, Surprise, Fear, Anger, Disgust, Neutral, Contempt
IEMOCAP [89] 2008 visual + audio + text 10,039 samples 10 categorical and continuous annotations

7 Facial Emotion Synthesis

Facial emotion synthesis is a crucial research field within human emotion synthesis, aiming to generate faces that express specified emotions. This technology holds significant academic value in computer graphics and computer vision, while also demonstrating great promise for applications in virtual reality (VR), gaming, and interactive computer systems. Based on existing works, we can broadly categorize facial emotion synthesis into three main approaches: face reenactment (Section 7.1), talking head generation (Section 7.3), and facial manipulation (Section 7.2). The related works are illustrated in Table V.

7.1 Face Reenactment

Face reenactment focuses on transferring facial expressions from a source actor to a target face, preserving the identity of the target while adopting the emotional expressions of the source. This technique is particularly useful in applications like film dubbing, virtual avatars, and privacy-preserving video conferencing.

In facial reenactment, there are some tasks that emphasize emotional attributes in the face. For example, in [90], Tripathy et al. introduced ICface, a GAN-based face animator that manipulated facial expressions in a given image. The animation process was guided by interpretable control signals, such as head pose angles and Action Units (AU) values, which were derived from various sources, allowing for selective emotion transfer. Zeng et al. [91] proposed DAE-GAN, which employed two deforming autoencoders to separate identity and pose in unlabeled videos, reducing the need for manual annotation. It realized emotional transfer between different identities with varied poses using conditional generation and disentangled features. Strizhkova et al. [92] proposed a novel method for emotion editing in head reenactment videos by manipulating the latent space of a pre-trained GAN. This technique disentangled emotion, identity, and pose within the latent space, allowing for the direct modification of emotions in the reenactment videos without affecting the person’s identity or the speech-related facial expressions. Groth et al. [93] designed a new method to achieve emotion mapping by generating correctly recognized expressions, using video reenactments to influence the intensity of the emotion. In [94], Ali et al. utilized two encoders to separately capture expression from a source and identity from a target image, merging these features to create expressive images, enhanced by innovative consistency losses for both expression and identity features. In Fig. 5, Xue et al. presented LSGAN [95], employing a transformative generator that combined target expression labels with specific facial region features to produce clear and distinct facial expressions in images. Shao et al. [96] utilized dual parallel generators and wavelet-based discriminators for facial expression translation, enhancing realism by focusing on key areas with an attention mechanism and capturing expression details across scales without the bidirectional translation interference seen in single-generator models.

Refer to caption

Figure 5: A mask-based GAN [95] for face reenactment. The system included four main components. A Semantic Mask Generator (SMG) produced masks for specific facial regions (eyes, mouth, cheeks). Then these masks were encoded into latent codes through an Adversarial Autoencoder (AAE). A Transformative Generator (TG) used these codes along with target expression labels to generate new facial expressions, with an AU-intensity Discriminator (AUD) that assessed their quality and intensity.

7.2 Face Manipulation

Face manipulation involves editing specific attributes of a face, such as changing expressions, age, hairstyles, etc., to generate different versions of the same person. It can be seen that this has a completely different goal compared to face reenactment which transfers the expressions and movements of a source face to a target face. Researchers in this field focus on the alteration of specific facial attributes while preserving the remaining attributes unchanged, under the condition of explicit predefined facial information, thereby effecting changes in emotional expression. Works in this field are mainly based on GAN and its variants [97, 98, 99, 100, 101, 102, 103, 104, 30, 105, 106, 20]. For instance, in [30], Xie et al. proposed a novel approach, using GAN equipped with consistency preservation and feature entropy regularization techniques. This innovation achieved improved results in attribute translation, particularly in preserving consistency and reducing feature entropy. Song et al. [107] utilized facial geometry to guide expression creation but required a neutral face image, making the process more complex. It consisted of two GANs for changing and removing expressions. In [108], Kong te al. proposed a dual-path GAN for emotion synthesis and introduced a learning strategy based on a separation discriminator to train it more efficiently. In [109], Tang et al. introduced structured latent space and perceptual loss to achieve fine-grained expression manipulation while preserving identity and global facial shape. Patashnik et al. [110] innovatively utilized text-driven manipulation techniques, in conjunction with the extraordinary visual concept encoding abilities of CLIP and StyleGAN, to specify desired attributes like different emotional facial expressions. Liu et al. [111] first proposed a novel general framework, EvoGAN, that combined an evolutionary algorithm (EA) and GAN to work as a whole, which generated face images with more compound expressions. In [112], Tang et al. proposed a novel ECGAN for generating faces with different emotions based on the input expression attribute vector. And in [113], the innovation in Sola et al.’s work was the use of ECGAN, a novel expression-conditioned GAN, to incorporate expression information into unmasking processes, resulting in improved naturalness and expression fidelity of generated faces. In [114], Lindt et al. updated the Conditional Adversarial Autoencoder (CAAE) [115] framework to manipulate facial expressions in images based on continuous two-dimensional emotion labels, representing valence and arousal.

Refer to caption

Figure 6: A DM-based talking head generation model from [23]. This architecture started by extracting conditions from multiple modalities: audio, visual, and textual inputs. Specifically, the emotion intensity block used a pretrained CLIP text encoder to convert text prompts into embeddings that represented the underlying emotional content. These embeddings, reflecting nuanced emotional states and their strengths, were then integrated into the DM’s denoising process.

7.3 Talking Head Generation

Talking head generation aims to create realistic, animated facial models that can speak and emote based on input audio or text. This approach is particularly valuable in creating virtual assistants, digital newscasters, and personalized content for educational or entertainment purposes.

A series of works that incorporate emotion synthesis into talking head generation have been explored. For example, Eskimez et al. [116] presented a novel system for generating talking faces, achieving independent control of emotional expressions by disregarding the emotions expressed in the input speech audio and instead conditioning the face generation on an independent categorical emotion variable. In [117], Vougioukas et al. developed a specialized GAN that uses three discriminators to create detailed facial expressions that match a speaker’s emotional state. As illustrated in Fig. 6, Zhang et al. presented EmoTalker [23], a framework for emotionally editable talking face generation, utilizing a diffusion model and a novel Emotion Intensity Block, and integrating a custom dataset to enhance emotion interpretation. In [118], Zeng et al. presented ET-GAN, an end-to-end system for generating talking faces with tailored expressions from guiding videos, identity images, and arbitrary audio, utilizing multiple encoders for identity, expression, and audio-lip synchronization, alongside advanced frame and spatial-temporal discriminators. Gan et al. [119] generated synchronized emotion coefficients and emotion-driven facial images through flow-based and vector-quantized models, while using lightweight emotion prompts or CLIP supervision for model control and adaptation. In [120], Tan et al. used the regularized flow model to generate emotion-related expression coefficients, mapped the input emotion coefficients to the latent space, and generated diverse expression dynamics based on audio and emotion labels, while achieving high-fidelity restoration of emotional details through a vector quantization image generator. In another work [121], they used orthogonal basis decomposition to decompose facial dynamics into independent latent spaces of lip shape, head posture, and emotional expression, allowing the model to adjust the relevant expression coefficients in each independent latent space.

TABLE V: Literature on Generative Models for Face Reenactment, Face Manipulation, and Talking Head Generation, in Facial Emotion Synthesis.

Reference Year Model Dataset Performance Ali et al. [94] 2019 TER-GAN Oulu-CASIA User study Shao et al. [96] 2021 WP2-GAN RaFD/CFEE Accuracy: 89.47, 87.97/FID: 41.74, 24.91/SSIM: 0.6818, 0.6659 Tripathy et al. [90] 2020 GAN VoxCeleb Image Quality Assesment scores: 25.02, 33.08 Zeng et al. [91] 2020 DAE-GAN VoxCeleb1/RaFD SSIM: 0.65, 0.73/FID: 60.8, 13.8/User Study: 0.61 Strizhkova et al. [92] 2021 GAN MUG/MEAD ECS: 0.88, 0.58/FID: 19.7, 25.5/ACD: 0.10, 0.13 Groth et al. [93] 2020 Encoder-Decoder MoCap Ratings figure for perceived intensity and sincerity Xue et al. [95] 2024 LSGAN RaFD/DISFA FID: 31.121/PSNR: 23.417/SSIM: 0.858 Hu et al. [122] 2023 2CET-GAN CFEE/RaFD FID: 31.1/IS: 1.55/Expression Transfer Score: 3.37 Choi et al. [123] 2018 StarGAN CelebA/RaFD Classification error: 2.12% Liu et al. [111] 2022 EvoGAN EmotioNet/RaFD User study Xie et al. [30] 2023 GAN FFHQ/RaFD/LFW FID: 8.32-16.05/Accuracy: 97.42-98.61 Tang et al. [109] 2020 EGGAN RaFD/MMI MSE: 0.24/PCC: 0.66 Patashnik et al. [110] 2021 StyleGAN FFHQ User study Sola et al. [113] 2023 ECGAN RAFDB Accuracy: 78.54/SSIM: 0.52-0.64/FID: 244.75-668.68/Perceptual Loss: 1.21-1.46 Zhu et al. [124] 2019 UGAN CFEE FID: 44.8 Ding et al. [125] 2018 ExprGAN Oulu-CASIA User study Tesei et al. [126] 2019 CCycleGAN FER2013 FID figure Tang et al. [112] 2019 ECGAN AR/Yale/JAFFE/FERG/3DFE AMT Score: 35.32 VGG Score: 78.13, 80.32 Wang et al. [127] 2019 Comp-GAN CelebA/F2EDF^{2}\text{ED} Accuracy figure Lindt et al. [114] 2019 CAAE AffectNet RMSE: 0.528, 0.607/SAGR: 0.567, 0.483/CCC: 0.312, 0.210 Kong et al. [108] 2021 DualPathGAN EmoVoxCeleb/VoxCeleb2 PSNR: 32.56/SSIM: 0.953/FID: 9.20 Doukas et al. [97] 2021 HeadGAN VoxCeleb FID: 50.9/FCD: 334/CSIM: 0.716 Zhao et al. [98] 2021 PAttGAN DISFA/DISFACat ICC: 0.89, 0.88/MAE: 0.24, 0.28/MSE: 0.22, 0.29/FID: 27.32, 21.77 Xia et al. [99] 2021 LGP-GAN RaFD IS: 1.31/FID: 11.88 Wang et al. [100] 2019 GAN EmotioNet/RaFD User study Akram et al. [101] 2023 SARGAN KDEF/CFEE/RaFD ACD: 0.306/FVS: 94.18±1.11/FID: 53.95 Zhang et al. [102] 2021 SwitchGAN RaFD FID: 17.4/Accuracy: 98.32 Akram et al. [103] 2024 US-GAN KDEF/RaFD/CFEE ACD: 0.2832/FVS: 94.19±1.11/User study: 43% improvement(expression plausibility) Azari et al. [104] 2024 StyleGAN CelebA/FFHQ LPIPS: 0.07/FID: 7.86/ID: 0.88 Apolito et al. [105] 2021 GAN AffectNet Smoothness score:0.33-0.38/ERE: 0.020, 0.018/FED: 0.71 Peng et al. [106] 2019 ApprGAN Bosphorus/CK+/MUG Correlation coefficients: 0.972, 0.953, 0.975/Normalised distances: 2.061, 1.879, 1.835 Pumarola et al. [20] 2018 GAN EmotioNet/RaFD User study Ding et al. [125] 2018 ExprGAN Oulu-CASIA User study Song et al. [107] 2018 G2-GAN CK+/Oulu-CASIA User study Xu et al. [128] 2024 StyleGAN MEAD/RAVDESS Realism: 0.40/Emotion similarity: 0.36/Mouth shape similarity: 0.43 Eskimez et al. [116] 2021 GAN CREMA-D PSNR: 30.91/SSIM: 0.85/Accuracy: 55.3 Zhang et al. [23] 2024 Diffusion model MEAD/CREMA-D/FED CSIMD: 0.67, 0.51/Accuracy: 84.76, 75.13 Vougioukas et al. [117] 2020 GAN CREMA-D PSNR: 23.565/SSIM: 0.70/ACD: 1.401041.40\cdot 10^{-4} Tan et al. [129] 2023 Encoder-Decoder CFD/MEAD/CREMA-D Accuracy: 65.20/PSNR: 29.38, 30.03/SSIM: 0.66, 0.68/M-LMD: 2.78, 3.03/F-LMD: 2.87, 3.16 Tan et al. [130] 2024 Encoder-Decoder MEAD SSIM: 0.795/FID: 23.207/M-LMD: 3.317/F-LMD: 2.696 Zhai et al. [131] 2023 Seq2Seq MEAD/CREMA-D SSIM: 0.82, 0.79/PSNR: 30.29, 30.55/M-LMD: 2.14, 1.16/LMD: 2.44, 1.46/FID: 19.59, 42.53/EP: 80.52, 85.64 Sheng et al. [132] 2023 DVAE MEAD/CREMA-D SSIM: 0.79, 0.92/LPIPS: 0.07, 0.08/LMD: 1.83, 1.34/LVD: 1.56, 0.84/CSIM: 0.88, 0.87/Emoacc: 0.87, 0.84 Gan et al. [119] 2023 Encoder-Decoder VoxCeleb2/MEAD PSNR: 21.75/SSIM: 0.68/FID: 19.69/M-LMD: 2.25/F-LMD: 2.47/Accuracy: 75.43 Tan et al. [120] 2024 Diffusion model HDTF/MEAD SSIM: 0.708, 0.689/FID: 15.165, 16.553/M-LMD: 1.643, 1.939/F-LMD: 1.958, 2.061/Accuracy: 71.53 Tan et al. [121] 2024 Encoder-Decoder HDTF/MEAD PSNR: 26.504, 22.771/SSIM: 0.845, 0.769/FID: 13.172, 15.548/M-LMD: 1.197, 1.102/F-LMD: 1.111, 1.060/Accuracy: 68.85 Ma et al. [133] 2023 Diffusion model MEAD/HDTF/Voxceleb2 SSIM: 0.86, 0.85, 0.69/CPBD: 0.16, 0.31, 0.30/F-LMD: 1.93, 1.80, 2.69/M-LMD: 2.91, 2.15, 2.72 Zhang et al. [134] 2023 Diffusion model MEAD/HDTF LPIPS: 0.169, 0.176/CPBD: 0.299, 0.280/F-LMD: 3.845, 3.948 Zeng et al. [118] 2020 ET-GAN CREMA-D/GRID PSNR: 23.981, 25.771/SSIM: 0.733, 0.810/FID: 76.92, 61.33 Sun et al. [135] 2023 StyleGAN MEAD/RAVDESS LIE: 0.739/CPBD: 0.247, 0.166/FED: 9.40/FID: 30.36/ID: 0.931/LSE-C: 7.11/LSE-D: 8.03 Wang et al. [136] 2024 Diffusion model MEAD/CREMA-D PSNR: 32.6131, 34.3401/CPBD: 0.3803, 0.5180/Emo-Acc: 74.57

8 Speech Emotion Synthesis

Speech emotion synthesis is a critical research field within human emotion synthesis, focusing on the manipulation and generation of acoustic features in speech signals to alter or create specific emotional states. This area of study aims to modify key vocal parameters such as volume, intonation, pitch, speaking rate, and timbre to effectively convey a desired emotional state in synthesized speech. Based on existing works, we divide it into voice conversion (Section 8.1) ,text-to-speech (Section 8.2) and speech manipulation (Section 8.3). Table VI illustrates the overall papers about speech emotion synthesis.

8.1 Voice Conversion

Voice conversion technology modifies a speaker’s voice to express different emotions while keeping the original words intact. This enables emotional dubbing in films and games, and helps create more natural-sounding AI assistants.

For instance, in [76], Zhou et al. leveraged VAW-GAN and deep emotional features from speech emotion recognition to describe emotional prosody. By using adjustable emotional features to guide the decoder, this approach could transform speech to express both familiar and new emotions. In [137], they combined VAW-GAN and Continuous Wavelet Transform (CWT) for advanced spectrum and prosody conversion. By incorporating CWT-based F0 modeling, the system uniquely enhanced the granularity of prosody representation. In [138], they focused on disentangling and re-composing emotional elements in speech, innovatively employing CWT for detailed prosody analysis and integrating F0 conditioning in the decoder to enhance emotion conversion performance. Similarly, in [139], they proposed a CycleGAN-based model [140] that did not require parallel data. It utilized CWT to analyze F0 on multiple scales, enabling detailed prosody modification. In [141], Fu et al. also presented an improved CycleGAN-based model that incorporated a transformer to augment temporal dependencies and integrated curriculum learning and a fine-grained level discriminator, enhancing the model’s ability to capture and convert emotional nuances in speech more effectively. Rizos et al. [142] utilized class-conditional GAN and an auxiliary domain classifier to generate emotional speech samples. In [143], Elgaar et al. introduced a Factorized Hierarchical Variational Autoencoder (FHVAE) for multi-speaker and multi-domain emotional voice conversion, enhancing disentangled representation and emotion conversion quality through novel algorithms and loss functions. Gao et al. [144] used style transfer autoencoders for emotional voice conversion without parallel data, leveraging disentangled representation learning to modify emotion-related characteristics. In Fig. 7, Chen et al. introduced a Tacotron2-based framework using emotion disentangling modules [145] to achieve cross-speaker emotion transfer by separating speaker identity from emotion. In [146], Oh et al. proposed the DurFlex-EVC model, integrating a style autoencoder and unit aligner for advanced control and flexibility. It leveraged HuBERT features, denoising diffusion models, and self-supervised learning to modify emotional tones while maintaining linguistic content and unique vocal traits. Moreover, there were also some works related to emotional voice conversion based on StarGAN [147, 148, 149], AE [150], and Seq2Seq [151, 152, 153].

Refer to caption

Figure 7: A two-stage model [145] for voice conversion. In the first stage, the method separated emotional and content features from speech using an attention-based mechanism, incorporating inter-speech relation to model emotional strength. The second stage employed a conversion adaptation strategy, leveraging a multi-view consistency mechanism to ensure that the emotional nuances were accurately transformed while the core speech content remained intact. This framework facilitated precise control over the emotional output of synthesized speech.

8.2 Text-to-Speech (TTS)

TTS aims to generate natural-sounding speech based on the semantic content and context of the input text. This technology is particularly crucial for enabling more engaging and human-like interactions with virtual assistants, audiobook narrations, and personalized content delivery. Some researchers have attempted to incorporate emotional factors into TTS systems to synthesize speech with emotional expressiveness.

For example, in [154], Lei et al. introduced a unified model for fine-grained emotional speech synthesis, obtaining fine-grained emotion expressions with emotion descriptors or phoneme-level manual labels. Li et al. [155] developed DiCLET-TTS, which improves emotional speech synthesis by combining emotion separation techniques with advanced probability-based decoding. In [156], Tits et al. explored adapting a Deep Convolutional TTS (DCTTS) model to various emotions using minimal emotional data. In [157], Schnell et al. leveraged WaveNet and emotion intensity extraction using attention LSTM and transformer models, demonstrating increased perceived emotion accuracy. Wu et al [158] synthesized emotional speech with limited labeled data, achieving comparable performance to fully supervised models. In [159], Um et al. employed an inter-to-intra distance ratio algorithm and an effective interpolation technique to achieve nuanced emotion intensity control. In [81], Im et al. proposed EmoQ-TTS, a system that synthesized expressive emotional speech by conditioning phoneme-wise emotion information with fine-grained emotion intensity, using intensity pseudo-labels generated via distance-based intensity quantization. Hortal et al. [160] combined Tacotron 2 with GANs to modulate prosody, allowing customization of inferred speech with specified emotions. Guo et al. [161] presented "EmoDiff," a DM-based model that enabled intensity-controllable emotional speech synthesis using a soft-label guidance technique. Li et al. [162] utilized the Tacotron framework, enhanced with emotion classifiers and style loss, to generate expressive, controllable emotional speech efficiently. In [163], Lei et al. introduced a novel method integrating global-level, utterance-level, and local-level modules to achieve precise emotion modeling and transfer, allowing for versatile emotional expression in synthesized speech. In Fig. 8, Li et al. created a system [164] that extracts emotion patterns independent of the speaker’s voice, allowing emotions to be transferred between speakers and adjusted in intensity. In [165], Li et al. developed a technique that analyzes speech style at different scales, capturing both broad and fine details to produce more controllable and expressive synthetic voices. In [166], Tang et al. introduced mix methods, enabling the manual combination of noise at runtime to produce diverse emotional mixtures, which was validated through evaluations demonstrating its capability to generate speech with various mixed emotions.

Refer to caption

Figure 8: A Tacotron2-based TTS model from [164]. This system integrated an Emotion Disentangling Module (EDM) that utilized dual encoders to separate emotion-related features from speaker identity, refining the emotion embedding for clearer expression without speaker leakage. An identity controller maintained the target speaker’s identity by incorporating speaker-specific information, enabling the synthesis of emotionally expressive speech consistent with the target speaker’s vocal characteristics.

8.3 Speech Manipulation

Speech manipulation focuses on modifying various aspects of speech signals, such as the speaker’s identity or linguistic content. It shares similarities with face manipulation in Section 7.2, as both aim to alter or control specific attributes of human-generated data, be it facial features or speech characteristics. Recently, some researchers have begun exploring how to manipulate the emotional attributes of speech.

For instance, in [79], Jia et al. presented ET-GAN, an innovative cross-language emotion transfer system for speech synthesis, which was unique for not necessitating parallel training data and utilized CycleGAN, achieving significant improvements in emotional accuracy and naturalness of synthetic speech. In [167], Matsumoto et al. utilized WaveNet and auxiliary features like voiced, unvoiced, and silence flags to generate speech-like emotional sounds without linguistic content, enhancing emotional expressiveness control. In [168], Wang et al. presented Emo-CampNet, a text-based speech editing model. It integrated emotion attributes and a context-aware mask prediction network, employing generative adversarial training and data augmentation to enhance emotional expressiveness and speaker variability in edited speech. In [36], Inoue et al. introduced a novel emotion editing technique in speech synthesis utilizing a hierarchical emotion distribution extractor within the FastSpeech2 framework [169], enabling fine-grained, quantitative emotion control at phoneme, word, and utterance levels for dynamic and nuanced speech generation.

TABLE VI: Literature on Generative Models for Voice Conversion, Text-to-Speech and Speech Manipulation, in Speech Emotion Synthesis.

Reference Year Model Dataset Performance Zhou et al. [76] 2021 VAW-GAN ESD/IEMOCAP MCD: 4.127-4.916/MOS: 3.24, 2.94, 3.15/Preference test Fu et al. [141] 2022 CycleGAN Japanese emotional speech dataset/ESD MOS figures/MCD: 5.65, 16.38/F0 RMSE: 44.91, 105.99 Zhou et al. [138] 2021 VAW-GAN EmoV-DB MCD: 4.085, 4.278/LSD: 5.681, 6.106/F0 RMSE: 61.712, 52.348/PCC: 0.893, 0.867/Preference test Zhou et al. [139] 2020 CycleGAN Emotional speech corpus MCD: 10.23, 8.71/F0 RMSE: 65.05, 63.03/PCC: 0.76, 0.76/Preference test Rizos et al. [142] 2020 StarGAN IEMOCAP Spectrogram representation/Human evalution score Zhou et al. [137] 2020 VAW-GAN EmoV-DB/English emotional speech corpus/JL-Corpus MCD: 4.439, 4.683/LSD: 6.161, 6.275/PCC: 0.776, 0.691/MOS: 2.808 Gao et al. [144] 2018 AE IEMOCAP Emotion conversion MOS: 48%/Speaker similarity MOS: 3.55 Zhou et al. [151] 2022 Seq2Seq VCTK/ESD MOS figures/MCD: 4.13, 4.15, 4.25/DDUR: 0.24, 0.17, 0.31 Oh et al. [146] 2024 AE Diffusion model ESD nMOS: 3.70/sMOS: 3.63/eMOC: 72.97/UTMOS: 3.58/ECA: 91.58/SECS: 74.83 Elgaar et al. [143] 2020 VAE Collected dataset/IEMOCAP Subjective evalution figures Kreuk et al. [152] 2022 Seq2Seq VCTK/EmoV MOS figures/eMOC figures Du et al. [150] 2021 VAE ESD MCD figures/F0 RMSE figures/SV accuracy figures/MOS: 3.53, 3.58, 3.74/Preference test Du et al. [147] 2021 StarGAN ESD MCD figures/MOS: 3.044/Preference test Chen et al. [145] 2023 AE ESD MCD: 4.596/ACC: 0.830/RMSE: 0.117/Emotion similarity: 76.12% Meftah et al. [148] 2023 StarGAN ESD MCD,F0 RMSE figures/Waveforms figures/Confusion matrices/Spectrograms figures Shah et al. [149] 2023 StarGAN Hindi Emotional Speech Database MOS: 4.21±0.01/Similarity: 0.76/EmoAcc: 41.50/Preference test Choi et al. [153] 2021 Seq2Seq Plain-to-emotional dataset MOS: 4.14/MCD: 5.778, 16.055/Emotion confusion matrices/Emotion similarity figures Qi et al. [170] 2024 CVAE ESD MCD: 4.06/RMSE: 38.28/DDUR: 0.21/MSD: 19.61, 20.55, 21.54, 22.24 Zhang et al. [18] 2023 AE Multi-S60-E3/Child-S1-E6/VA-S2/Read-S40 MOS: 4.09±0.03, 3.76±0.05, 4.10±0.03 (Emotion Similarity,Speaker Similarity,Voice Quality) Lei et al. [154] 2021 Seq2Seq Internal corpus MCD: 4.91, 5.03/F0 trajectories Li et al. [155] 2023 Diffusion model 5 female monolingual speakers Emotion similarity DMOS: 3.90-4.04/Cosine Similarity: 0.25, 0.73 Tits et al.[156] 2020 Seq2Seq EmoV-DB MOS: 2.00, 2.10, 2.27, 3.59, 3.29 (amused, angry, disgusted, neutral, sleepy) Schnell et al. [157] 2021 Seq2Seq SAVEE/IEMOCAP/WSJCAM0 Accuracy: 35.5, 28.9(total, emo) Wu et al. [158] 2019 Seq2Seq Emotional speech corpus MCD: 2.64/F0 RMSE: 64.4/V/UV: 8.26/FFE: 21.81 Um et al. [159] 2020 Seq2Seq Korean male voice database MOS: 3.90±0.54/Recognition acurracy/Preference test Im et al. [81] 2022 Seq2Seq KES/ETOD MOS: 3.72, 3.95/MCD: 4.81, 2.94/F0 RMSE: 53.15, 30.61/EmoAcc: 99.39, 86.85 Hortal et al. [160] 2021 Seq2Seq GAN LJ Speech/VESUS/CREMA-D/RAVDESS Accuracy Guo et al. [161] 2023 Diffusion model ESD MOS: 4.13±0.10/MCD: 5.94/Classification accuracy/Preference test Kang et al. [171] 2023 Diffusion model Multi-emotional dataset/ESD/LibriTTS Test ECA: 51.59, 38.89, 39.86, 32.57/MOS: 3.44, 3.31/SMOS: 3.22 Zhu et al. [172] 2019 Seq2Seq Emotional speech corpus Mel spectrograms/Pitch/PCA ordination diagram trajectories Tang et al. [166] 2023 Diffusion model IEMOCAP/ESD MOS: 4.10, 3.92/SMOS: 4.02, 3.82/MCD: 5.29, 5.65 Lei et al. [173] 2022 GAN Internal corpus Emotion MOS: 4.04, 4.01, 3.86/Pearson Correlation: 0.776, 0.790, 0.759, 0.794/F0 curves Lei et al. [163] 2022 Seq2Seq Emotional speech corpus MCD: 3.63/MOS: 4.02±0.119/CMOS: 0.520, 0.342/Preference: 58.1, 51.2/F0 curves Li et al. [162] 2021 Seq2Seq Emotional speech corpus Peference test/Strength confusion matrices/pitch trajectories Li et al. [164] 2022 Seq2Seq DB_1/AIC/DB_6 Emotion similarity DMOS: 3.71±0.066/Cosine simlarity: 0.28, 0.60 Li et al. [165] 2021 Seq2Seq Internal corpus MOS: 4.155±0.552, 4.136±0.701 Cai et al. [174] 2021 Seq2Seq IEMOCAP/BC2013-English/RECOLA MOS: 3.66 Accuracy: 78.75, 91.0 Wang et al. [175] 2023 Seq2Seq EmoV-DB MCD: 4.66/MOS: 3.76±0.03/Accuracy: 0.67, 0.67, 0.77/preference test Zhou et al. [176] 2022 Seq2Seq VCTK/ESD MCD figures/PCC figures/MOS: 3.21-3.81/BWS test Diatlova et al. [177] 2023 Seq2Seq ESD MOS: 4.37/NISQA: 4.1 Guan et al. [178] 2024 Seq2Seq MEAD-TTS MOS: 4.36, 3.95, 4.11, 3.77, 4.19, 3.93/SMOS: 4.61, 4.03 /MCD: 3.17, 6.69/EmoAcc: 0.659, 0.261, 0.636, 0.318 Zhou et al. [179] 2024 Encoder-Decoder ESD/LibriTTS MOS: 3.67/CMOS: 0.64, 0.71, 0.15, 0.92/preference test Tang et al. [180] 2024 Diffusion model IEMOCAP MOS: 4.12/SMOS: 4.10/ERA: 0.749/EDER: 27.8 Jing et al. [181] 2024 Diffusion model ESD/MSP-Podcast MOS-Q: 3.88/MOS-S: 3.36, 3.22, 3.05, 3.34/preference test Jia et al. [79] 2019 ET-GAN IEMOCAP/Emo-DB/CaFE/MEmoSD FAD,naturalness MOS figures/Preference test Inoue et al. [36] 2024 Seq2Seq Blizzard/ESD MOS: 3.596±0.141/MCD: 4.348/Pitch Distortion: 1.151/Energy Distortion: 4.018/FD: 6.881/BWS test Wang et al. [168] 2024 CampNet VCTK/ESD F0 curve/MCD: 3.078, 3.495, 3.528, 3.425, 3.332/Preference test/MOS figures Matsumoto et al. [167] 2020 WaveNet JSUT corpus MOS figures/F0 distribution/Confusion matrices (Subject-perceived emotions) Shi et al. [182] 2024 Encoder-Decoder ESD MOS: 3.98/SMOS: 4.15/EmoAcc: 99.31/MCD: 4.81

9 Textual Emotion Synthesis

Textual emotion synthesis is a vital research field within human emotion synthesis, focusing on the generation of texts that possess specific emotional, sentiment, or empathetic attributes. This area of study aims to create or modify written content to convey desired emotional states, sentiments, or empathetic responses, thereby enhancing the expressiveness and impact of textual communication. Based on existing works, we consider the following two types of tasks: text emotion transfer (Section 9.1) and empathetic dialogue generation (Section 9.2). Table VII shows the literature about textual emotion synthesis.

9.1 Text Emotion Transfer

Text emotion transfer focuses on transforming the emotional tone or sentiment of an existing text while preserving its core semantic content. It allows for the modification of neutral text into emotionally charged content, or the alteration of one emotional state to another.

Refer to caption

Figure 9: A text emotion transfer model from [183]. During training, the model restored a corrupted input by using a stable "style vector" derived from the preceding sentence to condition the reconstruction process. A new style vector allowed precise, targeted style modifications, or "targeted restyling," during inference by adjusting the direction and magnitude of the style shift relative to a baseline, using a few exemplar sentences to define the desired output style.

For example, in [184], Mohammadibaghmolaei et al. proposed a text emotion transfer technique based on masked language modeling and transfer learning, and a GPT-2 model underwent training to construct an initial sentence based on its altered sequences, allowing the model to perform efficiently even with limited emotion-annotated data. A novel text sentiment transfer methodology was proposed by Li et al. [26] in which they employed a three-step process—Delete, Retrieve, and Generate. This approach, powered by unsupervised learning and neural sequence-to-sequence models, effectively altered sentiment while retaining content. In [185], Jin et al. introduced the IMaT model that constructed a pseudo-parallel corpus through semantic alignment, then applied a sequence-to-sequence model for attribute translation, refining this alignment across iterations. Wu et al. [186] proposed a two-stage "Mask and Infill" methodology that significantly enhanced the performance of non-parallel text sentiment transfer. Following the "mask and infill" method, in [187], Malmi et al. introduced an innovative unsupervised method using padded masked language models (MLMs) for sentiment transfer, using a padded MLM variant to avoid having to predetermine the number of inserted tokens. In [188], Yang et al. presented a technique leveraging language models as discriminators for unsupervised sentiment manipulation, enhancing stability and content fidelity in generated text. In [189], Shen et al. developed a technique for text sentiment transfer without parallel data, utilizing refined alignment of latent representations, which effectively separated content from style, allowing for sentiment modification by mapping sentences to a style-independent content vector, then decoding this vector into another style. Zhang et al. [190] employed GAN for sentiment transfer across different text domains, innovatively combining adversarial and reinforcement learning with a cross-domain sentiment transfer model, enhancing the ability to generate emotionally nuanced text while maintaining domain-specific content. In Fig. 9, Riley et al. leveraged T5 to extract a style vector from preceding sentences and used it to provide extra conditioning for the decoder [183]. Huang et al. introduced a method [191] based on Cycle-consistent Adversarial AutoEncoders, comprising LSTM autoencoders, adversarial style transfer networks, and cycle-consistent constraints, innovating unsupervised text style transfer. In [192], Luo et al. proposed the Seq2SentiSeq model, incorporating sentiment intensity via a Gaussian kernel in the decoder, enhancing sentiment control. They trained the model using cycle reinforcement learning, which maintains the original message while changing emotional tone without requiring matched data pairs.

9.2 Empathetic Dialogue Generation

Empathetic dialogue generation is a crucial aspect of creating more human-like and emotionally intelligent conversational agents. It goes beyond simply generating contextually relevant responses and focuses on incorporating emotional understanding and support into the generated dialogue.

A lot of researchers focused on how to generate responses with specific emotional tendencies or more empathy by LLMs. For example, in [193], Li et al. employed the Emotional Chain-of-Thought (ECoT) technique, enhancing Large Language Models’ capability for nuanced emotional text generation, focusing on human emotional intelligence alignment. Yang et al. [194] provided a Hybrid Empathetic Framework (HEF) that used SEMs as flexible enhancements to LLMs, implementing a two-stage emotion prediction strategy and an emotion cause perception strategy. In [195], Sun et al. utilized the Chain of Emotion-aware Empathetic prompting (CoNECT) for better context understanding and emotional engagement. In [196], Lee et al. investigated GPT-3’s capacity for empathetic dialogue generation, employing in-context learning in zero-shot and few-shot settings. Casas et al. [24] introduced an empathic chatbot framework utilizing transformer-based language models for generating responses that recognized and adapted to the user’s emotional state. In [197], Chen et al. enhanced the empathetic responses of ChatGLM-6B by fine-tuning it with a specialized dataset comprising over 2 million multi-turn empathy conversation samples.

Apart from LLMs, there are some studies on emotional or empathetic dialogue generation based on models like Seq2Seq model and conditional variational autoencoder (CVAE). For example, as shown in Fig. 10, Song et al. proposed an emotional dialogue system, EmoDS [198], that enhanced a Seq2Seq framework with a lexicon-based attention mechanism and an emotion classifier, generating dialogue responses that expressed specified emotions, either explicitly or implicitly. In [199], Zhou et al. introduced an Emotional Chatting Machine (ECM) that integrated internal and external memory mechanisms along with emotion category embeddings. In [200], Kong et al. proposed a model that effectively combined CGANs with either standard Seq2Seq or CVAE models to produce dialogue responses with the specified sentiment. Li et al. [201] introduced a novel Dual-View CVAE model that synthesized emotional dialogue, changing the emotional expression of responses with higher content relevance. In [202], Xu et al. proposed a framework that incorporated multi-task learning and dual attention mechanisms, effectively decoupling and processing content and emotional information from the input. Asghar et al. [203] suggested an enhanced Seq2Seq model that incorporated three emotional strategies for the input, training, and inference processes, which was based on a designed dictionary with Valence, Arousal, and Dominance (VAD) scores. In [204], Colombo et al. employed a Seq2Seq architecture augmented by emotion embeddings and a VAD lexicon for word and sequence-level emotion modeling. It utilized an affect regularizer to favor emotionally charged words and an affect sampling method for generating emotionally relevant diverse responses. Huang et al. [205] introduced a novel approach to integrate emotions into dialogue generation by appending an emotion token to the dialogue input or injecting the emotion directly into the decoder. In [206], Lin et al. proposed a model integrating Transformer and CVAE, with an emotion perception encoder and a BERT-based emotion classification model to embed emotional intelligence, enabling the generation of nuanced and contextually relevant empathetic responses.

Refer to caption

Figure 10: A empathetic dialogue generation system from [198]. The system employed a bidirectional LSTM encoder to process the input text into a vector representation, which initialized a decoder. This decoder, guided by an emotion classifier and enriched with a lexicon-based attention mechanism, integrated emotional lexicon words seamlessly into the responses.

In addition to the aforementioned methods, there have been several research efforts exploring the use of GANs to achieve empathetic dialogue generation. For example, in [207, 208], Wang et al. developed SentiGAN, an innovative structure featuring numerous generators alongside a singular multi-class discriminator. This setup encouraged each generator to concentrate on crafting text samples that distinctly exhibited a designated sentiment label. Chen et al. [209] utilized GAN with multiple classifiers to enhance emotional dialogue production, with an emotion discriminative model to align the generated dialogue’s emotion with the intended one. Bi et al. [210] utilized a diffusion model-based approach to generate empathetic responses, distinctive for its use of multi-grained control signals, incorporating communication mechanism, intent, and semantic frame as control levels, enabling nuanced guidance over the generated responses. In [211], Chen et al. proposed CTGAN for emotional text generation by incorporating emotion labels, allowing for the production of text that aligned with specific emotional tones within variable-length outputs.

TABLE VII: Literature on Generative Models for Text Emotion Transfer, Empathetic Dialogue Generation, in Textual Emotion Synthesis.

Reference Year Model Dataset Performance Mohammadibaghmolaei et al. [184] 2023 LLM ISEAR/TEC Transfer strength figures/Content preservation: 0.8403-0.8967/Fluency: 164.9118-220.4884 Li et al. [26] 2018 Seq2Seq YELP/CAPTIONS/AMAZON Human evalution/BLEU: 11.8, 17.1, 27.1/Classifier Accuracy: 95.4, 96.8, 70.3 Jin et al. [185] 2019 Seq2Seq YELP/FORMALITY Human evalution/Accuracy: 95.90, 72.07/BLEU: 22.46, 38.16/PPL: 14.89, 32.63 Wu et al. [186] 2019 LLM YELP/AMAZON Human evalution/ACC: 97.3, 84.5/BLEU: 15.9, 32.1 Malmi et al. [187] 2020 LaserTagger YELP BLEU: 14.5, 15.3/ACC: 40.9, 49.6 Yang et al. [188] 2018 Encoder-Decoder LM YELP ACC: 90.0, 85.4/BLEU: 22.3, 13.4/PPLXPPL_{X}: 48.4, 32.8/PPLYPPL_{Y}: 61.6, 40.5 Shen et al. [189] 2017 AE YELP Accuracy:78.4/Human evaluation (sentiment: 62.6,fluency: 2.8,overall transfer quality: 41.5) Zhang et al. [190] 2019 Seq2Seq GAN YELP/AMAZON Sentiment Transfer Strength: 81.7, 69.2/Cosine Similarity: 95.0, 92.2/Word Overlap: 83.9, 75.4 Luo et al. [192] 2019 Seq2Seq YELP BLEU: 32.5, 10.3/MAE: 0.13/MRRR: 0.78/PPL: 35.1/Avg human evaluation: 3.96 Reif et al. [212] 2022 LLM YELP/GYAFC Human evalution figures/ACC: 90.6/BLEU: 10.4/PPL: 79 Yi et al. [213] 2021 Encoder-Decoder YELP ACC: 90.8/BLEU: 26.3/Cos: 96/PPL: 109/GM: 14.83 Li et al. [214] 2020 Dual-Generator Network YELP/IMDb ACC: 88.0, 70.1/BLEU: 54.5, 70.2 Huang et al. [191] 2020 AAE YELP Tranfer: 86.9%/BLEU: 22.51/PPL: 21.6/RPPL: 57.0 Riley et al. [183] 2021 LLM AMAZON ACC: 73.7, 54.9/Content preservation: 34.7, 55.8/Sentiment: 2.5/Preservation: 2.6/Fluency: 4.0 Sancheti et al. [215] 2020 Seq2Seq YELP BLEU: 0.153,0.088/Accuracy: 0.922, 0.744/Human evaluation Chawla et al. [216] 2020 Encoder-Decoder LM GYAFC/YELP/AMAZON ACC: 86.2, 68.9/BLEU: 14.1, 28.6 Li et al. [193] 2024 LLM IEMOCAP/DailyDialog/Empathetic-Dialogues/ESConv/PENS EGS: 36.02, 36.42, 36.21, 36.70, 33.48 Yang et al. [194] 2024 LLM EmpatheticDialogue dataset Acc: 45.63/Distinct-1: 42.23/Distinct-2: 80.08 Sun et al. [195] 2023 LLM EmpatheticDialogue dataset Human evaluation/PPL: 18.86/ACC: 53.44 Lee et al. [196] 2022 LLM EmpatheticDialogue dataset Human evaluation/EMOACC: 0.1683/Explorations: 0.4970/Interpretations: 0.2780 Casas et al. [24] 2021 LLM DD/ED dataset Emotion Reflection: 0.465/Emotional: 0.487/Empathy Score: 0.443/PPL: 149.8 Chen et al. [197] 2023 LLM SoulChat-Corpus/SMILECHAT BLEU: 33.78, 20.07,12.86, 8.52/ROUGE: 31.47, 8.92, 26.57/Emp: 1.84/Human evaluation Liu et al. [217] 2022 LLM EmpatheticDialogues dataset Emotion accuracy: 0.5262/Perplexity: 13.57/Dist-1: 2.04/Dist-2: 11.68 Qian et al. [218] 2023 LLM EmpatheticDialogues dataset Dist-1: 2.96/Dist-2: 18.29/BERTscore: 0.8803, 0.8816, 0.8774/BLEU-2: 9.37/BLEU-4: 3.26/Human evaluation Song et al. [198] 2019 Seq2Seq STC/NLPCC Embedding score: 0.634, 0.451, 0.435/BLUE: 1.73/Dist-1: 0.0113/Dist-2: 0.0867/emotion-a: 0.810/emotion-w: 0.687 Zhou et al. [199] 2018 Seq2Seq STC/NLPCC Perplexity: 61.8/Accuracy: 0.773/Human evalution Kong et al. [200] 2019 CGAN Seq2Seq CAVE MojiTalk dataset PPL: 69.54/ Sentiment Acc: 78.8, 78.9/Quality: 3.9 Li et al. [201] 2021 CAVE NLPCC2017/Weibo dataset/MojiTalk dataset PPL: 25.70, 23.60, 33.7/Dist-1: 0.109, 0.017, 0.105/Dist-2: 0.400, 0.225, 0.517/Human evalution Xu et al. [202] 2019 CVAE Seq2Seq NLPCC2017 PPL: 34.6/Accuracy: 0.637/Dist-1: 0.3315/Dist-2: 0.7900/Dist-3: 0.9023/Human evalution Asghar et al. [203] 2018 Seq2Seq Cornell Movie Dialogs Corpus Syntactic Coherence: 1.76/Naturalness: 1.09/Emotional Appropriateness: 1.10 Colombo et al. [204] 2019 Seq2Seq OpenSubtitles2018/Cornell Dist-1: 0.0406, 0.0305/Dist-2: 0.2030, 0.1431/BLEU: 0.0140, 0.110/Hyper-parameter optimization Huang et al. [205] 2018 Seq2Seq CBET/OpenSubtitles dataset Average Acc: 76.69, 75.91, 78.46 Lin et al. [206] 2022 CVAE EmpatheticDialogues dataset PPL: 19.6/Diversity: 0.0208, 0.1404, 0.3976/Human evaluation Firdaus et al. [219] 2020 CVAE SEMD PPL: 34.8/Sentiment Acc:0.85/Emotion Acc: 0.80/Dist-1: 0.0203/Dist-2: 0.0520/Human evaluation Majumder et al. [220] 2020 Encoder-Decoder EmpatheticDialogues dataset BLEU: 2.98/Human evaluation/Preference test Sabour et al. [221] 2022 Encoder-Decoder EmpatheticDialogues dataset PPL: 35.60/Dist-1: 0.66/Dist-2: 2.99/Acc: 39.11/Human evaluation Li et al. [222] 2019 Encoder-Decoder NLPCC2017 PPL: 62.2, 61, 61.4/Accuracy: 0.871, 0.870, 0.869/Human evaluation Wang et al. [208] 2019 GAN MR/BR/CR/Emotional tweet conversation Sentiment Acc: 0.885, 0.841, 0.803/Novelty: 0.395, 0.427, 0.549/Diversity: 0.741, 0.713, 0.708/Human evaluation Chen et al. [209] 2021 GAN NLPW/XHJ Acc: 0.701-0.870/Fluency score: 6.300-11.33/Human evaluation Bi et al. [210] 2023 Diffusion model EmpatheticDialogues dataset BERTScore: 0.5205/MIScore: 626.92/Acc: 92.36, 84.24/F1-SF: 52.79/Dist-1, 2, 4: 2.84, 29.25, 73.45/Self-BLEU: 1.09 Li et al. [223] 2020 GAN Seq2Seq NLPCC2017 PPL: 62.8/Acc: 0.735/Distinct: 0.0287, 0.3062/Human evaluation Chen et al. [211] 2020 GAN Yelp restaurant reviews/Amazon reviews/Film review/Obama speech Similarity: 0.1-0.2/Sentiment figures/Average Acc: 0.56 Gu et al. [224] 2024 LLM EmpatheticDialogues dataset Dist-1: 2.98/Dist-2: 18.38/B-2: 7.64/B-4: 2.57/PBERT: 87.29/RBERT: 88.04/FBERT:87.63/Human evaluation(Empaythy): 4.17 Chen et al. [225] 2024 LLM EmpatheticDialogues dataset Acc: 52.73/Dist-1: 2.96/Dist-2: 19.52/BLEU-2: 10.54/BLEU-4: 5.17

10 Evaluation Metric

In the field of human emotion synthesis, several commonly used evaluation metrics are employed to assess the quality of generated content across the three modalities of facial images, speech, and text. These metrics provide a comprehensive assessment of the synthesized emotions from various perspectives. While some evaluation indicators are universal and applicable to all modalities, others are exclusive to specific modalities, as illustrated in Table VIII.

For common evaluation indicators, there are usually two forms: user study and accuracy. User studies are a qualitative evaluation method that assesses the quality of generated content from aspects such as clarity, naturalness, and emotional authenticity. These studies involve gathering feedback and opinions from human participants to gauge their perception and experience of the synthesized emotions. For example, in tasks like face reenactment, where emotional authenticity is critical, researchers conduct a "Real vs. Fake" perceptual study on platforms like Amazon Mechanical Turk (AMT) to evaluate the outputs." In the field of TTS and voice conversion, researchers [166, 159, 146] use subjective evaluation metrics like Mean Opinion Score (MOS) and Similarity Mean Opinion Score (SMOS) to assess naturalness and emotional similarity. Accuracy is another important evaluation metric for human emotion synthesis. For example, in face reenactment, researchers use classifiers to assess generated facial images. Higher accuracy in expression recognition reflects higher accuracy in expression translation by the models [96]. Similarly, [23] utilizes an emotion classifier network from EVP [226] to measure the emotion accuracy of face manipulation.

In addition to the aforementioned universal evaluation metrics, different modalities in the field of emotion synthesis have their own specific and widely used evaluation indicators. These indicators target the unique attributes and synthesis goals of each modality, providing a more fine-grained and specialized assessment perspective. For example, in facial emotion synthesis, PSNR (Peak Signal-to-Noise Ratio) [107, 227] quantifies image quality by comparing compressed images to their originals. It is calculated using the logarithm of the ratio between maximum pixel value and mean squared error, with higher values indicating better quality. SSIM (Structural Similarity Index) [96] improves upon PSNR by evaluating image similarity based on luminance, contrast, and structure, focusing on perceived quality and structural integrity. FID (Fréchet Inception Distance) [91] measures the similarity between sets of images by comparing feature vectors, with lower scores indicating higher quality and diversity in generated images. In speech emotion synthesis, MCD (Mel Cepstral Distortion) [166, 228] objectively measures spectral similarity between reference and generated mel-spectrograms, providing quantitative feedback on emotional voice synthesis accuracy. F0 RMSE (F0 Root Mean Square Error) [158, 81] evaluates the accuracy of the fundamental frequency contour in synthesized speech compared to the reference. Lower values indicate higher pitch accuracy, contributing to perceived naturalness and expressiveness. In textual emotion synthesis, the BLEU (Bilingual Evaluation Understudy) score [26, 186], borrowed from machine translation evaluation, can measure lexical similarity between generated and reference texts, indicating how well the generated text maintains desired linguistic properties while altering emotional tone. PPL (Perplexity) [185] measures a language model’s prediction accuracy, reflecting its ability to produce coherent and fluent text. Lower PPL suggests the generated text more closely mirrors human language patterns.

TABLE VIII: Evaluation Metrics in Human Emotion Synthesis.
Modalities Metrics Description
Common User study Assesses the quality and realism of synthesized emotion contents based on scoring by human participants.
Accuracy Measures the percentage of correct predictions in emotion classification tasks.
Facial Emotion Synthesis PSNR Quantifies the similarity between the synthesized and original images, where higher values indicate better quality.
SSIM Assesses the perceived visual quality by comparing structural information between the original and generated images.
FID Evaluates the distribution difference between real and synthesized images by comparing deep feature representations.
Speech Emotion Synthesis MCD Measures the distortion between the synthesized and reference speech in the cepstral domain.
F0 RMSE Quantifies the difference in pitch between synthesized and real speech.
Textual Emotion Synthesis BLEU Computes the lexical similarity between synthesized and reference text.
PPL Measures the fluency of generated text, with lower values indicating more predictable and coherent sentences.

11 Discussion

11.1 Major Findings

Existing methods in generative technology for human emotion synthesis have made substantial progress across multiple modalities, including facial images, speech, and text. Each modality benefits from distinct approaches and these developments not only enhance the perceived emotional intelligence of systems but also push the boundaries of how machines can generate and interpret nuanced emotional states.

In facial emotion synthesis, GAN-based methods used to be the mainstream methods. However, in recent years, DM-based methods have also gained considerable attention and made significant progress. Specifically, some classical GAN-based models, such as StarGAN and StyleGAN, excel at altering the emotional expressions of faces but face challenges when attempting to generate more subtle or complex emotional states. Moreover, generating mixed emotions—that is, facial expressions that convey a strong contrast of emotions like happiness and sadness—still lacks finer control. Comparatively, DM generates images through progressive denoising, which avoids the problem of model collapse that GANs may have when generating facial expressions. It also shows strong adaptability in generating mixed emotions, better capturing subtle expression changes and emotional nuances, and can handle multiple combinations of emotions in facial expressions.

Speech emotion synthesis achieved similar progress through the adaptation of GANs and Seq2Seqs like Tacotron [229], where emotional intonation is introduced into synthesized speech. These models produce more natural and expressive speech by adjusting vocal elements like pitch, rhythm, and tone. However, recent research has also incorporated AEs, and DMs to further improve the emotional depth and expressiveness of synthesized speech. Specifically, AEs are used to disentangle emotional features from speaker identity, enabling more flexible emotion transfer while preserving the naturalness of speech. DMs, with their capacity for modeling complex data distributions, offer promising results in generating emotional speech with high fidelity and more controlled variations in prosody.

Textual emotion synthesis has increasingly leveraged LLMs and Seq2Seq architectures, utilizing mechanisms like sentiment control and emotional valence modulation to produce emotionally resonant content [230]. These systems are often employed in applications such as empathetic chatbots and emotionally responsive dialogue systems. Despite their effectiveness, generating responses with emotional depth that appropriately reflect varying levels of empathy, sympathy, or other complex emotional tones based on user inputs remains a challenge. Current models still struggle with maintaining a balance between emotional expressiveness and conversational coherence, especially in response to ambiguous or contextually nuanced inputs. Moreover, understanding the contextual triggers for emotional responses and integrating them effectively into generative models will be an ongoing research area.

In terms of evaluation metrics, quality assessment remains a complex, multidimensional task that integrates both subjective and objective indicators. Subjective metrics typically involve human evaluation, which focus on capturing the emotional authenticity, resonance, and overall impression of the generated content. However, subjective evaluations can be time-consuming and costly, and they are influenced by individual biases and cultural differences. On the other hand, objective metrics aim to quantify various aspects of emotion synthesis using automated computational methods, such as accuracy, PSNR, etc. They provide a scalable and reproducible way of assessment but may not fully capture the emotional nuances and human-perceived quality of the generated content. Future research directions include developing more fine-grained subjective evaluation methods to better capture the subtle differences in emotional content and the complexity of human responses. At the same time, improving objective metrics to more accurately quantify various aspects of emotional expression and correlating them with human judgments is also an important area of research.

11.2 Future Perspectives

Advances in generative AI have revolutionized how we simulate human emotions, creating more authentic and nuanced emotional expressions. Looking ahead, several promising avenues for further research can be explored:

Firstly, combining the capabilities of different generative models such as GANs, Seq2Seqs, AEs, DMs, and LLMs holds promise for further enhancing the quality of generated outputs. Each model has its unique strengths and limitations, and intelligently integrating them can compensate for the shortcomings of individual models, enabling more accurate and realistic emotion synthesis. By designing innovative hybrid architectures that leverage the strengths of each model, more powerful and comprehensive emotion synthesis systems can be developed. These hybrid models can seamlessly transition between different modalities, generating emotionally consistent and complementary outputs. For instance, a model combining DMs with the insights of Seq2Seq can generate high-quality emotional speech with appropriate facial expressions and lip synchronization.

Secondly, the horizon of emotion synthesis is not limited to common modalities like facial images, speech, and text. As generative models continue to evolve, we may see multimodal human emotion synthesis results [231, 232] in the future, including modalities beyond those mentioned in this paper, such as gesture [233] and physiological signals like EEG and ECG. In addition, emerging cross-modal generative models, such as text-to-image (T2I), text-to-video (T2V), and even text-to-3D, are poised to expand the creative and interactive potential of emotion synthesis. T2I models can generate imagery that resonates emotionally with written narratives, producing visuals that reflect subtle emotional undertones, while T2V models can bring stories to life by translating emotional content into animated, visually expressive scenes that engage audiences on a deeper emotional level. Moreover, as the technology matures, the potential for converting image-based emotions into sound (e.g., generating soundscapes that mirror the mood of a visual scene) opens up new dimensions for immersive experiences in fields like VR [234] and interactive entertainment.

Thirdly, the integration of generative models with edge devices—from server-based processing to smart terminals—marks a pivotal shift in the accessibility and application of emotion synthesis [235]. As the computational power of edge devices continues to grow, there is a growing potential for real-time emotion generation and synthesis directly on devices such as smartphones, wearables, and VR headsets. This transition from centralized server processing to edge computing opens up a wide range of applications, enabling personalized, on-the-fly emotional interactions. For instance, smart devices can analyze users’ facial expressions, voice tone, or even physiological signals, generating responsive emotional content that adapts to the user’s immediate emotional state, location, or context. Additionally, as AI models become more efficient, smaller-scale devices like wearables or even IoT sensors [236] can incorporate emotion-aware interactions, enhancing user experience in a range of industries—from healthcare, where emotion synthesis can assist in mental health monitoring and intervention, to retail, where it can personalize consumer experiences in-store or online. In this new paradigm, generative models are poised to provide immediate, adaptive emotional content, offering users a deeper connection to the digital world.

Fourthly, human emotion synthesis holds the potential to drive profound transformations across a variety of industries and can revolutionize domains such as digital entertainment and filmmaking [237, 238]. In the realm of digital entertainment, emotion synthesis lays the foundation for highly immersive experiences, enabling virtual characters and environments to express emotions with a level of authenticity rivaling that of human actors. By precisely generating emotional nuances in facial expressions, vocal tones, and body language, these technologies can elevate interactive media to new heights. In film production, AI-driven tools are already being employed to infuse emotional depth into character performances, achieving more dynamic and expressive storytelling that transcends the limitations of traditional physical acting. Moreover, the combination of AIGC video and emotion synthesis opens up new avenues for creation, empowering filmmakers to craft content with emotionally evocative visual and auditory cues. This presents opportunities for real-time emotional adjustments within films, where characters’ emotional arcs can be dynamically altered based on audience feedback or narrative shifts, further immersing viewers in the experience [239]. Through these advancements, the industry is entering a new chapter where the integration of emotion synthesis will unleash vast creative potential.

12 Conclusion

This review presents a detailed investigation of current generative technology for human emotion synthesis across various modalities, including facial images, speech, and text. It reveals how different genrative models, ranging from well-established approaches like AEs and GANs to emerging techniques such as DMs and LLMs, are capable of generating complex emotional expressions with remarkable depth and subtle nuances.

In Section 1, we introduce the background of human emotion synthesis, a pivotal area of research within affective computing. With the growing sophistication of generative models, which possess advanced data modeling and multi-modal generation capabilities, new avenues for emotion synthesis are emerging. However, there is currently a lack of comprehensive reviews on this subject. To address this gap, we present the first survey of generative technology for human emotion synthesis, building upon existing literature and filling the void in current research. In Section 2, we outline our rigorous literature screening strategy, which allowed us to systematically collect and classify key research in the field of generative models for emotion synthesis. In Sections 3 to 6, we highlight the gaps between previous reviews and our survey, demonstrating the novelty and significance of our contribution. We also provide an overview of emotion models and mathematics of generative models, as well as commonly used datasets, offering a deeper understanding of the latest advancements in this interdisciplinary field.

Sections 7 to 9 provide a detailed discussion of the latest research on human emotion synthesis based on facial images, speech, and text. We categorize the specific tasks under each modality, discuss the applications of different generative models, and summarize the performance of existing works in comprehensive tables. We classify specific tasks within each modality, analyze the distribution of different models across these tasks, and discuss their strengths and limitations in various contexts. For each modality, we summarize the application models and performance of existing works in comprehensive tables. Finally, in Sections 10 and 11, we summarize the commonly used evaluation metrics for emotion synthesis and discuss the current state and future development trends in this field, providing insights into the challenges and opportunities faced by emotion synthesis.

In summary, this review highlights the transformative potential of existing generative models in shaping the future of human emotion synthesis. These advancements will not only enhance the granularity and authenticity of synthesized emotions but also usher in a new era where machines can resonate with the subtleties of human emotions, fostering deeper and more empathetic connections.

References

  • [1] R. W. Picard, Affective computing.   MIT press, 2000.
  • [2] F. Ding, X. Kang, and F. Ren, “Neuro or symbolic? fine-tuned transformer with unsupervised lda topic clustering for text sentiment analysis,” IEEE Transactions on Affective Computing, vol. 15, no. 2, pp. 493–507, 2024.
  • [3] E. Yadegaridehkordi, N. F. B. M. Noor, M. N. B. Ayub, H. B. Affal, and N. B. Hussin, “Affective computing in education: A systematic review and future research,” Computers & education, vol. 142, p. 103649, 2019.
  • [4] F. Ren, Z. Liu, and X. Kang, “An efficient framework for constructing speech emotion corpus based on integrated active learning strategies,” IEEE Transactions on Affective Computing, vol. 13, no. 4, pp. 1929–1940, 2022.
  • [5] J. Deng and F. Ren, “A survey of textual emotion recognition and its challenges,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 49–67, 2023.
  • [6] Y. Zhou, X. Kang, and F. Ren, “Prompt consistency for multi-label textual emotion detection,” IEEE Transactions on Affective Computing, vol. 15, no. 1, pp. 121–129, 2024.
  • [7] M. Schröder, “Emotional speech synthesis: A review,” in Seventh European Conference on Speech Communication and Technology, 2001.
  • [8] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech synthesis based on hidden markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013.
  • [9] N. Hajarolasvadi, M. A. Ramirez, W. Beccaro, and H. Demirel, “Generative adversarial networks in human emotion synthesis: A review,” IEEE Access, vol. 8, pp. 218 499–218 529, 2020.
  • [10] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information fusion, vol. 37, pp. 98–125, 2017.
  • [11] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
  • [12] F. Ma, Y. Yuan, Y. Xie, H. Ren, I. Liu, Y. He, F. Ren, F. R. Yu, and S. Ni, “Generative technology for human emotion recognition: A scoping review,” Information Fusion, p. 102753, 2024.
  • [13] L. Ruthotto and E. Haber, “An introduction to deep generative modeling,” GAMM-Mitteilungen, vol. 44, no. 2, p. e202100008, 2021.
  • [14] A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, “Overview of the transformer-based models for nlp tasks,” in 2020 15th Conference on Computer Science and Information Systems (FedCSIS).   IEEE, 2020, pp. 179–183.
  • [15] S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech, “Generative ai,” Business & Information Systems Engineering, vol. 66, no. 1, pp. 111–126, 2024.
  • [16] Y. Xie, J. Wang, T. Feng, F. Ma, and Y. Li, “Ccis-diff: A generative model with stable diffusion prior for controlled colonoscopy image synthesis,” arXiv preprint arXiv:2411.12198, 2024.
  • [17] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [18] G. Zhang, Y. Qin, W. Zhang, J. Wu, M. Li, Y. Gai, F. Jiang, and T. Lee, “iemotts: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, vol. 27, 2014.
  • [20] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “Ganimation: Anatomically-aware facial animation from a single image,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 818–833.
  • [21] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in neural information processing systems, vol. 33, 2020, pp. 6840–6851.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, vol. 30, 2017.
  • [23] B. Zhang, X. Zhang, N. Cheng, J. Yu, J. Xiao, and J. Wang, “Emotalker: Emotionally editable talking face generation via diffusion model,” arXiv preprint arXiv:2401.08049, 2024.
  • [24] J. Casas, T. Spring, K. Daher, E. Mugellini, O. A. Khaled, and P. Cudré-Mauroux, “Enhancing conversational agents with empathic abilities,” in Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, 2021, pp. 41–47.
  • [25] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, vol. 27, 2014.
  • [26] J. Li, R. Jia, H. He, and P. Liang, “Delete, retrieve, generate: A simple approach to sentiment and style transfer,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 1865–1874.
  • [27] H. Rashkin, E. M. Smith, M. Li, and Y.-L. Boureau, “Towards empathetic open-domain conversation models: A new benchmark and dataset,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5370–5381.
  • [28] S. Dhere, S. B. Rathod, S. Aarankalle, Y. Lad, and M. Gandhi, “A review on face reenactment techniques,” in 2020 International Conference on Industry 4.0 Technology (I4Tech).   IEEE, 2020, pp. 191–194.
  • [29] X. Luo, X. Zhang, Y. Xie, X. Tong, W. Yu, H. Chang, F. Ma, and F. R. Yu, “Codeswap: Symmetrically face swapping based on prior codebook,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6910–6919.
  • [30] W. Xie, W. Lu, Z. Peng, and L. Shen, “Consistency preservation and feature entropy regularization for gan-based face editing,” IEEE Transactions on Multimedia, 2023.
  • [31] R. Zhen, W. Song, Q. He, J. Cao, L. Shi, and J. Luo, “Human-computer interaction system: A survey of talking-head generation,” Electronics, vol. 12, no. 1, p. 218, 2023.
  • [32] M. Toshpulatov, W. Lee, and S. Lee, “Talking human face generation: A survey,” Expert Systems with Applications, vol. 219, p. 119678, 2023.
  • [33] B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020.
  • [34] K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
  • [35] N. Kaur and P. Singh, “Conventional and contemporary approaches used in text to speech synthesis: a review,” Artificial Intelligence Review, vol. 56, pp. 5837–5880, 2023.
  • [36] S. Inoue, K. Zhou, S. Wang, and H. Li, “Fine-grained quantitative emotion editing for speech generation,” arXiv preprint arXiv:2403.02002, 2024.
  • [37] W.-L. Zheng and B.-L. Lu, “Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 3, pp. 162–175, 2015.
  • [38] D. B. Geselowitz, “On the theory of the electrocardiogram,” Proceedings of the IEEE, vol. 77, no. 6, pp. 857–876, 1989.
  • [39] S. Brave and C. Nass, “Emotion in human-computer interaction,” in The human-computer interaction handbook.   CRC Press, 2007, pp. 103–118.
  • [40] D. Keltner and J. J. Gross, “Functional accounts of emotions,” Cognition & Emotion, vol. 13, no. 5, pp. 467–480, 1999.
  • [41] I. B. Mauss and M. D. Robinson, “Measures of emotion: A reviews,” Cognition and emotion, pp. 109–137, 2010.
  • [42] J. S. Lerner, Y. Li, P. Valdesolo, and K. S. Kassam, “Emotion and decision making,” Annual review of psychology, vol. 66, pp. 799–823, 2015.
  • [43] B. Parkinson, A. H. Fischer, and A. S. Manstead, Emotion in social relations: Cultural, group, and interpersonal processes.   Psychology press, 2005.
  • [44] S. K. Khare, V. Blanes-Vidal, E. S. Nadimi, and U. R. Acharya, “Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations,” Information Fusion, vol. 102019, 2023.
  • [45] S. Zhao, G. Ding, J. Han, and Y. Gao, “Personality-aware personalized emotion recognition from physiological signals,” in IJCAI, 2018, pp. 1660–1667.
  • [46] S. Zhao, A. Gholaminejad, G. Ding, Y. Gao, J. Han, and K. Keutzer, “Personalized emotion recognition by personality-aware high-order learning of physiological signals,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, no. 1s, pp. 1–18, 2019.
  • [47] P. Ekman, “An argument for basic emotions,” Cognition & emotion, vol. 6, no. 3-4, pp. 169–200, 1992.
  • [48] R. Plutchik and H. Kellerman, Theories of emotion.   Academic press, 2013, vol. 1.
  • [49] F. Ma, Y. Li, S. Ni, S.-L. Huang, and L. Zhang, “Data augmentation for audio-visual emotion recognition with an efficient multimodal conditional gan,” Applied Sciences, vol. 12, no. 1, p. 527, 2022.
  • [50] G. F. Wilson and C. A. Russell, “Real-time assessment of mental workload using psychophysiological measures and artificial neural networks,” Human factors, vol. 45, no. 4, pp. 635–644, 2003.
  • [51] A. Mehrabian, “Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament,” Current Psychology, vol. 14, pp. 261–292, 1996.
  • [52] A. Kammoun, R. Slama, H. Tabia, T. Ouni, and M. Abid, “Generative adversarial networks for face generation: A survey,” ACM Computing Surveys, vol. 55, no. 5, pp. 1–37, 2022.
  • [53] Y. Liu, Q. Li, Q. Deng, Z. Sun, and M.-H. Yang, “Gan-based facial attribute manipulation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [54] A. Triantafyllopoulos, B. W. Schuller, G. İymen, M. Sezgin, X. He, Z. Yang, P. Tzirakis, S. Liu, S. Mertes, E. André et al., “An overview of affective speech synthesis and conversion in the deep learning era,” Proceedings of the IEEE, 2023.
  • [55] A. Wali, Z. Alamgir, S. Karim, A. Fawaz, M. B. Ali, M. Adan, and M. Mujtaba, “Generative adversarial networks for speech processing: A review,” Computer Speech & Language, vol. 72, p. 101308, 2022.
  • [56] H. Zhang, H. Song, S. Li, M. Zhou, and D. Song, “A survey of controllable text generation using transformer-based pre-trained language models,” ACM Computing Surveys, vol. 56, no. 3, pp. 1–37, 2023.
  • [57] G. H. D. Rosa and J. P. Papa, “A survey on text generation using generative adversarial networks,” Pattern Recognition, vol. 119, p. 108098, 2021.
  • [58] A. Saxena, A. Khanna, and D. Gupta, “Emotion recognition and detection methods: A comprehensive survey,” Journal of Artificial Intelligence and Systems, vol. 2, no. 1, pp. 53–79, 2020.
  • [59] P. Nandwani and R. Verma, “A review on sentiment analysis and emotion detection from text,” Social network analysis and mining, vol. 11, no. 1, p. 81, 2021.
  • [60] F. Z. Canal, T. R. Müller, J. C. Matias, G. G. Scotton, A. R. de Sa Junior, E. Pozzebon, and A. C. Sobieranski, “A survey on facial emotion recognition techniques: A state-of-the-art literature review,” Information Sciences, vol. 582, pp. 593–617, 2022.
  • [61] A. Oussidi and A. Elhassouny, “Deep generative models: Survey,” in 2018 International conference on intelligent systems and computer vision (ISCV).   IEEE, 2018, pp. 1–8.
  • [62] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [63] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  • [64] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
  • [65] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in neural information processing systems, vol. 32, 2019.
  • [66] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  • [67] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [68] Y. Liu, H. Hou, F. Ma, S. Ni, and F. R. Yu, “Mllm-ta: Leveraging multimodal large language models for precise temporal video grounding,” IEEE Signal Processing Letters, 2024.
  • [69] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen, “Facial expression recognition from near-infrared videos,” Image and vision computing, vol. 29, no. 9, pp. 607–619, 2011.
  • [70] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. V. Knippenberg, “Presentation and validation of the radboud faces database,” Cognition and emotion, vol. 24, no. 8, pp. 1377–1388, 2010.
  • [71] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).   IEEE, 2010, pp. 94–101.
  • [72] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014.
  • [73] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
  • [74] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013.
  • [75] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5562–5570.
  • [76] K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 920–924.
  • [77] A. Adigwe, N. Tits, K. E. Haddad, S. Ostadabbas, and T. Dutoit, “The emotional voices database: Towards controlling the emotion dimension in voice generation systems,” arXiv preprint arXiv:1806.09514, 2018.
  • [78] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Interspeech, vol. 5, 2005, pp. 1517–1520.
  • [79] X. Jia, J. Tai, H. Zhou, Y. Li, W. Zhang, H. Du, and Q. Huang, “Et-gan: Cross-language emotion transfer based on cycle-consistent generative adversarial networks,” arXiv preprint arXiv:1905.11173, 2019.
  • [80] P. Gournay, O. Lahaie, and R. Lefebvre, “A canadian french emotional speech dataset,” in Proceedings of the 9th ACM Multimedia Systems Conference, 2018, pp. 399–402.
  • [81] C.-B. Im, S.-H. Lee, S.-B. Kim, and S.-W. Lee, “Emoq-tts: Emotion intensity quantization for fine-grained controllable emotional text-to-speech,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6317–6321.
  • [82] R. He and J. McAuley, “Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering,” in Proceedings of the 25th International Conference on World Wide Web, 2016, pp. 507–517.
  • [83] X. Zhou and W. Y. Wang, “Mojitalk: Generating emotional responses at scale,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1128–1137.
  • [84] K. Wang, Q. Wu, L. Song, Z. Yang, W. Wu, C. Qian, R. He, Y. Qiao, and C. C. Loy, “Mead: A large-scale audio-visual dataset for emotional talking-face generation,” in European Conference on Computer Vision.   Springer, 2020, pp. 700–717.
  • [85] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.
  • [86] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in Proceedings of the 26th ACM International Conference on Multimedia, 2018, pp. 292–301.
  • [87] S. Haq, P. J. Jackson, and J. Edge, “Audio-visual feature selection and reduction for emotion classification,” in Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), 2008, pp. Tangalooma, Australia.
  • [88] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS One, vol. 13, no. 5, p. e0196391, 2018.
  • [89] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
  • [90] S. Tripathy, J. Kannala, and E. Rahtu, “Icface: Interpretable and controllable face reenactment using gans,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3385–3394.
  • [91] X. Zeng, Y. Pan, M. Wang, J. Zhang, and Y. Liu, “Realistic face reenactment via self-supervised disentangling of identity and pose,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 12 757–12 764.
  • [92] V. Strizhkova, Y. Wang, D. Anghelone, D. Yang, A. Dantcheva, and F. Brémond, “Emotion editing in head reenactment videos using latent space manipulation,” in 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).   IEEE, 2021, pp. 1–8.
  • [93] C. Groth, J.-P. Tauscher, S. Castillo, and M. Magnor, “Altering the conveyed facial emotion through automatic reenactment of video portraits,” in International Conference on Computer Animation and Social Agents.   Springer, 2020, pp. 128–135.
  • [94] K. Ali and C. E. Hughes, “All-in-one: Facial expression transfer, editing and recognition using a single network,” arXiv preprint arXiv:1911.07050, 2019.
  • [95] T. Xue, J. Yan, D. Zheng, and Y. Liu, “Semantic prior guided fine-grained facial expression manipulation,” Complex & Intelligent Systems, pp. 1–16, 2024.
  • [96] J. Shao and T. Bui, “Wp2-gan: Wavelet-based multi-level gan for progressive facial expression translation with parallel generators,” in Proc. British Mach. Vis. Conf., 2021, pp. 1388–1.
  • [97] M. C. Doukas, S. Zafeiriou, and V. Sharmanska, “Headgan: One-shot neural head synthesis and editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 398–14 407.
  • [98] Y. Zhao, L. Yang, E. Pei, M. C. Oveneke, M. Alioscha-Perez, L. Li, D. Jiang, and H. Sahli, “Action unit driven facial expression synthesis from a single image with patch attentive gan,” in Computer Graphics Forum, vol. 40.   Wiley Online Library, 2021, pp. 47–61.
  • [99] Y. Xia, W. Zheng, Y. Wang, H. Yu, J. Dong, and F.-Y. Wang, “Local and global perception generative adversarial network for facial expression synthesis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1443–1452, 2021.
  • [100] J. Wang, J. Zhang, Z. Lu, and S. Shan, “Dft-net: Disentanglement of face deformation and texture synthesis for expression editing,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 3881–3885.
  • [101] A. Akram and N. Khan, “Sargan: Spatial attention-based residuals for facial expression manipulation,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [102] X. Zhang, Y. Zhu, W. Chen, W. Liu, and L. Shen, “Gated switchgan for multi-domain facial image translation,” IEEE Transactions on Multimedia, vol. 24, pp. 1990–2003, 2021.
  • [103] A. Akram and N. Khan, “Us-gan: On the importance of ultimate skip connection for facial expression synthesis,” Multimedia Tools and Applications, vol. 83, no. 3, pp. 7231–7247, 2024.
  • [104] B. Azari and A. Lim, “Emostyle: One-shot facial expression editing using continuous emotion parameters,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6385–6394.
  • [105] S. d’Apolito, D. P. Paudel, Z. Huang, A. Romero, and L. V. Gool, “Ganmut: Learning interpretable conditional space for gamut of emotions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 568–577.
  • [106] Y. Peng and H. Yin, “Apprgan: Appearance-based gan for facial expression synthesis,” IET Image Processing, vol. 13, no. 14, pp. 2706–2715, 2019.
  • [107] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversarial facial expression synthesis,” in Proceedings of the 26th ACM International Conference on Multimedia, 2018, pp. 627–635.
  • [108] J. Kong, H. Shen, and K. Huang, “Dualpathgan: Facial reenacted emotion synthesis,” IET Computer Vision, vol. 15, no. 7, pp. 501–513, 2021.
  • [109] J. Tang, Z. Shao, and L. Ma, “Fine-grained expression manipulation via structured latent space,” in 2020 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2020, pp. 1–6.
  • [110] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2085–2094.
  • [111] F. Liu, H. Wang, J. Zhang, Z. Fu, A. Zhou, J. Qi, and Z. Li, “Evogan: An evolutionary computation assisted gan,” Neurocomputing, vol. 469, pp. 81–90, 2022.
  • [112] H. Tang, W. Wang, S. Wu, X. Chen, D. Xu, N. Sebe, and Y. Yan, “Expression conditional gan for facial expression-to-expression translation,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 4449–4453.
  • [113] S. Sola and D. Gera, “Unmasking your expression: Expression-conditioned gan for masked face inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5907–5915.
  • [114] A. Lindt, P. Barros, H. Siqueira, and S. Wermter, “Facial expression editing with continuous emotion labels,” in 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).   IEEE, 2019, pp. 1–8.
  • [115] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
  • [116] S. E. Eskimez, Y. Zhang, and Z. Duan, “Speech driven talking face generation from a single image and an emotion condition,” IEEE Transactions on Multimedia, vol. 24, pp. 3480–3490, 2021.
  • [117] K. Vougioukas, S. Petridis, and M. Pantic, “Realistic speech-driven facial animation with gans,” International Journal of Computer Vision, vol. 128, no. 5, pp. 1398–1413, 2020.
  • [118] D. Zeng, H. Liu, H. Lin, and S. Ge, “Talking face generation with expression-tailored generative adversarial network,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1716–1724.
  • [119] Y. Gan, Z. Yang, X. Yue, L. Sun, and Y. Yang, “Efficient emotional adaptation for audio-driven talking-head generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 634–22 645.
  • [120] S. Tan, B. Ji, and Y. Pan, “Flowvqtalker: High-quality emotional talking face generation through normalizing flow and quantization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 317–26 327.
  • [121] S. Tan, B. Ji, M. Bi, and et al., “Edtalk: Efficient disentanglement for emotional talking head synthesis,” arXiv preprint arXiv:2404.01647, 2024.
  • [122] X. Hu, N. Aldausari, and G. Mohammadi, “2cet-gan: Pixel-level gan model for human facial expression transfer,” in Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice, 2023, pp. 49–56.
  • [123] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
  • [124] D. Zhu, S. Liu, W. Jiang, C. Gao, T. Wu, Q. Wang, and G. Guo, “Ugan: Untraceable gan for multi-domain face translation,” arXiv preprint arXiv:1907.11418, 2019.
  • [125] H. Ding, K. Sricharan, and R. Chellappa, “Exprgan: Facial expression editing with controllable expression intensity,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  • [126] G. Tesei, “Generating realistic facial expressions through conditional cycle-consistent generative adversarial networks (ccyclegan),” 2019.
  • [127] W. Wang, Q. Sun, Y. Fu, T. Chen, C. Cao, Z. Zheng, G. Xu, H. Qiu, Y.-G. Jiang, and X. Xue, “Comp-gan: Compositional generative adversarial network in synthesizing and recognizing facial expression,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 211–219.
  • [128] Z. Xu, T. Chen, Z. Yang et al., “Self-supervised emotion representation disentanglement for speech-preserving facial expression manipulation,” in Proceedings of ACM Multimedia 2024, 2024.
  • [129] S. Tan, B. Ji, and Y. Pan, “Emmn: Emotional motion memory network for audio-driven emotional talking face generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 146–22 156.
  • [130] ——, “Style2talker: High-resolution talking head generation with emotion style and art style,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 5079–5087.
  • [131] S. Zhai, M. Liu, Y. Li, Z. Gao, L. Zhu, and L. Nie, “Talking face generation with audio-deduced emotional landmarks,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [132] Z. Sheng, L. Nie, M. Zhang, X. Chang, and Y. Yan, “Stochastic latent talking face generation towards emotional expressions and head poses,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [133] Y. Ma, S. Zhang, J. Wang, X. Wang, Y. Zhang, and Z. Deng, “Dreamtalk: When expressive talking head generation meets diffusion probabilistic models,” arXiv preprint arXiv:2312.09767, 2023.
  • [134] C. Zhang, C. Wang, J. Zhang, H. Xu, G. Song, Y. Xie, L. Luo, Y. Tian, X. Guo, and J. Feng, “Dream-talk: Diffusion-based realistic emotional audio-driven method for single image talking face generation,” arXiv preprint arXiv:2312.13578, 2023.
  • [135] Z. Sun, Y. H. Wen, T. Lv et al., “Continuously controllable facial expression editing in talking face videos,” IEEE Transactions on Affective Computing, 2023.
  • [136] H. Wang, X. Jia, and X. Cao, “Eat-face: Emotion-controllable audio-driven talking face generation via diffusion model,” in Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG).   IEEE, 2024, pp. 1–10.
  • [137] K. Zhou, B. Sisman, M. Zhang, and H. Li, “Converting anyone’s emotion: Towards speaker-independent emotional voice conversion,” arXiv preprint arXiv:2005.07025, 2020.
  • [138] K. Zhou, B. Sisman, and H. Li, “Vaw-gan for disentanglement and recomposition of emotional elements in speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021, pp. 415–422.
  • [139] ——, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data,” arXiv preprint arXiv:2002.00198, 2020.
  • [140] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
  • [141] C. Fu, C. Liu, C. T. Ishi, and H. Ishiguro, “An improved cyclegan-based emotional voice conversion model by augmenting temporal dependency with a transformer,” Speech Communication, vol. 144, pp. 110–121, 2022.
  • [142] G. Rizos, A. Baird, M. Elliott, and B. Schuller, “Stargan for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 3502–3506.
  • [143] M. Elgaar, J. Park, and S. W. Lee, “Multi-speaker and multi-domain emotional voice conversion using factorized hierarchical variational autoencoder,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7769–7773.
  • [144] J. Gao, D. Chakraborty, H. Tembine, and O. Olaleye, “Nonparallel emotional speech conversion,” arXiv preprint arXiv:1811.01174, 2018.
  • [145] Y. Chen, L. Yang, Q. Chen, J.-H. Lai, and X. Xie, “Attention-based interactive disentangling network for instance-level emotional voice conversion,” arXiv preprint arXiv:2312.17508, 2023.
  • [146] H.-S. Oh, S.-H. Lee, D.-H. Cho, and S.-W. Lee, “Durflex-evc: Duration-flexible emotional voice conversion with parallel generation,” arXiv preprint arXiv:2401.08095, 2024.
  • [147] Z. Du, B. Sisman, K. Zhou, and H. Li, “Expressive voice conversion: A joint framework for speaker identity and emotional style transfer,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2021, pp. 594–601.
  • [148] A. H. Meftah, A. A. Alashban, Y. A. Alotaibi, and S.-A. Selouani, “English emotional voice conversion using stargan model,” IEEE Access, 2023.
  • [149] N. Shah, M. Singh, N. Takahashi, and N. Onoe, “Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [150] Z. Du, B. Sisman, K. Zhou, and H. Li, “Disentanglement of emotional style and speaker identity for expressive voice conversion,” arXiv preprint arXiv:2110.10326, 2021.
  • [151] K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2022.
  • [152] F. Kreuk, A. Polyak, J. Copet, E. Kharitonov, T. A. Nguyen, M. Rivière, W.-N. Hsu, A. Mohamed, E. Dupoux, and Y. Adi, “Textless speech emotion conversion using discrete & decomposed representations,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11 200–11 214.
  • [153] H. Choi and M. Hahn, “Sequence-to-sequence emotional voice conversion with strength control,” IEEE Access, vol. 9, pp. 42 674–42 687, 2021.
  • [154] Y. Lei, S. Yang, and L. Xie, “Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis,” in 2021 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021, pp. 423–430.
  • [155] T. Li, C. Hu, J. Cong, X. Zhu, J. Li, Q. Tian, Y. Wang, and L. Xie, “Diclet-tts: Diffusion model based cross-lingual emotion transfer for text-to-speech—a study between english and mandarin,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • [156] N. Tits, K. E. Haddad, and T. Dutoit, “Exploring transfer learning for low resource emotional tts,” in Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1.   Springer, 2020, pp. 52–60.
  • [157] B. Schnell and P. N. Garner, “Improving emotional tts with an emotion intensity input from unsupervised extraction,” in Proc. 11th ISCA Speech Synth. Workshop, 2021, pp. 60–65.
  • [158] P. Wu, Z. Ling, L. Liu, Y. Jiang, H. Wu, and L. Dai, “End-to-end emotional speech synthesis using style tokens and semi-supervised training,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2019, pp. 623–627.
  • [159] S.-Y. Um, S. Oh, K. Byun, I. Jang, C. Ahn, and H.-G. Kang, “Emotional speech synthesis with rich and granularized control,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7254–7258.
  • [160] E. Hortal and R. B. Alarcia, “Gantron: Emotional speech synthesis with generative adversarial networks,” arXiv preprint arXiv:2110.03390, 2021.
  • [161] Y. Guo, C. Du, X. Chen, and K. Yu, “Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [162] T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).   IEEE, 2021, pp. 1–5.
  • [163] Y. Lei, S. Yang, X. Wang, and L. Xie, “Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 853–864, 2022.
  • [164] T. Li, X. Wang, Q. Xie, Z. Wang, and L. Xie, “Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1448–1460, 2022.
  • [165] X. Li, C. Song, J. Li, Z. Wu, J. Jia, and H. Meng, “Towards multi-scale style control for expressive speech synthesis,” arXiv preprint arXiv:2104.03521, 2021.
  • [166] H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Emomix: Emotion mixing via diffusion models for emotional speech synthesis,” arXiv preprint arXiv:2306.00648, 2023.
  • [167] K. Matsumoto, S. Hara, and M. Abe, “Controlling the strength of emotions in speech-like emotional sound generated by wavenet,” in INTERSPEECH, 2020, pp. 3421–3425.
  • [168] T. Wang, J. Yi, R. Fu, J. Tao, Z. Wen, and C. Y. Zhang, “Emotion selectable end-to-end text-based speech editing,” Artificial Intelligence, vol. 329, p. 104076, 2024.
  • [169] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [170] T. Qi, S. Wang, C. Lu et al., “Towards realistic emotional voice conversion using controllable emotional intensity,” arXiv preprint arXiv:2407.14800, 2024.
  • [171] M. Kang, W. Han, S. J. Hwang, and E. Yang, “Zet-speech: Zero-shot adaptive emotion-controllable text-to-speech synthesis with diffusion and style-based models,” arXiv preprint arXiv:2305.13831, 2023.
  • [172] X. Zhu, S. Yang, G. Yang, and L. Xie, “Controlling emotion strength with relative attribute for end-to-end speech synthesis,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 192–199.
  • [173] Y. Lei, S. Yang, X. Zhu, L. Xie, and D. Su, “Cross-speaker emotion transfer through information perturbation in emotional speech synthesis,” IEEE Signal Processing Letters, vol. 29, pp. 1948–1952, 2022.
  • [174] X. Cai, D. Dai, Z. Wu, X. Li, J. Li, and H. Meng, “Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5734–5738.
  • [175] S. Wang, J. Guðnason, and D. Borth, “Fine-grained emotional control of text-to-speech: Learning to rank inter-and intra-class emotion intensities,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [176] K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Speech synthesis with mixed emotions,” IEEE Transactions on Affective Computing, 2022.
  • [177] D. Diatlova and V. Shutov, “Emospeech: Guiding fastspeech2 towards emotional text to speech,” arXiv preprint arXiv:2307.00024, 2023.
  • [178] W. Guan, Y. Li, T. Li et al., “Mm-tts: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 117–18 125.
  • [179] K. Zhou, Y. Zhang, S. Zhao et al., “Emotional dimension control in language model-based text-to-speech: Spanning a broad spectrum of human emotions,” arXiv preprint arXiv:2409.16681, 2024.
  • [180] H. Tang, X. Zhang, N. Cheng et al., “Ed-tts: Multi-scale emotion modeling using cross-domain emotion diarization for emotional speech synthesis,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 12 146–12 150.
  • [181] X. Jing, K. Zhou, A. Triantafyllopoulos et al., “Enhancing emotional text-to-speech controllability with natural language guidance through contrastive learning and diffusion models,” arXiv preprint arXiv:2409.06451, 2024.
  • [182] H. Shi, J. Wang, X. Zhang et al., “Rset: Remapping-based sorting method for emotion transfer speech synthesis,” in Proceedings of Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data.   Springer Nature Singapore, 2024, pp. 90–104.
  • [183] P. Riley, N. Constant, M. Guo, G. Kumar, D. C. Uthus, and Z. Parekh, “Textsettr: Few-shot text style extraction and tunable targeted restyling,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 3786–3800.
  • [184] R. MohammadiBaghmolaei and A. Ahmadi, “Tet: Text emotion transfer,” Knowledge-Based Systems, vol. 262, p. 110236, 2023.
  • [185] Z. Jin, D. Jin, J. Mueller, N. Matthews, and E. Santus, “Imat: Unsupervised text attribute transfer via iterative matching and translation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3097–3109.
  • [186] X. Wu, T. Zhang, L. Zang, J. Han, and S. Hu, “Mask and infill: Applying masked language model to sentiment transfer,” Aug 2019.
  • [187] E. Malmi, A. Severyn, and S. Rothe, “Unsupervised text style transfer with padded masked language models,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8671–8680.
  • [188] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick, “Unsupervised text style transfer using language models as discriminators,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [189] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola, “Style transfer from non-parallel text by cross-alignment,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [190] R. Zhang, Z. Wang, K. Yin, and Z. Huang, “Emotional text generation based on cross-domain sentiment transfer,” IEEE Access, vol. 7, pp. 100 081–100 089, 2019.
  • [191] Y. Huang, W. Zhu, D. Xiong, Y. Zhang, C. Hu, and F. Xu, “Cycle-consistent adversarial autoencoders for unsupervised text style transfer,” in Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 2213–2223.
  • [192] F. Luo, P. Li, P. Yang, J. Zhou, Y. Tan, B. Chang, Z. Sui, and X. Sun, “Towards fine-grained text sentiment transfer,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2013–2022.
  • [193] Z. Li, G. Chen, R. Shao, D. Jiang, and L. Nie, “Enhancing the emotional generation capability of large language models via emotional chain-of-thought,” arXiv preprint arXiv:2401.06836, 2024.
  • [194] Z. Yang, Z. Ren, W. Yufeng, S. Peng, H. Sun, X. Zhu, and X. Liao, “Enhancing empathetic response generation by augmenting llms with small-scale empathetic models,” arXiv preprint arXiv:2402.11801, 2024.
  • [195] L. Sun, N. Xu, J. Wei, B. Yu, L. Bu, and Y. Luo, “Rational sensibility: Llm enhanced empathetic response generation guided by self-presentation theory,” arXiv preprint arXiv:2312.08702, 2023.
  • [196] Y.-J. Lee, C.-G. Lim, and H.-J. Choi, “Does gpt-3 generate empathetic dialogues? a novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 669–683.
  • [197] Y. Chen, X. Xing, J. Lin, H. Zheng, Z. Wang, Q. Liu, and X. Xu, “Soulchat: Improving llms’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 1170–1183.
  • [198] Z. Song, X. Zheng, L. Liu, M. Xu, and X.-J. Huang, “Generating responses with a specific emotion in dialog,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3685–3695.
  • [199] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu, “Emotional chatting machine: Emotional conversation generation with internal and external memory,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
  • [200] X. Kong, B. Li, G. Neubig, E. Hovy, and Y. Yang, “An adversarial approach to high-quality, sentiment-controlled neural dialogue generation,” arXiv preprint arXiv:1901.07129, 2019.
  • [201] M. Li, J. Zhang, X. Lu, and C. Zong, “Dual-view conditional variational auto-encoder for emotional dialogue generation,” Transactions on Asian and Low-Resource Language Information Processing, vol. 21, no. 3, pp. 1–18, 2021.
  • [202] W. Xu, X. Gu, and G. Chen, “Generating emotional controllable response based on multi-task and dual attention framework,” IEEE Access, vol. 7, pp. 93 734–93 741, 2019.
  • [203] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou, “Affective neural response generation,” in Advances in Information Retrieval: 40th European Conference on IR Research, ECIR 2018.   Springer, 2018, pp. 154–166.
  • [204] P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia, “Affect-driven dialog generation,” arXiv preprint arXiv:1904.02793, 2019.
  • [205] C. Huang, O. R. Zaiane, A. Trabelsi, and N. Dziri, “Automatic dialogue generation with expressed emotions,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2, 2018, pp. 49–54.
  • [206] H. Lin and Z. Deng, “Emotional dialogue generation based on transformer and conditional variational autoencoder,” in 2022 IEEE 21st International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS).   IEEE, 2022, pp. 386–393.
  • [207] K. Wang and X. Wan, “Sentigan: Generating sentimental texts via mixture adversarial networks,” in IJCAI, 2018, pp. 4446–4452.
  • [208] ——, “Automatic generation of sentimental texts via mixture adversarial networks,” Artificial Intelligence, vol. 275, pp. 540–558, 2019.
  • [209] W. Chen, X. Chen, and X. Sun, “Emotional dialog generation via multiple classifiers based on a generative adversarial network,” Virtual Reality & Intelligent Hardware, vol. 3, no. 1, pp. 18–32, 2021.
  • [210] G. Bi, L. Shen, Y. Cao, M. Chen, Y. Xie, Z. Lin, and X. He, “Diffusemp: A diffusion model-based framework with multi-grained control for empathetic response generation,” arXiv preprint arXiv:2306.01657, 2023.
  • [211] J. Chen, Y. Wu, C. Jia, H. Zheng, and G. Huang, “Customizable text generation via conditional text generative adversarial network,” Neurocomputing, vol. 416, pp. 125–135, 2020.
  • [212] E. Reif, D. Ippolito, A. Yuan, A. Coenen, C. Callison-Burch, and J. Wei, “A recipe for arbitrary text style transfer with large language models,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 837–848.
  • [213] X. Yi, Z. Liu, W. Li, and M. Sun, “Text style transfer via learning style instance supported latent space,” in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021, pp. 3801–3807.
  • [214] X. Li, G. Chen, C. Lin, and R. Li, “Dgst: A dual-generator network for text style transfer,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7131–7136.
  • [215] A. Sancheti, K. Krishna, B. V. Srinivasan, and A. Natarajan, “Reinforced rewards framework for text style transfer,” in Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part I, vol. 42.   Springer, 2020, pp. 545–560.
  • [216] K. Chawla and D. Yang, “Semi-supervised formality style transfer using language model discriminator and mutual information maximization,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2340–2354.
  • [217] Y. Liu, W. Maier, W. Minker, and S. Ultes, “Empathetic dialogue generation with pre-trained roberta-gpt2 and external knowledge,” in Conversational AI for Natural Human-Centric Interaction: 12th International Workshop on Spoken Dialogue System Technology, IWSDS 2021, Singapore.   Springer, 2022, pp. 67–81.
  • [218] Y. Qian, W. Zhang, and T. Liu, “Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  • [219] M. Firdaus, H. Chauhan, A. Ekbal, and P. Bhattacharyya, “Emosen: Generating sentiment and emotion controlled responses in a multimodal dialogue system,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1555–1566, 2020.
  • [220] N. Majumder, P. Hong, S. Peng, J. Lu, D. Ghosal, A. Gelbukh, R. Mihalcea, and S. Poria, “Mime: Mimicking emotions for empathetic response generation,” arXiv preprint arXiv:2010.01454, 2020.
  • [221] S. Sabour, C. Zheng, and M. Huang, “Cem: Commonsense-aware empathetic response generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 11 229–11 237.
  • [222] J. Li, X. Sun, X. Wei, C. Li, and J. Tao, “Reinforcement learning based emotional editing constraint conversation generation,” arXiv preprint arXiv:1904.08061, 2019.
  • [223] Y. Li and B. Wu, “Emotional dialogue generation with generative adversarial networks,” in 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), vol. 1.   IEEE, 2020, pp. 868–873.
  • [224] Z. Gu, Q. Zhu, H. He et al., “Multi-level knowledge-enhanced prompting for empathetic dialogue generation,” in Proceedings of the 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD).   IEEE, 2024, pp. 3170–3175.
  • [225] X. Chen, C. Yang, M. Lan et al., “Cause-aware empathetic response generation via chain-of-thought fine-tuning,” arXiv preprint arXiv:2408.11599, 2024.
  • [226] X. Ji, H. Zhou, K. Wang, W. Wu, C. C. Loy, X. Cao, and F. Xu, “Audio-driven emotional video portraits,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 080–14 089.
  • [227] Z. Guo, Y. Xie, W. Xie, P. Huang, F. Ma, and F. R. Yu, “Gaussianpu: A hybrid 2d-3d upsampling framework for enhancing color point clouds via 3d gaussian splatting,” arXiv preprint arXiv:2409.01581, 2024.
  • [228] T. Cornille, F. Wang, and J. Bekker, “Interactive multi-level prosody control for expressive speech synthesis,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8312–8316.
  • [229] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
  • [230] V. Sorin, D. Brin, Y. Barash, E. Konen, A. Charney, G. Nadkarni, and E. Klang, “Large language models (llms) and empathy—a systematic review,” medRxiv, pp. 2023–08, 2023.
  • [231] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423–443, 2018.
  • [232] H. Hou, P. Zeng, F. Ma, and F. R. Yu, “Visualrwkv: Exploring recurrent neural networks for visual language models,” arXiv preprint arXiv:2406.13362, 2024.
  • [233] S. Nyatsanga, T. Kucherenko, C. Ahuja, G. E. Henter, and M. Neff, “A comprehensive review of data-driven co-speech gesture generation,” Computer Graphics Forum, vol. 42, pp. 569–596, 2023.
  • [234] C. Papoutsi, A. Drigas, and C. Skianis, “Virtual and augmented reality for developing emotional intelligence skills,” Int. J. Recent Contrib. Eng. Sci. IT (IJES), vol. 9, no. 3, pp. 35–53, 2021.
  • [235] K. Cao, Y. Liu, G. Meng, and Q. Sun, “An overview on edge computing research,” IEEE access, vol. 8, pp. 85 714–85 728, 2020.
  • [236] F. J. Dian, R. Vahidnia, and A. Rahmati, “Wearables and the internet of things (iot), applications, opportunities, and challenges: A survey,” IEEE access, vol. 8, pp. 69 200–69 211, 2020.
  • [237] M. Izani, A. Razak, D. Rehad, and M. Rosli, “The impact of artificial intelligence on animation filmmaking: Tools, trends, and future implications,” in 2024 International Visualization, Informatics and Technology Conference (IVIT).   IEEE, 2024, pp. 57–62.
  • [238] A. Channa, A. Sharma, M. Singh, P. Malhotra, A. Bajpai, and P. Whig, “Original research article revolutionizing filmmaking: A comparative analysis of conventional and ai-generated film production in the era of virtual reality,” Journal of Autonomous Intelligence, vol. 7, no. 4, 2024.
  • [239] H. Sun and Y. He, “The application and user acceptance of aigc in network audiovisual field: Based on the perspective of social cognitive theory,” in Proceedings of the 7th International Conference on Computer Science and Application Engineering, 2023, pp. 1–5.