A Comprehensive Survey on Multi-modal Conversational Emotion Recognition with Deep Learning

Yuntao Shou [email protected] Central South University of Forestry and TechnologyChangShaHunanChina410004 , Tao Meng [email protected] Central South University of Forestry and TechnologyChangShaHunanChina410004 , Wei Ai [email protected] Central South University of Forestry and TechnologyChangShaHunanChina410004 , Guinan Guo [email protected] Sun Yat-sen UniversityGuangZhouGuangDongChina510275 , Nan Yin [email protected] Mohamed bin Zayed University of Artificial IntelligenceMasdar CityAbu DhabiUAE100701 and Keqin Li [email protected] State University of New YorkNew PaltzNew YorkUSA12561

Abstract.

Multi-modal conversation emotion recognition (MCER) aims to recognize and track the speaker’s emotional state using text, speech, and visual information in the conversation scene. Analyzing and studying MCER issues is significant to affective computing, intelligent recommendations, and human-computer interaction fields. Unlike the traditional single-utterance multi-modal emotion recognition or single-modal conversation emotion recognition, MCER is a more challenging problem that needs to deal with more complex emotional interaction relationships. The critical issue is learning consistency and complementary semantics for multi-modal feature fusion based on emotional interaction relationships. To solve this problem, people have conducted extensive research on MCER based on deep learning technology, but there is still a lack of systematic review of the modeling methods. Therefore, a timely and comprehensive overview of MCER’s recent advances in deep learning is of great significance to academia and industry. In this survey, we provide a comprehensive overview of MCER modeling methods and roughly divide MCER methods into four categories, i.e., context-free modeling, sequential context modeling, speaker-differentiated modeling, and speaker-relationship modeling. In addition, we further discuss MCER’s publicly available popular datasets, multi-modal feature extraction methods, application areas, existing challenges, and future development directions. We hope that our review can help MCER researchers understand the current research status in emotion recognition, provide some inspiration, and develop more efficient models.

Multi-modal conversational emotion recognition, Deep Learning, Multi-modal datasets, Multi-modal feature fusion, Multimodal feature extraction

^†^†journal: JACM^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8^†^†ccs: General and reference Surveys and overviews^†^†ccs: Human-centered computing Natural language interfaces

1. Introduction

With the development of the mobile Internet, social media has become the main platform for people to communicate with each other (Park et al., 2016). Users can fully express their emotions through multi-modal data such as text, voice, image, and video. Building an multi-modal conversational emotion recognition model through multi-modal data is of vital practical significance for understanding users’ true emotional intentions (Ghosh and Anwar, 2021). Therefore, researchers have been trying to give machines the ability to understand emotions in recent years (Li et al., 2022a), (Zhu et al., 2021).

Before the rise of multi-modal conversation emotion recognition research, early conversational emotion recognition methods (Kim, 2014; Tzinis et al., 2018; Zhong et al., 2019; Lotfian and Busso, 2019) mainly focused on single-modal text or speech, and they only considered the context dependencies between texts (or speech) and the semantic information of words (or audio) themselves to analyze and recognize emotions (Meng et al., 2021), (Ying et al., 2021), (Shou et al., 2022b). However, only extracting the emotional information contained in the text data may not be enough for the model to understand the emotion expressed by the speaker, because the speaker may be more reserved when expressing his opinion (Zhu et al., 2023; Ghosal et al., 2019). For example, a speaker may be veiled in expressing his anger, which may result in a more neutral utterance. In response to the above problems, multi-modal conversational emotion recognition (MCER) technology was proposed to solve the problem of insufficient expression of text semantic information (Shou et al., 2023b), (Shou et al., 2023a), (Ying et al., 2021). As shown in Fig. 1, MCER aims to extract semantic information complementary within and between modalities and identify the emotions expressed by speakers in text, audio, and video. The advantage of MCER is that when the emotional polarity of text information is insufficiently expressed (Lian et al., 2023), the model can use visual information (such as facial expressions, etc.) and audio information (such as tone, etc.) to enhance the emotional understanding of text information (Yin et al., 2022), (Yin et al., 2023a), (Yin et al., 5555), (Yin et al., 2023b).

Refer to caption — Figure 1. An example of a multimodal conversation emotion recognition dataset which contains three modal features: video, audio, and text. The task of MCER is to identify the emotion label of each speaker at the current moment based on the utterance content (e.g., neutral, angry, surprised, etc.).

However, unlike traditional single-utterance multi-modal emotion recognition or single-modal conversation emotion recognition, MCER is a more challenging issue that requires consideration of factors such as multi-modal context, dialogue scenarios, speaker’s own emotional inertia, and interlocutor stimulation (Chudasama et al., 2022; Zhang et al., 2020). Powerful deep learning technology (Shou et al., 2022a) enables MCER to recognize emotion by fusing semantic features with complex emotional interactions. Feature fusion in MCER mainly considers intra-modal contextual semantic information fusion and inter-modal complementary semantic information fusion (Huang et al., 2020). On the one hand, intra-modal contextual semantic information fusion refers to extracting the temporal and spatial dependencies of speaker feature representations in each modality. On the other hand, complementary semantic information fusion between modalities refers to using the interactive information between different modalities to enhance the emotional understanding ability of the model. MCER synergistically improves the effect of emotion recognition by fusing the characteristics of various modal data, which has important theoretical significance for processing and understanding multi-modal data (Yang et al., 2022; Hou et al., 2023; Zhang et al., 2023a).

Despite more and more people being devoted to researching new models and methods for multi-modal conversation emotion recognition (Ghosal et al., 2018; Chudasama et al., 2022; Liu et al., 2022), there is still a lack of understanding of the theoretical and methodological classification of multi-modal conversational emotion recognition, especially multi-modal conversational emotion recognition based on deep learning. To the best of our knowledge, this survey is the first comprehensive survey focusing on deep learning in multi-modal conversation emotion recognition. Existing surveys mainly focus on single-utterance multi-modal emotion recognition (Zhu et al., 2023) or single-modal conversation emotion recognition (Deng and Ren, 2021). Therefore, with the rapid penetration of deep learning in various fields, multi-modal conversation emotion recognition methods based on deep learning have become a research hotspot and require timely and comprehensive investigation.

Based on the above analysis, our survey summarizes the research work on multi-modal conversational emotion recognition. We first list some publicly available and popular datasets in the field of MCER, and list the commonly used feature extraction methods for each modality. Secondly, as shown in Fig. 2, we roughly divide the MCER methods into four categories, i.e., context-free modeling, sequential context modeling, distinguishing-speaker modeling, speaker-relationship modeling. Third, we provide evaluation indicators for MCER experiments. Fourth, we give the applications and problems of MCER. Finally, we illustrate future research directions.

The contributions made in this paper are summarized as follows:

•

New Taxonomy: We provide a new taxonomy for multi-modal conversational emotion recognition. Specifically, we classify existing MCER methods into four groups: context-free modeling, sequential context modeling, distinguishing-speaker modeling, and speaker-relationship modeling.
•

Comprehensive Review: This paper provides the most comprehensive review of deep learning and machine learning algorithms for MCER. For each modeling approach, we provide representative models and make corresponding comparisons.
•

Abundant Resources: We collect relevant resour-ces about MCER, including state-of-the-art models and publicly available datasets. This paper can serve as a practical guide for learning and developing different emotion recognition algorithms.
•

Future Directions: We analyzed the limitations of existing MCER methods and proposed possible future research directions in many aspects, such as the collaborative generation of multi-modal data, the deep fusion of multi-modal features, and the unbiased learning of multi-modal emotions.

The paper is organized as follows: Section 2 summarizes the publicly available and popular datasets in the field of MCER. Section 3 illustrates the background, definitions, and commonly used feature extraction techniques for MCER. Section 4 broadly divides MCER methods into four categories and analyzes their advantages and disadvantages. Section 5 summarizes some commonly used evaluation metrics for MCER tasks. Section 6 gives the performance of different algorithms on the IEMOCAP and MELD data sets. Section 7 discusses the real-life applications of MCER. Section 8 illustrates the problems of existing research and Section 9 gives directions for future research. Finally, we conclude the work of this paper.

2. Popular Benchmark Datasets

Table 1 presents seven publicly available emotion recognition benchmark datasets. We counted the release time, modality, and open-source URL for each dataset. As shown in Table 2, we also counted the distribution of the data set on different emotional labels, and the data showed a long-tail distribution.

Table 1. Publicly available benchmark datasets in multi-modal conversational emotion recognition.

Datasets	Year	Modality	Available at
IEMOCAP (Busso et al., 2008)	2008	Text,Video,Audio	https://sail.usc.edu/iemocap/
MELD (Poria et al., 2019)	2019	Text,Video,Audio	https://web.eecs.umich.edu/ $\sim$ mihalcea/downloads/MELD.Raw.tar.gz
DailyDialog (Li et al., 2017)	2017	Text	https://huggingface.co/datasets/daily_dialog
EmoryNLP (Zahiri and Choi, 2017)	2017	Text	https://github.com/emorynlp/character-mining
SEMAINE (McKeown et al., 2011)	2012	Text,Video,Audio	https://semaine-db.eu/
EmotionLines (Hsu et al., 2018)	2018	Text	https://doraemon.iis.sinica.edu.tw/emotionlines/index.html
EmoContext (Chatterjee et al., 2019)	2019	Text	https://www.humanizing-ai.com/emocontext.html

Table 2. Distribution of seven conversational emotion recognition datasets on different emotion labels.

Labels	IEMOCAP	MELD	EmoContext	EmotionLines	EmoryNLP	DailyDialog	SEMAINE
Neutral	1,708	6,436	-	6,530	15,104	855,72	-
Happiness/Joy	648	2,308	4,669	1,710	11,020	12,885	93
Surprise/Powerful	-	1,636	-	1,658	4,252	1,823	-
Sadness	1,084	1,002	5,838	498	3,376	1,150	58
Anger/Mad	1,103	1,607	5,954	772	5,328	1,022	41
Disgust	-	361	-	338	-	354	7
Fear/Scared	-	358	-	255	6,584	74	3
Frustrated	1,849	-	-	-	-	-	-
Excited	1,041	-	-	-	-	-	-
Other	-	-	21,960	-	4,760	-	197

2.1. IEMOCAP

The IEMOCAP dataset (Busso et al., 2008) was released in 2008 and contains 12.46 hours of conversations. The IEMOCAP dataset contains three modal features, i.e., video, audio and text, and it is the first multi-modal dataset for MCER. In the IEMOCAP dataset, ten theater actors express specific emotion categories (i.e., sad, neutral, frustrated, anger, happy, excited) through binary dialogue. To ensure the consistency and accuracy of annotation, each sentence is annotated by multiple experts.

2.2. MELD

The MELD dataset (Poria et al., 2019) is from the classic TV series Friends, which contains text, video and audio data. The MELD dataset contains a total of 13,709 video clips, and each sentence is labeled as a specific emotion (i.e., anger, neutral, fear, disgust, surprise, joy, disgust). In addition, the MELD dataset is also annotated by neutral, negative and positive three-category emotion. To ensure the consistency and accuracy of annotation, each sentence is annotated by multiple experts.

2.3. DailyDialog

The DailyDialog dataset (Li et al., 2017) is a multi-turn dialogue dataset about daily chat scenarios, which only contains text modalities. The DailyDialog dataset contain 13, 000 dialogues and labels each sentence with intention (i.e., inform, commissive, directives, questions) and emotion (surprise, sadness, fear, happiness, disgust, anger). Each sentence is annotated jointly by three experts.

2.4. EmoryNLP

EmoryNLP (Zahiri and Choi, 2017) is a unimodal dataset, containing only text modalities. The EmoryNLP dataset contains 12,606 utterances, and each utterance is annotated with seven emotions: peaceful, scared, crazy, powerful, sad, happy, and neutral. EmoryNLP dataset is divided into training set, testing set and validation set.

2.5. SEMAINE

SEMAINE (McKeown et al., 2011) is a multi-modal conversation data set, which contains four binary conversations between robots and humans. The SEMAINE data set has 95 dialogues with a total of 5798 sentences. Four emotional dimensions are marked: Valence, Arousal, Expectancy, and Power. Valence, Arousa, and Expectancy are continuous values in the range [-1, 1], and the size of the SEMAINE data set is small.

2.6. EmotionnLines

The EmotionLines dataset (Hsu et al., 2018) comes from binary conversations between Friends and Facebook, and it only contains text data. The EmotionLines dataset contains 1,000 dialogues with a total of 29,245 sentences. Seven categories of emotions are marked: neutral, fear, surprise, sadness, anger, happiness, disgust. The EmotionLines dataset is rarely used in conversational emotion recognition.

2.7. EmoContext

The EmoContext data set (Chatterjee et al., 2019) only contains text data. It has a total of 38,421 dialogues and a total of 115,263 sentences. Three types of emotions are marked: happiness, sadness, and anger. Although the EmoContext data is large, it is rarely used in conversational emotion recognition because it only contains text data.

3. Background, Definition, and Feature Extraction

3.1. Background

As shown in Fig. 3, we counted multi-modal conversational emotion recognition algorithms from 2000 to 2023. As can be seen from the figure, before 2018, traditional machine learning algorithms were mainly used, and then deep learning algorithms gradually became the main ones. Next, we briefly return to the main development history of the MCER algorithm.

3.1.1. Brief History of Conversational Emotion Recognition

The emotion recognition method based on the dictionary is the earliest used for emotion recognition (Hardeniya and Borikar, 2016), which motivated early work on naive bayes method (Frank et al., 2000). With the widespread application of machine learning algorithms in classification tasks, the representative machine learning classification algorithm (SVM (Rozgić et al., 2012), (Hu et al., 2007), and binary decision tree (Cichosz and Slot, 2007), (Lee et al., 2011), (Liu et al., 2018b), etc) has also begun to shine in the field of emotion recognition. The above-mentioned method determines the category of emotion by learning the polarity and occurrence frequency of emotional words in the text, which is difficult to extract the semantic information and context information.

Encouraged by the success of CNNs in the field of computer vision (CV), CNNs began to be migrated to text classification tasks and received extensive research attention (Khare and Bajaj, 2020), (Kwon et al., 2021), (Kollias and Zafeiriou, 2020). In 2017, Poria et al. (Poria et al., 2017) used LSTM for the first time to resolve dependencies between contexts. Since then, improvements, extensions and applications of LSTMs and GRUs have increased (Hazarika et al., 2018b), (Majumder et al., 2019), (Hazarika et al., 2018a), (Rajamani et al., 2021). Until recently, many GNN-based methods (e.g., (Ghosal et al., 2019), (Ishiwatari et al., 2020), (Shen et al., 2021), (Li et al., 2023), (Zadeh et al., 2018b)) emerged. Apart from CNNs, RNNs, and GNNs, many alternative Transformer-based methods (e.g., (Zhong et al., 2019), (Li et al., 2020), (Zhu et al., 2021)) have been developed in the past decades. We detail the categories to which these algorithms belong in Section 4.

3.1.2. Multi-modal Conversational Emotion Recognition Versus Traditional Machine Learning

MCER methods based on traditional machine learning (Rozgić et al., 2012), (Cichosz and Slot, 2007), (Lee et al., 2011), (Liu et al., 2018b), (Lin and Wei, 2005), (Hu et al., 2007) are closely related to hand-extracted features, which have attracted increasing attention from the data mining and emotion recognition communities. These methods aim to learn the feature embeddings of raw data for subsequent downstream tasks such as classification and clustering. The classic conversational emotion recognition method based on machine learning is to use support vector machine to map emotional features to a hyperplane and classify them (Rozgić et al., 2012), (Bhavan et al., 2019). However, these methods require a large amount of high-quality labeled data.

3.1.3. Conversational Emotion Recognition Versus Convolutional Neural Network

The CNN-based emotion recognition methods (Lu et al., 2023), (Kwon et al., 2021), (Kollias and Zafeiriou, 2020) are the first deep learning method to solve the emotion classification problem historically (Kim, 2014). These CNN-based methods employ convolutional filters to extract semantic features of text so that the model can use supervised learning to understand the meaning of text. Similar to machine learning algorithms, CNN can also map emotional features into vector space through mapping functions. The difference is that this mapping function is learned in an end-to-end manner. Since the convolution kernel extracts local receptive field information, CNN cannot contain contextual semantic information.

3.1.4. Multi-modal Conversational Emotion Recognition Versus Recurrent Neural Network

The RNN-based emotion recognition methods (Ma et al., 2019), (Tao and Liu, 2018), (Majumder et al., 2019), (Kollias and Zafeiriou, 2020) are developed on the basis of CNN, but they believe that contexts should be mutually influential and interdependent (Ma et al., 2019), (Tao and Liu, 2018). These RNN-based methods usually use LSTM or GRU (to avoid gradient disappearance or gradient explosion) to extract semantic features including context. Similar to CNN, RNN can also map emotional features into vector space through mapping functions in an end-to-end manner.

3.1.5. Multi-modal Conversational Emotion Recognition Versus Transformer

Similar to the RNN-based emotion recognition method, the Transformer-based emotion recognition methods (Huang et al., 2020), (Lian et al., 2021), (Tsai et al., 2019), (Rahman et al., 2020) also extract semantic information including context, and completes subsequent emotion classification based on this (Lian et al., 2021). However, unlike RNN, Transformer’s sequential context modeling ability is better than RNN. Therefore, the accuracy of the Transformer-based emotion recognition methods are significantly better than RNNs.

3.1.6. Multi-modal Conversational Emotion Recognition Versus Graph Neural Network

The GNN-based emotion recognition methods (Lin et al., 2022), (Zhang et al., 2023a), (Li et al., 2023), (Ishiwatari et al., 2020), (Shen et al., 2021) inherit the idea of the RNN method, i.e., the contexts should interact and depend on each other (Ghosal et al., 2019). On the basis of RNN, GNNs believe that there is also a relationship of mutual influence between speakers. Therefore, GNNs model the dialogue relationship between speakers through the inherent properties of the graph structure.

Table 3. Some symbols commonly used in the paper.

Notations	Descriptions
$\mid\cdotp\mid$	The length of the set.
$\bigodot$	Element-wise product.
$\mathcal{G}$	A graph.
$\mathcal{V}$	A set of nodes in a graph.
$v$	A node $v\in V$
$\mathcal{E}$	A set of edges in a graph.
$e_{ij}$	An edge $e_{ij}\in E$ .
$N(v)$	The neighbors of a node $v$ .
$S$	A speaker.
$u$	An utterance.
$K$	The context window size.
$M$	The number of the speakers.
$L$	The number of utterances in a dialogue.
$U$	The set of contextual utterence.
$\mathcal{R}$	The type of edge.
$\mathcal{W}$	Learnable parameters.
A	The adjacency matrix of a graph.
$m$	The node properties of the graph.
$x^{t}\in\mathbb{R}^{d}$	$d$ -dimensional text feature vectors.
$x^{a}\in\mathbb{R}^{k}$	$k$ -dimensional audio feature vectors.
$x^{v}\in\mathbb{R}^{h}$	$h$ -dimensional video feature vectors.
$x$	Concatenated video, audio and text feature vectors.

3.2. Definitions and Preliminaries

The symbols used in this paper are listed in Table 3. Now, we define the sets needed to understand this paper. In particular, we use uppercase letters for matrices and lowercase letters for vectors.

Definition 1 (Utterances context) The multi-modal conversational emotion recognition task aims to recognize the emotional changes (e.g., happiness, and sadness, etc) of speakers $\{S_{1},S_{2},\ldots,S_{M}\}$ at the current moment $t$ in a dialogue. $L$ represents the number of utterances in a dialogue, $U$ represent a set of contextual utterances, and $U=\{u_{1},u_{2},\ldots,u_{L}\}$ .

The MCER task aims to correctly classify each utterance by incorporating contextual information. At the current moment $t$ , the model needs to infer the speaker’s emotion based on the context information $\{u_{1},u_{2},\ldots,u_{t-1}\}$ . We assume that the context window size is set to $K$ . The set of contextual utterances is defined as follows:

(1)

C_{\lambda}=\left\{u_{i}\mid i\in[t-K,t-1],u_{i}\in U_{\lambda},\mid C_{\lambda}\mid\leq K\right\}

When the context window size is 6, the speaker’s contextual utterances and predicted utterances are shown in Table 4.

Definition 2 (Dialogue graph) A dialogue graph is represented as $\mathcal{G}=\{\mathcal{V},\mathcal{E},\mathcal{R},\mathcal{W}\}$ , where $\mathcal{V}$ is a set of nodes in the graph, $\mathcal{E}$ is a set of edges, $v_{i}\in\mathcal{V}$ represents the $i$ -th node, $e_{ij}=(v_{i},v_{j})\in\mathcal{E}$ represents a directed edge from $v_{i}$ to $v_{j}$ , the relationship $r_{ij}\in\mathcal{E}$ represents that there is a dialog relationship between nodes $v_{i}$ and $v_{j}$ . The neighbor nodes of node $v$ are represented as $N(v)=\{u\in\mathcal{V}|(v,u)\in\mathcal{E}\}$ . $\textbf{A}\in{\mathbb{R}^{n\times n}}$ means the adjacency matrix with $\textbf{A}_{ij}=1$ if $e_{ij}\in\mathcal{E}$ , otherwise $\textbf{A}_{ij}=0$ . $\textbf{X}\in\mathbb{R}^{m\times m}$ represents the node properties of the graph. For the MCER task based on GCN, the speaker’s utterance information is regarded as the node of the graph, and the dialogue relationship information between speakers is regarded as the edge of the graph.

Definition 3 (Problem definition) For a given multi-modal utterance sequence $U$ , the MCER task requires using the utterance context information to determine a deep neural network $F\left({{u_{i}}}\right)$ so that the output emotion label $\hat{y}_{i}$ is as close as possible to the real emotion label ${y_{i}}$ , $i\in\left\{{1,...,L}\right\}$ . Deep neural networks can solve the optimal parameters by minimizing loss, and its loss is defined as follows:

(2)

\min_{F}\frac{1}{L}\sum_{i=1}^{L}\mathcal{L}\left(\hat{y}_{i}=F\left(u_{i}\right),y_{i}\right)

where $L$ represents the number of utterances in the dialogue, $\mathcal{L}$ is an indicator function.

Table 4. We assume that there are three speakers in a dialogue, and the window size

K

of the dialogue is set to 6. The dialogue process is as follows:

Speaker	Utterences	Description
$C_{a}$ , $C_{b}$ , $C_{c}$	${{u}_{1}^{a},{u}_{3}^{a}},{{u}_{2}^{b},{u}_{5}^{b}},{{u}_{4}^{c},{u}_{6}^{c}}$	Contextual utterances
$S_{b}$	$u_{7}^{b}$	Predicted utterance

From the development history and related preliminary definitions of MCER, it can be seen that the process of multi-modal conversation emotion recognition mainly includes three aspects: multi-modal feature extraction, multi-modal feature fusion representation, and emotion classification. The overall process is shown in Fig. 4, and we will provide a comprehensive overview of these three aspects below.

3.3. Multi-modal Feature Extraction

Multi-modal feature extraction (e.g., text, video and audio, etc) is one of the important techniques for emotion analysis. In this section, we introduce the process of using feature extraction methods to perform data preprocessing on text, video, and audio, and list some commonly used feature extraction methods. As shown in Table 5, we count the multi-modal feature extraction techniques used by many deep learning methods.

Table 5. Feature extraction methods for text, video and audio features used by different emotion recognition techniques.

Methods	Text	Video	Audio	Methods	Text	Video	Audio
THMM (Morency et al., 2011)	Polarized words	OKAO Vision	OpenEAR	CMN (Hazarika et al., 2018b)	TEXT-CNN	3D-CNN	openSMILE
SVM (Pérez-Rosas et al., 2013)	Bag-of-words	CERT	OpenEAR	Att-BiLSTM (Poria et al., 2017)	TEXT-CNN	3D-CNN	openSMILE
MKL (Poria et al., 2015)	Word2vec	CLM-Z	openSMILE	ICON (Hazarika et al., 2018a)	TEXT-CNN	3D-CNN	openSMILE
SAL-CNN (Wang et al., 2017)	Word2vec	CLM-Z	COVAREP	DialogueRNN (Majumder et al., 2019)	TEXT-CNN	3D-CNN	openSMILE
TFN (Zadeh et al., 2017)	GLOVE	Facet	COVAREP	DialogueGCN (Ghosal et al., 2019)	TEXT-CNN	3D-CNN	openSMILE
LMF (Liu et al., 2018a)	GLOVE	Facet	COVAREP	COIN (Zhang and Chai, 2021)	TEXT-CNN	3D-CNN	openSMILE
HFFN (Mai et al., 2019a)	GLOVE	Facet	COVAREP	CESTa (Wang et al., 2020b)	TEXT-CNN	3D-CNN	openSMILE
LMFN (Mai et al., 2019b)	GLOVE	Facet	COVAREP	EmoCaps (Li et al., 2022b)	BERT	3D-CNN	openSMILE
GME-LSTM (Mai et al., 2019b)	GLOVE	Facet	COVAREP	MM-DFN (Hu et al., 2022a)	TEXT-CNN	3D-CNN	openSMILE
MARN (Zadeh et al., 2018c)	GLOVE	Facet	COVAREP	M2FNet (Chudasama et al., 2022)	RoBERTa	Mel Spectrograms	MTCNN
MFN (Zadeh et al., 2018a)	GLOVE	Facet	COVAREP	GraphCFC (Li et al., 2023)	TEXT-CNN	openSMILE	3D-CNN
RAVEN (Wang et al., 2019)	GLOVE	Facet	COVAREP	UniMSE (Hu et al., 2022b)	T5	openSMILE	3D-CNN
SWRM (Wu et al., 2022)	BERT	Facet	COVAREP	EmotionIC (Yingjian et al., 2023)	TEXT-CNN	openSMILE	3D-CNN
MCTN (Pham et al., 2019)	GLOVE	Facet	COVAREP	SACL-LSTM (Hu et al., 2023)	RoBERTa	openSMILE	3D-CNN
MulT (Tsai et al., 2019)	GLOVE	Facet	COVAREP	HyCon (Mai et al., 2022)	BERT	Facet	COVAREP
MAG (Rahman et al., 2020)	BERT	Facet	COVAREP	HGraph-CL (Lin et al., 2022)	BERT	Facet	COVAREP
ICDN (Zhang et al., 2023b)	GLOVE	Facet	COVAREP	bc-LSTM (Poria et al., 2017)	TEXT-CNN	3D-CNN	openSMILE
AMOA (Li et al., 2022c)	BERT	OpenFace 2.0	openSMILE	MMMU-BA (Ghosal et al., 2018)	GLOVE	Facet	COVAREP
ICCN (Sun et al., 2020)	BERT	Facet	COVAREP	MISA (Hazarika et al., 2020)	BERT	Facet	COVAREP

3.3.1. Text Feature Extraction

With the rapid development of deep learning technology, word embedding technology has also been widely used to extract text features. Word embedding technology uses a shallow neural network to learn the semantic information of words, and uses Euclidean distance to measure the similarity between words. Unlike traditional one-hot encoding methods, word embedding technology can map high-dimensional sparse feature vectors to low-dimensional dense feature vectors, thus saving a lot of computing resources and solving the problem that one-hot encoding cannot distinguish the semantic gap between words. A commonly used word embedding method is Word2Vec (Chen et al., 2019), which contains two different forms: CBOW (Ghosh and Anwar, 2021) and Skip-gram (Du et al., 2020). CBOW predicts the central word based on the surrounding words, and Skip-gram predicts the surrounding words based on the central word. Although the above methods can capture the semantic similarity between words, they require large datasets for training.

Some recent studies use simple TextCNN (Kim, 2014) and GLOVE (Gan et al., 2022) to extract text features. In addition, large-scale predictive pre-training models such as BERT (Ma et al., 2021) and RoBERTa (Kim et al., 2021) are often used, which capture contextual information through attention mechanisms.

3.3.2. Video Feature Extraction

Visual feature extraction is mainly to extract information such as facial expressions and gestures that contain the speaker’s emotions from the video. In recent years, deep neural networks have been able to extract deep features from images in an end-to-end learning manner, avoiding the tedious manual feature extraction. For example, Tran et al. (Tran et al., 2015) proposed an effective and efficient 3D-CNN to process video frames containing spatio-temporal features.

Nowadays, most of the visual features are extracted by some advanced neural networks (e.g., CNN (Kattenborn et al., 2021; Wang et al., 2020a), and Transformer (Han et al., 2021; Han et al., 2022), etc) or some open source toolkits (e.g., OKAO Vision (Zhu et al., 2023), OpenFace (Baltrusaitis et al., 2018), CERT (Boopathy et al., 2019) and Facet (Barreiro and Treglown, 2020), etc). OKAO Vision extracts the speaker’s facial expression in each frame of the video, and finally gets the speaker’s smile intensity (0-100) and eye gaze direction. CERT can adaptively extract visual features such as face and head pose. OpenFace 2.0 is capable of detecting facial landmarks, recognizing facial movements, and estimating eye gaze direction. Facet extracts visual features, including facial movements, head poses, HOG features and etc.

3.3.3. Audio Feature Extraction

Deep learning has also begun to get more and more attention in research in audio feature extraction research. For example, LSTM (Xie et al., 2019) has been widely used to automatically extract acoustic characteristics. Poria et al. (Poria et al., 2017) uses CNN to extract features from the audio, and then enter the extracted audio features into the classifier for emotional classification.

Recently, more and more emotion analysis models (Tzinis et al., 2018; Majumder et al., 2019; Zhong et al., 2019) have started to use open source toolkits such as CONVAREP (Dumpala et al., 2023), openSMILE (Kumar et al., 2022), LibROSA (Suman et al., 2022), and OpenEAR (Schepker et al., 2020) to extract audio features. Specifically, OpenEAR adaptively extracts a set of acoustic features (e.g., prosody, spectrum and sepstral, etc.) and uses Z-normalization to normalize the audio features. The features extracted by openSMILE consist of MFCC, pitch and sound intensity. The LibROSA Speech Toolkit extracts 33 frame-level acoustic features (i.e., 20-dimensional MFCC and CQT) including speaker intonation variations. Similar to other audio feature extraction methods, COVAREP can also be used to extract features such as 12-dimensional MFCC, maximum dispersion scale (MDQ) and Liljencrants-Fant (LF).

4. Taxonomy of Multi-modal Conversational Emotion Recognition Algorithms

In this section, we present a taxonomy of MCER modeling approaches. We categorize existing work into context-free modeling, sequential context modeling, distingguishing speaker modeling, and speaker relation modeling. We briefly introduce each method in the following.

4.1. Context-free Modeling

These are mostly pioneering works on conversational emotion recognition. Context-free modeling methods aim to learn a feature representation for each sentence, which does not exploit the contextual information of the sentence (Zhang et al., 2020), (Seng et al., 2016), (Lotfian and Busso, 2019). For example, some traditional machine methods (e.g., SVMs (Rozgić et al., 2012; Lin and Wei, 2005), and decision trees (Cichosz and Slot, 2007; Lee et al., 2011), etc) is used to extracts the feature representation of each sentence, and utilize the extracted sentence features to complete emotion classification. The above process assumes that each sentence is independent and does not influence each other. We introduce several common context-free modeling methods based on feature fusion below.

4.1.1. Add

The early fusion method based on addition operation obtains the final emotional feature representation by weigh-ted summation of different modality features (Deng and Ren, 2021). This fusion method is simple to operate and requires only a small amount of calculation. However, its shortcomings are also obvious. It cannot model the context information in a fine-grained manner, and the information that can be utilized is limited. The formula for implementing the context-free modeling method using the additive approach is defined as follows:

(3)

h_{e}=x^{t}+x^{a}+a^{v}

where $h_{e}$ represents the fused emotional vectors, $x^{t},x^{a},x^{v}$ represent the text, audio, and video vectors, respectively.

4.1.2. Concatenation

The early fusion method based on concatenation operation obtains the final emotion feature representation by concatenating and merging different modal features (Cambria et al., 2018). Although this fusion method does not introduce additional calculations, it leads to very high dimensionality of the data, which makes calculations difficult. Furthermore, it also fails to capture intra-modal and inter-modal semantic information that is complementary.

(4)

h_{e}=Concat\left([x^{t},x^{a},a^{v}]\right)

where $Concat\left(\cdot\right)$ represents concatenation operation.

4.1.3. Tri-modal Hidden Markov Model

Morency et al. used text, video, and audio features for the task of trimodal emotion analysis, and designed a model to extract useful information in different modal features (Morency et al., 2011). After extracting multi-modal features, the three modal features are connected and input into a Hidden Markov Chain (HMM) classifier (Morency et al., 2011) to learn the emotional state of the input signal. HMM believes that the state of the current moment is only related to the information of the previous moment, which makes the model unable to use the context information of the utterance. The formula of HMM is defined as follows:

(5)	$\displaystyle P(w\|x^{a},x^{v},x^{t})$	$\displaystyle=\sum_{i=1}^{C}\sum_{j=1}^{D}\sum_{k=1}^{M}P(w,\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t}\|x^{a},x^{v},x^{t})$
		$\displaystyle=\sum_{i=1}^{C}\sum_{j=1}^{D}\sum_{k=1}^{M}P(w\|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$
		$\displaystyle\times P(\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t}\|x^{a},x^{v},x^{t})$

where $C,D,M$ represent feature vector dimensions for audio, video, and text, $w$ represents the emotional class, $P(\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t}|x^{a},x^{v},x^{t})$ represents the confidence of the emotion classification.

Since the true class label is based on the output of the predicted class label $\hat{w}_{b}$ , the formula of HMM can be expanded as follows:

(6)			$\displaystyle P(w\|\lambda^{a}_{i},\lambda^{v}_{j},\lambda^{t}_{k},x^{a},x^{v},x^{t})$
			$\displaystyle=\sum_{b=1}^{B}P(w,\hat{w}_{b}\|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$
			$\displaystyle=\sum_{i=1}^{B}P(w\|\hat{w}_{b},\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$
			$\displaystyle\times P(\hat{w}_{b}\|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$

where $P(\hat{w}_{b}|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$ represents the probability of predicted label.

4.1.4. SVM

SVM is a machine learning algorithm for classification and regression whose optimization goal is to find a hyperplane (a straight line in two-dimensional space, and a hyperplane in high-dimensional space) that separates samples of different classes. Based on the above research, Perez-Rosas et al. (Pérez-Rosas et al., 2013) concatenate multi-modal features as input vectors and use SVM to classify utterances for emotion. SVM works better for binary classification problems, but is less effective in multi-classification problems, and is only suitable for training small-scale data sets. The formula of SVM is defined as follows:

(7)

f\left(x\right)=sign\left(\sum_{i=1}^{N}\alpha_{i}^{*}y_{i}\exp\left(-\frac{\|x-z\|^{2}}{2\sigma^{2}}\right)+b^{*}\right)

where $sign(x>0)=1,sign(x=0)=0,sign(x<0)=1$ , $\alpha_{i}^{*},b^{*}$ represents the learnable parameters, $\exp\left(-\frac{\|x-z\|^{2}}{2\sigma^{2}}\right)$ represents the kernel function, $N$ is the number of the samples.

4.1.5. Multiple Kernel Learning

After preprocessing the features of three different modalities, Poria et al. (Poria et al., 2015) constructed two different feature selectors to achieve feature dimensionality reduction. One of the feature selectors is based on circular correlated feature subset selection (CFS), and the other is based on principal component analysis (PCA). The above two feature selectors can not only eliminate redundant information and noise information, but also improve the running speed of the model. After feature selection and dimensionality reduction, the researchers spliced and merged the processed feature vectors and trained a classifier using a multi-kernel learning (MKL) algorithm (Poria et al., 2015). Based on the previous research work, the authors further propose the Convolutional Recurrent Multi-kernel Learning (CRMKL) (Poria et al., 2016) model. CRNKL uses a convolutional recurrent neural network for emotion detection, which can extract contextual information. The formula of MKL is defined as follows:

(8)

\begin{gathered}\max_{\alpha,\beta}\left[\sum_{i=1}^{N}\alpha_{i}-\sum_{i,j=1}^{N}\alpha_{i}\alpha_{j}y_{i}y_{j}\mathrm{K}_{mkl}(x_{i},x_{j})\right]\\ \sum_{i}^{N}\alpha_{i}y_{i}=0\\ 0\leq\alpha_{i}\leq C\\ \mathrm{K}_{mkl}=\sum_{k}^{M}\beta_{k}K_{k}>0\end{gathered}

where $y_{i}$ is the true label, $\alpha,\beta$ are the learnable parameters, $M$ is the feature dimension.

4.1.6. Select-Additive Learning CNN

CNN is a classic neural network in visual tasks and cannot be directly used for emotion recognition. In order to solve this problem, Kim et al. (Kim, 2014) proposed the TextCNN model, and its overall process is shown in Fig. 5. In order to perform multi-modal emotion recognition, Wang et al. (Wang et al., 2017) proposed the SAL-CNN model, which first uses multi-modal data to fully train the CNN, and then uses Select-Additive Learning (SAL) to improve its versatility and prevent the model from overfitting during training. The SAL method consists of two phases (i.e., selection and addition). In the selection phase, SAL preserves important features and removes noisy information from the latent feature representations learned from neurons. In the addition phase, SAL improves the model’s noise immunity by adding Gaussian noise to the feature representation. The SAL method improves the generalization performance of deep fusion models.

The formula for extracting text features by CNN is defined as follows:

(9)		$\displaystyle x_{1:n}^{t}=x_{1}\oplus x_{2}\oplus\ldots x_{n}$
(9)		$\displaystyle c_{i}=f(\omega\cdot x_{p:p+q-1}+b)$

where $\oplus$ represents concatenation operator, $\omega$ represents convolution filter, $c_{i}$ represents the feature representation within a window, $f(\cdot)$ represents activation function. Convolutional filters are used to extract features from all sentences to generate feature maps:

(10)

\mathbf{c}=maxpooling[c_{1},c_{2},\ldots,c_{n-h+1}]

The max pooling operation is used to capture the most critical semantic information in the sentence.

It can be seen from the processing flow of the convolutional neural network that using CNN to extract text features does not contain contextual information, i.e., it is assumed that each sentence is independent of each other.

4.1.7. Tensor Fusion Network

The tensor-based feature fusion method mainly calculates the tensor product of different modal feature representations through Cartesian product to obtain the fused tensor representation (Pandey et al., 2022). Therefore, the above methods need to first map the input multi-modal feature representation into a high-dimensional space, and then map it back to a low-dimensional tensor space for emotion representation. Tensor-based methods are able to capture important high-order interaction information across time, space, and modality. However, the computational complexity of tensor methods is very high and grows exponentially, and there is no fine-grained semantic information interaction between modalities. Zadeh et al. (Zadeh et al., 2017) proposed the multi-modal tensor fusion network TFN. TFN adopts the method of tensor fusion, which can simulate the interaction process between the three modalities of text, audio and video, and effectively fuse multi-modal features. Although TFN can effectively model information interaction within and between modalities, the model complexity of the TFN method is related to the dimensionality of multi-modal features and grows exponentially. The formula of TFN is defined as follows:

(11)

\left.\left\{(x^{t},x^{v},x^{a})\mid x^{t}\in\begin{bmatrix}\mathbf{x}^{l}\\ 1\end{bmatrix}\right.,x^{v}\in\begin{bmatrix}\mathbf{x}^{v}\\ 1\end{bmatrix},x^{a}\in\begin{bmatrix}\mathbf{x}^{a}\\ 1\end{bmatrix}\right\}

where the extra dimension with 1 is used to perform modal interaction. The Cartesian product is then used to fuse the three modal features as follows:

(12)

\mathbf{x}^{m}=\begin{bmatrix}\mathbf{x}^{l}\\ 1\end{bmatrix}\otimes\begin{bmatrix}\mathbf{x}^{v}\\ 1\end{bmatrix}\otimes\begin{bmatrix}\mathbf{x}^{a}\\ 1\end{bmatrix}

where $\otimes$ represents the outer product, $x^{m}$ represents fused vectors.

4.1.8. Low-rank Tensor Fusion Network

On the basis of TFN, in order to more efficiently fuse multi-modal data, Liu et al. (Liu et al., 2018a) proposed a low-rank tensor method LFM to achieve dimensionality reduction of multi-modal features, so as to improve the fusion efficiency of multi-modal features as shown in Fig. 6. LFM has achieved high performance on many different tasks.

(13)		$\displaystyle\mathbf{x}^{m}$	$\displaystyle=\left(\sum_{i=1}^{r}\mathbf{w}_{a}^{(i)}\otimes\mathbf{w}_{v}^{(i)}\otimes\mathbf{w}_{t}^{(i)}\right)\cdot\mathbf{x}$
(13)			$\displaystyle=\left(\sum_{i=1}^{r}\mathbf{w}_{a}^{(i)}\cdot x_{a}\right)\circ\left(\sum_{i=1}^{r}\mathbf{w}_{v}^{(i)}\cdot x_{v}\right)\circ\left(\sum_{i=1}^{r}\mathbf{w}_{t}^{(i)}\cdot x_{t}\right)$

where $\mathbf{w}_{a},\mathbf{w}_{v},\mathbf{w}_{t}$ represents the decomposed low-rank learnable tensor.

4.1.9. Data Augmentation with Generative Adversarial Networks

Multimodal emotion recognition based on adversarial learning is an advanced direction in this field, which combines the principles of adversarial learning to improve the accuracy and robustness of emotion recognition (Ren et al., 2023), (Yuan et al., 2023). Next, we introduce the existing overall process of data augmentation based on adversarial generative networks.

1) Conditional GANs Conditional Generative Adversarial Network (cGAN) (Sun et al., 2023) is a variant of GAN that introduces conditional information to more precisely control the output of the generator. The core idea of cGAN is to pass additional condition information to the generator and discriminator during the generation process, thereby generating specific types of data based on given conditions. The main advantage of cGAN is its ability to precisely control the generation process in order to generate data that meets the conditional information. The optimization goal of cGAN is defined as follows:

(14)

\begin{gathered}\min_{G}\max_{D}V(D,G)=\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\{\log D([\mathbf{x},\mathbf{y}])\}+\mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})}\{\log(1-D([G([\mathbf{z},\mathbf{y}]),\mathbf{y}]))\}\end{gathered}

where $x$ represents real data, and $y$ represents extra information.

(15)			$\displaystyle\mathcal{L}_{D}^{(cGAN)}=-\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\{\log D([\mathbf{x},\mathbf{y}])\}-\mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})}\{\log(1-D([G([\mathbf{z},\mathbf{y}]),\mathbf{y}]))\}$
(15)			$\displaystyle\mathcal{L}_{G}^{(cGAN)}=-\mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})}\{\log(D([G([\mathbf{z},\mathbf{y}]),\mathbf{y}]))\}$

2) Adversarial Autoencoders Adversarial Autoencoder (AAE) (Latif et al., 2020) combines the ideas of Autoencoder and GAN. The main goal of AAE is to make this encoding space more continuous and have better data generation capabilities while learning a compressed representation of data. The training objective function of AAE usually includes two parts: one is the reconstruction error of the autoencoder, which is used to ensure the quality of the encoding, and the other is the GAN loss, which is used to make the encoding distribution more continuous and closer to the real distribution. The formula is defined as follows:

(16)			$\displaystyle\mathcal{L}_{D}^{(AAE)}=-\mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})}\{\log D(\mathbf{z})\}-\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\{\log(1-D(E(\mathbf{x})))\}$
			$\displaystyle\mathcal{L}_{E}^{(AAE)}=-\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\{\log(D(E(\mathbf{x})))\}$
			$\displaystyle\mathcal{L}_{R}^{(AAE)}=\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\big{\{}\|\|\mathbf{x}-R(E(\mathbf{x}))\|\|^{2}\big{\}}$

where $p_{z}(z)$ represents the prior distribution.

3) Adversarial Data Augmentation Network Adversarial Data Augmentation Network (ADAN) (Wang et al., 2022) includes the following components: autoencoder $R(E(x))$ , auxiliary classifier $C(E(x))$ , generator $G(z,y)$ and discriminator $D(h)$ . First, ADAN aims to learn a latent representation of the input data x in order to preserve the emotional information in it. Second, it attempts to ensure that the generated latent representation is consistent with the emotional information of the input data by matching the posterior distribution $p(h|z,y)$ with $p(h|x)$ . Third, ADAN simultaneously strives to minimize the reconstruction error between the input data $x$ and its reconstructed version $\hat{x}$ to ensure high-quality data reconstruction. The generator $G(z,y)$ accepts a sample $z$ drawn from an M-dimensional Gaussian distribution and a one-hot encoding of the emotion label $y$ as input, and the goal is to generate samples in the latent space such that they are indistinguishable from real samples. The discriminator $D(h)$ is optimized to distinguish whether the latent vector $h$ comes from real data or from the generator.

(17)			$\displaystyle\mathcal{L}_{D}^{(\mathrm{ADAN})}=-\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\{\operatorname{log}D(E(\mathbf{x}))\}-\mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})}\{\log(1-D(G(\mathbf{z},\mathbf{y})))\}$
			$\displaystyle\mathcal{L}_{C}^{(\text{ADAN})}=-\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\bigg{\{}\sum_{k=1}^{K}y_{\mathrm{emo}}^{(k)}\log C(E(\mathbf{x}))_{k}\bigg{\}}$
			$\displaystyle\mathcal{L}_{R}^{(\mathrm{ADAN})}=\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\{\|\|\mathbf{x}-R(E(\mathbf{x}))\|\|^{2}\}$
			$\displaystyle\mathcal{L}_{E}^{(\mathrm{ADAN})}=\mathbb{E}_{\mathbf{x}\sim p_{\mathrm{data}}(\mathbf{x})}\left\{\|\|\mathbf{x}-R(E(\mathbf{x}))\|\|^{2}-\sum_{k=1}^{K}y_{\mathrm{emo}}^{(k)}\log C(E(\mathbf{x}))_{k}\right\}$
			$\displaystyle\mathcal{L}_{G}^{\mathrm{(ADAN)}}=\mathbb{E}_{\mathbf{z}\sim p_{z}(\mathbf{z})}\bigg{\{}\log(1-D(G(\mathbf{z},\mathbf{y})))-\alpha\sum_{k=1}^{K}y_{\mathrm{emo}}^{(k)}\log C(G(\mathbf{z},\mathbf{y}))_{k}\bigg{\}}$

where $\alpha$ determines the contribution of classification error to model optimization.

4.2. Sequential Context Modeling

Context-free modeling is conceptually important and has inspired later research on sequential context modeling (Tu et al., 2022). In particular, sequential context modeling methods consider that contextual sentences are mutually influential. Sequential context modeling approaches (Ma et al., 2019; Tao and Liu, 2018; Xie et al., 2019) consider each sentence influenced by its surrounding utterances. The main idea is to generate a feature representation with rich contextual semantic information by combining its own utterance representation $x_{i}$ with the surrounding contextual sentence representation $\{x_{i-k},\cdots,x_{i-1},x_{i+1},\cdots,x_{i+k}\}$ , where $k$ represents the context window size. Different from the context-free modeling method, the sequential context modeling method obtains a better feature representation by setting a memory network to preserve the context information of the sentence. Taking Fig. 7 as an example, a LSTM or Transformer is used to extract contextual information for three modalities of video, audio and text. The sequential context modeling approach plays an important role for many other MCER modeling approaches.

LSTM is a variant of RNN that can remember contextual information. Specifically, LSTM models long-distance dependent context through cellular units and can solve the vanishing gradient problem. Each LSTM consists of input gate $j_{t}$ , output gate $O_{t}$ , cell state ${C}_{t}$ , and forget gate $f_{t}$ .

(18)			$\displaystyle\begin{bmatrix}\widetilde{C}_{t}\\ O_{t}\\ j_{t}\\ f_{t}\end{bmatrix}=\begin{bmatrix}\tanh\\ \sigma\\ \sigma\\ \sigma\end{bmatrix}W_{T}\begin{bmatrix}x_{t}\\ h_{i}^{t-1}\end{bmatrix}$
			$\displaystyle C_{t}=C_{t}\odot j_{t}+C_{t-1}\odot f_{t}$
			$\displaystyle h_{i}^{t}=O_{t}\odot\tanh(C_{t})$

where $\sigma$ represents activation function.

After LSTM was used in multi-modal conversational emotion recognition, many other works were proposed to extract contextual emotional information. Lu et al. (Lu et al., 2023) proposed a multi-scale LSTM multi-modal emotion recognition model, which uses LSTM to extract low-level and high-level local emotional features in multi-modal features. This method can capture subtle changes in complex expressions in a more fine-grained manner and implement an information feedback mechanism. However, it cannot capture the status information of the utterance and the status information of the speaker.

Existing models ignore modal alignment and directly fuse information on different modal features. Modal alignment can eliminate the heterogeneity of single-modal features and obtain accurate emotional representations of different modal features. Based on this current situation, Hou et al. (Hou et al., 2023) proposed a semantic alignment network based on multi-space learning, which uses LSTM to extract emotional features of different modalities and obtains high-level emotional representations as supervisory signals for modal alignment. This method can capture the global correlation between different modalities and achieve feature fusion between modalities.

Transformers are another way of modeling sequential context (Li et al., 2020; Zhu et al., 2021; Huang et al., 2020). Transformer’s long-distance modeling capabilities are far superior to recurrent neural networks like Bi-LSTM, and Transformer can achieve parallel computing. Therefore, existing research on multi-modal emotion recognition based on sequential context modeling often regards Transformer as an important technology. The implementation details of Transformer are as follows.

Firstly, video, audio and text features (i..e., $x^{t},x^{a},x^{v}$ ) are concatenated into a fusion vector. The formula is defined as follows:

(19)

Q,K,V=Concat(x^{t},x^{a},x^{v})

where $Q,K,V$ represent the query vector, key vector and value vector of multi-modal features, respectively.

Secondly, we use a feedforward neural network to perform multiple linear transformations on $Q$ , $K$ , and $V$ . The formula is defined as follows:

(20)			$\displaystyle\tilde{Q}=Concat(QW_{1}^{Q},\ldots,QW_{i}^{Q},\ldots,QW_{m}^{Q})$
			$\displaystyle\tilde{K}=Concat(KW_{1}^{K},\ldots,KW_{i}^{K},\ldots,KW_{m}^{K})$
			$\displaystyle\tilde{V}=Concat(VW_{1}^{V},\ldots,VW_{i}^{V},\ldots,VW_{m}^{V})$

where $m$ represents the number of linear transformations.

We then perform multi-head attention in parallel to obtain emotion feature representation:

(21)			$\displaystyle head_{i}=\frac{softmax\left((QW_{i}^{Q})(KW_{i}^{K})^{T}\right)}{VW_{i}^{V}}$
(21)			$\displaystyle H_{head}=Concat(head_{1},\ldots,head_{m})$

where $H_{head}$ represents the emotion feature vectors.

Finally, we use sine and cosine position encoding to obtain the position information of the emotion sequence:

(22)		$\displaystyle PE_{(pos,2i)}$	$\displaystyle=\sin\left(\frac{pos}{10000^{2i/d}}\right)$
(22)		$\displaystyle PE_{(pos,2i+1)}$	$\displaystyle=\cos\left(\frac{pos}{10000^{2i/d}}\right)$

where pos is the index of the $i$ -th sentence, position encoding information is fused into $Q$ , $K$ , and $V$ .

After Transformer was proposed, many Transformer-based multi-modal conversational emotion recognition research methods were proposed to model long-distance context dependencies (Gerczuk et al., 2021; Hazarika et al., 2021). In view of the inability of previous works to model long-distance dependencies between different modal features, Yang et al. (Yang et al., 2022) proposed a multi-modal speech emotion recognition method using context Transformer, which improves the emotional representation of the current utterance by embedding contextual information. This method can adaptively learn feature fusion between modalities.

In view of the fact that existing methods cannot dynamically identify subtle changes in emotion in multi-modal features and multi-scale features in different modal features, Liu et al. (Liu et al., 2022) proposed a multi-scale self-attention fusion emotion recognition method, which uses the self-attention mechanism to extract context-related dependencies in multi-modal features. Therefore, there is potential to use Transformers to model long-distance context dependencies. This method combines bc-LSTM and a multi-head attention mechanism to achieve fine-grained emotional information mining, and uses feature-level fusion and decision-level fusion methods to experiment with cross-modal feature fusion.

4.3. Distinguishing speaker modeling

The distinguishing speakers modeling method considers that the speaker’s emotion is not only related to the global context, but also related to the speaker’s own emotional state. Take Fig. 8 as an example, there are three GRU states (i.e., a global GRU, an emotional GRU and a speaker GRU). The global GRU is utilized to extract global multi-modal information and speaker’s emotional state information. The speaker GRU is used to fuse the semantic information with context captured by the attention mechanism and the speaker’s emotional state information. The emotion GRU combines the speaker’s emotional state information and global context information to complete the final emotion classification.

Global GRU captures the contextual semantic information of an utterance by modeling the utterance and speaker states. Each speaker state is used to memorize a speaker-specific representation of an utterance. By distinguishing the subordination relationship between speakers and utterances, it is beneficial to model the dependency relationship between speakers and utterances, thereby enhancing the semantic representation ability of context. The formula of Global GRU is defined as follows:

(23)

g_{t}=GRU_{\mathcal{G}}(x_{t-1},(x_{t}\oplus q_{s(x_{t}),t-1}))

where $g_{t}$ represents the latent feature representation of the global state, $q_{s(x_{t})}$ represents the speaker state of the current utterance $x_{t}$ .

Speakers typically reply to conversations based on contextual information from other speakers. Therefore, speaker GRU extracts the context $c_{t}$ related to the utterance $x_{t}$ . The formula is defined as follows:

(24)			$\displaystyle\beta=\text{softmax}(x_{t}^{T}W_{\beta}[g_{1},g_{2},\ldots,g_{t-1}]),$
(24)			$\displaystyle c_{t}=\beta[g_{1},g_{2},\ldots,g_{t-1}]^{T}$

where $W_{\beta}$ is the learnable parameters. We first calculate the attention score of the global state in the previous $t-1$ time. The attention score give higher weight to utterances related to utterance $x_{t}$ . The final context vector $c_{t}$ is obtained by the dot product of the attention score $\beta$ and the global state $g_{t}$ .

(25)

q_{s(u_{t}),t}=GRU_{\mathcal{P}}(q_{s(u_{t}),t-1},(u_{t}\oplus c_{t}))

The emotional representation et of the utterance ut is obtained by combining the speaker’s state $q_{s(ut),t}$ and the utterance $e_{t-1}$ at time $t-1$ . One underlying intuition is that context has a greater impact on utterance $u_{t}$ , and $e{t-1}$ integrates emotional contextual information from other parties’ states into the emotional representation et. Therefore, we use the Emotion GRU unit to model $e_{t_{1}}$ , and the formula is defined as follows:

(26)

e_{t}=GRU_{E}(e_{t-1},q_{s(u_{t}),t})

The emotion representation $e_{t}$ that combines context information and speaker status information is used for the final emotion classification.

In modeling methods based on distinguishing between speakers, Ghosal et al. (Ghosal et al., 2020) proposed COSMIC, which clarifies the relationship between the speaker and the utterance, and introduces common sense knowledge to enhance the emotional understanding of the model. COSMIC can learn a variety of different prior knowledge (e.g., event relationships and causal relationships, etc.), and can distinguish speaker information and dynamically detect the speaker’s emotional changes.

In view of the fact that existing methods cannot pay attention to the correlation between utterances and speakers and the lack of interaction between speakers, Zhang et al. (Zhang and Chai, 2021) proposed a conversational interaction model, which extracts contextual semantic information and state interaction information of utterances through stacked global interaction modules. In addition, this method also implements adversarial feature representation of the model by introducing noise information. Experimental results prove that adversarial learning can improve the performance of emotion recognition.

4.4. Speaker Relationship Modeling

4.4.1. GNN for speaker relationship modeling

The speaker relationship modeling method innovatively introduces graph neural network to capture the speaker’s dialogue relationship information while extracting sequential context information. Taking Fig. 9 as an example, it extracts dialogue relationships between speakers and inter-speaker dependencies by constructing a speaker relationship graph.

GCN extends convolution operations into graph-structured data to extract structural information. GCN performs first-order neighbor information aggregation and spectral domain estimation. The formula of GCN is defined as follows:

(27)

\boldsymbol{H}^{(l+1)}_{i}=ReLU\left(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}\boldsymbol{H}^{(l)}\boldsymbol{W}^{(l)}\right)

where $\boldsymbol{W}^{(l)}$ is the learnable parameters, $\tilde{A}=A+I_{n}$ , $I_{n}$ is the identity matrix, $\tilde{\boldsymbol{D}}_{ii}=\sum_{j}\tilde{a}_{ij}$ . $\boldsymbol{H}^{(l+1)}$ represents the latent feature representations of layer $l+1$ .

The steps to apply GCN to the field of multi-modal emotion recognition are as follows. First, each utterance is represented as a node in the graph, and edge relationships are constructed based on the context between utterances (i.e., if there is a dialogue between utterances, an edge is constructed). We then apply GCN to the constructed dialogue graph for speaker-level information extraction. Through the above process, the model can dynamically learn the correlation between sentences. According to the definition of Equation 27, our formula for aggregating surrounding contextual utterence information is deformed as follows:

(28)

\displaystyle H_{i}^{(l+1)}=ReLU\left(\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{i}^{r}}\frac{1}{|\mathcal{N}_{i}^{r}|}\left(W_{\theta_{1}}^{(l)}H_{j}^{(l)}+W_{\theta_{2}}^{(l)}H_{i}^{(l)}\right)\right)

where $W_{\theta_{1}}$ and $W_{\theta_{2}}$ are the learnable parameters, $\mathcal{N}_{i}^{r}$ represents the neighbor node under the relationship $r\in\mathcal{R}$ .

GAT is a variant of GCN that aggregates surrounding neighbor node features through learnable weights with an attention mechanism. GAT captures the more important node features in the graph by calculating the degree of similarity between nodes. The formula for GAT is defined as follows:

(29)

\boldsymbol{H}_{i}^{(l+1)}=ReLU\left(\sum_{j\in N(w_{i})}\alpha_{ij}^{(l+1)}\boldsymbol{W}^{(l+1)}\boldsymbol{h}_{j}^{(l)}\right)

where $\alpha_{ij}$ is the edge weight between node $i$ and node $j$ .

Similarly, the formula for using GAT to extract conversational relationships between speakers is defined as follows:

(30)

\displaystyle H_{i}^{+(l+1)}=ReLU\left(\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{i}^{r}}\frac{1}{|\mathcal{N}_{i}^{r}|}\left(\alpha_{ij}^{(l)}W_{\theta_{1}}^{(l)}H_{j}^{+(l)}\right.\right.+\left.\alpha_{ii}^{(l)}W_{\theta_{2}}^{(l)}H_{i}^{+(l)}\right),

The multi-modal method based on GNN is the current mainstream research, which can consider context information and speaker relationship information simultaneously (Ghosal et al., 2019). In order to jointly learn the three tasks of sequential context information, multi-modal information interaction and multi-task representation, Zhang et al. (Zhang et al., 2023a) designed a multi-modal, multi-task interactive graph attention network (M3GAT) to simultaneously model context dependencies, multi-modal emotional interactions, and speaker dependencies. M3GAT can achieve cross-modal feature interaction, capture of sequential contextual semantic information, and correlation between tasks.

In view of the fact that existing graph fusion methods will cause the model to lose important semantic information and fail to eliminate redundant information in the model, Li et al. (Li et al., 2023) proposed a graph network based on cross-modal feature complementarity, which effectively extracts the speaker’s context and interaction information through multiple hypothesis spaces of the graph. This method eliminates the heterogeneity between modalities and fuses modal information by performing different message aggregation on different nodes and edge relationships in the graph, thereby extracting contextual information and speaker relationship information.

Although existing MCER methods use GCN to model conversational relationships between speakers. In particular, the most competitive methods model the dependence of conversational relations between speakers and the importance between conversational relations by using relational graph attention networks. However, existing GCN-based multimodal conversational emotion recognition methods do not consider conversational relationships and sequential information in contextual relationships. Based on the above problems, Ishiwatari et al. (Ishiwatari et al., 2020) introduced relational position coding in RGAT to provide sequence information. The specific flow chart of RGAT is shown in Fig. 10.

The position encoding formula used by RGAT is defined as follows:

(31)

\left.PE_{ijr}=\left\{\begin{array}[]{rl}max(-p,min(p,j-i))&r=1,wherej\in\mathcal{N}^{1}(i)\\ max(-p,min(p,j-i))&r=2,wherej\in\mathcal{N}^{2}(i)\\ max(-f,min(f,j-i))&r=3,wherej\in\mathcal{N}^{3}(i)\\ max(-f,min(f,j-i))&r=4,wherej\in\mathcal{N}^{4}(i)\end{array}\right.\right.

where $PE_{ijr}$ represents the relative position distance between node $i$ under relationship type $r$ and its surrounding neighbor nodes $j$ . The maximum relative position distance between nodes is clipped to $p$ or $4$ , which represents the context window size. $\mathcal{N}r(i)$ represents the neighborhood of node $i$ under relationship type $r$ . In order to make the position encoding information learnable, FFN is used to obtain position embeddings.

4.5. Emotion Classification

After obtaining the multi-modal emotion feature representation, the MCER task uses a multi-layer perceptron and a softmax layer to achieve the final emotion classification. The probability distribution of emotion categories is as follows:

(32)			$\displaystyle l_{t}=\operatorname{ReLU}(W_{l}e_{t}+b_{l})$
			$\displaystyle\mathcal{P}_{t}=\text{softmax}(Wl_{t}+b)$
			$\displaystyle\hat{y}_{t}=\underset{i}{\operatorname*{argmax}}(\mathcal{P}_{t}[i])$

where $W_{l},W,b_{l},b$ are the learnable parameters, $\mathcal{P}_{t}$ is the probability distribution of emotion categories, $\hat{y}_{t}$ is the predicted labels.

5. Evaluation Metrics

For MCER tasks, there are four commonly used evaluation indicators, i.e., accuracy rate, weighted average accuracy rate (WA), F1 value, and weighted average F1 value (WF1). These four indicators are defined as follows:

We assume that $N$ is the number of emotion labels in the dialogue emotion dataset, $E_{j}$ represents the total number of samples of emotion labels in the $j$ -th, $j\in[1,N]$ .

1) Accuracy represents the emotion recognition accuracy of the model, and the formula is defined as follows:

(33)

\operatorname{Accuracy}_{j}=\frac{\sum_{n=1}^{\vartheta_{2}}E_{j}^{i}}{\sum_{m=1}^{\vartheta_{1}}S_{j}^{m}}

where $\vartheta_{1}$ is the number of labels on a certain category of emotion. $\vartheta_{2}$ is the number that the model predicts on a certain category of emotion. $E_{j}^{i}$ means that the $i$ -th sample in the $j$ -th emotionally predicted correctly. $E_{j}^{i}\in[0,1]$ . $S_{j}^{m}$ represents the $m$ -th sample of the $j$ -th emotion. The larger the value of $Accuracy_{j}$ , the better the recognition effect of the model on the $j$ -th type of emotion.

2) The F1 value is the F1-score of each emotion, and the formula is defined as follows:

(34)

F1_{j}=\frac{2\times\operatorname{Recall}\left(E_{TP}^{j},E_{FP}^{j}\right)\times\operatorname{Precision}\left(E_{TP}^{j},E_{FN}^{j}\right)}{\operatorname{Recall}\left(E_{TP}^{j},E_{FP}^{j}\right)+\operatorname{Precision}\left(E_{TP}^{j},E_{FN}^{j}\right)}

and

(35)

\begin{gathered}\operatorname{Precision}\left(E_{TP}^{j},E_{FN}^{j}\right)=\frac{\left|E_{TP}^{j}\right|}{\left|E_{TP}^{j}\cup E_{FN}^{j}\right|}\\ \operatorname{Recall}\left(E_{TP}^{j},E_{FP}^{j}\right)=\frac{\left|E_{TP}^{j}\right|}{\left|E_{TP}^{j}\cup E_{FP}^{j}\right|}\end{gathered}

where $E_{TP}^{j}$ is the number of samples that the model predicts correctly on the $j$ -th category of emotion, $E_{FP}^{j}$ is the number of samples that the model predicts incorrectly on the $j$ -th category of emotion, and $E_{FP}^{j}$ is the number of emotions from other categories that the model predicts as the $j$ -th category of emotion. $Precision(E_{TP}^{j},E_{FN}^{j})$ is the model’s precision on the $j$ -th category of emotion, and $Recall(E_{TP}^{j},E_{FP}^{j})$ is the recall of the model on the $j$ -th emotion. f1 value combines the effects of both precision and recall metrics. Usually, the larger the value of f1, the better the prediction of the model.

3) Weight accuracy (WA) is the weighted average of the classification accuracy of all emotion categories. The more samples of the $j$ -th emotion, the smaller the weight of the sample. The formula is defined as follows:

(36)

WA=\frac{\sum_{m=1}^{\vartheta_{1}}S_{j}*\text{ Accuracy }_{j}}{\sum_{j=1}^{N}\sum_{m=1}^{\vartheta_{1}}S_{j}^{m}}

WA is the classification accuracy of the model combining all emotions. The larger the WA, the better the model performs on average across all classes.

4) Weight F1 (WF1) is the weighted F1 value of all emotion categories. The more samples of the $j$ -th emotion, the smaller the weight of the sample. The formula is defined as follows:

(37)

WF1=\frac{\sum_{m=1}^{\vartheta_{1}}S_{j}*F1_{j}}{\sum_{j=1}^{N}\sum_{m=1}^{\vartheta_{1}}S_{j}^{m}}

WF1 is the F1 value where the model integrates all emotions. WF1 is another effective index to evaluate the model effect. In general, the larger the WF1, the better the average performance of the model across all classes.

Table 6. We count the performance of different types of emotion recognition algorithms on publicly available datasets. The weighted F1 score is chosen as evaluation metric.

Approaches	Category	Inputs	Database	Performence(%)
SAL (Wang et al., 2017)	Context free	T+A+V	IEMOCAP/MELD	49.2/58.8
SVM (Rozgić et al., 2012)	Context free	T+A+V	IEMOCAP/MELD	48.7/56.4
TFN (Zadeh et al., 2017)	Context free	T+A+V	IEMOCAP/MELD	54.2/56.7
LFM (Zadeh et al., 2017)	Context free	T+A+V	IEMOCAP/MELD	55.3/56.7
UniMSE (Hu et al., 2022b)	Sequential context	T+V+A	IEMOCAP/MELD	70.7/65.5
bc-LSTM+Att (Poria et al., 2017)	Sequential context	T+V+A	IEMOCAP/MELD	55.0/56.4
M2FNet (Chudasama et al., 2022)	Sequential context	T+V+A	IEMOCAP/MELD	69.9/66.7
CESTa (Wang et al., 2020b)	Sequential context	T+V+A	IEMOCAP/DailyDialog/MELD	67.1/63.1/58.4
CMN (Hazarika et al., 2018b)	Sequential context	T+V+A	IEMOCAP	56.2
SACL-LSTM (Hu et al., 2023)	Sequential context	T+A+V	IEMOCAP/MELD/EmoryNLP	69.2/66.5/39.7
Att-BiLSTM (Tzinis et al., 2018)	Sequential context	T+V+A	IEMOCAP	62.9
DialogueCRN (Hu et al., 2021)	Sequential context	T+A+V	IEMOCAP/MELD	66.2/58.39
EmoCaps (Li et al., 2022b)	Sequential context	T+V+A	IEMOCAP/MELD	71.8/64.0
ICON (Hazarika et al., 2018a)	Sequential context	T+V+A	IEMOCAP	63.5
DialogueRNN (Majumder et al., 2019)	Distinguishing speakers	T+V+A	IEMOCAP/MELD	62.8/56.8
EmotionIC (Yingjian et al., 2023)	Distinguishing speakers	T+V+A	IEMOCAP/DailyDialog/MELD/EmoryNLP	69.5/59.8/66.4/40.0
COIN (Zhang and Chai, 2021)	Distinguishing speakers	T+V+A	IEMOCAP	65.4
COSMIC (Ghosal et al., 2020)	Distinguishing speakers	T+A+V	IEMOCAP/DailyDialog/MELD/EmoryNLP	65.3/58.5/65.2/38.1
RGAT (Ishiwatari et al., 2020)	Speaker relationship	T+A+V	IEMOCAP/DailyDialog/MELD/EmoryNLP	65.2/54.3/60.9/34.4
DialogueGCN (Ghosal et al., 2019)	Speaker relationship	T+V+A	IEMOCAP/MELD	64.2/58.1
DAG-ERC (Shen et al., 2021)	Speaker relationship	T+A+V	IEMOCAP/DailyDialog/MELD/EmoryNLP	68.0/59.3/63.7/39.0
MM-DFN (Hu et al., 2022a)	Speaker relationship	T+V+A	IEMOCAP/MELD	68.2/59.5
GraphCFC (Li et al., 2023)	Speaker relationship	T+V+A	IEMOCAP/MELD	68.9/58.9

6. Experimental Results

As Shown in Table 6, we present the emotion recognition effects of different algorithms on multiple data sets (i.e., IEMOCAP, MELD, EmoryNLP and DailyDialogue). In particular, each algorithm uses multi-modal data, and we distinguish different MCER algorithms according to our classification method. Experimental results show that context-free based algorithms have the worst performance because they contain the least semantic information and cannot obtain good emotional feature representation. The multi-modal conversational emotion recognition algorithm based on sequential context has significant performance improvement compared to the context-free algorithm. The performance improvement may be attributed to the sequential context algorithm’s ability to model the dependencies between contexts and its ability to utilize context information to improve the feature representation of emotions. The emotion recognition effects of modeling methods based on distinguishing speaker relationships and sequential context modeling methods are similar, and both are better than context-independent modeling methods. The performance improvement may be attributed to the ability of the distinguishing speakers modeling method to dynamically capture the speaker status information of the utterance and integrate it into the emotion representation information. The modeling method based on speaker relationship has the best performance and is currently the most popular modeling method. The modeling method based on speaker relationship mainly constructs the dialogue relationship between speakers through the inherent properties of the graph structure, and extracts the dialogue relationship representation between speakers through GCN. In addition, the speaker relationship modeling method can also consider the dependency information of the sequential context simultaneously.

Table 7. On the IEMOCAP dataset, we counted the emotion recognition effects of different MCER algorithms on different emotion categories. The best result in each column is in bold.

Methods	IEMOCAP
	Happy	Sad	Neutral	Angry	Excited	Frustrated
	Acc. F1	Acc. F1	Acc. F1	Acc. F1	Acc. F1	Acc. F1
TextCNN (Kim, 2014)	27.73 29.81	57.14 53.83	34.36 40.13	61.12 52.47	46.11 50.09	62.94 55.78
bc-LSTM (Poria et al., 2017)	29.16 34.49	57.14 60.81	54.19 51.80	57.03 56.75	51.17 57.98	67.12 58.97
bc-LSTM+Att (Poria et al., 2017)	30.56 35.63	56.73 62.09	57.55 53.00	59.41 59.24	52.84 58.85	65.88 59.41
CMN (Hazarika et al., 2018b)	25.01 30.34	55.96 62.45	52.81 52.36	61.77 59.88	55.59 60.24	71.16 60.67
LFM (Liu et al., 2018a)	25.63 33.14	75.71 78.83	58.52 59.21	64.77 65.26	80.21 71.85	61.14 58.97
A-DMN (Xing et al., 2020)	43.15 50.64	69.47 76.88	63.05 62.92	63.53 56.56	88.34 77.91	53.34 55.72
DialogueRNN (Majumder et al., 2019)	25.63 33.11	75.14 78.85	58.56 59.24	64.76 65.23	80.27 71.85	61.16 58.97
DialogueGCN (Ghosal et al., 2019)	40.63 42.71	89.14 84.45	61.97 63.54	67.51 64.14	65.46 63.08	64.13 66.90
DialogueCRN (Hu et al., 2021)	71.47 51.93	75.82 78.25	66.17 59.86	78.53 64.16	68.95 77.72	54.91 60.l8
SumAggGIN (Sheng et al., 2020)	56.74 54.22	86.85 79.17	62.95 65.32	64.64 62.28	76.21 78.43	63.42 61.67
DisGCN (Sun et al., 2021)	71.17 56.92	68.65 76.47	66.63 57.41	74.26 54.35	74.54 76.47	51.14 59.28
MM-DFN (Hu et al., 2022a)	40.17 42.22	74.27 78.98	69.13 66.42	70.25 69.97	76.99 75.56	68.58 66.33
M2FNet (Chudasama et al., 2022)	65.92 60.00	79.18 82.11	65.80 65.88	75.37 68.21	74.84 72.60	66.87 68.31
EmoCaps (Li et al., 2022b)	70.34 72.86	77.39 82.45	64.27 65.10	71.79 69.14	84.50 73.90	63.94 63.41
CT-Net (Lian et al., 2021)	47.97 51.36	78.01 79.94	69.08 65.82	72.98 67.21	85.35 78.74	52.27 58.83
LR-GCN (Ren et al., 2021)	54.24 55.51	81.67 79.14	59.13 63.84	69.47 69.02	76.37 74.05	68.26 68.91

Table 8. On the MELD dataset, we counted the emotion recognition effects of different MCER algorithms on different emotion categories. The best result in each column is in bold.

Methods	MELD
	Neutral	Surprise	Fear	Sadness	Joy	Disgust	Anger
	Acc. F1	Acc. F1	Acc. F1	Acc. F1	Acc. F1	Acc. F1	Acc. F1
TextCNN (Kim, 2014)	76.23 74.91	43.35 45.51	4.63 3.71	18.25 21.17	46.14 49.47	8.91 8.36	35.33 34.51
bc-LSTM (Poria et al., 2017)	78.45 73.84	46.82 47.71	3.84 5.46	22.47 25.19	51.61 51.34	4.31 5.23	36.71 38.44
bc-LSTM+Att (Poria et al., 2017)	70.45 75.55	46.43 46.35	0.00 0.00	21.77 16.27	49.30 50.72	0.00 0.00	41.77 40.71
A-DMN (Xing et al., 2020)	76.54 78.92	56.24 55.35	8.22 8.61	22.14 24.94	59.81 57.45	1.23 3.45	41.31 40.96
DialogueRNN (Majumder et al., 2019)	72.12 73.54	54.42 49.47	1.61 1.23	23.97 23.83	52.01 50.74	1.52 1.73	41.01 41.54
CT-Net (Lian et al., 2021)	75.61 77.45	51.32 52.76	5.14 10.09	30.91 32.56	54.31 56.08	11.62 11.27	42.51 44.65
DialogueCRN (Hu et al., 2021)	70.91 75.73	47.32 47.18	0.00 0.00	34.06 13.29	41.95 49.72	0.00 0.00	41.66 35.69
SumAggGIN (Sheng et al., 2020)	78.19 77.82	52.27 54.11	2.17 2.31	35.79 36.43	54.15 55.07	4.05 2.12	48.31 47.22
DisGCN (Sun et al., 2021)	70.84 76.67	42.71 46.13	1.17 1.55	32.08 16.97	50.03 50.17	2.35 1.99	38.25 39.97
MM-DFN (Hu et al., 2022a)	78.17 77.76	52.15 50.69	0.00 0.00	25.77 22.93	56.19 54.78	0.00 0.00	48.31 47.82
M2FNet (Chudasama et al., 2022)	72.88 67.98	72.76 58.66	5.57 3.45	50.09 47.03	68.49 65.50	17.69 25.24	57.33 55.25
EmoCaps (Li et al., 2022b)	75.24 77.12	63.57 63.19	3.45 3.03	43.78 42.52	58.34 57.05	7.01 7.69	58.79 57.54
LR-GCN (Ren et al., 2021)	81.51 80.83	55.42 57.11	0.00 0.00	36.36 36.96	62.21 65.84	7.32 11.07	52.63 54.74

In addition, we also counted the emotion recognition effects of different MCER algorithms on different emotion categories. As shown in Table 7, on the IEMOCAP data set, the performance effects of each algorithm on various emotions are consistent with the overall results introduced previously. The method based on context-free modeling has the worst effect, with the recognition effect on the “happy” emotion being less than 50%. In comparison, most of the other three types of algorithms have exceeded 60%, and some categories of emotions have exceeded 80%. The performance of each algorithm on the MELD data set is shown in Table 8. The recognition effects of each algorithm on most categories of emotions are similar to those on the IEMOCAP data set. It is important to note that we found that all emotion recognition methods have poor performance in identifying “fear” and “disgust” emotions, and the accuracy of some algorithms is even 0%. When we observe the distribution of the data set, we can find that the MELD data set has a serious data imbalance problem. This results in the model’s very poor emotion recognition performance on minority classes.

7. Applications of Multi-modal Conversational Emotion Analysis

Emotion recognition is a method of applying natural language processing, machine learning, and deep learning techniques to multi-modal data such as text, video, and audio to identify and analyze the emotional state expressed in multi-modal data (Deng and Ren, 2021). Therefore, analyzing and studying the problem of emotion recognition has broad application value in many practical application scenarios.

7.1. Social Media Analysis

Multi-modal conversational emotion recognition has many broad applications in social media analysis (Zhang et al., 2021). The most typical application is product improvement and innovation, that is, by analyzing user comments and feedback on social media, companies can understand users’ preferences and dissatisfaction with products. This helps companies tweak product designs, improve functionality, and develop products that better meet user needs. Therefore, businesses can employ emotion analysis techniques to improve their products. In addition, emotion analysis can also help advertisers understand users’ emotional attitudes towards advertisements, thereby optimizing advertisement content and strategies, and improving advertisement effectiveness.

7.2. Public Opinion Analysis

Multi-modal conversational emotion analysis also has a wide application value in the field of opinion mining, which can help mine and analyze people’s opinions and emotions expressed in text, video and audio (Tan et al., 2021). For example, in market research, researchers can use emotion analysis techniques to analyze users’ opinions on different products, as well as users’ purchase intentions. This has great potential value for developing marketing strategies.

7.3. Recommendations Systems

Multi-modal conversational emotion analysis in recommender systems can help personalize recommendations more in line with users’ emotions and preferences (Ding et al., 2022). For example, the recommendation system can recommend products that users are more interested in according to the emotional changes of consumers, and can perform emotion analysis on multi-modal data of user evaluations to realize real-time early warning and disposal of negative product evaluations.

7.4. Medical Care

Multi-modal conversational emotion analysis plays an important role in many aspects in the field of health care (Saganowski et al., 2022). It can help medical institutions and doctors better understand the current emotional state of patients, so as to give better treatment plans. For example, doctors can use emotion analysis technology to analyze patients’ medical records and symptom descriptions, so as to better understand the patient’s emotional state and help make more accurate diagnosis and treatment plans. Furthermore, in diagnosis and treatment decisions, understanding a patient’s emotional state is important for developing an appropriate medical regimen. Emotion analysis can help doctors make more targeted decisions.

7.5. Financial Field Analysis

In the field of financial analysis, emotion analysis can help financial practitioners and investors better understand the emotional state of the market and predict market trends, thereby helping investors make correct investment decisions (Gerczuk et al., 2021). For example, some financial institutions and financial practitioners use emotion analysis technology to calculate market emotion index to measure market emotion, which can provide investors with an objective reference value.

7.6. Social Robot

Multimodal conversational emotion recognition has many potential applications on social robots, which can enhance the capabilities of social robots and make them more intelligent and humane (Lee et al., 2022). Social robots can use multimodal emotion recognition to sense the emotional state of the users they interact with. This includes identifying users’ facial expressions, voice emotions, text emotions, and other modal emotional signals (Laban et al., 2022). The robot can then adjust its interaction to better meet the user’s emotional needs, providing support, comfort or entertainment. In addition, social robots can use MCER to better understand users’ needs and emotional states to provide personalized suggestions and assistance. For example, in the field of mental health, robots can provide targeted psychological support suggestions based on the user’s emotional state.

8. Research Challenges

Although deep learning technology has promoted the prosperity of MCER tasks, many scholars have proposed many state-of-the-art algorithms. However, building an accurate MCER model still faces challenges.

8.1. Scarcity of Training Data

Multi-modal conversation emotion recognition models require sufficient and comprehensive emotional samples as a basis to achieve accurate prediction or classification of emotions. The existing multi-modal benchmark data sets IEMOCAP, MELD, and SEMAINE have only 11098, 5810, and 394 utterances, respectively. Unfortunately, although we can easily collect large amounts of multi-modal conversation data from channels such as social media, the emotion labeling process is often expensive and time-consuming. In addition, the collected multi-modal data inevitably has problems such as ambiguous labels or multiple labels, which makes it a great challenge to obtain sufficient multi-modal labeled data, which in turn leads to the scarcity of multi-modal training data. Therefore, the scarcity of training data limits the effectiveness of current multi-modal conversational emotion recognition models.

8.2. Data is Heterogeneous and Noisy

Multi-modal conversation emotion recognition models need to fully eliminate heterogeneity and noise information between modalities to achieve accurate prediction or classification of emotions. Multi-modal data is naturally heterogeneous, and features of different modalities have huge differences in processing methods and representation forms. In addition, multi-modal conversation data contains a large amount of redundant or noisy information, and its emotion is usually determined by only a small amount of consistent key information, such as certain words in a sentence, a specific frequency band in speech, or a particular expression in a video. Even in some extreme cases, part of the modal information is basically unavailable under noise interference, such as ambiguous sentence expressions, noise in the speech, blocked expressions, etc. Therefore, the heterogeneity and noise of data limit the effectiveness of current multi-modal conversational emotion recognition models.

8.3. Unbalanced Data Distribution

Multi-modal dialogue data samples have serious imbalance problems, and the unbiased learning of the model is seriously interfered with. The multi-modal conversation emotion recognition model is based on cross-modal feature fusion, driven by emotion category sample data, and is easily affected by the number of emotion category samples. However, multi-modal conversation emotion data naturally suffers from the problem of category sample imbalance. A few emotion category samples account for a larger proportion, while most emotion category samples account for a small proportion. For example, in the MELD data set, the “fear” emotion only accounts for 1.91% of the total samples, and the “disgust” emotion only accounts for 2.61% of the total samples. A similar sample distribution also exists on the benchmark data set SEMAINE. Small samples are difficult to drive unbiased learning of the model, which seriously affects the model’s prediction accuracy for small sample emotional categories. Therefore, the unbalanced sample distribution limits the effectiveness of current multi-modal conversational emotion recognition models.

8.4. Consistent Semantic Association

Multi-modal conversation emotion recognition requires the model to fully learn the consistent semantics between modalities to filter noise information and eliminate heterogeneity between modalities, which is the basis for building an accurate multimodal dialogue emotion recognition model. However, the consistent semantic association of multi-modal conversation is more complicated and is not only related to the multi-modal context but also to factors such as the conversation scene, the speaker’s own emotional inertia, and the speaker’s stimulation. In addition, multi-modal data are heterogeneous, each modality has differentiated representation and distribution characteristics in space, and some consistent semantic associations are hidden in the feature distribution space between modalities. Therefore, efficiently performing consistent semantic association is the primary issue that needs to be considered at the model level.

8.5. Complementary Semantic Capture

Multi-modal conversation emotion recognition models need to establish accurate and consistent semantic associations and capture complementary semantic features between modalities, which can expand the emotional representation capabilities of a single modality. However, unlike consistency semantics, complementary semantics represent differences between modalities, and this difference may contain noise components. Therefore, consistency semantics and complementarity semantics are a pair of game entities, and how to balance the relationship between them is another issue that needs to be considered at the model level.

8.6. Multi-model Collaboration

Multi-model collaboration is the third challenge faced at the model level in building accurate multi-modal conversation emotion recognition models. Multi-modal conversation emotion recognition often requires the collaboration of multiple models to complete tasks, such as feature extraction models and feature fusion models. However, existing methods often perform task collaboration from the data level and ignore the collaborative relationship between models. Therefore, in order to achieve ideal synergistic results, not only the respective characteristics of the modes and their interrelationships need to be considered, but also the synergistic relationships between models need to be considered.

9. Future Work

9.1. Multi-modal Conversation Data Generation

Multi-modal conversational emotion recognition models require sufficient and comprehensive emotional samples as a basis. When sample data is scarce, training multi-modal conversation emotion recognition models without causing overfitting or underfitting problems is extremely challenging. However, the sample size of existing benchmark data sets is relatively small, and there is a common problem of data scarcity. Multimodal dialogue data generation can effectively alleviate this problem. However, the distribution of multi-modal conversation data is more complex, and traditional single-modal data generation or cross-modal data generation models cannot meet the requirements. Therefore, there is an urgent need to solve the problem of collaborative generation of multi-modal conversation data.

9.2. Multi-modal Feature Deep Fusion

Multi-modal feature fusion is crucial to the MCER task. The fused feature vector can represent the consistent semantics and complementary information between modalities. However, many different information interactions exist between multi-modalities, and many consistent or complementary features are hidden in multiple time series or local spatial correlations. Since multi-modal conversation data is heterogeneous and contains noise, there are significant differences in the temporal period and spatial distribution of different modal features, and the spatiotemporal importance between modalities is dynamic. Currently, few works consider this difference, and more efforts are still needed for deep fusion of multi-modal features.

9.3. Unbiased Emotional Learning

Many benchmark datasets in the field of multi-modal conversational emotion recognition suffer from serious sample category imbalance, that is, the minority emotion category contains a large amount of data, while the majority category emotion only contains a small amount of data. In the case of unbalanced data, the existing models tend to be biased towards fitting the minority emotion with a large amount of data, and the learning is insufficient on the majority emotion with a small amount of data, which leads to the model being in a small sample emotion category, resulting in the recognition accuracy is poor. Thus, the small-sample problem in multi-modal dialogue emotion recognition urgently requires further research.

9.4. Incomplete Multi-modal Conversation Emotion Recognition

Each modality is not always available in real-world scenarios, which can lead to modal incompleteness problems. For example, the voice contains much noise, the expression is blocked, the light is dim, etc. At this moment, some modal information becomes unavailable due to noise interference. Modal integrity requirements reduce the applicability of multi-modal conversation emotion recognition methods. Therefore, cross-modal content recovery methods based on deep learning should continue to be developed to achieve multi-modal conversation emotion recognition in missing modalities.

9.5. Zero-shot Multi-modal Conversation Emotion Recognition

Affected by factors such as the complexity of emotions and the high cost of labeling, it is difficult to fully label some emotional samples. Furthermore, with the rapidly growing personal emotion annotation space, real-world emotion recognition systems may frequently encounter unseen emotion labels. Therefore, improving the generalization performance of emotion recognition models is an issue that needs to be considered. Deep methods utilizing zero-shot learning are expected to achieve better multi-modal dialogue emotion recognition.

9.6. Multi-modal Conversation Multi-label Emotion Recognition

In multi-modal conversation scenarios, existing emotion recognition models usually use a single-label supervised learning. Due to the ambiguity of emotions, emotion recognition in real life is often a multi-label task. The single-label requirement greatly limits the application scenarios of multi-modal conversation emotion recognition. Therefore, the multi-label emotion recognition problem in multi-modal conversation scenarios should be considered in future work.

10. Conclusion

This paper reviews the latest research results in the field of multi-modal conversational emotion recognition. To allow readers to implement emotion recognition tasks better, we have collected popular data sets in this field and given relevant download links. Since text, video, and audio are unstructured data that cannot be directly input into a computer for computation, we summarize some publicly available feature extraction methods. We divide emotion recognition methods into four categories, i.e., context-free modeling, sequential context modeling, distinguishing speaker modeling, and speaker relationship modeling. This paper further discusses the challenges faced by existing methods and future research directions. According to the review of existing work, it is found that multi-modal emotion recognition mainly improves the effect of emotion recognition by modeling intra-modal and inter-modal complementary semantic information. We hope this review can shed some light on developments in this field.

11. Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No. 62372478, No. 61802444); the Research Foundation of Education Bureau of Hunan Province of China (Grant No. 22B0275, No.20B625).

References

(1)
Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59–66.
Barreiro and Treglown (2020) Carmen Amador Barreiro and Luke Treglown. 2020. What makes an engaged employee? A facet-level approach to trait emotional intelligence as a predictor of employee engagement. Personality and Individual Differences 159 (2020), 109892.
Bhavan et al. (2019) Anjali Bhavan, Pankaj Chauhan, Rajiv Ratn Shah, et al. 2019. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems 184 (2019), 104886.
Boopathy et al. (2019) Akhilan Boopathy, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. 2019. Cnn-cert: An efficient framework for certifying robustness of convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3240–3247.
Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.
Cambria et al. (2018) Erik Cambria, Devamanyu Hazarika, Soujanya Poria, Amir Hussain, and RBV Subramanyam. 2018. Benchmarking multimodal sentiment analysis. In Computational Linguistics and Intelligent Text Processing: 18th International Conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part II 18. Springer, 166–179.
Chatterjee et al. (2019) Ankush Chatterjee, Umang Gupta, Manoj Kumar Chinnakotla, Radhakrishnan Srikanth, Michel Galley, and Puneet Agrawal. 2019. Understanding emotions in text using deep learning and big data. Computers in Human Behavior 93 (2019), 309–317.
Chen et al. (2019) Yi-Chen Chen, Sung-Feng Huang, Hung-yi Lee, Yu-Hsuan Wang, and Chia-Hao Shen. 2019. Audio word2vec: Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 1481–1493.
Chudasama et al. (2022) Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, and Naoyuki Onoe. 2022. M2fnet: Multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4652–4661.
Cichosz and Slot (2007) Jarosław Cichosz and Krzysztof Slot. 2007. Emotion recognition in speech signal using emotion-extracting binary decision trees. Proceedings of Affective Computing and Intelligent Interaction (2007), 1–10.
Deng and Ren (2021) Jiawen Deng and Fuji Ren. 2021. A survey of textual emotion recognition and its challenges. IEEE Transactions on Affective Computing (2021).
Ding et al. (2022) Yi Ding, Neethu Robinson, Su Zhang, Qiuhao Zeng, and Cuntai Guan. 2022. Tsception: Capturing temporal dynamics and spatial asymmetry from EEG for emotion recognition. IEEE Transactions on Affective Computing (2022).
Du et al. (2020) Xingbo Du, Junchi Yan, Rui Zhang, and Hongyuan Zha. 2020. Cross-network skip-gram embedding for joint network alignment and link prediction. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2020), 1080–1095.
Dumpala et al. (2023) Sri Harsha Dumpala, Katerina Dikaios, Sebastian Rodriguez, Ross Langley, Sheri Rempel, Rudolf Uher, and Sageev Oore. 2023. Manifestation of depression in speech overlaps with characteristics used to represent and recognize speaker identity. Scientific Reports 13, 1 (2023), 11155.
Frank et al. (2000) Eibe Frank, Leonard Trigg, Geoffrey Holmes, and Ian H Witten. 2000. Naive Bayes for regression. Machine Learning 41 (2000), 5–25.
Gan et al. (2022) Leilei Gan, Zhiyang Teng, Yue Zhang, Linchao Zhu, Fei Wu, and Yi Yang. 2022. Semglove: Semantic co-occurrences for glove from bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 2696–2704.
Gerczuk et al. (2021) Maurice Gerczuk, Shahin Amiriparian, Sandra Ottl, and Bjorn W Schuller. 2021. Emonet: A transfer learning framework for multi-corpus speech emotion recognition. IEEE Transactions on Affective Computing (2021).
Ghosal et al. (2018) Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2018. Contextual inter-modal attention for multi-modal sentiment analysis. In proceedings of the 2018 conference on empirical methods in natural language processing. 3454–3466.
Ghosal et al. (2020) Deepanway Ghosal, Navonil Majumder, Alexander Gelbukh, Rada Mihalcea, and Soujanya Poria. 2020. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2470–2481.
Ghosal et al. (2019) Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Ghosh and Anwar (2021) Shreya Ghosh and Tarique Anwar. 2021. Depression intensity estimation via social media: a deep learning approach. IEEE Transactions on Computational Social Systems 8, 6 (2021), 1465–1474.
Han et al. (2022) Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.
Han et al. (2021) Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021), 15908–15919.
Hardeniya and Borikar (2016) Tanvi Hardeniya and Dilipkumar A Borikar. 2016. Dictionary based approach to sentiment analysis-a review. International Journal of Advanced Engineering, Management and Science 2, 5 (2016), 239438.
Hazarika et al. (2018a) Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018a. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2594–2604.
Hazarika et al. (2018b) Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018b. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018. NIH Public Access, 2122.
Hazarika et al. (2021) Devamanyu Hazarika, Soujanya Poria, Roger Zimmermann, and Rada Mihalcea. 2021. Conversational transfer learning for emotion recognition. Information Fusion 65 (2021), 1–12.
Hazarika et al. (2020) Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM international conference on multimedia. 1122–1131.
Hou et al. (2023) Mixiao Hou, Zheng Zhang, Chang Liu, and Guangming Lu. 2023. Semantic Alignment Network for Multi-Modal Emotion Recognition. IEEE Transactions on Circuits and Systems for Video Technology 33, 9 (2023), 5318–5329.
Hsu et al. (2018) Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo, Ting-Hao Huang, and Lun-Wei Ku. 2018. EmotionLines: An Emotion Corpus of Multi-Party Conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
Hu et al. (2023) Dou Hu, Yinan Bao, Lingwei Wei, Wei Zhou, and Songlin Hu. 2023. Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations. arXiv preprint arXiv:2306.01505 (2023).
Hu et al. (2022a) Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. 2022a. MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7037–7041.
Hu et al. (2021) Dou Hu, Lingwei Wei, and Xiaoyong Huai. 2021. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. 7042–7052.
Hu et al. (2022b) Guimin Hu, Ting-En Lin, Yi Zhao, Guangming Lu, Yuchuan Wu, and Yongbin Li. 2022b. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 7837–7851.
Hu et al. (2007) Hao Hu, Ming-Xing Xu, and Wei Wu. 2007. GMM supervector based SVM with spectral features for speech emotion recognition. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 4. IEEE, IV–413.
Huang et al. (2020) Jian Huang, Jianhua Tao, Bin Liu, Zheng Lian, and Mingyue Niu. 2020. Multimodal transformer fusion for continuous emotion recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3507–3511.
Ishiwatari et al. (2020) Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. 2020. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7360–7370.
Kattenborn et al. (2021) Teja Kattenborn, Jens Leitloff, Felix Schiefer, and Stefan Hinz. 2021. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS journal of photogrammetry and remote sensing 173 (2021), 24–49.
Khare and Bajaj (2020) Smith K Khare and Varun Bajaj. 2020. Time–frequency representation and convolutional neural network-based emotion recognition. IEEE transactions on neural networks and learning systems 32, 7 (2020), 2901–2909.
Kim et al. (2021) Byoungjae Kim, Jungyun Seo, and Myoung-Wan Koo. 2021. Randomly wired network based on RoBERTa and dialog history attention for response selection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2437–2442.
Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
Kollias and Zafeiriou (2020) Dimitrios Kollias and Stefanos Zafeiriou. 2020. Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Transactions on Affective Computing 12, 3 (2020), 595–606.
Kumar et al. (2022) Devesh Kumar, Pavan Kumar V Patil, Ayush Agarwal, and SR Mahadeva Prasanna. 2022. Fake Speech Detection Using OpenSMILE Features. In International Conference on Speech and Computer. Springer, 404–415.
Kwon et al. (2021) Soonil Kwon et al. 2021. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Systems with Applications 167 (2021), 114177.
Laban et al. (2022) Guy Laban, Val Morrison, Arvid Kappas, and Emily S Cross. 2022. Informal caregivers disclose increasingly more to a social robot over time. In Chi Conference on Human Factors in Computing Systems Extended Abstracts. 1–7.
Latif et al. (2020) Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Julien Epps, and Björn W Schuller. 2020. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective computing 13, 2 (2020), 992–1004.
Lee et al. (2011) Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication 53, 9-10 (2011), 1162–1171.
Lee et al. (2022) Christine P Lee, Bengisu Cagiltay, and Bilge Mutlu. 2022. The unboxing experience: Exploration and design of initial interactions between children and social robots. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–14.
Li et al. (2020) Jingye Li, Donghong Ji, Fei Li, Meishan Zhang, and Yijiang Liu. 2020. Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations. In Proceedings of the 28th International Conference on Computational Linguistics. 4190–4200.
Li et al. (2023) Jiang Li, Xiaoping Wang, Guoqing Lv, and Zhigang Zeng. 2023. Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition. IEEE Transactions on Multimedia (2023).
Li et al. (2022a) Wei Li, Wei Shao, Shaoxiong Ji, and Erik Cambria. 2022a. BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis. Neurocomputing 467 (2022), 73–82.
Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 986–995.
Li et al. (2022b) Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 2022b. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1610–1618.
Li et al. (2022c) Ziming Li, Yan Zhou, Weibo Zhang, Yaxin Liu, Chuanpeng Yang, Zheng Lian, and Songlin Hu. 2022c. AMOA: Global acoustic feature enhanced modal-order-aware network for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 7136–7146.
Lian et al. (2023) Zheng Lian, Lan Chen, Licai Sun, Bin Liu, and Jianhua Tao. 2023. GCNet: graph completion network for incomplete multimodal learning in conversation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Lian et al. (2021) Zheng Lian, Bin Liu, and Jianhua Tao. 2021. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 985–1000.
Lin and Wei (2005) Yi-Lin Lin and Gang Wei. 2005. Speech emotion recognition based on HMM and SVM. In 2005 International Conference on Machine Learning and Cybernetics, Vol. 8. IEEE, 4898–4901.
Lin et al. (2022) Zijie Lin, Bin Liang, Yunfei Long, Yixue Dang, Min Yang, Min Zhang, and Ruifeng Xu. 2022. Modeling intra-and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 7124–7135.
Liu et al. (2022) Yang Liu, Haoqin Sun, Wenbo Guan, Yuqi Xia, and Zhen Zhao. 2022. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Communication 139 (2022), 1–9.
Liu et al. (2018a) Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018a. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2256.
Liu et al. (2018b) Zhen-Tao Liu, Min Wu, Wei-Hua Cao, Jun-Wei Mao, Jian-Ping Xu, and Guan-Zheng Tan. 2018b. Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273 (2018), 271–280.
Lotfian and Busso (2019) Reza Lotfian and Carlos Busso. 2019. Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 4 (2019), 815–826.
Lu et al. (2023) Yuanyuan Lu, Zengzhao Chen, Qiuyu Zheng, Yanhui Zhu, and Mengke Wang. 2023. Exploring multimodal data analysis for emotion recognition in teachers’ teaching behavior based on LSTM and MSCNN. Soft Computing (2023), 1–8.
Ma et al. (2019) Jiaxin Ma, Hao Tang, Wei-Long Zheng, and Bao-Liang Lu. 2019. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM international conference on multimedia. 176–183.
Ma et al. (2021) Tinghuai Ma, Qian Pan, Huan Rong, Yurong Qian, Yuan Tian, and Najla Al-Nabhan. 2021. T-bertsum: Topic-aware text summarization based on bert. IEEE Transactions on Computational Social Systems 9, 3 (2021), 879–890.
Mai et al. (2019a) Sijie Mai, Haifeng Hu, and Songlong Xing. 2019a. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In Proceedings of the 57th annual meeting of the association for computational linguistics. 481–492.
Mai et al. (2019b) Sijie Mai, Songlong Xing, and Haifeng Hu. 2019b. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Transactions on Multimedia 22, 1 (2019), 122–137.
Mai et al. (2022) Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. 2022. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing (2022).
Majumder et al. (2019) Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6818–6825.
McKeown et al. (2011) Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder. 2011. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE transactions on affective computing 3, 1 (2011), 5–17.
Meng et al. (2021) Tao Meng, Yuntao Shou, Wei Ai, Jiayi Du, Haiyan Liu, and Keqin Li. 2021. A Multi-Message Passing Framework Based on Heterogeneous Graphs in Conversational Emotion Recognition. Available at SSRN 4353605 (2021).
Morency et al. (2011) Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th international conference on multimodal interfaces. 169–176.
Pandey et al. (2022) Sandeep Kumar Pandey, Hanumant Singh Shekhawat, and SRM Prasanna. 2022. Attention gated tensor neural network architectures for speech emotion recognition. Biomedical Signal Processing and Control 71 (2022), 103173.
Park et al. (2016) Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2016. Multimodal analysis and prediction of persuasiveness in online social multimedia. ACM Transactions on Interactive Intelligent Systems (TiiS) 6, 3 (2016), 1–25.
Pérez-Rosas et al. (2013) Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973–982.
Pham et al. (2019) Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6892–6899.
Poria et al. (2015) Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing. 2539–2544.
Poria et al. (2017) Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers). 873–883.
Poria et al. (2016) Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 439–448.
Poria et al. (2019) Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
Rahman et al. (2020) Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2020. NIH Public Access, 2359.
Rajamani et al. (2021) Srividya Tirunellai Rajamani, Kumar T Rajamani, Adria Mallol-Ragolta, Shuo Liu, and Björn Schuller. 2021. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6294–6298.
Ren et al. (2021) Minjie Ren, Xiangdong Huang, Wenhui Li, Dan Song, and Weizhi Nie. 2021. Lr-gcn: Latent relation-aware graph convolutional network for conversational emotion recognition. IEEE Transactions on Multimedia 24 (2021), 4422–4432.
Ren et al. (2023) Minjie Ren, Xiangdong Huang, Jing Liu, Ming Liu, Xuanya Li, and An-An Liu. 2023. MALN: Multimodal Adversarial Learning Network for Conversational Emotion Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2023).
Rozgić et al. (2012) Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad. 2012. Ensemble of svm trees for multimodal emotion recognition. In Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 1–4.
Saganowski et al. (2022) Stanislaw Saganowski, Bartosz Perz, Adam Polak, and Przemyslaw Kazienko. 2022. Emotion recognition for everyday life using physiological signals from wearables: A systematic literature review. IEEE Transactions on Affective Computing (2022).
Schepker et al. (2020) Henning Schepker, Florian Denk, Birger Kollmeier, and Simon Doclo. 2020. Acoustic transparency in hearables—Perceptual sound quality evaluations. Journal of the Audio Engineering Society 68, 7/8 (2020), 495–507.
Seng et al. (2016) Kah Phooi Seng, Li-Minn Ang, and Chien Shing Ooi. 2016. A combined rule-based & machine learning audio-visual emotion recognition approach. IEEE Transactions on Affective Computing 9, 1 (2016), 3–13.
Shen et al. (2021) Weizhou Shen, Siyue Wu, Yunyi Yang, and Xiaojun Quan. 2021. Directed Acyclic Graph Network for Conversational Emotion Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. 1551–1560.
Sheng et al. (2020) Dongming Sheng, Dong Wang, Ying Shen, Haitao Zheng, and Haozhuang Liu. 2020. Summarize before aggregate: a global-to-local heterogeneous graph inference network for conversational emotion recognition. In Proceedings of the 28th International Conference on Computational Linguistics. 4153–4163.
Shou et al. (2023a) Yuntao Shou, Wei Ai, and Tao Meng. 2023a. Graph Information Bottleneck for Remote Sensing Segmentation. arXiv preprint arXiv:2312.02545 (2023).
Shou et al. (2023b) Yuntao Shou, Wei Ai, Tao Meng, and Keqin Li. 2023b. CZL-CIAE: CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation. arXiv preprint arXiv:2312.01758 (2023).
Shou et al. (2022a) Yuntao Shou, Tao Meng, Wei Ai, Canhao Xie, Haiyan Liu, and Yina Wang. 2022a. Object Detection in Medical Images Based on Hierarchical Transformer and Mask Mechanism. Computational Intelligence and Neuroscience 2022 (2022).
Shou et al. (2022b) Yuntao Shou, Tao Meng, Wei Ai, Sihan Yang, and Keqin Li. 2022b. Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis. Neurocomputing 501 (2022), 629–639.
Suman et al. (2022) Shubham Suman, Kshira Sagar Sahoo, Chandramouli Das, NZ Jhanjhi, and Ambik Mitra. 2022. Visualization of audio files using librosa. In Proceedings of 2nd International Conference on Mathematical Modeling and Computational Science: ICMMCS 2021. Springer, 409–418.
Sun et al. (2021) Yang Sun, Nan Yu, and Guohong Fu. 2021. A discourse-aware graph neural network for emotion recognition in multi-party conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2949–2958.
Sun et al. (2020) Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992–8999.
Sun et al. (2023) Zhe Sun, Hehao Zhang, Jiatong Bai, Mingyang Liu, and Zhengping Hu. 2023. A discriminatively deep fusion approach with improved conditional GAN (im-cGAN) for facial expression recognition. Pattern Recognition 135 (2023), 109157.
Tan et al. (2021) Liang Tan, Keping Yu, Long Lin, Xiaofan Cheng, Gautam Srivastava, Jerry Chun-Wei Lin, and Wei Wei. 2021. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space–air–ground integrated intelligent transportation system. IEEE Transactions on Intelligent Transportation Systems 23, 3 (2021), 2830–2842.
Tao and Liu (2018) Fei Tao and Gang Liu. 2018. Advanced LSTM: A study about better time dependency modeling in emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2906–2910.
Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
Tu et al. (2022) Geng Tu, Jintao Wen, Cheng Liu, Dazhi Jiang, and Erik Cambria. 2022. Context-and sentiment-aware networks for emotion recognition in conversation. IEEE Transactions on Artificial Intelligence 3, 5 (2022), 699–708.
Tzinis et al. (2018) Efthymios Tzinis, Georgios Paraskevopoulos, Christos Baziotis, and Alexandros Potamianos. 2018. Integrating Recurrence Dynamics for Speech Emotion Recognition. Proc. Interspeech 2018 (2018), 927–931.
Wang et al. (2017) Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. 2017. Select-additive learning: Improving generalization in multimodal sentiment analysis. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 949–954.
Wang et al. (2022) Ning Wang, Hui Cao, Jun Zhao, Ruilin Chen, Dapeng Yan, and Jie Zhang. 2022. M2R2: Missing-Modality Robust emotion Recognition framework with iterative data augmentation. IEEE Transactions on Artificial Intelligence (2022).
Wang et al. (2020a) Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020a. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704.
Wang et al. (2019) Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216–7223.
Wang et al. (2020b) Yan Wang, Jiayu Zhang, Jun Ma, Shaojun Wang, and Jing Xiao. 2020b. Contextualized emotion recognition in conversation as sequence tagging. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 186–195.
Wu et al. (2022) Yang Wu, Yanyan Zhao, Hao Yang, Song Chen, Bing Qin, Xiaohuan Cao, and Wenting Zhao. 2022. Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors. In Findings of the Association for Computational Linguistics: ACL 2022. 1397–1406.
Xie et al. (2019) Yue Xie, Ruiyu Liang, Zhenlin Liang, Chengwei Huang, Cairong Zou, and Björn Schuller. 2019. Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 11 (2019), 1675–1685.
Xing et al. (2020) Songlong Xing, Sijie Mai, and Haifeng Hu. 2020. Adapted dynamic memory network for emotion recognition in conversation. IEEE Transactions on Affective Computing 13, 3 (2020), 1426–1439.
Yang et al. (2022) Dingkang Yang, Shuai Huang, Yang Liu, and Lihua Zhang. 2022. Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition. IEEE Signal Processing Letters 29 (2022), 2093–2097.
Yin et al. (2022) Nan Yin, Li Shen, Baopu Li, Mengzhu Wang, Xiao Luo, Chong Chen, Zhigang Luo, and Xian-Sheng Hua. 2022. DEAL: An Unsupervised Domain Adaptive Framework for Graph-Level Classification. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22). Association for Computing Machinery, New York, NY, USA, 3470–3479. https://doi.org/10.1145/3503161.3548012
Yin et al. (2023a) Nan Yin, Li Shen, Mengzhu Wang, Long Lan, Zeyu Ma, Chong Chen, Xian-Sheng Hua, and Xiao Luo. 2023a. CoCo: A Coupled Contrastive Framework for Unsupervised Domain Adaptive Graph Classification. arXiv preprint arXiv:2306.04979 (2023).
Yin et al. (2023b) Nan Yin, Li Shen, Mengzhu Wang, Xiao Luo, Zhigang Luo, and Dacheng Tao. 2023b. OMG: Towards Effective Graph Classification Against Label Noise. IEEE Transactions on Knowledge and Data Engineering 35, 12 (2023), 12873–12886. https://doi.org/10.1109/TKDE.2023.3271677
Yin et al. (5555) N. Yin, L. Shen, H. Xiong, B. Gu, C. Chen, X. Hua, S. Liu, and X. Luo. 5555. Messages are Never Propagated Alone: Collaborative Hypergraph Neural Network for Time-Series Forecasting. IEEE Transactions on Pattern Analysis and Machine Intelligence 01 (nov 5555), 1–15. https://doi.org/10.1109/TPAMI.2023.3331389
Ying et al. (2021) RunKai Ying, Yuntao Shou, and Chang Liu. 2021. Prediction Model of Dow Jones Index Based on LSTM-Adaboost. In 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). IEEE, 808–812.
Yingjian et al. (2023) Liu Yingjian, Li Jiang, Wang Xiaoping, and Zeng Zhigang. 2023. EmotionIC: Emotional Inertia and Contagion-driven Dependency Modelling for Emotion Recognition in Conversation. arXiv preprint arXiv:2303.11117 (2023).
Yuan et al. (2023) Ziqi Yuan, Yihe Liu, Hua Xu, and Kai Gao. 2023. Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis. IEEE Transactions on Multimedia (2023).
Zadeh et al. (2017) Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114.
Zadeh et al. (2018a) Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
Zadeh et al. (2018c) Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Zadeh et al. (2018b) AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018b. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246.
Zahiri and Choi (2017) Sayyed M Zahiri and Jinho D Choi. 2017. Emotion detection on tv show transcripts with sequence-based convolutional neural networks. arXiv preprint arXiv:1708.04299 (2017).
Zhang and Chai (2021) Haidong Zhang and Yekun Chai. 2021. Coin: Conversational interactive networks for emotion recognition in conversation. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence. 12–18.
Zhang et al. (2020) Jianhua Zhang, Zhong Yin, Peng Chen, and Stefano Nichele. 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion 59 (2020), 103–126.
Zhang et al. (2021) Ke Zhang, Yuanqing Li, Jingyu Wang, Erik Cambria, and Xuelong Li. 2021. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1034–1047.
Zhang et al. (2023b) Qiongan Zhang, Lei Shi, Peiyu Liu, Zhenfang Zhu, and Liancheng Xu. 2023b. ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis. Applied Intelligence 53, 12 (2023), 16332–16345.
Zhang et al. (2023a) Yazhou Zhang, Ao Jia, Bo Wang, Peng Zhang, Dongming Zhao, Pu Li, Yuexian Hou, Xiaojia Jin, Dawei Song, and Jing Qin. 2023a. M3GAT: A Multi-Modal Multi-Task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition. ACM Transactions on Information Systems (2023).
Zhong et al. (2019) Peixiang Zhong, Di Wang, and Chunyan Miao. 2019. Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 165–176.
Zhu et al. (2021) Lixing Zhu, Gabriele Pergola, Lin Gui, Deyu Zhou, and Yulan He. 2021. Topic-driven and knowledge-aware transformer for dialogue emotion detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1571–1582.
Zhu et al. (2023) Linan Zhu, Zhechao Zhu, Chenwei Zhang, Yifei Xu, and Xiangjie Kong. 2023. Multimodal sentiment analysis based on fusion methods: A survey. Information Fusion 95 (2023), 306–325.

(5)	$\displaystyle P(w\|x^{a},x^{v},x^{t})$	$\displaystyle=\sum_{i=1}^{C}\sum_{j=1}^{D}\sum_{k=1}^{M}P(w,\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t}\|x^{a},x^{v},x^{t})$
		$\displaystyle=\sum_{i=1}^{C}\sum_{j=1}^{D}\sum_{k=1}^{M}P(w\|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$
		$\displaystyle\times P(\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t}\|x^{a},x^{v},x^{t})$

(6)			$\displaystyle P(w\|\lambda^{a}_{i},\lambda^{v}_{j},\lambda^{t}_{k},x^{a},x^{v},x^{t})$
			$\displaystyle=\sum_{b=1}^{B}P(w,\hat{w}_{b}\|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$
			$\displaystyle=\sum_{i=1}^{B}P(w\|\hat{w}_{b},\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$
			$\displaystyle\times P(\hat{w}_{b}\|\lambda_{i}^{a},\lambda_{j}^{v},\lambda_{k}^{t},x^{a},x^{v},x^{t})$