ESIHGNN: Event-State Interactions Infused Heterogeneous Graph Neural Network for Conversational Emotion Recognition
Abstract
Conversational Emotion Recognition (CER) aims to predict the emotion expressed by an utterance (referred to as an “event”) during a conversation. Existing graph-based methods mainly focus on event interactions to comprehend the conversational context, while overlooking the direct influence of the speaker’s emotional state on the events. In addition, real-time modeling of the conversation is crucial for real-world applications but is rarely considered. Toward this end, we propose a novel graph-based approach, namely Event-State Interactions infused Heterogeneous Graph Neural Network (ESIHGNN), which incorporates the speaker’s emotional state and constructs a heterogeneous event-state interaction graph to model the conversation. Specifically, a heterogeneous directed acyclic graph neural network is employed to dynamically update and enhance the representations of events and emotional states at each turn, thereby improving conversational coherence and consistency. Furthermore, to further improve the performance of CER, we enrich the graph’s edges with external knowledge. Experimental results on four publicly available CER datasets show the superiority of our approach and the effectiveness of the introduced heterogeneous event-state interaction graph.
Index Terms— Conversational Emotion Recognition, Event-State Interactions, Heterogeneous Knowledge Graph
1 Introduction
Evidence from psychology suggests that human actions are influenced by numerous factors, including environmental events and emotional ambiance [1, 2]. Understanding the decision-making process underlying human actions is crucial, as it can reflect individuals’ subjective evaluations of these factors. Recognizing emotions in conversations is an important foundation for this endeavor [3, 4].
Conversational Emotion Recognition (CER) is a crucial cognitive task that aims to identify the emotions conveyed by speakers through their utterances (referred to as “events”) during a conversation. CER has attracted significant attention in multiple research fields, as it serves as the foundation for developing conversational agents with emotional intelligence. These agents find applications in domains such as virtual reality therapy [5], social robotics [6], and smart home systems [7]. In reality, the flow of a conversation is influenced by both previous events and the emotional states of the participants, which can trigger new events and update their emotions. Therefore, accurately modeling the real-time conversation driven by previous events and emotional states is crucial for a successful CER.
Recent studies have focused on integrating speaker information into conversation modeling. Two approaches have been researched: recurrence-based methods and graph-based methods. Recurrence- based methods encode an event flow and a parallel state flow, and build interactions between them to dynamically model the conversation. For example, DialogueRNN [8] tracks the state of each speaker along the event flow using different Gated Recurrent Units (GRUs). However, it has limitations in terms of scalability and capacity to efficiently capture the interactions between events and speakers’ states. To tackle scalability, DialogueCRN [9], CoMPM [10], and DialogueINAB [11] propose a unified speaker-level emotion tracking module. To enhance natural interactions between events and states, COSMIC [12] extends DialogueRNN with external knowledge. However, it still remains challenging to balance long- and short-term memory within each utterance. Moreover, these methods often require bidirectional context modeling, which sacrifices real-time conversational abilities while improving performance.
On the other hand, graph-based methods construct a conversation structure graph constrained by speaker identity to model the conversation using Graph Neural Networks (GNNs). DialogueGCN [13], for example, represents events as nodes and connections between speakers as edges to gather contextual information from neighboring nodes within a specific window. RGAT [14] expands upon DialogueGCN by including position encodings to consider sequential information. Meanwhile, DAG-ERC [15] constructs a directed acyclic graph inspired by the Directed Acyclic Graph Neural Network (DAGNN) [16] to sequentially model context. Similar to COSMIC, SKAIG [17] enhances event interactions with external knowledge, while knowledge also plays an essential role in both KET [18] and KI-Net [19]. In summary, graph-based methods effectively model conversations by constructing interaction networks. However, these methods overlook the emotional states of speakers during a conversation and focus only on event interactions.
According to the aforementioned observations, it is important to develop an event-state interaction graph-based approach that can capture effective interactions between events and the emotions of participants for real-time conversation modeling. This paper proposes a solution named ESIHGNN (Event-State Interactions infused Heterogeneous Graph Neural Network) to meet these requirements. Specifically, we construct a heterogeneous event-state graph, where the event node is initialized with the semantic feature of the utterance and the accompanying state node representing the speaker’s emotional state is initialized with the speaker identity. To support real-time conversation modeling and enhance natural interactions between the event nodes and state nodes, we define eight preliminary logical edge relations based on speaker identity, represented by external knowledge or trainable vectors. The resulting graph is then fed into an introduced HDAGNN (Heterogeneous Directed Acyclic Graph Neural Network) that recurrently updates each event and state node in a single layer by gathering information from previous events and emotional states. Additionally, we use GRUs to enhance the representations of event and state nodes at each turn, in order to improve conversational consistency. Our contributions are summarized as follows: (1) We propose an ESIHGNN, a novel approach that first treats a conversation as a heterogeneous event-state interaction graph for the CER task. (2) We introduce an HDAGNN, a heterogeneous directed acyclic graph neural network that dynamically models interactions between events and speakers’ emotional states during real-time conversations. (3) Extensive experiments conducted on four benchmark datasets validate the effectiveness and superiority of the proposed ESIHGNN and confirm the advantages of the heterogeneous event-state interaction graph.

2 Methodology
Our ESIHGNN comprises three primary modules: graph construction, HDAGNN for feature transformation, and emotion prediction. Figure 1 illustrates the ESIHGNN framework.
2.1 Task Definition
In a conversation, a series of sequential utterances are encountered, denoted as . Here, represents the utterance of the -th turn, and indicates the speaker identity for . Note that and may belong to the same speaker, but there must be at least two speakers participating in the conversation. The task of the CER is to predict the emotion category for each utterance. For a target utterance , its prediction relies on real-time sequence pairs consisting of utterances and accompanying speaker identities: .
2.2 Graph Construction
A new methodology for modeling conversation structure is presented, emphasizing the interactions between events and speakers’ emotional states. This methodology constructs a heterogeneous event-state interaction graph, denoted as , where represents the node set consisting of event nodes and state nodes . In this graph, an edge signifies a connection from a node to its neighboring node under relation type , with its representation denoted by .
Nodes: For sequential utterances in a conversation, each utterance is regarded as an event node , while the emotional state of ’ speaker is represented as a state node . Clearly, both and correspond to the same speaker and turn. Consistent with COSMIC [12] and SKAIG [17], we employ the fine-tuned RoBERTa model [20] to encode the utterance as the initial event node feature . The state node feature is initialized using a one-hot vector that indicates the speaker identity . The context-independence of these node features is essential for modeling real-time conversations.
Relations: To simulate a real-time conversation driven by previous events and speakers’ emotional states, we suggest eight preliminary event-state interaction relations to capture the historical context.
Event Node Updating:
-
•
xWant: Intra-speaker interaction: Event passes the speaker’s action guidance to the subsequent event .
-
•
oWant: Inter-speaker interaction: Event can coordinate and motivate the subsequent event triggered by another speaker.
-
•
xDrive: Intra-speaker interaction: The speaker’s emotional state internally influences the trend of their subsequent event .
-
•
oDrive: Inter-speaker interaction: The speaker’s emotional state , resulting from contagion and empathy, externally influences the trend of the subsequent event triggered by another speaker.
State Node Updating:
-
•
xReact: Intra-speaker interaction: When event is executed by the speaker, their subsequent state is influenced by their reaction to this event .
-
•
oReact: Inter-speaker interaction: The reaction of another speaker to an event influences their subsequent state .
-
•
xDepend: Intra-speaker interaction: The speaker’s emotional state at a given moment has an influence on their subsequent emotional state , indicating self-dependency.
-
•
oDepend: Inter-speaker interaction: The emotional states of different speakers influence and interact with each other, indicating inter-speaker dependency.
Edges: Given the set of relation types, = {xWant, oWant, xDrive, oDrive, xReact, oReact, xDepend, oDepend}, information can only flow from previous event/state nodes to the current ones along the edges , and not in the opposite direction. To study the effect of the node connection range on the performance of CER, we introduce a window parameter that restricts the connections between the target node and previous nodes of each speaker within a window of size . With , only the most recent event and state nodes are considered predecessors for the target node.
Edge Representations: External knowledge plays a crucial role in modeling conversations with fluency, coherence, and emotional contagion. To harness this potential, we employ COMET [21], an inferential, knowledge-based transformer model, to generate edge representations based on the input format from COSMIC [12] and SKAIG [17]. For each edge , COMET takes the concatenated predecessor and relation as input, and extracts the hidden state of the relation token to serve as the edge representation . Note that COMET was trained on commonsense knowledge data consisting of explicit event prompts, enabling it to generate edge representations for predecessor event nodes. However, COMET does not generate edge representations for predecessor state nodes associated with implicit header entities, including xDrive, oDrive, xDepend, and oDepend. To address this limitation, we propose using 300-dimensional trainable random vectors, which allow the model to learn these implicit relations during training.
2.3 HDAGNN Layers
We will now describe the methodology for feature transformation using HDAGNN. In addition to establishing event-state interactions across different turns for conversational coherence (see Subsection 2.2), it is equally crucial to consider the event-state interactions within the same turn to improve conversational consistency. To achieve this, we propose establishing two implicit, bidirectional paths connecting the event and its accompanying emotional state.
Inter-Turn Event-State Interactions: To model real-time conversations, we use a dynamic and forward feature propagation strategy that enables current nodes to only receive information from previous nodes. For each node , the weight coefficients are calculated by normalizing the attention scores between the hidden feature of at the ()-th layer and those of its neighborhood at the -th layer:
(1) |
where , , , and are trainable parameters, and signifies the positional information of node using two dimensions. The first dimension of indicates the absolute position, providing global sequential information throughout a conversation, while the second dimension denotes the relative position within the relation , giving local sequential information within that relation.
Once obtained, the normalized coefficients are used to compute a linear combination of the features of ’s neighborhood , producing aggregated information (referred to as “message”):
(2) |
In HDAGNN, two GRUs are used to address the node heterogeneity and capture the sequential dependencies among nodes in an event-state interaction graph. The GRU takes the past feature of the node as its input and the message as its hidden state.
(3) |
(4) |
Intra-Turn Event-State Interactions: Within a turn, there are two event-state interaction paths. The first path involves the speaker being influenced by the emotional context when executing events. For the emotional context of the -th turn, we specify instead of the updated that may have forgotten some emotional information.
(5) |
where , , and are the input, hidden state, and output of the GRU, respectively.
The second path involves the emotional state of the speaker at the -th turn being influenced by the context of the event, :
(6) |
To ensure conversational coherence and consistency, the hidden feature of the node at the -th layer is the addition of and :
(7) |
2.4 Emotion Prediction
To preserve the original semantics of each node , we employ a concatenation operation to combine its hidden features from each layer in HDAGNN, resulting in the final node representation . Additionally, considering node consistency within a conversation turn , we take the sum of the event representation and the state representation as the emotion representation of the utterance , and input it into a fully connected layer to predict the emotion label:
(8) |
(9) |
3 Experiments
3.1 Setup
Datesets: Our ESIHGNN is evaluated on four widely-used benchmark datasets: IEMOCAP [22], MELD [23], EmoryNLP [24], and DailyDialog [25]. The dataset statistics are presented in Table 1. To assess performance, we apply the evaluation metrics utilized in previous studies [14, 26]. This entails calculating micro-averaged F1 for DailyDialog and weighted-average F1 for the other datasets.
Baselines: We compare our proposed ESIHGNN with state-of-the-art baseline methods, including recurrence-based methods: DialogueRNN [8], COSMIC [12], DialogueCRN [9], BiERU [27], MVN [28], CoMPM [10], and DialogueINAB [11]; and graph-based methods: DialogueGCN [13], KET [18], KI-Net [19], RGAT [14], SKAIG [17], DAG-ERC [15], CoG-BART [29], and MM-DFN [30].
Datasets | # Dialogues | # Utterances | A. L./A. S. |
train / val / test | train / val / test | ||
IEMOCAP | 100/20/31 | 4810/1000/1623 | 50/2 |
MELD | 1038/114/280 | 9989/1109/2610 | 10/2.7 |
EmoryNLP | 713/99/85 | 9934/1344/1328 | 14/3.5 |
DailyDialog | 11118/1000/1000 | 87170/8069/7740 | 8/2 |
Implementation Details: We use the PyTorch framework and the AdamW optimizer on two RTX 3090 GPUs for code implementation. Our approach involves performing hyperparameter searches for the learning rate, dropout rate, batch size, and number of HDAGNN layers. We set as the default setting for overall performance comparisons, but in Subsection 3.3 we present ablation results for ranging from 1 to 3. The initial dimensions for the node features and edge features are set to 1024 and 768, respectively. These dimensions are consistent across all hidden layers, remaining at 300. The reported results of our approach are the average of five test runs.
3.2 Results and Analysis
The results of the four datasets are summarized in Table 2. It can be observed that: (1) Our proposed ESIHGNN achieves state-of-the-art performance, except for competitive results in MELD. This is probably attributed to the complex and ambiguous contexts present in short conversations with multiple speakers in MELD, which limit ESIHGNN’s capability to effectively capture event-state interactions. (2) Among recurrence-based methods, COSMIC outperforms DialogueRNN when external knowledge is incorporated, while our method surpasses DAG-ERC among graph-based methods. These findings highlight the importance of incorporating external knowledge. (3) Our ESIHGNN surpasses graph-based methods that do not consider future utterances on all datasets, confirming the superiority of the event-state connection structure and the effectiveness of ESIHGNN. Although considering future utterances can enhance the model’s understanding of the current context [27, 17], it is not practical for real-time conversations. (4) Graph-based methods consistently outperform recurrence-based methods on most CER datasets, indicating their superior capability to model conversational context. Our method achieves better results compared to existing graph-based methods on IEMOCAP and EmoryNLP, showing its superiority in modeling conversation, particularly for longer conversations (with an average length of 50) in IEMOCAP.
Method | IEMOCAP | MELD | EmoryNLP | DailyDialog |
Recurrence-based methods | ||||
DialogueRNN | 64.76 | 63.61 | 37.44 | 57.32 |
65.28 | 65.21 | 38.11 | 58.48 | |
66.20 | 58.39 | - | - | |
65.22 | 60.84 | - | - | |
65.44 | 59.03 | - | - | |
65.79 | 64.62 | 37.44 | 59.63 | |
67.22 | 57.78 | - | - | |
Graph-based methods | ||||
DialogueGCN | 64.18 | 58.10 | - | - |
59.56 | 58.18 | 34.39 | 53.37 | |
66.98 | 63.24 | - | 57.30 | |
65.22 | 60.91 | 34.42 | 54.31 | |
66.96 | 65.18 | 38.88 | 59.75 | |
DAG-ERC | 68.03 | 63.56 | 39.02 | 59.33 |
66.18 | 64.81 | 39.04 | 56.29 | |
68.18 | 59.46 | - | - | |
68.53 | 63.92 | 39.56 | 59.78 |
3.3 Ablation Studies
In our ablation study, we analyze the contributions of different components in our ESIHGNN: edge construction, edge representation, IntraESI (Intra-turn Event-State Interactions), and window size . The edge construction module distinguishes our work from previous approaches like DAG-ERC [15] and SKAIG [17], as it allows the analysis of relations at a coarser scale. For example, denotes the removal of the xWant and oWant relations. To evaluate the effect of different edge representations, we use “ trainable ”, which replaces all edge representations with trainable embeddings, and “ 0/1”, which limits the relation types to the set . Table 3 presents the ablation results. It can be seen that: (1) Removing any of the coarse-grained relations in ESIHGNN leads to a decrease in overall performance, particularly in participants’ reactions to events (i.e., ). This suggests that emotion recognition is influenced by both previous events and speakers’ emotional states. (2) Removing knowledge-based edges results in a more significant performance decline compared to removing trainable embedding-based edges. Additionally, both “ trainable ” and “ 0/1” show poor performance without encoding external knowledge in edge representations. These findings suggest that external knowledge can enhance the information interaction between nodes. (3) Ablating the IntraESI module noticeably decreases the performance of ESIHGNN, suggesting that there is an interplay between the speaker’s events and emotional states. (4) Increasing may not significantly affect the performance of ESIHGNN, indicating that the event-state interactions are localized.
Method | IEMOCAP | MELD | EmoryNLP | DailyDialog |
ESIHGNN | 68.53 | 63.92 | 39.56 | 59.78 |
ESIHGNN () | 68.22 | 63.88 | 39.42 | 59.56 |
ESIHGNN () | 68.17 | 63.91 | 39.51 | 59.69 |
67.70 | 63.80 | 39.38 | 59.52 | |
68.03 | 63.85 | 39.50 | 59.68 | |
67.56 | 63.76 | 39.18 | 59.22 | |
67.60 | 63.81 | 39.30 | 59.47 | |
trainable | 67.92 | 63.78 | 39.40 | 59.48 |
0/1 | 67.68 | 63.81 | 39.37 | 59.50 |
-IntraESI | 67.29 | 63.78 | 39.05 | 59.03 |
4 Conclusion
This paper proposes ESIHGNN, a novel approach that incorporates the speaker’s emotional state into conversational context modeling for conversational emotion recognition, based on a heterogeneous directed acyclic graph neural network. The experimental results on four benchmark datasets demonstrate that our approach achieves competitive or state-of-the-art performance when compared with baselines. Moreover, our ablation studies confirm the effectiveness of event-state interactions and emphasize the superiority of knowledge-enriched edge representations. In future work, we plan to contribute a knowledge graph where emotion serves as the head entity, filling gaps in both state-to-state and state-to-event relations.
References
- [1] Joseph Paul Forgas, “Affect in social judgments and decisions: A multiprocess model,” in Advances in experimental social psychology, vol. 25, pp. 227–275. 1992.
- [2] Adam M Grant and Susan J Ashford, “The dynamics of proactivity at work,” Research in organizational behavior, vol. 28, pp. 3–34, 2008.
- [3] Gene Ball and Jack Breese, “Emotion and personality in a conversational agent,” Embodied conversational agents, vol. 189, pp. 189–219, 2000.
- [4] Yu-Ping Ruan, Shu-Kai Zhen4g, Taihao Li, Fen Wang, and Guanxiong Pei, “Hierarchical and multi-view dependency modelling network for conversational emotion recognition,” in Proc. ICASSP, 2022, pp. 7032–7036.
- [5] Paul MG Emmelkamp and Katharina Meyerbröker, “Virtual reality therapy in mental health,” Annual review of clinical psychology, vol. 17, pp. 495–519, 2021.
- [6] Alessandra Maria Sabelli, Takayuki Kanda, and Norihiro Hagita, “A conversational robot in an elderly care center: An ethnographic study,” in Proc. HRI, 2011, pp. 37–44.
- [7] Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Minlie Huang, “Augmenting end-to-end dialogue systems with commonsense knowledge,” in Proc. AAAI, 2018, pp. 4970–4977.
- [8] Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proc. AAAI, 2019, pp. 6818–6825.
- [9] Dou Hu, Lingwei Wei, and Xiaoyong Huai, “Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations,” in Proc. ACL, 2021, pp. 7042–7052.
- [10] Joosung Lee and Wooin Lee, “Compm: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation,” in Proc. NAACL, 2022, pp. 5669–5679.
- [11] Junyuan Ding, Xiaoliang Chen, Peng Lu, Zaiyan Yang, Xianyong Li, and Yajun Du, “Dialogueinab: an interaction neural network based on attitudes and behaviors of interlocutors for dialogue emotion recognition,” The Journal of Supercomputing, pp. 1–34, 2023.
- [12] Deepanway Ghosal, Navonil Majumder, Alexander F. Gelbukh, Rada Mihalcea, and Soujanya Poria, “COSMIC: commonsense knowledge for emotion identification in conversations,” in Proc. EMNLP, 2020, pp. 2470–2481.
- [13] Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander F. Gelbukh, “Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation,” in Proc. EMNLP, 2019, pp. 154–164.
- [14] Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto, “Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations,” in Proc. EMNLP, 2020, pp. 7360–7370.
- [15] Weizhou Shen, Siyue Wu, Yunyi Yang, and Xiaojun Quan, “Directed acyclic graph network for conversational emotion recognition,” in Proc. ACL, 2021, pp. 1551–1560.
- [16] Veronika Thost and Jie Chen, “Directed acyclic graph neural networks,” in ICLR, 2021.
- [17] Jiangnan Li, Zheng Lin, Peng Fu, and Weiping Wang, “Past, present, and future: Conversational emotion recognition through structural modeling of psychological knowledge,” in Proc. EMNLP, 2021, pp. 1204–1214.
- [18] Peixiang Zhong, Di Wang, and Chunyan Miao, “Knowledge-enriched transformer for emotion detection in textual conversations,” in Proc. EMNLP, 2019, pp. 165–176.
- [19] Yunhe Xie, Kailai Yang, Chengjie Sun, Bingquan Liu, and Zhenzhou Ji, “Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations,” in Proc. EMNLP, 2021, pp. 2879–2889.
- [20] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- [21] Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi, “COMET: commonsense transformers for automatic knowledge graph construction,” in Proc. ACL, 2019, pp. 4762–4779.
- [22] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Ebrahim (Abe) Kazemzadeh, Emily Mower Provost, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan, “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
- [23] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proc. ACL, 2019, pp. 527–536.
- [24] Sayyed M. Zahiri and Jinho D. Choi, “Emotion detection on TV show transcripts with sequence-based convolutional neural networks,” in Proc. AAAI, 2018, pp. 44–52.
- [25] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu, “Dailydialog: A manually labelled multi-turn dialogue dataset,” in Proc. IJCNLP, 2017, pp. 986–995.
- [26] Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie, “Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition,” in Proc. AAAI, 2021, pp. 13789–13797.
- [27] Wei Li, Wei Shao, Shaoxiong Ji, and Erik Cambria, “Bieru: Bidirectional emotional recurrent unit for conversational sentiment analysis,” Neurocomputing, vol. 467, pp. 73–82, 2022.
- [28] Hui Ma, Jian Wang, Hongfei Lin, Xuejun Pan, Yijia Zhang, and Zhihao Yang, “A multi-view network for real-time emotion recognition in conversations,” Knowledge-Based Systems, vol. 236, pp. 107751, 2022.
- [29] Shimin Li, Hang Yan, and Xipeng Qiu, “Contrast and generation make bart a good dialogue emotion recognizer,” in Proc. AAAI, 2022, pp. 11002–11010.
- [30] Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo, “Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations,” in Proc. ICASSP, 2022, pp. 7037–7041.