Adaptive speech emotion representation learning based on dynamic graph
Abstract
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of graph representation learning for sequential data. For this reason, we propose an adaptive graph representation learning method based on dynamically evolved graphs, which are consecutively constructed on a series of subsequences segmented by a sliding window. In doing this, it is better to capture local and global context information within a long sequence. Moreover, we introduce a weighted approach to update the node representation rather than the conventional average one, where the weights are calculated by a novel matrix computation based on the degree of neighboring nodes. Finally, we construct a learnable graph convolutional layer that combines the graph structure loss and classification loss to optimize the graph structure. To verify the effectiveness of the proposed method, we conducted experiments for speech emotion recognition on the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed method outperforms the latest (non-)graph-based models.
Index Terms— Speech emotion recognition, Dynamic graph, Node similarity matrix, Adaptive graph presentation learning
1 Introduction
Graph representation learning has demonstrated tremendous promise due to its powerful capability of mining graph structure information and data relationships [1]. Graph convolutional network (GCN), a concrete and popular implementation in graph representation learning, has been widely used since it fully considers the relationships between target nodes and neighboring nodes to learn efficient node representations. For example, Compact-GCN [2] constructs a lightweight GCN architecture, which can perform accurate graph convolution for speech emotion recognition (SER). To model dynamic data, L-GrIN [3] proposes a learnable graph structure, which is designed to adapt across modalities.
Despite great progress made in graph representation learning for SER, they primarily focused on a static graph constructed on an entire utterance. This may fail to capture the trivial variation of emotion in a small region. Moreover, for most previous methods, the dominant paradigm to update the node representations is by averaging the information of neighboring nodes, without considering the importance of different neighboring nodes. Although some efforts have been made to explore a weighted average approach by considering the attention mechanism [4, 5], the calculation process is time-consuming and highly computationally complex.
To address these shortcomings, this paper proposes an adaptive graph representation learning model based on dynamically evolved graphs. Specifically, we consecutively construct the graphs on a set of subsequences segmented by a sliding window, where each node of the graph corresponds to a frame (frame-to-node), and extract the feature vector of the frame as its node representation. Then, the node representation is updated by our proposed matrix calculation method. Finally, we construct a learnable graph convolutional layer to optimize the graph structure. The contributions of this paper are summarized as follows:
-
•
We introduce a weighted method to update the node representations rather than the traditional methods of averaging the information of neighboring nodes, where the weights are calculated by the proposed matrix computation based on the degree of neighboring nodes.
-
•
We construct a learnable graph convolutional layer that combines the graph structure loss and classification loss to jointly optimize the graph structure during the training.
-
•
Experimental results show that our SER model has better performance than the state-of-the-art (SOTA) graph-based networks and several widely used non-graph-based models on the IEMOCAP and RAVDESS datasets.
2 Related work
There are two kinds of graph representation learning that are used to deal with static graphs and dynamic graphs. For the former, these methods focus on mining the connectivity of graphs. A well-known connectivity way is to connect the first-order or second-order neighboring nodes [3], which provide the structure information of a graph at different levels. Commonly used approaches, including random walk (DeepWalk [6] and Node2Vec [7]), graph convolution (GCNs [8]), sampling (GraphSAGE [9] and GraphSAINT [10]), non-negative matrix decomposition (M-NMF [11]), and attention mechanism (GAT [12]). However, these methods largely ignore the evolution of graph structures and the temporal properties of graphs.
Recently, several dynamic representation learning approaches have been proposed. Specifically, to dig deeper into the local structure of the graph, StudentLSP [13] utilizes a knowledge distillation method to learn the node representations. DySAT [14] learns node representations with the help of a self-attention mechanism by modeling both neighboring nodes and temporal attributes. FADGC [5] captures the temporal properties of dynamic graphs by using a fine-grained attention mechanism to focus on the node changes. In addition, to learn the dynamic graph representations of a set of nodes, EvolveGCN [15] is implemented through a long short-term memory. However, like most GCN-based methods, the node aggregation of EvolveGCN is realized by averaging the information of neighboring nodes, which may not effectively take into account the importance of different neighboring nodes. To this end, we propose a new matrix computation method to update node representations based on the degree of neighboring nodes.
3 Proposed Approach
This section elaborates on our proposed architecture, which consists of three parts: the graph construction, the computation of relations between nodes, and the learnable graph convolutional layer. The overall framework of our model is shown in Fig. 1.

3.1 Graph construction
Let indicate a dynamic graph, different from the traditional methods that only construct a static graph for an entire sequence. is an observed graph specific to an audio segment and the structure of varies with different audio segments. denote the structure of . More specifically, , where indicate that each audio segment has nodes, is an edge set at audio segment , and is a feature matrix for all nodes at audio segment . Given a dynamic graph , the goal of graph representation learning is to study a function for each in . Specifically, based on the function , the given can be output as low-dimensional representations, e.g., , where .
Our graph construction follows the frame-to-node transformation, as shown in Fig. 2.

A frame represents a 25ms audio. To encode the temporal information, neighboring nodes (frames) need to be connected. Meanwhile, to aggregate the information of distant nodes, we also randomly connect the nodes. The represents the weight corresponding to the edge between and nodes. Note that the graph structure is not naturally defined here, i.e., the elements in are unknown. The common methods to determine the include cosine similarity function [1], manual definition [2], and distance function [16]. However, these may lead to a sub-optimal graph [17]. Therefore, we propose a new matrix method to initially the and optimize the graph structure during the training by combining graph structure loss and classification loss. This loss function will be presented in Subsection 3.3.
3.2 A novel matrix computation
Calculating the similarity between each node and its neighboring nodes is crucial in graph-based analysis. Dice similarity is a well-known node similarity measurement function [18]. Since Dice similarity is calculated directly based on the network topology, it is relatively interpretable and saves computing resources. Given two nodes and , the Dice similarity score is calculated as follows:
(1) |
where indicates the number of common neighbors. and denote the number of neighbors of and (not contain the node itself), respectively. As can be seen from Eq. (1), when calculating the similarity between the nodes, the degree of neighboring nodes does not affect the final result.
However, nodes with higher degrees are more likely to have more influence and higher weight values [19]. For example, in the social networks of NBA players, players with more followers may bring more attention to their teams than players with similar physical conditions and skills. Take Fig. 3 as an example,

the common neighbors number of node with node is equal to the node with node . The influence of node with smaller degree is . However, the influence of node with larger degree is , which is not expected in the real world. This is because graph-based methods usually update node representations by transferring and aggregating information from neighboring nodes. Thus, the higher-degree nodes may aggregate more information.
Therefore, we propose a new Dice matrix calculation method to solve the problem that the neighboring node degree has no positive effect on the target node. The proposed method is defined as follows:
(2) |
(3) |
(4) |
where represents the number of neighbors shared by node and node , and indicates the degree of neighbor node (the node itself is contained). According to Eq. (4), as expected, the influence of node on and the influence of node on in Fig. 3 are and . That is to say, the nodes with relatively large degrees have higher node weight values.
3.3 Learnable graph convolution network
Traditional GCN layer usually takes the node feature matrix and the graph adjacency matrix as inputs to generate the node-level representation matrix . The GCN layer can be described as:
(5) |
where is a activation function that implements nonlinearity. , is a the degree matrix of , and is a identity matrix. , , and is the weight matrix for the layer.
We present a new graph convolution structure for SER. It consists of the following two novel components:
Segment-specific graph convolution. For each audio segment , we generate a node representation matrix . The graph convolution layer is represented as:
(6) |
where is a degree matrix of and . The , is a novel matrix calculated by the proposed Eq. (4), and is an identity matrix. is a feature matrix and is a weight matrix, where is obtained by random and . Different from the traditional GCN that employs to guide the aggregate of node representations. We additionally add the to further guide the aggregation and use a parameter to control its contribution. Under the guidance of the new aggregation strategy, a higher-quality node representation can be obtained.
Learnable adjacency matrix (). We combine graph classification loss () and graph structure loss () to jointly optimize the graph structure during training. The is defined using the cross-entropy loss:
(7) |
where represents the prediction label for the - sample. The is implemented as follows:
(8) |
where and control the proportion of these items respectively, is an all-ones vector, indicates Frobenious norm, and refers to Eq. (4). The overall loss function is as follows:
(9) |
where represents all learnable parameters on all graph convolution layers. Each term in the overall loss function is differentiable, thus allowing end-to-end optimization.
4 Experiments
Models | IEMOCAP | |
Acc(%) | F1(%) | |
Graph-based | ||
GCN (2017)[8] | 56.14 | - |
PATCHY-Diff (2018)[20] | 63.23 | - |
HSGCF (2023)[21] | 65.13 | 65.18 |
DialogueGCN (2019)[22] | 65.25 | 64.18 |
L-GrIN (2022)[3] | 65.50 | - |
Non-graph based | ||
SpecMAE-12 (2023)[23] | 46.70 | 45.90 |
CNN-LSTM (2019)[24] | 50.17 | - |
DialogueRNN (2019)[25] | 63.40 | 62.75 |
DualTransformer (2023)[26] | 64.80 | 64.90 |
Adjacency matrix | ||
Ours (Binary) | 53.46 | 53.02 |
Ours (Weighted) | 58.69 | 58.41 |
Ours () | 63.58 | 63.04 |
Ours () | 65.64 | 65.28 |
Ours | 66.94 | 66.54 |
Models | RAVDESS | |
Acc(%) | F1(%) | |
Graph-based | ||
Synch-Graph (2020)[27] | 42.60 | - |
Esma et al. (2019)[28] | 45.10 | - |
GCN (2023) | 51.67 | 50.69 |
TSP-INCA (2021)[29] | 53.00 | 57.00 |
Riccardo et al (2022)[30] | 58.50 | 57.00 |
Non-graph based | ||
SpecMAE-12 (2023)[23] | 52.20 | 52.00 |
CNN-LSTM (2019)[24] | 53.08 | - |
VGG-Transformer (2023)[31] | 61.60 | - |
GResNet (2019)[32] | 64.48 | 63.11 |
Adjacency matrix | ||
Ours (Binary) | 51.67 | 50.69 |
Ours (Weighted) | 51.46 | 50.59 |
Ours () | 64.10 | 63.72 |
Ours () | 65.69 | 65.34 |
Ours | 67.50 | 67.05 |
4.1 Database
The IEMOCAP database contains 12 hours of data. To keep consistent with previous studies, only four emotions are used. We utilize the OpenSMILE tool to extract features from the audio. For each sample, we use a sliding window of length 25ms (with a step length of 10ms) to locally extract the low-level descriptors (LLDs), such as signal strength, power spectral density, and base frequency. Each speech sample contains 120 nodes, where each node represents an (overlapping) audio frame of length 25ms. The RAVDESS database includes 1,500 speech samples, performed by 24 professional actors (12 females and 12 males). Each actor simulated eight different emotional states. We use the Fourier transform to convert the speech signal into frequency domain representations. Then, we extract frequency domain features, such as the Mel frequency cepstrum coefficient (MFCC), where the sampling rate is 22,050 Hz and the number of MFCCs is 40. Each audio sample contains 40 nodes, where the node corresponds to a 25ms audio frame. Appropriately increasing the number of MFCCs can help perceive changes in sound signals at different frequencies, and these detailed features may play a positive role in emotional sensitivity. The reason for extracting MFCC features from the RAVDESS dataset is that it provides representative spectral information.
4.2 Implementation Details
we trained the model to have up to 1000 epochs, with 150 iterations per epoch, and used an early stop strategy. The batch size of the model was set to 64. Meantime, the RAdam optimizer with a learning rate of 0.001 was employed and we set the decay rate to 0.5 after every 150 epochs. We conducted experiments on GeForce GTX 3090 Ti, NVIDIA-SMI 460.39, and CUDA Version 11.2 GPUs, and obtained the final experimental data by using the 10-fold cross-validation averaging method.
4.3 Results and Analysis
Comparison with SOTA methods. Table 1(b) presents all results on two datasets. On the IEMOCAP, compared with the graph-based SOTA approaches, the performance of our model seems only slightly better (1.44% of accuracy absolutely) than L-GrIN [3], but the adjacency matrix of L-GrIN is obtained by a distance function, which tends to locally optimal graphs during the training. When compared with other SOTA non-graph methods, we find that our method outperforms popular Transformer-based methods. This is mainly attributed to the fact that we connect neighboring nodes while randomly connecting distant nodes for information transfer. The relationship between the nodes is calculated by the proposed matrix calculation method to guide the aggregation of node information, thus capturing the long-distance information. Table 1(b)(b) also shows that our model performs better than the graph-based and non-graph-based approaches on the RAVDESS database. For example, compared with the classic GCN model, our model has great performance advantages (15.83% of accuracy absolutely). This indicates that the proposed matrix computation method can alleviate the sub-optimal graph in traditional GCN caused by averaging the information of neighboring nodes.
Ablation experiment. We perform ablation studies on two datasets, including the following variations of our model: Binary: manually defining the structure of the graph, the matrix A only contains and ; Weighted: the relationship between nodes is determined based on the distance; : the structure of the graph is learned during the training; : the loss function only contains the proposed matrix computation method. The results of the study are displayed in the “Adjacency matrix” of Table 1(b). We have the following observations.
Firstly, both Binary and Weighted determine the relationship between nodes through customized form, which makes them easy to form the local optimal graphs during the training process, so the performance is the worst. Secondly, the proposed matrix calculation method has enhanced performance compared to the learnable , which shows that our method can effectively distinguish the significance of different neighboring nodes. Finally, we optimize the structure of the graph by jointing the graph structure loss and classification loss to achieve the best performance. This indicates that the proposed matrix computation method can update the node representation and that the structure of the graph can be continuously optimized during the training process.
5 Conclusion
We propose an adaptive graph representation learning method based on dynamically evolved graphs rather than the traditional methods that only construct a static graph for an entire sequence. Our graphs are consecutively constructed on a series of subsequences segmented by a sliding window. To compute the edge weights, we propose a new matrix calculation method that updates the node representations based on the degree of neighboring nodes. Moreover, we combine the graph structure loss and classification loss to jointly optimize the graph structure during the training. In the future, we will simultaneously consider combining structural similarity with feature similarity to jointly measure the similarity of nodes, and work on multimodal data with graph structures.
References
- [1] Dou Hu, Xiaolong Hou, Lingwei Wei, Lian-Xin Jiang, and Yang Mo, “MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations,” in ICASSP, 2022, pp. 7037–7041.
- [2] Amir Shirian and Tanaya Guha, “Compact graph architecture for speech emotion recognition,” in ICASSP, 2021, pp. 6284–6288.
- [3] Amir Shirian, Subarna Tripathi, and Tanaya Guha, “Dynamic emotion modeling with learnable graphs and graph inception network,” IEEE Trans. Multim., vol. 24, pp. 780–790, 2022.
- [4] Huan Zhao, Zhengwei Li, Zhu-Hong You, Ru Nie, and Tangbo Zhong, “Predicting mirna-disease associations based on neighbor selection graph attention networks,” IEEE ACM Trans. Comput. Biol. Bioinform., vol. 20, no. 2, pp. 1298–1307, 2023.
- [5] Bo Wu, Xun Liang, Xiangping Zheng, Yuhui Guo, and Hui Tang, “Improving dynamic graph convolutional network with fine-grained attention mechanism,” in ICASSP, 2022, pp. 3938–3942.
- [6] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena, “Deepwalk: online learning of social representations,” in KDD, 2014, pp. 701–710.
- [7] Aditya Grover and Jure Leskovec, “node2vec: Scalable feature learning for networks,” in KDD, 2016, pp. 855–864.
- [8] Thomas N. Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
- [9] William L. Hamilton, Zhitao Ying, and Jure Leskovec, “Inductive representation learning on large graphs,” in NeurIPS, 2017, pp. 1024–1034.
- [10] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor K. Prasanna, “Graphsaint: Graph sampling based inductive learning method,” in ICLR, 2020.
- [11] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang, “Community preserving network embedding,” in AAAI, 2017, pp. 203–209.
- [12] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio, “Graph attention networks,” in ICLR, 2018.
- [13] Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, and Xinchao Wang, “Distilling knowledge from graph convolutional networks,” in CVPR, 2020, pp. 7072–7081.
- [14] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang, “Dysat: Deep neural representation learning on dynamic graphs via self-attention networks,” in WSDM, 2020, pp. 519–527.
- [15] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson, “Evolvegcn: Evolving graph convolutional networks for dynamic graphs,” in AAAI, 2020, pp. 5363–5370.
- [16] Yonghua Zhu, Junbo Ma, Changan Yuan, and Xiaofeng Zhu, “Interpretable learning based dynamic graph convolutional networks for alzheimer’s disease analysis,” Inf. Fusion, vol. 77, pp. 53–61, 2022.
- [17] Amir Shirian, Mona Ahmadian, Krishna Somandepalli, and Tanaya Guha, “Heterogeneous graph learning for acoustic event classification,” in ICASSP, 2023, pp. 1–5.
- [18] Yu Xie, Maoguo Gong, Shanfeng Wang, and Bin Yu, “Community discovery in networks with deep sparse filtering,” Pattern Recognit., vol. 81, pp. 50–59, 2018.
- [19] Zemin Liu, Trung-Kien Nguyen, and Yuan Fang, “On generalized degree fairness in graph neural networks,” in AAAI, 2023, pp. 4525–4533.
- [20] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L. Hamilton, and Jure Leskovec, “Hierarchical graph representation learning with differentiable pooling,” in NeurIPS, 2018, pp. 4805–4815.
- [21] Binqiang Wang, Gang Dong, Yaqian Zhao, Rengang Li, Qichun Cao, Kekun Hu, and Dongdong Jiang, “Hierarchically stacked graph convolution for emotion recognition in conversation,” Knowl. Based Syst., vol. 263, pp. 110285, 2023.
- [22] Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander F. Gelbukh, “Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation,” in EMNLP, 2019, pp. 154–164.
- [23] Samir Sadok, Simon Leglaive, and Renaud Séguier, “A vector quantized masked autoencoder for speech emotion recognition,” in ICASSP, 2023, pp. 1–5.
- [24] Jack Parry, Dimitri Palaz, Georgia Clarke, Pauline Lecomte, Rebecca Mead, Michael Berger, and Gregor Hofer, “Analysis of deep learning architectures for cross-corpus speech emotion recognition,” in INTERSPEECH, 2019, pp. 1656–1660.
- [25] Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander F. Gelbukh, and Erik Cambria, “Dialoguernn: An attentive RNN for emotion detection in conversations,” in AAAI, 2019, pp. 6818–6825.
- [26] Zheng Liu, Xin Kang, and Fuji Ren, “Dual-tbnet: Improving the robustness of speech features via dual-transformer-bilstm for speech emotion recognition,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 2193–2203, 2023.
- [27] Esma Mansouri-Benssassi and Juan Ye, “Synch-graph: Multisensory emotion recognition through neural synchrony via graph convolutional networks,” in AAAI, 2020, pp. 1351–1358.
- [28] Esma Mansouri-Benssassi and Juan Ye, “Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks,” in IJCNN, 2019, pp. 1–8.
- [29] Türker Tuncer, Sengül Dogan, and U. Rajendra Acharya, “Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques,” Knowl. Based Syst., vol. 211, pp. 106547, 2021.
- [30] Riccardo Franceschini, Enrico Fini, Cigdem Beyan, Alessandro Conti, Federica Arrigoni, and Elisa Ricci, “Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss,” in ICPR, 2022, pp. 2589–2596.
- [31] Esam Ghaleb, Jan Niehues, and Stylianos Asteriadis, “Joint modelling of audio-visual cues using attention mechanisms for emotion recognition,” Multim. Tools Appl., vol. 82, no. 8, pp. 11239–11264, 2023.
- [32] Yuni Zeng, Hua Mao, Dezhong Peng, and Zhang Yi, “Spectrogram based multi-task audio classification,” Multim. Tools Appl., vol. 78, no. 3, pp. 3705–3722, 2019.