A Graph Isomorphism Network with Weighted Multiple Aggregators for Speech Emotion Recognition
Abstract
Speech emotion recognition (SER) is an essential part of human-computer interaction. In this paper, we propose an SER network based on a Graph Isomorphism Network with Weighted Multiple Aggregators (WMA-GIN), which can effectively handle the problem of information confusion when neighbour nodes’ features are aggregated together in GIN structure. Moreover, a Full-Adjacent (FA) layer is adopted for alleviating the - problem, which is existed in all Graph Neural Network (GNN) structures, including GIN. Furthermore, a multi-phase attention mechanism and multi-loss training strategy are employed to avoid missing the useful emotional information in the stacked WMA-GIN layers. We evaluated the performance of our proposed WMA-GIN on the popular IEMOCAP dataset. The experimental results show that WMA-GIN outperforms other GNN-based methods and is comparable to some advanced non-graph-based methods by achieving 72.48% of weighted accuracy (WA) and 67.72% of unweighted accuracy (UA).
Index Terms: Speech Emotion Recognition, Weighted Multiple Aggregators, Graph Isomorphism Network, Full-Adjacent layer
1 Introduction
Speech Emotion Recognition (SER) task is a well-studied field in the domain of affective computing and has been essential for computers understanding human’s mood state [1, 2]. The SER has attracted many researchers’ attention and been applied to plenty of fields such as automated customer service systems, smart voice assistants, human psychotherapy, and so on [3, 4]. However, SER is an extraordinarily complicated task even for human beings because speech emotion is elusive.
Graph Neural Network (GNN) has been an active research field for the last ten years and made significant advancements in graph representation learning [5, 6, 7]. GNNs broadly follow a recursive neighborhood aggregation (or message passing) scheme, where each node aggregates the feature vectors of its neighbors to compute its new feature vector [8, 9]. In recent years, some methods based on GNN have been applied in audio processing fields, including few-shot audio classification [10], anti-spoofing [11] and so on. Su et al. [12] proposed a framework in imposing a graph attention mechanism on a gated recurrent unit network (GA-GRU) to improve utterance-based SER. Shirian et al. [13] constructed a compact Graph Convolution Network architecture which for the first time only utilized GNN for SER task. Liu et al. [14] proposed a novel LSTM-GIN model, which applies Graph Isomorphism Network (GIN) [15] on LSTM outputs for global emotion modeling in the non-Euclidean space. These SER methods based on GNN all model the time sequence features of speech signals as graphs and treat the SER as a graph classification task.
There are many GNN variants with different neighborhood aggregation, such as GCN [16], GAT [17], PATCHY-Diff [18]. In this paper, we choose GIN as the basic GNN structure because it possesses the discriminative power over other GNNs [15]. Nevertheless, in the abovementioned papers, GNNs all adopt a single aggregator in a GNN layer, which does not extract enough information from the nodes’ neighbourhoods and thus limit their expressive power and learning abilities. Corso et al. [19] have mathematically proved the necessity of adopting multiple aggregators. We adopt weighted multiple aggregations in the GIN structure and choose three kinds of aggregators: sum, mean and softmax aggregations. We refer to our proposed GNN-based SER structure as Weighted Multiple Aggregators GIN (WMA-GIN).
Many research fields utilizing GNN, including SER, require the interaction between nodes that are not directly connected. They achieve this by stacking multiple GNN layers for long-range sequence problems. However, as the number of layers increases, the number of nodes in each node’s receptive field grows exponentially. This phenomenon is named as - by U. Alon and E. Yahav [20]. For alleviating this problem in our network, we transform the neighbourhood relationship to the Full-Adjacent (FA) at one layer of the WMA-GIN. Furthermore, we have made the ablation experiments to demonstrate which layer of the WMA-GIN should be introduced into the FA for the SER task.
A Bi-GRU layer extracts global context sequence information before the stacked WMA-GIN layers. Moreover, multi-phase attention (MPA) is adopted to extract emotional information in underutilization. And a multi-stage loss training strategy is employed the handle the output information from different stages in our network. The experimental results show the effectiveness of our proposed method.
The contributions of this paper are summarized as follows:
(1) We propose a Weighted Multiple Aggregators GIN (WMA-GIN) for the SER task, and the experimental results show its effectiveness for obtaining more accurate information from neighbour nodes.
(2) We introduce a Full-Adjacent (FA) layer into the WMA-GIN for relieving the - problem. And we have explored which layer of WMA-GIN is more suitable to adopt the FA for the SER task.
(3) Our proposed WMA-GIN achieves 72.48% of WA and 67.72% of UA on the IEMOCAP dataset, outperforming other state-of-the-art GNN-based SER methods.
2 Proposed Method
In this section, we will introduce our proposed WMA-GIN network architecture, the graph construction from the audio utterances, GIN, and weighted multiple aggregators. At last, we discuss the Fully-Adjacent layer and multi-stage loss training strategy.
2.1 WMA-GIN Architecture
The architecture of WMA-GIN is illustrated in Fig. 1. A Bi-GRU layer is used for extracting the global context information of the features among long-range time series, and the hidden dimension of each direction is 128. Then, the output of Bi-GRU is fed into stacked WMA-GIN layers for extracting the features of higher resolution and learning the information with richer emotional characteristics. The number of WMA-GIN layers is set to 4. There are residual connections between each two adjacent WMA-GIN layers. Moreover, multi-phase attention (MPA) is employed for extracting emotional features that might be omitted from different phases of the network. Similar to self-attention [21], each of three linear projections is performed on the input features, the output of Bi-GRU, and output of the last WMA-GIN layer, respectively, such that they are transformed as , and . The output dimension of the linear projection is equal to the hidden size of the WMA-GIN layer.
Finally, the outputs of the Bi-GRU layer, WMA-GIN1, WMA-GIN2, WMA-GIN3, and MPA are fed into the corresponding linear layer to produce the predicted results , , , , respectively. As shown in Figure 1, all outputs of predicted results are summed up as . Then, the predicted emotion label is obtained via a softmax layer.
Graph Construction
Following the prior works [13, 14], we adopt frame-to-node transformation for our graph construction. The node feature is obtained from the corresponding audio frame with the feature dimension of . A graph , where is the set of nodes, and is the set of all edges of each nodes. So the feature matrix of each graph is denoted as . The adjacency matrix of is denoted by where an element denotes the edge weight connecting node and . Here, each node has two neighboring nodes corresponding to the previous and following one frame. The relationship of node neighborhood is defined as a cycle construction where the node of the first time frame is connected with the last time frame.
Graph Isomorphism Network (GIN)
The basic GIN is a GNN architecture proposed by [15] that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. The GIN has achieved excellent performance for graph classification and node classification tasks. The GIN also follows a neighborhood aggregation scheme, where the representation vector of a node is computed by recursively aggregating and transforming representation vectors of its neighboring nodes [15]. The original GIN adopts a sum aggregation strategy for aggregating information representation from the neighboring nodes. The calculation of sum aggregation is as follow:
(1) |
where denotes the nodes’ features at layer , and the neighbourhood of node .
2.2 Weighted Multiple Aggregators (WMA)
While a single aggregator is not enough for differentiating between neighbourhood messages. For example, a node of graph receives two messages {} from two neighbour nodes. And the received messages of another node of graph are {}. In this case, the sum aggregator cannot distinguish between graph and graph . For alleviating the abovementioned information confusion, we introduce two other aggregators: mean aggregator and softmax aggregator. Their calculations are as follows:
(2) |
(3) |
where which denotes the number of neighbour nodes. The messages from different nodes can be assigned different weights through softmax aggregation. During the model training, the information will constantly flow to the part with salient emotional features. As the graph structure in our proposed SER method is relatively simple where each node gets two neighbour nodes, we do not choose to scale the results of aggregations as [19] does. Finally, the node representation in WMA-GIN updates as:
(4) | ||||
where MLP denotes Multilayer Perceptron that is a linear layer, denotes feature representation of node in the hidden layer. And is initialized as which denotes the output of Bi-GRU layer. is a learnable parameter. , , are three hyperparameters of weights for three aggregators.
2.3 Fully-Adjacent layer
To ease the - problem in all GNNs, we modified the second WMA-GIN layer to be a Fully-Adjacent (FA) layer. The FA layer is a WMA-GIN layer in which every pair of nodes is connected by an edge. This does not change the type of WMA-GIN layer nor add trainable parameters but only changes the adjacency relationship of the nodes in a single layer. After adding the FA layer, only the second WMA-GIN layer allows the topology-aware node-representations to interact directly and consider nodes far beyond their original neighbors [20].
2.4 Multi-stage Loss
A multi-stage loss training strategy is adopted in our proposed SER method. The outputs of four WMA-GIN layers and the final output are defined as five different stages in the SER network. We denote them as , , , , in Fig. 1. Generally, the stage deeper, the obtained emotional information more salient. Thus the output is assigned with the larger weights. Cross-entropy (CE) loss is used as the loss function. Note that the multi-stage loss training strategy is only used during the training phase. The calculation of the final loss function is as follows:
(5) |
where is the weight of loss at the stage and is the number of total stages.
3 Experiments
3.1 Dataset
We performed the experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [22], which is a popular benchmark dataset for emotion recognition. The corpus contains approximately 12 hours of data over five dyadic sessions with ten subjects. Each interaction conversation is around 5 minutes long and segmented into multiple sentences. The single categorical emotional label was assigned to every utterance, with over two of three annotators agreeing on the emotional labels. We performed the experiments on the task of four-class emotion classification for a fair comparison with other methods. The samples contain 5531 utterances (1,103 angry, 1636 happy merged with excitement, 1,084 sad, 1,708 neutral).
3.2 Node features
Follow two prior work [14, 12], we extract 78-dimensional frame-level Low Level Descriptors (LLDs) from [23] using the openSMILE toolkit [24]. We also make the experiments using 128-dimensional log-Mel spectrograms as [30] does. For each sample, we use a sliding window of length 25ms (with a stride length of 10ms) to extract the LLDs locally. We set each graph length to 120 as [13] and [14] do, that means each graph contains 120 nodes. The graph label is the same as its original utterance. An utterance may be cut as several graphs because the lengths of some utterances are much longer than 120. Padding is used to make the samples of equal length.
3.3 Experimental Settings
In the experiments, we performed 5-fold cross-validation in the speaker-independent environment, the proportion of training set, validation set, the test set was set with 8:1:1. The Adam optimizer [25] was used with the initial learning rate of 1e-4 and weight decay set to 1e-8. Moreover, the early stopping was set during training. The batch size was set to 128. We employ unweighted accuracy (UA) and weighted accuracy (WA) as evaluation metrics following the previous studies.
3.4 The discussion about FA
Fig. 2 shows the UA and WA scores of our proposed method adapting the different numbers of WMA-GIN layer along with different hidden sizes. Alon et.al [20] provide the minimal hidden size for different numbers of GIN layers to fit the training data perfectly. Therefore, we set the lower bound of hidden size to 64, 128, and 256 for four, five, and six WMA-GIN layers. The results of WA are all higher than that of UA, and both of them increase first and then decrease with the increase of hidden size. Generally, the more WMA-GIN layer we use, the worse the results are for WA or UA. The model suffers from - that results in underfitting, which prevents the model from distinguishing between different examples. As a result, we employ the Fully-Adjacent layer in our proposed model once in the following experiments, and adopt the WMA-GIN layers of 4 with the hidden size of 256 according to the result in Fig.2.
We further explore which WMA-GIN layer should adopt the FA. Tab.1 shows three situations: the FA is applied at the last, penultimate, and antepenultimate layer. The model with the FA applied at the antepenultimate layer performs best. Those three situations of applying the FA layers can effectively ease information flow and prevent the - problem. The model with the FA applied at the antepenultimate layer achieves a better balance between the information interaction with the emotional features extraction. So, in the following experiments, the FA was all applied at the antepenultimate layer in our proposed WMA-GIN model.
FA layer | WA | UA |
---|---|---|
Last FA | 70.67 | 65.80 |
Penultimate layer | 69.68 | 66.41 |
Antepenultimate layer | 72.48 | 67.72 |
3.5 The ablation of Weighted Multiple Aggregators
Aggregator weight | WA(%) | UA(%) | ||
softmax | sum | mean | ||
1/2 | 1/4 | 1/4 | 68.46 | 66.52 |
1/4 | 1/2 | 1/4 | 67.67 | 65.31 |
1/4 | 1/4 | 1/2 | 67.26 | 65.30 |
3/5 | 1/5 | 1/5 | 67.88 | 65.78 |
1/5 | 1/5 | 3/5 | 70.76 | 66.24 |
1/5 | 3/5 | 1/5 | 69.31 | 65.15 |
1/3 | 1/3 | 1/3 | 72.48 | 67.72 |
1 | 0 | 0 | 71.42 | 66.59 |
0 | 1 | 0 | 71.26 | 66.38 |
0 | 0 | 1 | 69.86 | 65.89 |
In this section, we explore different weights of three aggregators for aggregating the nodes’ information more precisely from different graphs. Furthermore, we also performed experiments with three kinds of single aggregators. As shown in Tab.2, the model performs the best when three aggregators’ weights are equal. That means three aggregators are equally crucial in alleviating information confusion when aggregating node information. As for the single aggregator, the model with the softmax aggregator does the best for the softmax aggregator can allow an asymmetric message passing in the direction of the strongest signal [19].
3.6 Comparison with other methods
As shown in Tab.3, the results of GCN [16], GAT [17] are reported by [14], and we reproduced the result of PATCHY-Diff [18]. The results of other compared methods are reported in their published papers. Our proposed WMA-GIN outperforms all compared graph-based methods, especially on WA. Compared with LSTM-GIN [14] which also adopts GIN structure, WMA-GIN achieves the improvements by 12.1% and 3.3% on WA and UA, respectively. Without applying FA, the performance decreases. It proves that the FA layer can effectively alleviate the information compression problem for the SER task. The results in Tab.3 also indicate that adopting the Multi-stage Loss, assigning different weights to the features at different stages of the network has a positive influence. And the log-Mel spectrogram features are inferior to the LLDs. Furthermore, we also compare six recent advanced methods which do not adopt any GNN structures as shown in Table 4. WMA-GIN utilizing LLDs features outperforms all listed methods on WA.
Graph-based | Feature sets | Para. | WA(%) | UA(%) |
---|---|---|---|---|
GCN [16] | 78-LLDs | 78K | 61.16 | 62.21 |
GAT [17] | 78-LLDs | - | 60.93 | 62.09 |
PATCHY-Diff [18] | 78-LLDs | 68K | 63.23 | 58.71 |
GA-GRU [12] | 78-LLDs | - | 62.27 | 63.80 |
Amir et.al [13] | 35-LLDs | 30K | 63.69 | 59.87 |
L-GrIN [26] | 35-LLDs | 92K | 65.50 | N/A |
LSTM-GIN [14] | 78-LLDs | 0.89M | 64.65 | 65.53 |
WMA-GIN | 78-LLDs | 0.98M | 72.48 | 67.72 |
w/o FA | 78-LLDs | - | 70.19 | 65.55 |
w/o Multi-stage Loss | 78-LLDs | - | 70.25 | 66.47 |
WMA-GIN | 128-Log-Mel | 0.98M | 65.61 | 60.26 |
4 Conclusions
In this paper, we propose a network based on stacked Weighted Multiple Aggregators Graph Isomorphism Network (WMA-GIN) for SER. The experimental results demonstrate that the WMA can effectively improve the GIN structure performance. Moreover, the Full-Adjacent (FA) layer is proved to help alleviate - problem in the SER task. Finally, with the assistance of multi-phase attention (MPA) and multi-stage loss training strategy, WMA-GIN surpasses other graph-based methods and achieves comparable performance to some advanced non-graph-based methods. In the future, we will focus on exploring different graph construction for modeling longer speech utterances in a single graph.
5 Acknowledgements
This work is supported by National Natural Science Foundation of China (NSFC) (U1903213), Tianshan Innovation Team Plan Project of Xinjiang (202101642)
References
- [1] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011,” Artificial Intelligence Review, vol. 43, no. 2, pp. 155–177, 2015.
- [2] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
- [3] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017.
- [4] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, “An interaction-aware attention network for speech emotion recognition in spoken dialogs,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6685–6689.
- [5] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.
- [6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
- [7] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
- [8] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in International Conference on Machine Learning, ICML, 2018, pp. 5453–5462.
- [9] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning, ICML, 2017, pp. 1263–1272.
- [10] S. Zhang, Y. Qin, K. Sun, and Y. Lin, “Few-shot audio classification with attentional graph neural networks.” in INTERSPEECH, 2019, pp. 3649–3653.
- [11] H. Tak, J.-w. Jung, J. Patino, M. Todisco, and N. Evans, “Graph attention networks for anti-spoofing,” in INTERSPEECH, 2021.
- [12] B.-H. Su, C.-M. Chang, Y.-S. Lin, and C.-C. Lee, “Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network.” in INTERSPEECH, 2020, pp. 506–510.
- [13] A. Shirian and T. Guha, “Compact graph architecture for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6284–6288.
- [14] J. Liu and H. Wang, “Graph isomorphism network for speech emotion recognition,” INTERSPEECH, pp. 3405–3409, 2021.
- [15] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, ICLR, 2018.
- [16] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations, ICLR, 2017.
- [17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, ICLR, 2018.
- [18] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec, “Hierarchical graph representation learning with differentiable pooling,” Advances in neural information processing systems, vol. 31, 2018.
- [19] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Veličković, “Principal neighbourhood aggregation for graph nets,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 260–13 271, 2020.
- [20] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” in International Conference on Learning Representations, ICLR, 2020.
- [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- [22] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
- [23] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Müller, and S. Narayanan, “The interspeech 2010 paralinguistic challenge,” in INTERSPEECH, 2010, pp. 2794–2797.
- [24] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: The munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 1459–1462.
- [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, ICLR, Y. Bengio and Y. LeCun, Eds., 2015.
- [26] A. Shirian, S. Tripathi, and T. Guha, “Dynamic emotion modeling with learnable graphs and graph inception network,” IEEE Transactions on Multimedia, 2021.
- [27] S. Mao, P. Ching, and T. Lee, “Deep learning of segment-level feature representation with multiple instance learning for utterance-level speech emotion recognition.” in INTERSPEECH, 2019, pp. 1686–1690.
- [28] Z. Zhao, Z. Bao, Z. Zhang, N. Cummins, H. Wang, and B. W. Schuller, “Attention-enhanced connectionist temporal classification for discrete speech emotion recognition,” INTERSPEECH, pp. 206–210, 2019.
- [29] Y. Xia, L.-W. Chen, A. Rudnicky, and R. M. Stern, “Temporal context in speech emotion recognition,” in INTERSPEECH, vol. 2021, 2021, pp. 3370–3374.
- [30] Y. Zhong, Y. Hu, H. Huang, and W. Silamu, “A lightweight model based on separable convolution for speech emotion recognition.” in INTERSPEECH, 2020, pp. 3331–3335.
- [31] Q. Cao, M. Hou, B. Chen, Z. Zhang, and G. Lu, “Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6334–6338.
- [32] Y. Gao, J. Liu, L. Wang, and J. Dang, “Domain-adversarial autoencoder with attention based feature level fusion for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6314–6318.