Recursive Joint Attention for Audio-Visual Fusion
in Regression based Emotion Recognition
Abstract
In video-based emotion recognition (ER), it is important to effectively leverage the complementary relationship among audio (A) and visual (V) modalities, while retaining the intra-modal characteristics of individual modalities. In this paper, a recursive joint attention model is proposed along with long short-term memory (LSTM) modules for the fusion of vocal and facial expressions in regression-based ER. Specifically, we investigated the possibility of exploiting the complementary nature of A and V modalities using a joint cross-attention model in a recursive fashion with LSTMs to capture the intra-modal temporal dependencies within the same modalities as well as among the A-V feature representations. By integrating LSTMs with recursive joint cross-attention, our model can efficiently leverage both intra- and inter-modal relationships for the fusion of A and V modalities. The results of extensive experiments111The code is available on GitHub: https://github.com/praveena2j/RecurrentJointAttentionwithLSTMs. performed on the challenging Affwild2 and Fatigue (private) datasets indicate that the proposed A-V fusion model can significantly outperform state-of-art-methods.
Index Terms— Emotion Recognition, Audio-Visual Fusion, Attention Mechanisms, Long Short-Term Memory.
1 Introduction
Automatic emotion recognition (ER) is a challenging problem due to the complex and extremely diverse nature of expressions across individuals and cultures. In most of the real-world applications, emotions are exhibited over a wide range of emotional states besides the six basic categorical expressions - anger, disgust, fear, happy, sad, and surprise. For instance, emotional states can be expressed as intensities of fatigue, stress, and pain over discrete levels. Similarly, the wide range of continuous emotional states are often formulated as dimensional ER, where the diverse and complex human emotions are represented along the dimensions of valence and arousal. Valence denotes the range of continuous emotional states pertinent to pleasantness, spanning from being very sad (negative) to very happy (positive). Similarly, arousal spans the range of emotional states related to intensity, from being very passive (sleepiness) to extremely active (high excitement). In this paper, we have focused on developing a robust model for regression-based ER in valence-arousal space, as well as for fatigue.
A and V modalities often carry complementary relationships among themselves, which is crucial to be exploited in order to build an efficient A-V fusion system for regression-based ER. In addition to the inter-modal relationships across A and V modalities, temporal dynamics in videos carry significant information pertinent to the evolution of facial and vocal expressions over time. Therefore, effectively leveraging both the inter-modal association across the A and V modalities and temporal dynamics (intra-modal) within A and V modalities plays a major role in building a robust A-V recognition system. In this paper, we have investigated the prospect of leveraging these inter- and intra-modal characteristics of A and V modalities in a unified framework. In most of the existing approaches for regression-based ER, LSTMs has been used in order to model the intra-modal temporal dynamics in videos [1, 2] due to their efficiency in capturing the long-term temporal dynamics [3]. On the other hand, cross-attention models [4] have been explored to model the inter-modal characteristics of A and V modalities for dimensional ER.
In this work, we have proposed a unified framework for A-V fusion, which effectively leverages both the intra- and inter-modal information in videos using LSTMs and joint cross attention respectively. In order to further improve the A-V feature representations of the joint cross-attention model, we have also explored the recursive attention mechanism. Training the joint cross-attention model recursively allows refining the A and V feature representations, thereby improving the system performance. The main contributions of the paper are as follows. (1) A recursive joint cross-attentional model for A-V fusion is introduced to effectively exploit the complementary relationship across modalities while deploying a recursive mechanism to further refine the A-V feature representations. (2) LSTMs are further integrated to effectively capture the temporal dynamics within the individual modalities, as well as within the A-V feature representations. (3) An extensive set of experiments are conducted on the challenging Affwild2 and Fatigue (private) datasets, showing that the proposed A-V fusion model outperforms the related state-of-the-art models for regression-based ER.
2 Related Work
An early DL approach for A-V fusion-based dimensional ER was proposed by Tzirakis et al. [5], where the deep features (obtained with Resnet-50 for V and 1D CNN for A) are concatenated and fed to an LSTM. Recently, Vincent et al. [3] investigated the effectiveness of attention models and compared them with recurrent networks. They have shown that LSTMs are quite efficient in capturing the temporal dependencies when compared to attention models for dimensional ER. Kuhnke et al. [2] proposed a two-stream A-V network, where deep models are used to extract A and V features, and further concatenated for dimensional ER. Most of these approaches fail to effectively capture the intermodal semantics across A and V modalities. In [6] and [7], authors focused on cross-modal attention using transformers to exploit the inter-modal relationships of A and V modalities for dimensional ER. Rajasekhar et al [4] explored cross-attention models to leverage the inter-modal characteristics based on cross-correlation across the A and V features. They improved their approach by introducing joint feature representation into the cross-attention model to retain the intra-modal characteristics [8, 9]. In most of these approaches, they cannot effectively leverage intra-modal relationships. Chen et al. [10] modeled A and V features using LSTMs, and the unimodal predictions are combined using attention weights from conditional attention based on LSTMs. Priyasad et al. [11] also explored LSTMs for V features, and used DNN-based attention on the concatenated features of A and V modalities for the final output predictions. Beard et al. [12] proposed a recursive recurrent attention model, where LSTMs are augmented using an additional shared memory state in order to capture the multi-modal relationships in a recursive manner. In contrast with these approaches, we focus on modeling the A-V relationships by allowing the A and V features to interact and measure the semantic relevance across and within the modalities in a recursive fashion before feature concatenation. LSTMs are employed for temporal modeling of both uni-modal and multimodal features to further enhance the proposed framework. Therefore, our proposed model effectively leverages the intra- and complementary inter-modal relationships, resulting in a higher level of performance.
3 Proposed Approach
A) Problem Formulation: Given an input video sub-sequence , non-overlapping video clips are uniformly sampled and deep feature vectors and are extracted for the individual A and V modalities respectively from pre-trained networks. Let and where and represent the dimensions of the A and V feature representations, respectively, and and denotes the A and V feature vectors of the video clips, respectively, for clips. The objective of the problem is to estimate the regression model from the training data , where denotes the set of A and V feature vectors of the input video clips and represents the regression labels of the corresponding video clips.
B) Audio and Visual Networks: Spectrograms has been found to be promising with various 2D-CNNs (Resnet-18 [13]) for ER [14, 15]. Therefore, we have explored spectrograms in the proposed framework. In order to effectively leverage the temporal dynamics within A modality, we have also explored LSTMs across the temporal segments of the A sequences. Finally, the A feature vectors of video clips are shown as .
Facial expressions exhibit significant information pertinent to both visual appearance and temporal dynamics in videos. LSTMs are found to be efficient in capturing the long-term temporal dynamics while 3DCNNs are effective in capturing the short-term temporal dynamics [16]. Therefore, we have used LSTMs with 3D CNNs (R3D [17]) to obtain the V features for the fusion model. In most of the existing approaches, the output of the last convolution layer is 512 x 7 x 7, which is further passed through a pooling operation in order to reduce the spatial dimensions to 1 (7 1). This reduction in spatial dimension was found to leave out useful information as the stride is big. Therefore, inspired by the idea of [18], we use the A feature representation to smoothly reduce the spatial dimensions of raw V features for each video clip similar to that of [18]. Finally, we obtain a matrix of V feature vectors of the video clips as .
C) Recursive Joint Attention Model: Given the A and V features, and , the joint feature representation is obtained by concatenating the A and V feature vectors , where denotes the dimensionality of concatenated features. The concatenated A-V feature representations () of a video sub-sequence () are now used to attend to unimodal feature representations and . The joint correlation matrix across the A features , and the combined A-V features are given by:
(1) |
where represents learnable weight matrix across the A and combined A-V features, and denotes transpose operation. Similarly, the joint correlation matrix for V features are given by:
(2) |
The joint correlation matrices capture the semantic relevance across the A and V modalities as well as within the same modalities among consecutive video clips, which helps in effectively leveraging intra- and inter-modal relationships. After computing the joint correlation matrices, the attention weights of the A and V modalities are estimated. For the A modality, the joint correlation matrix and the corresponding A features are combined using the learnable weight matrices to compute the attention weights of A modality, which is given by where and represents the attention maps of the A modality. Similarly, the attention maps () of V modality are obtained as where . Then, the attention maps are used to compute the attended features of A and V modalities as:
(3) |
(4) |
where and denote the learnable weight matrices for A and V respectively. After obtaining the attended features they are fed again to the joint cross-attentional model to compute the new A and V feature representations as:
(5) |
(6) |
where and denote the learnable weight matrices of iteration for A and V respectively. Finally, the attended A and V features after iterations are further concatenated and fed to BLSTM to obtain the temporal dependencies within the refined A-V feature representations, which are fed to fully connected layers for final prediction.
4 Results and Discussion
A) Datasets: Affwild2 is among the largest public dataset in affective computing, consisting of videos collected from YouTube, all captured in-the-wild [19]. The annotations are provided by four experts using a joystick and the final annotations are obtained as the average of the four raters. In total, there are frames with subjects, out of which are male and female. The annotations for valence and arousal are provided continuously in the range of . The dataset is partitioned in a subject-independent manner so that every subject’s data will be present in only one subset. The partitioning produces 341, 71, and 152 videos for the training, validation, and test sets respectively.
The Fatigue dataset (private) is obtained from participants in a Rehabilitation center, suffering from degenerative diseases inducing fatigue. A total of 27 video sessions are captured with a duration of 40 - 45 minutes and labeled at sequence level on a scale of 0 to 10 for every 10 to 15 minutes. We have considered of data as training data (50,845 samples) and as validation data (21,792 samples).
B) Ablation Study: Table 1 presents the results of the experiments conducted on the validation set for the ablation study. The performance of the approach is evaluated using Concordance Correlation Coefficient (CCC). In this section, we have analyzed the contribution of BLSTMs, where we have performed experiments with and without BLSTMs. Firstly, we have conducted experiments without using BLSTM for both the individual A and V representations as well as the A-V feature representations. Then, we included BLSTMs only for the individual A and V modalities before feeding to the joint attention fusion model i.e., only Unimodal-BLSTMs (U-BLSTMs). By including U-BLSTMs to capture the temporal dependencies within the individual modalities, we can observe improvements in performance. Therefore, BLSTMs are found to be promising in capturing the intra-modal temporal dynamics better than that of correlation-based intra-modeling in the joint attention model. After that, we have also included joint BLSTM (J-BLSTM) in order to capture the temporal dynamics across the joint A-V feature representations, which further improved the performance of the system. Note that in the above experiments, we have not performed recursive attention. In addition to the impact of U-BLSTM and J-BLSTMs, we have also conducted a few more experiments to investigate the impact of the recursive behavior of the joint attention model. First, we did recursion without LSTMs and found some improvement due to recursion. Then we included LSTMs and conducted several experiments by varying the number of recursions (iterations) in the fusion model. As we increase the number of recursive times, the model performance increases and starts to decrease after a certain recursion number. A similar trend of the model performance is also observed in the test set. Therefore, this can be attributed to the fact that recursion also works as a regularizer which improves the generalization ability of the model. In our experiments, we found that gives the best performance i.e, we have achieved the best performance of our model with two recursive iterations.
Method | Valence | Arousal |
---|---|---|
JA Fusion w/o Recursion | ||
Fusion w/o U-BLSTM | 0.670 | 0.590 |
Fusion w/o J-BLSTM | 0.691 | 0.646 |
Fusion w/ U-BLSTM and J-BLSTM | 0.715 | 0.688 |
JA Fusion w/ Recursion | ||
JA Fusion w/o BLSTMs, | 0.703 | 0.623 |
JA Fusion with BLSTMs, | 0.721 | 0.694 |
JA Fusion with BLSTMs, | 0.706 | 0.652 |
JA Fusion with BLSTMs, | 0.685 | 0.601 |
C) Comparison to State-of-art: Table 2 shows our results against relevant state-of-the-art A-V fusion models on the Affwild2 dataset. The Affwild2 dataset has recently been widely used for Affective Behavior Analysis in-the-Wild (ABAW) challenges [20, 21]. Therefore, we compare our approach with relevant state-of-art approaches in the ABAW challenges. Kuhnke et al [2] used a simple feature concatenation using Resnet-18 for A and R3D for V modalities, showing better performance for arousal than valence. Zhang et al [22] proposed a leader-follower attention model for fusion, and improved the performance (arousal) of the model proposed by Kuhnke et al [2]. Rajasekhar et al. [4] explored the cross-attention model by leveraging only the inter-modal relationships of A and V, and showed improvement for valence but not for arousal. Rajasekhar et al. [8] further improved the performance of the model by introducing joint feature representation to the cross-attention model. The proposed model performs even better than that of vanilla JCA [8] by introducing LSTMs as well as a recursive attention mechanism.
Method | Type of Fusion | Valence | Arousal |
---|---|---|---|
Validation Set | |||
Kuhnke et al. [2] | Feature Concatenation | 0.493 | 0.613 |
Zhang et al. [22] | Leader Follower Attention | 0.469 | 0.649 |
Rajasekhar et al [4] | Cross Attention | 0.541 | 0.517 |
Rajasekhar et al. [8] | Joint Cross Attention | 0.657 | 0.580 |
Ours | LSTM + Transformers | 0.628 | 0.654 |
Ours | Recursive JA + BLSTM | 0.721 | 0.694 |
Test Set | |||
Meng et al. [23] | LSTM + Transformers | 0.606 | 0.596 |
Vincent et al. [3] | LSTM + Transformers | 0.418 | 0.407 |
Rajasekhar et al [8] | Joint Cross Attention | 0.451 | 0.389 |
Ours | Recursive JA + BLSTM | 0.467 | 0.405 |
For test set results, the winner of the latest ABAW challenge [23] has shown improvement using A-V fusion, however using three external datasets and multiple backbones. We have also compared the performance of our A and V backbones with the ensembling of LSTMs and transformers [23] on the validation set. Vincent [3] used LSTMs to capture intra-modal dependencies and explored transformers for cross-modal attention, however, they fail to effectively capture the inter-modal relationships across the consecutive video clips. Rajasekhar et al [8] further improved the performance (valence) using joint cross-attention. The proposed model outperforms both [3] and [8].
Table 3 shows the performance of the proposed approach on the Fatigue dataset. We have shown the performance of individual modalities along with feature concatenation and cross-attention [4]. The proposed approach outperforms cross-attention [4] and baseline feature concatenation.
Method | Fatigue Level |
---|---|
Audio only (2D-CNN: Resnet18) | 0.312 |
Visual only (3D-CNN: R3D) | 0.415 |
Feature Concatenation | 0.378 |
Cross Attention [4] | 0.421 |
Recursive JA + BLSTM (Ours) | 0.447 |
5 Conclusion
This paper introduces a recursive joint attention model along with BLSTMs that allows effective spatiotemporal A-V fusion for regression-based ER. In particular, the joint attention model is trained in a recursive fashion, allowing to refine the A-V features. We further investigated the impact of BLSTMs for capturing the intra-modal temporal dynamics of individual A and V modalities, as well as A-V features for regression-based ER. By effectively capturing the intra-modal relationships using BLSTMs, and inter-modal relationships using recursive joint attention, the proposed approach is able to outperform the related state-of-the-art approaches.
References
- [1] Liam Schoneveld, Alice Othmani, and Hazem Abdelkawy, “Leveraging recent advances in deep learning for audio-visual emotion recognition,” Pattern Recognition Letters, vol. 146, 2021.
- [2] Felix Kuhnke, Lars Rumberg, and Jörn Ostermann, “Two-stream aural-visual affect analysis in the wild,” in FG Workshop, 2020.
- [3] Vincent Karas, Mani Kumar Tellamekala, Adria Mallol-Ragolta, Michel Valstar, and Björn W. Schuller, “Time-continuous audiovisual fusion with recurrence vs attention for in-the-wild affect recognition,” in CVPRW, 2022.
- [4] Gnana Praveen Rajasekhar, Eric Granger, and Patrick Cardinal, “Cross attentional audio-visual fusion for dimensional emotion recognition,” in FG, 2021.
- [5] Panagiotis Tzirakis, George Trigeorgis, Mihalis A. Nicolaou, Björn W. Schuller, and Stefanos Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE J. of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
- [6] Panagiotis Tzirakis, Jiaxin Chen, Stefanos Zafeiriou, and Björn Schuller, “End-to-end multimodal affect recognition in real-world environments,” Information Fusion, vol. 68, 2021.
- [7] Srinivas Parthasarathy and Shiva Sundaram, “Detecting expressions with multimodal transformers,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 636–643.
- [8] R Gnana Praveen, Patrick Cardinal, and Eric Granger, “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” IEEE Transactions on Biometrics, Behavior, and Identity Science, 2023.
- [9] R Gnana Praveen, Wheidima Carneiro de Melo, Nasib Ullah, Haseeb Aslam, Osama Zeeshan, Théo Denorme, Marco Pedersoli, Alessandro L. Koerich, Simon Bacon, Patrick Cardinal, and Eric Granger, “A joint cross-attention model for audio-visual fusion in dimensional emotion recognition,” in CVPRW, 2022.
- [10] S Chen and Q Jin, “Multi-modal conditional attention fusion for dimensional emotion prediction,” in Proc. of ACM’M, 2016.
- [11] Priyasad Darshana, F Tharindu, S Simon, and F Clinton, “Learning salient features for multimodal emotion recognition with recurrent neural networks and attention based fusion,” in Proc. of AVSP, 2019.
- [12] R Beard, R Das, R. W. M. Ng, P. G. K Gopalakrishnan, L Eerens, P Swietojanski, and O Miksik, “Multi-modal sequence fusion via recursive attention for emotion recognition,” in CoNLL, 2018.
- [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- [14] Anwer Slimi, Mohamed Hamroun, Mounir Zrigui, and Henri Nicolas, “Emotion recognition from speech using spectrograms and shallow neural networks,” in ICAMCM, 2020, p. 35–39.
- [15] Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman, “Emotion recognition in speech using cross-modal transfer in the wild,” in 26th ACMM, 2018, p. 292–301.
- [16] Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu, “Video-based emotion recognition using cnn-rnn and c3d hybrid networks,” in ACM ICMI, 2016, p. 445–450.
- [17] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018, pp. 6450–6459.
- [18] Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, and Yan Yan, “Audio-visual event localization via recursive fusion by joint co-attention,” in WACV, 2021.
- [19] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A. Nicolaou, Athanasios Papaioannou, Guoying Zhao, Bjorn Schuller, Irene Kotsia, and Stefanos Zafeiriou, “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond,” IJCV, vol. 127, pp. 907–929, 2019.
- [20] D Kollias, A Schulc, E Hajiyev, and S Zafeiriou, “Analysing affective behavior in the first abaw competition,” in FG 2020, 2020.
- [21] Dimitrios Kollias and Stefanos Zafeiriou, “Analysing affective behavior in the second abaw2 competition,” in ICCVW, 2021, pp. 3652–3660.
- [22] Su Zhang, Yi Ding, Ziquan Wei, and Cuntai Guan, “Continuous emotion recognition with audio-visual leader-follower attentive fusion,” in ICCVW, 2021.
- [23] Liyu Meng, Yuchen Liu, Xiaolong Liu, Zhaopei Huang, Wenqiang Jiang, Tenggan Zhang, Chuanhe Liu, and Qin Jin, “Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild,” in CVPRW, 2022, pp. 2344–2351.