A Pre-trained Audio-visual Transformer for Emotion Recognition

Abstract

In this paper, we introduce a pretrained audio-visual Transformer trained on more than 500k utterances from nearly 4000 celebrities from the VoxCeleb2 dataset for human behavior understanding. The model aims to capture and extract useful information from the interactions between human facial and auditory behaviors, with application in emotion recognition. We evaluate the model performance on two datasets, namely CREMAD-D (emotion classification) and MSP-IMPROV (continuous emotion regression). Experimental results show that fine-tuning the pre-trained model helps improving emotion classification accuracy by 5-7% and Concordance Correlation Coefficients (CCC) in continuous emotion recognition by 0.03-0.09 compared to the same model trained from scratch. We also demonstrate the robustness of finetuning the pre-trained model in a low-resource setting. With only 10% of the original training set provided, finetuning the pre-trained model can lead to at least 10% better emotion recognition accuracy and a CCC score improvement by at least 0.1 for continuous emotion recognition.

Index Terms— Emotion recognition, Transformer, multiomdal fusion

1 Introduction

Recent advances in machine learning and signal processing enable an unprecedented opportunity to computationally analyze and predict social behaviors. A better understanding of how people behave and express themselves could have wide applicability. Much human interaction research (e.g. emotion recognition) is task-oriented, which often requires time-consuming and expensive data collection processes; and hence, suffers from small population that prevents ML models to generalize well. Despite the scarcity of labeled data, there is an abundance of data on human communication that is unlabeled, multi-modal, and easily accessible [1]. This opens an opportunity to address the challenge, by creating self-supervised pretrained models that are trained on unlabeled data and can be finetuned for downstream tasks. Similar approaches have been very successful in NLP [2] and speech processing [3] tasks.

Most research extending the standard Transformer [4] in a multimodal context focuses on the visual-and-language domain. Existing work generally utilize the language-pretrained BERT [2] and train only the visual components through either the single-stream framework (image and text are jointly processed by a single encoder) [5, 6] or dual-stream framework (with separate visual and text encoders) [7, 8]. To the best of our knowledge, Lee et al. present the only pretrained Transformer-based model for the audio-and-visual domain [9]. Their end-to-end model contains two Transformers to encode audio and visual inputs independently, followed by another Transformer that processes the encoded audio and video signals sequentially. Lee et al. pretrain their model on Kinetics-700 (containing 700 human action classes) [10] and AudioSet (containing 632 audio event classes) [11]. Since both datasets contain little information on human interactions, the pretrained models would not be appropriate for downstream tasks such as emotion recognition.

In this study, we present the first pretrained audio-visual Transformer-based model that learns from human communicative behaviors. We then validate the the pretrained model for the downstream task of emotion recognition on the CREMA-D dataset [12] and the MSP-IMPROV dataset [13].

Refer to caption — Fig. 1: The architecture of the $i^{th}$ layer in a $b\rightarrow a$ Cross-modal Transformer (Cross-modal Attention Block) is shown on the left. The architecture of the Multimodal Transformer is shown on the right. Re-illustration based on [14]

2 Method

2.1 Multimodal Transformer architecture

We adapt the Multimodal Transformer (MulT) architecture [14] for the pretraining task. At the high level, the architecture consists of 4 main components: the temporal convolutions that projects features of different modalities to the same dimension, a sinusoidal positional encoding to capture temporal information, the Cross-modal Transformers that allow one modality to pass information to another, and the standard Self-Attention Transformers that process the fused information produced by the Cross-modal Transformers. An overview of the MulT architecture is available in Figure 1 (right).

At the core of MulT is the Cross-modal Attention Block (Fig 1 left), which differs from the standard Transformer encoder layer in two ways. First, the module merges audio-visual temporal information though the Multihead Cross-modal Attention layer, with the Queries being in one modality while the Keys and Values being in another. Second, each Cross-modal Attention Block learns directly from the low-level feature sequences (i.e. $X^{[0]}_{b}$ is passed to the $b\rightarrow a$ Cross-modal Attention Block regardless of the layer position) while the standard Transformer takes intermediate-level features as input. Tsai et al. [14] empirically show that adapting low-level features for the Cross-modal Attention Block is beneficial for MulT. With only 2 modalities in consideration, we have 2 types of Cross-modal Attention Blocks ( $V\rightarrow A$ and $A\rightarrow V$ ). Eventually, we concatenate the outputs of the Cross-modal Transformers and pass them through a standard Self-attention Transformer [4]. The outputs of the Self-attention Transformer are finally converted to the original audio and visual feature dimensions for predictions using two independent fully-connected layers.

2.2 Pretraining procedure

Following prior work on pretrained Transformers [2, 3, 9], we use the masked frame prediction task to train our model. Specifically, we randomly select 15% of the frames, mask them for both the audio and visual inputs, and train the model to reconstruct the masked frames. Following [2, 3] on the selected frames for masking, we mask the frames all to zero with a probability of $0.8$ , replace them with randomly selected frames with a probability of $0.1$ and keep them untouched with a probability of $0.1$ . Similar to [3], we use the $L1$ -Loss to measure reconstruction error. The loss for each prediction is the sum of the $L1$ -loss for the audio modality and the $L1$ -loss for the visual modality. We adapt the strategy of masking consecutive frames from [3] to prevent the model from exploiting local smoothness. We use dynamic masking for the training set (the masked frames of each input sequence are selected independently every time the sequence is called) and static masking for the validation set (the masked frames for each input sequence are pre-computed) to make the comparison between models’ performances fair.

2.3 Feature selection

The Multimodal Transformer is not end-to-end, so we need to extract acoustic and visual features as inputs to the model. Since features can be more or less powerful depending on the context of their usage and the target of our study is emotion recognition, we compare different baseline features on the CREMA-D and MSP-IMPROV datasets [12, 13]. The motivation for pretraining data with MulT is to capture and model temporal dependencies so we also want the base features to be temporally independent. Thus, even though features extracted from pretrained Speech Transformers such as [3, 15, 16] are powerful, they are not suitable to be base features for MulT.

With these considerations, we select and compare the features extracted from pretrained Facenet [17], pretrained ResNet [18] and OpenFace Action Units’ intensities [19] for the visual modality. For the acoustic modality, we compare Mel-scale spectrogram, Linear-scale spectrogram and features extracted by TRILL [20]. To extract features from FaceNet and ResNet for a given video, we extract frames from the video at a constant rate and crop the face regions before feeding them into the pretrained models. Because TRILL only provides one vector representation for each input acoustic sequence, we split the input sequence into segments that matches a specified frame rate and extract the representations of the segments. To make the comparison fair, we apply all extracted features to the same model (a single-layer Gated Recurrent Unit with a hidden size of $512$ and dropout ratio of $0.2$ with fixed initialization) for emotion classification (CREMA-D dataset) and continuous emotion estimations (MSP-IMRPOV dataset). We find that extracted OpenFace and TRILL features outperform other baseline features by a considerable margin on both datasets. On the CREMA-D dataset, OpenFace shows a gain of $17\%$ in Accuracy in comparison with the ResNet representations, and TRILL shows a gain of more than $11\%$ in Accuracy from the Linear-scale spectrogram. On the MSP-IMPROV dataset, TRILL outperforms the Linear-scale spectrogram with a CCC margin of at least $0.03$ while OpenFace outperforms the second-best baseline with a CCC margin of at least $0.14$ .

3 Experiments

3.1 Data

Voxceleb2 We use the Voxceleb2 dataset for pretraining [1]. It contains more than 1M utterances from more than 6,000 celebrities collected from around 150K videos on Youtube. The dataset is fairly gender balanced (61% are men). For the acoustic modality, we first segment the audio, into 200ms segments, before feeding them into TRILL [20] for feature extraction. Because TRILL originally provides a single embedding for an audio input as a whole, we do not want to extract the features with smaller segment duration. For the visual modality, we use OpenFace2.0 [19] to track 17 Facial Action Unit (AU) intensities from the videos at 30 FPS. Since there is a high variation in the video quality of the Voxceleb2 dataset, we remove frames with detection confidence below 80%. We then downsample OpenFace outputs to 5 FPS to match the frame rate of the acoustic modality. We remove utterance samples with the audio and video features misaligned for more than $1$ second (more than $5$ frames difference). Although MulT can handle unaligned multimodal sequences, the model achieves better performance with aligned sequences. In the end, we end up with a training dataset of 524K utterances from about 4K speakers (the average duration of each utterance is $5$ s with a standard deviation of $0.7$ s).

CREMA-D The CREMA-D dataset is an acted audiovisual database consisting of 6 basic emotional states (happy, sad, anger, fear, disgust and neutral). It includes 7,442 video clips from 91 actors speaking 12 sentences with different emotions. The emotion labels are collected through crowd-sourcing from 2,443 raters, and the human recognition accuracy of intended emotion is $63.6\%$ . The emotion classes in the dataset are balanced. In this study, we perform speaker-independent split of the CREMA-D dataset into the train-validation-test set with a ratio of 60%-20%-20% respectively.

MSP-IMRPOV The MSP-IMPROV dataset is an acted audiovisual database that includes emotional interactions between people in a dyadic conversational setting. The conversation scenarios are designed to invoke realistic emotions. The dataset consists of 8,450 video recordings that are recorded during 6 dyad sessions from 12 actors. The annotations for the dataset are collected via crowd-sourcing, and each video is annotated with at least 5 evaluators. The annotation includes emotional content and a five-point Likert-like scale on valence (1-negative and 5-positive), arousal (1-excited and 5-calm) and dominance (1-weak and 5-strong). In this study, we focus on the continuous emotion regression task that estimates the values of Valence and Arousal for a given video. We use Session 1-4 as the training set, Session 5 as the validation set and Session 6 as the test set.

	CREMA-D	MSP-IMPROV
	CREMA-D	Arousal		Valence
	Accu. $\uparrow$	MAE $\downarrow$	CCC $\uparrow$	MAE $\downarrow$	CCC $\uparrow$
TFN [21]	63.09	0.466	0.581	0.596	0.592
EF-GRU	57.06	0.676	0.399	0.774	0.478
LF-GRU	58.53	0.496	0.546	0.619	0.579
MulT WOP	63.93	0.466	0.665	0.580	0.607
MulT BASE	68.87	0.456	0.697	0.576	0.658
MulT Large	70.22	0.431	0.693	0.563	0.692

Table 1: Comparison between the performances of different models. WOP stands for w/o Pretraining.

3.2 Pretraining implementation details

Following prior work on pretraining Transformers [2, 3], we implement the pretraining task with two model settings: BASE and LARGE. For both configurations, we set the number of attention heads to $12$ , the number of consecutive frames for masking to $3$ ( $\sim 0.6$ sec) and the length of each processed sequence is $50$ ( $\sim 10$ sec).

The hidden sizes for each of the audio and visual modality are $288$ (BASE) and $576$ (LARGE). The sizes of the feed-forward layers in each cross-modal attention block are $1152$ (BASE) and $1536$ (LARGE). The sizes of the feed-forward layers in each Self-Attention Block are $2304$ (BASE) and $3072$ (LARGE). The BASE configuration has 6 $A\rightarrow V$ cross-modal attention blocks, 6 $V\rightarrow A$ cross-modal attention blocks and 6 self-attention blocks, which sums up to 38.3M parameters. The LARGE configuration has 8 $A\rightarrow V$ cross-modal attention blocks, 8 $V\rightarrow A$ cross-modal attention blocks and 8 self-attention blocks that totals 89.2M parameters. We train both models with the Adam optimizer [22]. The learning rate is set to $5e^{-4}$ , with a linear learning rate scheduler and a warmup portion of $0.1$ . Both models are trained with a batch size of $64$ for $30$ epochs.

3.3 Application on downstream task

For fine-tuning, the last elements from the outputs of the pretrained MulT are passed to a residual block followed by a fully-connected layer to make final predictions. We compare the performance of the fine-tuned MulT to 4 baseline models: Early Fusion GRU (EF-GRU), Late Fusion GRU (LF-GRU), the Tensor Fusion Network (TFN) [21] and the Multimodal Transformer without the pretrained weights initialization. It is important to note that TFN only processes static inputs, i.e., each modality of a sample is represented by a vector. However, we decide to include it as a baseline model because TRILL is originally developed to represent an audio as a whole with a vector [20]. Hence, for each video, we use the vector representation extracted from TRILL for the acoustic modality along with the average of the 17 OpenFace AU intensities for the visual modality as inputs to TFN.

To make the comparisons fair for EF-GRU and LF-GRU, we make them Bidirectional and control the hidden size as well as number of layers such that the number of parameters of these models are approximately the same with the BASE configuration of MulT. For MulT without pretrained weights, we perform experiments with both the BASE and LARGE configurations and report the better performing configuration based on the validation set. We train all of the models until early stopping occurs on the validation set.

Table 1 shows the performance of different models on the CREMA-D and MSP-IMPROV datasets. Since CREMA-D’s classes are balanced, we use accuracy as our evaluation metric. Following [23, 24, 25], we report the Mean Absolute Error (MAE) and the Concordance Correlation Coefficients (CCC) to assess the quality of the regression models on the MSP-IMPROV dataset.

The fine-tuned models outperform the baseline models by a considerable margin. For emotion recognition accuracy, we see a 5% improvement for the BASE model and 7% improvement for LARGE model in comparison with the baselines. On the MSP-IMPROV dataset, the fine-tuned models also shows improvements over the baselines on both Arousal and Valence regressions. Specifically, fine-tuning the BASE model achieves $3.2\%$ and $5.1\%$ gain in CCC for Arousal and Valence regression respectively. Although there might be discrepancies between train-validation-test set split, we find our best results (accuracy of 70.22% on CREMA-D, CCC of 0.697 and 0.692 on MSP-IMPROV Arousal and Valence regression) competitive with existing benchmarks on CREMA-D [26, 27, 28] and MSP-IMPROV [29, 24].

3.4 Limited resource setting

Since the ultimate motivation of transfer learning is to reduce the requirements on labeled data, we are interested in exploring the capability of the pretrained MulT in a limited resource setting. Figure 2 shows the performance of the models when only $N\%$ of the original training set are used for training. We can see that the performance drop curves for the pretrained models are less steep in comparison with training MulT from scratch and TFN. With only $10\%$ of the original training set (less than $500$ training samples on both datasets), finetuning the pretrained models outperforms training from scratch by at least $10\%$ for emotion recognition, and more than $20\%$ and $10\%$ CCC improvements for Arousal and Valence regression respectively. This further suggests the robustness of the pretrained weights in preventing overfitting with limited data. We can also note that TFN tends to perform better than training MulT from scratch with small training sets, which is expected because light models tend to be less susceptible to overfitting than complex ones with limited data.

4 Conclusion

In this study, we present the potential of pretraining the Multimodal Transformer architecture [14] to model human communicative behaviors. We validate the usefulness of the pretrained model for the task of emotion recognition on two datasets, and demonstrate the robustness of the model in a low-resource setting. In the future, we will explore the performance of the model on other domains relating to communication such as mental health assessment.

Acknowledgment

Research was sponsored by the Army Research Office and was accomplished under Cooperative Agreement Number W911NF-20-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

[1] Joon Son Chung et al., “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
[2] Jacob Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
[3] Andy T Liu et al., “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in ICASSP, 2020, pp. 6419–6423.
[4] Ashish Vaswani et al., “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
[5] Luowei Zhou et al., “Unified vision-language pre-training for image captioning and vqa,” in AAAI, 2020, vol. 34, pp. 13041–13049.
[6] Linchao Zhu and Yi Yang, “Actbert: Learning global-local video-text representations,” in CVPR, 2020, pp. 8746–8755.
[7] Hao Tan and Mohit Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
[8] Jiasen Lu et al., “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” NeurIPS, vol. 32, pp. 13–23, 2019.
[9] Sangho Lee et al., “Parameter efficient multimodal transformers for video representation learning,” in ICLR, 2020.
[10] Joao Carreira et al., “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
[11] Jort F Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017, pp. 776–780.
[12] Houwei Cao et al., “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Trans. Affective Computing, 2014.
[13] Carlos Busso et al., “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,” IEEE Trans. Affective Computing, 2016.
[14] Yao-Hung Hubert Tsai et al., “Multimodal transformer for unaligned multimodal language sequences,” in ACL. NIH Public Access, 2019, vol. 2019, p. 6558.
[15] Andy T Liu et al., “Tera: Self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 2351–2366, 2021.
[16] Po-Han Chi et al., “Audio albert: A lite bert for self-supervised learning of audio representation,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 344–350.
[17] Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[19] Tadas Baltrusaitis et al., “Openface 2.0: Facial behavior analysis toolkit,” in IEEE FG, 2018, pp. 59–66.
[20] Joel Shor et al., “Towards learning a universal non-semantic representation of speech,” arXiv preprint arXiv:2002.12764, 2020.
[21] Amir Zadeh et al., “Tensor fusion network for multimodal sentiment analysis,” in EMNLP, 2017, pp. 1103–1114.
[22] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[23] Gao-Yi Chao et al., “Enforcing semantic consistency for cross corpus valence regression from speech using adversarial discrepancy learning.,” in INTERSPEECH, 2019, pp. 1681–1685.
[24] Bagus Tris Atmaja and Masato Akagi, “Deep multilayer perceptrons for dimensional speech emotion recognition,” in APSIPA ASC. IEEE, 2020, pp. 325–331.
[25] Srinivas Parthasarathy and Carlos Busso, “Jointly predicting arousal, valence and dominance with multi-task learning.,” in Interspeech, 2017, vol. 2017, pp. 1103–1107.
[26] Esam Ghaleb et al., “Metric learning-based multimodal audio-visual emotion recognition,” IEEE Multimedia, vol. 27, pp. 37–48, 2019.
[27] Andreea Birhala et al., “Temporal aggregation of audio-visual modalities for emotion recognition,” in Int’l Conf. Telecommunications and Signal Processing (TSP), 2020, pp. 305–308.
[28] Esam Ghaleb et al., “Multimodal attention-mechanism for temporal emotion recognition,” in ICIP. IEEE, 2020, pp. 251–255.
[29] Bagus Tris Atmaja and Masato Akagi, “Multitask learning and multistage fusion for dimensional audiovisual emotion recognition,” in ICASSP, 2020, pp. 4482–4486.