An Ensemble Approach for Multiple Emotion Descriptors Estimation Using Multi-task Learning

Irfan Haider Minh-Trieu Tran Soo-Hyung Kim Hyung-Jeong Yang Guee-Sang Lee

Abstract

This paper illustrates our submission method to the fourth Affective Behavior Analysis in-the-Wild (ABAW) Competition. The method is used for the Multi-Task Learning Challenge. Instead of using only face information, we employ full information from a provided dataset containing face and the context around the face. We utilized the InceptionNet V3 model to extract deep features then we applied the attention mechanism to refine the features. After that, we put those features into the transformer block and multi-layer perceptron networks to get the final multiple kinds of emotion. Our model predicts arousal and valence, classifies the emotional expression and estimates the action units simultaneously. The proposed system achieves the performance of 0.917 on the MTL Challenge validation dataset.

Keywords:

Multi-Task Learning, Emotion Recognition, Deep Learning, Context Information, Spatial Attention

1 Introduction

Emotion recognition plays an important role in many researches because of its multiple application in different fields such as autonomous driving, or medicine. Emotion Recognition is also significant in the human effective behaviour research and it has been studied many times with the development of big data and deep learning technologies. Emotion can be detected by several ways such as images, videos, speech and by using bio- signals. Human facial expressions are not enough to recognise the emotion by images or videos, scene context also effective to rather than only using the facial features[9]. Meanwhile, people recognize human emotions not only from facial features but also from their surroundings, such as movement, communication with others, and location[2][1].

Traditional recognition studies only focused on facial features and ignore the other information i.e., contextual information that is very effective in recognising the human emotion accurately. Previously, due to a lack of vast amounts of data collected in actual scenarios, research focused on controlled scenarios[12]. However, digital networks and social media have previously become broadly used and a huge amount of data has become available.

Human emotions have recently been the focus of psychology. The most common type of emotion representation is deterministic, which includes the seven basic categories of anger, disgust, fear, happiness, sadness, surprise, and neutrality[6].

Emotion recognition research is booming in computer vision. Continuous research in this field and development in deep learning it is gaining more and more attention. The proof is available as large number of datasets Aff-Wild[29][15], s-Aff-Wild2[21][14][20][19][17][13], Aff-Wild2[18].

Some problems related to emotion recognition are discussed in 3rd Affective Behavior Analysis in-the wild (ABAW) Competition is held in conjunction with IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3rd ABAW includes Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, Action Unit (AU) Detection Challenge, Multi-Task-Learning (MTL) Challenge. Many teams participate every time in ABAW[11] challenge to improve the performance of human emotion models in the real world.

2 Related Work

For many years, researchers have researched about human emotion recognition utilizing various forms of human emotion, such as fundamental facial expressions, action units, and valence arousal. Mostly researchers focused on the facial features to recognize the human emotion[9][24]. Some methods are based on the facial action system[7][8], which uses a set of local facial postures to encrypt facial expression. The majority of studies have depended on analyzing human faces, which restricts their capability to utilise semantic features for emotion recognition in the wild[16].

Many researchers have researched about how to improve recognising the emotion and to improve the performance, they used different inputs. Some used only images[4][27][28] and some used audio data[5][23] with images. Different researches used different approaches[10] for classifying and recognising the human emotions. In emotion recognition different approaches has been used for classifying the emotions. Deep Convolutional network(CNN) is used in different fields of research[22][25]. Especially CNN have big contribution in the computer vision field. After having good performance[30] in computer vision, CNN is investigated to classifying the facial emotions[26].

In our study we used CNN due to its better performance and we used Inception-V3[31] model that is a deep CNN architecture. This model is proposed by Szegedy et al. in the Large-Scale ImageNet Visual Identification Challenge 2014. It’s goal was to improve computational performance and reduce the parameters in applications. Inception-V3 model is faster than VGG. Basically it is the modification of AlexNet. Last year top 1st ranked author of ABAW challenge also used the same model in their work and achieved good results[3].

Refer to caption — Figure 1: Illustration of our proposed architecture.

3 Proposed Method

Our proposed architecture has been shown in figure1. We employ two kinds of information from the provided dataset: cropped and cropped aligned images. Cropped images include the facial, while the cropped aligned images contain the facial and also the context around. Based on the results of the first ranking last year[3], we utilize the model and add some modifications. Instead of using only the cropped aligned images, we explored full information from the provided dataset. Basically, our approach is based on three main parts: feature extraction, attention mechanism, and classifiers.

For the first part, we used Inception Net V3 to extract the features from the input image. Inception v3 is a deep network for the image recognition field. The model itself contains symmetric and asymmetric building blocks with integrated convolutions and pooling layers inside. This model shows a superb result with higher than 78 % accuracy on the ImageNet dataset. The feature extractor gives the feature output with the sizes of 768 $\times$ 17 $\times$ 17.

In the second part, we will mention the mechanism modules that help the model to focus better on meaningful regions. After having the feature maps, we put them through the attention module to learn the spatial attention for each region. Based on the combination of attention maps and features, we have the embedding feature vector of each region.

The final part included transformer block and multi-layer perception networks. We apply the transformer to learn the correlation between different regions related to different action units. After having the output features from the transformer block, we put them into action unit classifiers. Note that each action unit has a separate classifier. Besides the action unit classifiers, we proposed the valence/arousal predictors and expression classifier. For the valence/arousal predictor, we use two fully connected layers. We employ a multi-layer perception with three fully connected layers for the expression classifier.

During training, we try multiple types of loss functions. Particularly, in the action unit prediction and emotional expression classification, we use the cross-entropy loss function. With the arousal/valence prediction task, a negative Concordance Correlation Coefficient (CCC) is employed as a loss function. Finally, the total loss is the sum of three elemental losses.

We use several kinds of evaluation metrics in this paper. Firstly, the F1 score metric is used in the evaluation step to compute the performance of action unit prediction and emotional expression classification tasks. Secondly, the performance of valence/arousal prediction is calculated by CCC metric. The total metric for multi-task prediction is based on the combination of the F1 score and CCC metrics.

4 Experimental Results

The proposed method is trained on the system with only one Nvidia GTX 3090 GPU card, 24GB memory, and with SGD optimizer. We trained the model for 50 epochs, batch size 24, and the learning rate of 0.001. Note that we train in the same number of epochs for a fair comparison between the proposed method and last year’s first rank method.

Table 1 shows comparison results between our proposed method with the top-ranking ABAW3 method in the EXPR classification task. The results prove two things.

First is the meaning of context in emotion recognition which contains faces can bring more useful information for the EXPR classification task than only the face information. It is similar to AU classification and Valence prediction tasks, shown in Tables 2, 3, respectively. However, for the Arousal prediction task results, shown in Table 4, information that comes from the face only seems more meaningful than the combination of face and context around.

The second thing is that the results show the ensemble approach brings better results than using only a separate model.

The information of the training procedure are presented in Figures 2,3,4,5,6. Figure 2 is the illustration of accuracy in action unit prediction during training steps. Figure 3 is the illustration of the F1 score in expression classification during training steps. Figures 4, 5 is the illustration of the Concordance Correlation Coefficient in Arousal and Valence predictions during training steps. Fig 6 presents loss reduction during training steps. From those images, the orange line indicates the performance of training from facial (cropped images). The blue line indicates the performance of training from facial and context around (cropped aligned images). The grey line indicates the performance of training from last year’s first ranking provided source code from facial and context around.

Table 1: The comparison results between our proposed method with the top-ranking ABAW3 method in the EXPR classification task.

Method	F1 Score
Last year 1st rank	0.1023
Ours (Cropped)	0.1692
Ours (Cropped Aligned)	0.2194
Ours (Ensemble)	0.3055

Table 2: The comparison results between our proposed method with the top-ranking ABAW3 method in AU classification task.

Method	F1 Score
Last year 1st rank	0.4714
Ours (Cropped)	0.4856
Ours (Cropped Aligned)	0.4860
Ours (Ensemble)	0.4939

Table 3: The comparison results between our proposed method with the top-ranking ABAW3 method in the Valence Prediction task.

Method	CCC
Last year 1st rank	0.0534
Ours (Cropped)	0.0520
Ours (Cropped Aligned)	0.1091
Ours (Ensemble)	0.1597

Table 4: The comparison results between our proposed method with the top-ranking ABAW3 method in the Arousal Prediction task.

Method	CCC
Last year 1st rank	0.0504
Ours (Cropped)	0.0735
Ours (Cropped Aligned)	0.0446
Ours (Ensemble)	0.0754

Table 5: The comparison results between our proposed method with baseline and the top-ranking ABAW3 method in the Multi-Task Prediction.

Method	Total Metric
Baseline	0.3000
Last year 1st rank	0.6256
Ours	0.9170

5 Conclusion

This paper presents an ensemble approach for multiple emotion descriptors estimation using multi-task learning. The context is one of the important factors in improving the performance of emotion recognition tasks. Additionally, Instead of using a separate framework, the ensemble approach still can bring more benefit to recognition issues. Our code implementation is available at https://github.com/tmtvaa/abaw4.

6 Acknowledgement

This research was supported by the Bio & Medical Technology Development Program of the National Research Foundation (NRF) & funded by the Korean government (MSIT) (NRF-2019M3E5D1A02067961) and by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2018R1D1A3B05049058 & NRF-2020R1A4A1019191).

References

[1] Aminoff, E.M., Kveraga, K., Bar, M.: The role of the parahippocampal cortex in cognition. Trends in cognitive sciences 17(8), 379–390 (2013)
[2] Barrett, L.F., Mesquita, B., Gendron, M.: Context in emotion perception. Current Directions in Psychological Science 20(5), 286–290 (2011)
[3] Deng, D.: Multiple emotion descriptors estimation at the abaw3 challenge. arXiv preprint arXiv:2203.12845 (2022)
[4] Deng, D., Chen, Z., Shi, B.E.: Multitask emotion recognition with incomplete labels. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). pp. 592–599. IEEE (2020)
[5] Deng, D., Wu, L., Shi, B.E.: Iterative distillation for better uncertainty estimates in multitask emotion recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3557–3566 (2021)
[6] Ekman, P.: Darwin, deception, and facial expression. Annals of the new York Academy of sciences 1000(1), 205–221 (2003)
[7] Ekman, P., Friesen, W.V.: Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978)
[8] Eleftheriadis, S., Rudovic, O., Pantic, M.: Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition. IEEE transactions on image processing 24(1), 189–204 (2014)
[9] Fabian Benitez-Quiroz, C., Srinivasan, R., Martinez, A.M.: Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5562–5570 (2016)
[10] Hasani, B., Mahoor, M.H.: Facial expression recognition using enhanced deep 3d convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 30–40 (2017)
[11] Kollias, D.: Abaw: Learning from synthetic data & multi-task learning challenges. arXiv preprint arXiv:2207.01138 (2022)
[12] Kollias, D.: Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2328–2336 (2022)
[13] Kollias, D., Cheng, S., Pantic, M., Zafeiriou, S.: Photorealistic facial synthesis in the dimensional affect space. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops. pp. 0–0 (2018)
[14] Kollias, D., Cheng, S., Ververas, E., Kotsia, I., Zafeiriou, S.: Deep neural network augmentation: Generating faces for affect analysis. International Journal of Computer Vision 128(5), 1455–1484 (2020)
[15] Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 26–33 (2017)
[16] Kollias, D., Sharmanska, V., Zafeiriou, S.: Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790 (2021)
[17] Kollias, D., Tzirakis, P., Nicolaou, M.A., Papaioannou, A., Zhao, G., Schuller, B., Kotsia, I., Zafeiriou, S.: Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision 127(6), 907–929 (2019)
[18] Kollias, D., Zafeiriou, S.: Aff-wild2: Extending the aff-wild database for affect recognition. arXiv preprint arXiv:1811.07770 (2018)
[19] Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855 (2019)
[20] Kollias, D., Zafeiriou, S.: Va-stargan: Continuous affect generation. In: International Conference on Advanced Concepts for Intelligent Vision Systems. pp. 227–238. Springer (2020)
[21] Kollias, D., Zafeiriou, S.: Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792 (2021)
[22] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)
[23] Kuhnke, F., Rumberg, L., Ostermann, J.: Two-stream aural-visual affect analysis in the wild. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). pp. 600–605. IEEE (2020)
[24] Li, Y., Zeng, J., Shan, S., Chen, X.: Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing 28(5), 2439–2450 (2018)
[25] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[26] Singh, K.K., Xiao, F., Lee, Y.J.: Track and transfer: Watching videos to simulate strong human supervision for weakly-supervised object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3548–3556 (2016)
[27] Youoku, S., Toyoda, Y., Yamamoto, T., Saito, J., Kawamura, R., Mi, X., Murase, K.: A multi-term and multi-task analyzing framework for affective analysis in-the-wild. arXiv preprint arXiv:2009.13885 (2020)
[28] Youoku, S., Yamamoto, T., Saito, J., Uchida, A., Mi, X., Shi, Z., Liu, L., Liu, Z., Nakayama, O., Murase, K.: Multi-modal affect analysis using standardized data within subjects in the wild. arXiv preprint arXiv:2107.03009 (2021)
[29] Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: valence and arousal’in-the-wild’challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 34–41 (2017)
[30] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)
[31] Zou, Q., Cao, Y., Li, Q., Huang, C., Wang, S.: Chronological classification of ancient paintings using appearance and shape features. Pattern Recognition Letters 49, 146–154 (2014)