2School of Artificial Intelligence, University of Chinese Academy of Sciences
3CAS Center for Excellence in Brain Science and Intelligence Technology 11email: {sunhaiyang2021, lianzheng2016}@ia.ac.cn
{liubin, jhtao}@nlpr.ia.ac.cn
Two-Aspect Information Fusion Model For ABAW4 Multi-task Challenge
Abstract
In this paper, we propose the solution to the Multi-Task Learning (MTL) Challenge of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition. The task of ABAW is to predict frame-level emotion descriptors from videos: discrete emotional state; valence and arousal; and action units. Although researchers have proposed several approaches and achieved promising results in ABAW, current works in this task rarely consider interactions between different emotion descriptors. To this end, we propose a novel end to end architecture to achieve full integration of different types of information. Experimental results demonstrate the effectiveness of our proposed solution.
Keywords:
ABAW4 Multi-task Challenge, information fusion, emotion recognition, action unit detection1 Introduction
Automatic emotion recognition techniques have been an important task in affective computing. Previously, Ekman et al. [2] divided facial emotion descriptors into two categories: sign vehicles and messages. The sign vehicles are determined by facial movements, which are better fit with action units (AU) detection. The messages are more concerned with the person observing the face or getting the message, which are better fit with facial expressions (EXPR), valence and arousal (VA) prediction.
Deng [1] assumed that estimates of EXPR and VA vary across different observers, but AU is easier to reach consensus. This phenomenon indicates that the sign vehicles and messages express two aspects of emotion. Therefore, the author use different modules to predict them separately. However, observing a smiling but frowning face, the intensity of frowning and smiling influences the change of action units to some extent; while the amplitude of the eyebrow and lip movements influences whether the expressions are positive, and the intensity of emotional states. Therefore, we assume that the results of sign vehicles and messages are are mutually influenced.
In this paper, we present our solution to the ABAW competition. Specifically, we leverage the ROI Feature Extraction module in [1, 3] to capture the facial emotion information. Afterwards, we leverage interactions between sign vehicles and messages to achieve better performance in the ABAW competition.Our main contributions are as follows:
-
1.
We make the model compute two aspects of the information by a two-way computation, to represent the information of sign vehicles and messages.
-
2.
We increase the information interoperability between these two aspects to better integrate them into multiple tasks.
-
3.
Our method showed superior performance than baseline model.
2 Methodology
Our model has three main components: ROI Feature Extraction, Interaction Module, and Temporal Smoothing Module. The overall framework is shown in Figure 1.

2.1 ROI Feature Extraction
The main function of ROI Feature Extraction is to encode important information in the images. We adopt the same module in [1]. The feature extractor is an Inception V3 model, images are fed into it and generate feature maps. After that, feature maps generate spatial attention maps for the model’s regions of interest by using three convolutional layers and a sigmoid calculation. The attention maps are fused with feature maps to generate feature vectors of different regions by different encoding modules shown by red tensors. More details about ROI Feature Extraction are in [3].
2.2 Interaction Module
The main function of Interaction Module is to extract information from different semantic spaces, then provide a more comprehensive representation of emotion descriptors for different tasks through information interaction operations. Coding through the ROI Feature Extraction module, each image is transformed into feature vectors to represent the information of the entire face. These representations are combined with positional encoding vectors and fed into two Transformer blocks to learn information from two perspectives. The outputs of these two blocks are concatenated together as an overall information representation shown by blue tensors. We assign a separate trainable query vector to each task. Each task separately performs query operation on the overall representation and integrates information from different perspectives.
2.3 Temporal Smoothing and Classifier Module
The main function of Temporal Smoothing Module is to add the time domain information to the feature vector. Since the previous modules are calculated on a single image, the temporal information of the video data cannot be taken into account. Therefore, we perform a temporal smoothing operation on the encoded features. After extracting feature vectors from a piece of video data, for the feature vector at the time step , we smoothed it with this function:
(1) |
where is the feature that is fed to the classifier at time . Unlike [1], we add the temporal smoothing operation in the training phase and assign a trainable to each task. It is important to emphasize that we set a trainable for each task for the case . To better train , we discard videos with less than 10 frames of data. Finally, we feed the smoothed features into simple FC layers for classification or prediction.
2.4 Losses
We use BCEloss as the inference loss for the AU detection task, which formula is:
(2) |
(3) |
where denotes the AU prediction result of the model, denotes the true label for AU of class .
For the EXPR task, we use the cross-entropy loss as the inference loss, with the following equation:
(4) |
(5) |
where denotes the EXPR prediction result of the model of class , denotes the true label for EXPR of class .
Finally, we use the negative Concordance Correlation Coefficient (CCC) as the inference loss for the VA prediction, with the following equation:
(6) |
The sum of the three loss values is used as the overall evaluation inference loss of the multitask model:
(7) |
3 Experiments
3.1 Datasets
Static dataset s-Aff-Wild2 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14], for multi-task learning (MTL) [4], is provided by the ABAW4 competition. It contains only a subset of frames from the Aff-Wild2 dataset. Each frame has a corresponding AU, EXPR and VA labels.
However, there are abnormal labels in this dataset, the cases of for EXPR label, for VA label, and for both AU labels, respectively. In our experiments, to balance the number of samples between multiple tasks, we did not use the data with abnormal labels in the training and validation phase.
3.2 Training Details
The aligned faces are provided by this Challenge. Each image has a resolution of . We resize it to and feed it into the model. In the ROI Feature Extraction module, we try different settings and choose an implementation with U of 24 and D of 24. Transformer blocks share the same architecture, containing 4 attention head. The hidden units of the feed-forward neural network in these blocks are set to be 1024. The optimizer we used is Adam, and the total number of training epochs is 100.
3.3 Evaluation Metric
We use the averaged F1 score of 12 AUs as the evaluation score for AU detection, the averaged macro F1 score as the evaluation score for EXPR prediction, and the CCC value as the evaluation score for the VA task.
4 Results
4.1 Comparison With Baseline Models
In the ABAW4 challenge, the official baseline model is provided which use pre-trained VGG16 network to extract features, and 22 simple classifiers for multitask(2 linear units for VA predictions, 8 softmax activation function units for EXPR predictions and 12 sigmoid activation function units for AU predictions). The experiment results are shown in table 1, demonstrate the superiority of our model over the baseline model.
4.2 Different Temporal Smoothing Methods
We try to add a Bidirectional Temporal Smoothing operation (BTS) to each task, but only assign a trainable initial state vector to the forward operation, i.e. the reverse operation all starts from the penultimate time step. The experiment results are shown in table 1, it is not as effective as Temporal Smoothing operation (TS).
4.3 Different Backbones
We try to use the same structure as Sign Vehicle Space and Message Space of the SMM-EmotionNet [1] (SMM) to extract feature vectors separately before feeding them into the Temporal Smoothing and Classifier Module, but the results are not satisfactory.
We try to use only one Transformer block (Single Perspective) in the Interaction Module to extract information, and the performance degradation is significant compares to two Transformer blocks (Double Perspective). This also proves that the facial emotion descriptors information extracted by two Transformer blocks is more complete. The experiment results are shown in table 1.
Model | TS | BTS | SMM |
|
|
Overall Metric | ||||
---|---|---|---|---|---|---|---|---|---|---|
baseline | - | - | - | - | - | 0.3 | ||||
ours(U=17,D=16) | ✓ | ✓ | 0.759 | |||||||
✓ | ✓ | 0.815 | ||||||||
✓ | ✓ | 0.79 | ||||||||
✓ | ✓ | 0.825 | ||||||||
ours(U=24,D=24) | ✓ | ✓ | 0.85 |
4.4 Different semantic information
To verify the effectiveness of these two Transformer blocks, we extract some samples on the validation set for inference and use the t-SNE algorithm for dimensionality reduction to visualize the outputs of these two blocks. The results are shown in Figure 2.
The results show that the sample points of the two colors are clearly demarcated, indicating that there are different types of perspective information extracted from the two Transformer blocks.

4.5 Additional
The performance of our model on each task is shown in Table 2.
Task | Valence | Arousal | Expression | Action Units |
---|---|---|---|---|
ours(U=24,D=24) | 0.41 | 0.62 | 0.207 | 0.385 |
5 Conclusions
In this paper, we introduce our method for the Multi-Task Learning Challenge ABAW4 competition. We extracted different perspective information for Sign Vehicle space and Message space, and enhanced the model’s utilization of these information by enabling multi-task information to interact through an attention mechanism. The results show that our method achieves a performance of 0.85 on the validation set.
References
- [1] Deng, D.: Multiple emotion descriptors estimation at the ABAW3 challenge. CoRR abs/2203.12845 (2022). https://doi.org/10.48550/arXiv.2203.12845, https://doi.org/10.48550/arXiv.2203.12845
- [2] Ekman, P., Friesen, W.V.: The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica 1(1), 49–98 (1969)
- [3] Jacob, G.M., Stenger, B.: Facial action unit detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7680–7689 (2021)
- [4] Kollias, D.: Abaw: Learning from synthetic data & multi-task learning challenges. arXiv preprint arXiv:2207.01138 (2022)
- [5] Kollias, D.: ABAW: valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. CoRR abs/2202.10659 (2022), https://arxiv.org/abs/2202.10659
- [6] Kollias, D., Cheng, S., Pantic, M., Zafeiriou, S.: Photorealistic facial synthesis in the dimensional affect space. In: Computer Vision - ECCV 2018 Workshops - Munich, Germany, September 8-14, 2018, Proceedings, Part II. pp. 475–491 (2018). https://doi.org/10.1007/978-3-030-11012-3_36, https://doi.org/10.1007/978-3-030-11012-3_36
- [7] Kollias, D., Cheng, S., Ververas, E., Kotsia, I., Zafeiriou, S.: Deep neural network augmentation: Generating faces for affect analysis. Int. J. Comput. Vis. 128(5), 1455–1484 (2020). https://doi.org/10.1007/s11263-020-01304-3, https://doi.org/10.1007/s11263-020-01304-3
- [8] Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 1972–1979 (2017). https://doi.org/10.1109/CVPRW.2017.247, https://doi.org/10.1109/CVPRW.2017.247
- [9] Kollias, D., Sharmanska, V., Zafeiriou, S.: Distribution matching for heterogeneous multi-task learning: a large-scale face study. CoRR abs/2105.03790 (2021), https://arxiv.org/abs/2105.03790
- [10] Kollias, D., Tzirakis, P., Nicolaou, M.A., Papaioannou, A., Zhao, G., Schuller, B.W., Kotsia, I., Zafeiriou, S.: Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 127(6-7), 907–929 (2019). https://doi.org/10.1007/s11263-019-01158-4, https://doi.org/10.1007/s11263-019-01158-4
- [11] Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. In: 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019. p. 297 (2019), https://bmvc2019.org/wp-content/uploads/papers/0399-paper.pdf
- [12] Kollias, D., Zafeiriou, S.: Va-stargan: Continuous affect generation. In: Advanced Concepts for Intelligent Vision Systems - 20th International Conference, ACIVS 2020, Auckland, New Zealand, February 10-14, 2020, Proceedings. pp. 227–238 (2020). https://doi.org/10.1007/978-3-030-40605-9_20, https://doi.org/10.1007/978-3-030-40605-9_20
- [13] Kollias, D., Zafeiriou, S.: Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. CoRR abs/2103.15792 (2021), https://arxiv.org/abs/2103.15792
- [14] Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: Valence and arousal ’in-the-wild’ challenge. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 1980–1987 (2017). https://doi.org/10.1109/CVPRW.2017.248, https://doi.org/10.1109/CVPRW.2017.248