¹¹institutetext: ¹National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
²School of Artificial Intelligence, University of Chinese Academy of Sciences
³CAS Center for Excellence in Brain Science and Intelligence Technology ¹¹email: {sunhaiyang2021, lianzheng2016}@ia.ac.cn
{liubin, jhtao}@nlpr.ia.ac.cn

Two-Aspect Information Fusion Model For ABAW4 Multi-task Challenge

Haiyang Sun^1,2 Zheng Lian¹ Bin Liu

{}^{1^{\dagger}}

Jianhua Tao

{}^{1,2,3^{\dagger}}

Licai Sun^1,2 Cong Cai^1,2

Abstract

In this paper, we propose the solution to the Multi-Task Learning (MTL) Challenge of the 4th Affective Behavior Analysis in-the-wild (ABAW) competition. The task of ABAW is to predict frame-level emotion descriptors from videos: discrete emotional state; valence and arousal; and action units. Although researchers have proposed several approaches and achieved promising results in ABAW, current works in this task rarely consider interactions between different emotion descriptors. To this end, we propose a novel end to end architecture to achieve full integration of different types of information. Experimental results demonstrate the effectiveness of our proposed solution.

Keywords:

ABAW4 Multi-task Challenge, information fusion, emotion recognition, action unit detection

1 Introduction

Automatic emotion recognition techniques have been an important task in affective computing. Previously, Ekman et al. [2] divided facial emotion descriptors into two categories: sign vehicles and messages. The sign vehicles are determined by facial movements, which are better fit with action units (AU) detection. The messages are more concerned with the person observing the face or getting the message, which are better fit with facial expressions (EXPR), valence and arousal (VA) prediction.

Deng [1] assumed that estimates of EXPR and VA vary across different observers, but AU is easier to reach consensus. This phenomenon indicates that the sign vehicles and messages express two aspects of emotion. Therefore, the author use different modules to predict them separately. However, observing a smiling but frowning face, the intensity of frowning and smiling influences the change of action units to some extent; while the amplitude of the eyebrow and lip movements influences whether the expressions are positive, and the intensity of emotional states. Therefore, we assume that the results of sign vehicles and messages are are mutually influenced.

In this paper, we present our solution to the ABAW competition. Specifically, we leverage the ROI Feature Extraction module in [1, 3] to capture the facial emotion information. Afterwards, we leverage interactions between sign vehicles and messages to achieve better performance in the ABAW competition.Our main contributions are as follows:

1.

We make the model compute two aspects of the information by a two-way computation, to represent the information of sign vehicles and messages.
2.

We increase the information interoperability between these two aspects to better integrate them into multiple tasks.
3.

Our method showed superior performance than baseline model.

2 Methodology

Our model has three main components: ROI Feature Extraction, Interaction Module, and Temporal Smoothing Module. The overall framework is shown in Figure 1.

Refer to caption — Figure 1: ROI Feature Extraction and Interaction Module.

2.1 ROI Feature Extraction

The main function of ROI Feature Extraction is to encode important information in the images. We adopt the same module in [1]. The feature extractor is an Inception V3 model, images are fed into it and generate feature maps. After that, feature maps generate spatial attention maps for the model’s regions of interest by using three convolutional layers and a sigmoid calculation. The attention maps are fused with feature maps to generate feature vectors of different regions by different encoding modules shown by red tensors. More details about ROI Feature Extraction are in [3].

2.2 Interaction Module

The main function of Interaction Module is to extract information from different semantic spaces, then provide a more comprehensive representation of emotion descriptors for different tasks through information interaction operations. Coding through the ROI Feature Extraction module, each image is transformed into $U$ feature vectors to represent the information of the entire face. These representations are combined with positional encoding vectors and fed into two Transformer blocks to learn information from two perspectives. The outputs of these two blocks are concatenated together as an overall information representation shown by blue tensors. We assign a separate trainable query vector to each task. Each task separately performs query operation on the overall representation and integrates information from different perspectives.

2.3 Temporal Smoothing and Classifier Module

The main function of Temporal Smoothing Module is to add the time domain information to the feature vector. Since the previous modules are calculated on a single image, the temporal information of the video data cannot be taken into account. Therefore, we perform a temporal smoothing operation on the encoded features. After extracting feature vectors from a piece of video data, for the feature vector $v_{t}$ at the time step $t$ , we smoothed it with this function:

\displaystyle f_{t}=\frac{1}{1+\mu}(v_{t}+\mu f_{t-1})

(1)

where $f_{t}$ is the feature that is fed to the classifier at time $t$ . Unlike [1], we add the temporal smoothing operation in the training phase and assign a trainable $\mu$ to each task. It is important to emphasize that we set a trainable $f_{-1}$ for each task for the case $t=0$ . To better train $\mu$ , we discard videos with less than 10 frames of data. Finally, we feed the smoothed features into simple FC layers for classification or prediction.

2.4 Losses

We use BCEloss as the inference loss for the AU detection task, which formula is:

\displaystyle\hat{y}^{AU}=\frac{1}{1+e^{-y_{pre}^{AU}}}

(2)

\displaystyle\mathcal{L}^{AU}=\frac{1}{N}\sum\limits_{i}-(y_{i}^{AU}\log(\hat{y}_{i}^{AU})+(1-y_{i}^{AU})\log(1-\hat{y}_{i}^{AU}))

(3)

where $y_{pre}^{AU}$ denotes the AU prediction result of the model, $y_{i}^{AU}$ denotes the true label for AU of class $i$ .

For the EXPR task, we use the cross-entropy loss as the inference loss, with the following equation:

\displaystyle\hat{y}_{i}^{EXPR}=\frac{e^{y_{pre(i)}^{EXPR}}}{\sum\limits_{k=1}\limits^{K}e^{y_{pre(k)}^{EXPR}}}

(4)

\displaystyle\mathcal{L}^{EXPR}=-\sum\limits^{K}\limits_{i=1}y_{i}^{EXPR}\ln{\hat{y}_{i}^{EXPR}}

(5)

where $y_{pre(i)}^{EXPR}$ denotes the EXPR prediction result of the model of class $i$ , $y_{i}^{EXPR}$ denotes the true label for EXPR of class $i$ .

Finally, we use the negative Concordance Correlation Coefficient (CCC) as the inference loss for the VA prediction, with the following equation:

\displaystyle\mathcal{L}^{VA}=1-CCC^{V}+1-CCC^{A}

(6)

The sum of the three loss values is used as the overall evaluation inference loss of the multitask model:

\displaystyle\mathcal{L}=\mathcal{L}^{AU}+\mathcal{L}^{EXPR}+\mathcal{L}^{VA}

(7)

3 Experiments

3.1 Datasets

Static dataset s-Aff-Wild2 [5, 6, 7, 8, 9, 10, 11, 12, 13, 14], for multi-task learning (MTL) [4], is provided by the ABAW4 competition. It contains only a subset of frames from the Aff-Wild2 dataset. Each frame has a corresponding AU, EXPR and VA labels.

However, there are abnormal labels in this dataset, the cases of $-1$ for EXPR label, $-5.0$ for VA label, and $-1$ for both AU labels, respectively. In our experiments, to balance the number of samples between multiple tasks, we did not use the data with abnormal labels in the training and validation phase.

3.2 Training Details

The aligned faces are provided by this Challenge. Each image has a resolution of $112\times 112$ . We resize it to $224\times 224$ and feed it into the model. In the ROI Feature Extraction module, we try different settings and choose an implementation with U of 24 and D of 24. Transformer blocks share the same architecture, containing 4 attention head. The hidden units of the feed-forward neural network in these blocks are set to be 1024. The optimizer we used is Adam, and the total number of training epochs is 100.

3.3 Evaluation Metric

We use the averaged F1 score of 12 AUs as the evaluation score for AU detection, the averaged macro F1 score as the evaluation score for EXPR prediction, and the CCC value as the evaluation score for the VA task.

4 Results

4.1 Comparison With Baseline Models

In the ABAW4 challenge, the official baseline model is provided which use pre-trained VGG16 network to extract features, and 22 simple classifiers for multitask(2 linear units for VA predictions, 8 softmax activation function units for EXPR predictions and 12 sigmoid activation function units for AU predictions). The experiment results are shown in table 1, demonstrate the superiority of our model over the baseline model.

4.2 Different Temporal Smoothing Methods

We try to add a Bidirectional Temporal Smoothing operation (BTS) to each task, but only assign a trainable initial state vector to the forward operation, i.e. the reverse operation all starts from the penultimate time step. The experiment results are shown in table 1, it is not as effective as Temporal Smoothing operation (TS).

4.3 Different Backbones

We try to use the same structure as Sign Vehicle Space and Message Space of the SMM-EmotionNet [1] (SMM) to extract feature vectors separately before feeding them into the Temporal Smoothing and Classifier Module, but the results are not satisfactory.

We try to use only one Transformer block (Single Perspective) in the Interaction Module to extract information, and the performance degradation is significant compares to two Transformer blocks (Double Perspective). This also proves that the facial emotion descriptors information extracted by two Transformer blocks is more complete. The experiment results are shown in table 1.

Model

BTS

SMM

Single

Perspective

Double

Perspective

Overall Metric

baseline

0.3

ours(U=17,D=16)

✓

0.759

✓

0.815

✓

0.79

✓

0.825

ours(U=24,D=24)

✓

0.85

Table 1: Results of all experiments on the validation set.

4.4 Different semantic information

To verify the effectiveness of these two Transformer blocks, we extract some samples on the validation set for inference and use the t-SNE algorithm for dimensionality reduction to visualize the outputs of these two blocks. The results are shown in Figure 2.

The results show that the sample points of the two colors are clearly demarcated, indicating that there are different types of perspective information extracted from the two Transformer blocks.

4.5 Additional

The performance of our model on each task is shown in Table 2.

Task	Valence	Arousal	Expression	Action Units
ours(U=24,D=24)	0.41	0.62	0.207	0.385

Table 2: Results of all tasks.

5 Conclusions

In this paper, we introduce our method for the Multi-Task Learning Challenge ABAW4 competition. We extracted different perspective information for Sign Vehicle space and Message space, and enhanced the model’s utilization of these information by enabling multi-task information to interact through an attention mechanism. The results show that our method achieves a performance of 0.85 on the validation set.

References

[1] Deng, D.: Multiple emotion descriptors estimation at the ABAW3 challenge. CoRR abs/2203.12845 (2022). https://doi.org/10.48550/arXiv.2203.12845, https://doi.org/10.48550/arXiv.2203.12845
[2] Ekman, P., Friesen, W.V.: The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica 1(1), 49–98 (1969)
[3] Jacob, G.M., Stenger, B.: Facial action unit detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7680–7689 (2021)
[4] Kollias, D.: Abaw: Learning from synthetic data & multi-task learning challenges. arXiv preprint arXiv:2207.01138 (2022)
[5] Kollias, D.: ABAW: valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. CoRR abs/2202.10659 (2022), https://arxiv.org/abs/2202.10659
[6] Kollias, D., Cheng, S., Pantic, M., Zafeiriou, S.: Photorealistic facial synthesis in the dimensional affect space. In: Computer Vision - ECCV 2018 Workshops - Munich, Germany, September 8-14, 2018, Proceedings, Part II. pp. 475–491 (2018). https://doi.org/10.1007/978-3-030-11012-3_36, https://doi.org/10.1007/978-3-030-11012-3_36
[7] Kollias, D., Cheng, S., Ververas, E., Kotsia, I., Zafeiriou, S.: Deep neural network augmentation: Generating faces for affect analysis. Int. J. Comput. Vis. 128(5), 1455–1484 (2020). https://doi.org/10.1007/s11263-020-01304-3, https://doi.org/10.1007/s11263-020-01304-3
[8] Kollias, D., Nicolaou, M.A., Kotsia, I., Zhao, G., Zafeiriou, S.: Recognition of affect in the wild using deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 1972–1979 (2017). https://doi.org/10.1109/CVPRW.2017.247, https://doi.org/10.1109/CVPRW.2017.247
[9] Kollias, D., Sharmanska, V., Zafeiriou, S.: Distribution matching for heterogeneous multi-task learning: a large-scale face study. CoRR abs/2105.03790 (2021), https://arxiv.org/abs/2105.03790
[10] Kollias, D., Tzirakis, P., Nicolaou, M.A., Papaioannou, A., Zhao, G., Schuller, B.W., Kotsia, I., Zafeiriou, S.: Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 127(6-7), 907–929 (2019). https://doi.org/10.1007/s11263-019-01158-4, https://doi.org/10.1007/s11263-019-01158-4
[11] Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. In: 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019. p. 297 (2019), https://bmvc2019.org/wp-content/uploads/papers/0399-paper.pdf
[12] Kollias, D., Zafeiriou, S.: Va-stargan: Continuous affect generation. In: Advanced Concepts for Intelligent Vision Systems - 20th International Conference, ACIVS 2020, Auckland, New Zealand, February 10-14, 2020, Proceedings. pp. 227–238 (2020). https://doi.org/10.1007/978-3-030-40605-9_20, https://doi.org/10.1007/978-3-030-40605-9_20
[13] Kollias, D., Zafeiriou, S.: Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. CoRR abs/2103.15792 (2021), https://arxiv.org/abs/2103.15792
[14] Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: Valence and arousal ’in-the-wild’ challenge. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 1980–1987 (2017). https://doi.org/10.1109/CVPRW.2017.248, https://doi.org/10.1109/CVPRW.2017.248