Multi-modal Facial Action Unit Detection with Large Pre-trained Models for
the 5th Competition on Affective Behavior Analysis in-the-wild

Yufeng Yin^∗, Minh Tran^∗, Di Chang^∗, Xinrui Wang, Mohammad Soleymani
University of Southern California
Los Angeles, CA, USA
{yufengy, minhntra, dichang, xinruiw, msoleyma}@usc.edu

Abstract

Facial action unit detection has emerged as an important task within facial expression analysis, aimed at detecting specific pre-defined, objective facial expressions, such as lip tightening and cheek raising. This paper presents our submission to the Affective Behavior Analysis in-the-wild (ABAW) 2023 Competition for AU detection. We propose a multi-modal method for facial action unit detection with visual, acoustic, and lexical features extracted from the large pre-trained models. To provide high-quality details for visual feature extraction, we apply super-resolution and face alignment to the training data. Our approach achieves the F1 score of 52.3% on the official validation set.

^*^*footnotetext: Equal contributions

1 Introduction

Facial expressions play a crucial role in social interactions. Action Units (AUs) are muscular activations that anatomically describe the mechanics of facial expressions [3]. Accurate detection of AUs enables unbiased computational descriptions of human faces, thereby improving face analysis applications such as emotion recognition or mental health diagnosis.

From a machine learning perspective, AU detection in the wild presents many technical challenges. Most notably, in-the-wild datasets such as Aff-Wild2 [32, 18, 15, 19, 14, 20, 21, 16, 12, 13, 17] collect data with huge variations in the cameras (resulting in blurred video frames), environments (illumination conditions), and subjects (large variance in expressions, scale, and head poses). Ertugrul et al. [4, 5] demonstrate that the deep-learning-based AU detectors have limited generalization abilities due to the aforementioned variations. In this work, we investigate three main questions using the AffWild2 dataset [12] to address the above challenges. (i) Do multi-modal signals help improve the robustness of a pre-trained model with respect to the varying conditions? (ii) Are super-resolution techniques useful for reducing the problem with low-quality video frames? (iii) Do large-scale pre-trained models produce reliable feature representations for in-the-wild generalization?

Refer to caption — Figure 1: Multi-modal method for facial action unit detection. The inputs from different modalities are passed through different large pre-trained models to extract high-level representations, in this case, Swin Transformer [23] and GH-Feat encoder [31] for vision, HuBERT [8] for audio, and RoBERTa [22] for text. Then the static and temporal features are passed through the linear projectors and GRUs to get the hidden states. Finally, the hidden states are concatenated together and the action units are detected.

To answer these questions, we propose a multi-modal method for facial action unit detection utilizing large pre-trained model features (see Figure 1). We first encode the visual, acoustic, and lexical modalities to get the high-level feature representations. In particular, we extract features from Swin Transformer [23] and GH-Feat [31], and we use HuBERT [8] and RoBERTa [22] to extract the acoustic and lexical features respectively. These large pre-trained model features are shown to be discriminative and generalizable representations for downstream tasks. To provide high-quality image details for GH-Feat extraction, we apply super-resolution and face alignment to the training data to reduce the domain gap between its pre-trained dataset, i.e., FFHQ [10] and Aff-Wild2 [12].

Then, given the features of different modalities, we fuse the multi-modal features in an early fusion manner and train an MLP to output the AU labels. The main contributions of this work include:

•

We propose a multi-modal method for facial action unit detection with visual, acoustic, and lexical features extracted from the large pre-trained models.
•

We apply super-resolution and face alignment to the training data to provide high-quality details for visual feature extraction.
•

Experimental results demonstrate the effectiveness of our proposed framework, achieving 52.3% in terms of F1 score on the official validation set of Aff-wild2.

2 Related Work

A facial action unit (AU) is a facial behavior indicative of activation of an individual or a group of muscles, e.g., cheek raiser (AU 6). AUs were formalized by Ekman in Facial Action Coding System (FACS) [3]. Recent studies have proposed novel methods for improving AU detection accuracy using deep learning-based techniques. For instance, Zhao et al. [35] propose the Deep Region and Multi-label Learning (DRML) method, which leverages region learning (RL) and multi-label learning (ML) to identify specific regions for different AUs. Shao et al. [27] propose the Joint AU Detection and Face Alignment Network (J $\hat{\text{A}}$ A-Net), which employs an adaptive attention learning module to refine the attention map for each AU. Other recent approaches, such as Zhang et al. [34]’s heatmap regression-based method and Song et al. [28]’s hybrid message-passing neural network with performance-driven structures (HMP-PS), utilize graph neural networks to refine features and generate graph structures. Finally, Luo et al. [25] propose an approach that models the relationship between each pair of AUs in a target facial display using a unique graph. These recent advancements in AU detection hold significant promise for improving our understanding of facial expressions and their underlying emotional states.

3 Methods

3.1 Problem Formulation

Facial Action Unit Detection. Given a video set $S$ , for each frame $x\in S$ , the goal is to detect the occurrence for each AU $a_{i}$ $(i=1,2,...,n)$ using function $\text{F}(\cdot)$ .

a_{1},a_{2},...,a_{n}=\text{F}(x),

(1)

where $n$ is the number of AUs to be detected. $a_{i}=1$ if the AU is active otherwise $a_{i}=0$ .

3.2 Overview

In this study, we propose a multi-modal framework for AU detection with large pre-trained model features (see Figure 1). We first encode the visual, acoustic, and lexical modalities to get the high-level feature representations. In particular, we extract features from Swin Transformer [23] and GH-Feat [31] and we use HuBERT [8] and RoBERTa [22] to extract the acoustic and lexical features respectively. Then, given the features of different modalities, we fuse the multi-modal features in an early fusion manner and train an MLP to output the AU labels.

Table 1: Details and statistics of our splits for Aff-Wild2 dataset. We discard the frames labeled with -1.

Split	Train	Validation	Test
# frames	1,155,618	204,068	445,841
# videos	265	30	105

3.3 Feature Extraction

Vision. We extract two types of visual features, i.e., Swin Transformer [23] and GH-Feat [31]. Swin-Transformer takes as inputs the original cropped and aligned frames provided by the challenge, and the GH-Feat encoder takes as inputs the super-resolution images as described later.

Swin Transformer is a hierarchical Transformer [29] in which the representation is computed with shifted windows and can be served as a general-purpose backbone for computer vision [23]. In particular, we choose Swin Transformer tiny for efficient training. We denote the features extracted from the Swin Transformer as $f_{v}$ .

Xu et al. [31] consider the pre-trained StyleGAN generator [10] as a learned loss function and train a hierarchical ResNet encoder [7] to get visual representations, namely Generative Hierarchical-Features (GH-Feat), for input images. GH-Feat has strong transferability to both generative and discriminative tasks. The encoder of GH-Feat is trained on the FFHQ dataset [10] which contains large-scale and high-quality face images.

However, there is a huge domain gap between the two datasets which may result in inferior performance (see Figure 2). We observe that the face alignments are different and the image quality of Aff-Wild2 is too low ( $112\times 112$ resolution) compared to FFHQ ( $1024\times 1024$ resolution). To address these two problems, we first utilize Real-ESRGAN [30] for super-resolution to improve the image quality, and then we use dlib [11] to detect the 68 facial landmarks for the input frame and use FFHQ-alignment to align them (see Figure 3). Finally, we denote the extracted GH-Feat as $f_{g}$ .

To obtain the temporal visual information, we also extract the GH-Feat from the previous and next four frames and we denote the temporal GH-Feat as $f_{tg}$ .

Audio. We use the pre-trained Hidden Unit BERT (HuBERT) [8] to extract the audio features. HuBERT is one of the state-of-the-art representation learning models for speech. HuBERT is learned in a self-supervised manner that is similar to BERT [2], in which a certain portion of the input frames are masked and the model is optimized to reconstruct the masked positions. Following Gururangan et al. [6], which demonstrates the effectiveness of adaptive pre-trained (continually pre-train model on task-specific data), we further train the HuBERT on the Aff-Wild2 dataset (in a self-supervised manner with the architecture’s self-supervised loss). We use the HuBERT x-large to extract our features (the last hidden state produced by the Transformer encoder). Specifically, for every input frame, we identify the corresponding timestamp of the frame, i.e., $\frac{\text{frame}\_\text{index}}{\text{video}\_\text{fps}}$ and extract the representations of the speech segment corresponding to two seconds before and after the timestamp. The extracted audio features are denoted as $f_{a}$ .

Text. We use Google Cloud ASR service^*^**https://cloud.google.com/speech-to-text to extract the transcripts for the provided videos. Then, we use RoBERTa Large [22] to extract the text embeddings for each input frame. Similar to the audio extraction process, we first identify the timestamp for the given input frame and we then collect all words uttered during the two seconds before and after the computed timestamp and use the pre-trained language model to extract the corresponding textual representation. For frames in which we cannot find any words within four seconds, we use a zero vector as the extracted textual representation. The lexical features are denoted as $f_{t}$ .

In summary, we extract five types of features for the input frame, i.e., Swin Transformer $f_{v}$ , GH-Feat $f_{g}$ , temporal GH-Feat $f_{tg}$ for vision, HuBERT $f_{a}$ for audio, and RoBERTa $f_{t}$ for text.

3.4 Multi-modal Fusion

Among the five types of extracted features, $f_{v}$ and $f_{g}$ are static features while $f_{tg}$ , $f_{a}$ , and $f_{t}$ are temporal.

For the static features, we feed them to separate linear projectors to reduce the feature dimensions and get the hidden representations $h_{v}$ and $h_{g}$ . For the temporal features, after reducing feature dimensions by linear projectors, we feed them to separate bi-directional GRUs [1] and get the last hidden states, i.e., $h_{tg}$ , $h_{a}$ and $h_{t}$ as the vector representations.

We fuse the multi-modal features in an early fusion manner. With the encoded hidden representations $h_{v}$ , $h_{g}$ , $h_{tg}$ , $h_{a}$ , and $h_{t}$ , we concatenate the outputs together and input the concatenated features into an MLP (two fully-connected layers) for final prediction.

h=\textnormal{Concat}(h_{v},h_{g},h_{tg},h_{a},h_{t}),

(2)

\hat{y}=\textnormal{MLP}(h).

(3)

Table 2: Performance of AU detection on our validation set with the F1-score metric (%

\uparrow

). Base is the multimodal model without any post-processing. With label smoothing, threshold fine-tuning, and AU correlation, our model’s performance gets improved.

Method	AU1	AU2	AU4	AU6	AU7	AU10	AU12	AU15	AU23	AU24	AU25	AU26	Avg.
Base	43.2	39.9	42.5	57.7	69.8	68.4	69.2	17.5	6.5	15.0	85.3	25.6	45.0
+ Smooth	44.1	40.9	43.0	58.2	70.2	68.8	70.0	17.92	5.7	15.0	86.0	26.0	45.5
+ Smooth + Threshold	44.1	42.6	43.0	61.6	71.0	69.2	70.0	17.9	5.7	13.7	87.3	26.0	46.0
+ Smooth + Threshold + AUcorr	44.1	42.6	43.0	61.6	71.0	69.2	70.0	17.9	5.7	26.7	87.3	31.2	47.5

Table 3: Performance of AU detection on Aff-Wild2 official validation set in terms of F1-score (%

\uparrow

). Base is the multimodal model without any post-processing. With label smoothing and threshold fine-tuning, our model’s performance gets improved.

Method	AU1	AU2	AU4	AU6	AU7	AU10	AU12	AU15	AU23	AU24	AU25	AU26	Avg.
Baseline [12]	-	-	-	-	-	-	-	-	-	-	-	-	39
Zhang et al. [33]	55.3	48.9	56.7	62.8	74.4	75.5	73.6	28.1	10.5	20.8	83.9	39.1	52.5
Base	51.6	45.6	49.1	61.1	73.9	75.4	73.4	31.1	14.4	8.3	81.6	33.4	49.9
+ Smooth	52.5	47.6	50.2	61.9	74.6	76.2	74.4	32.7	15.1	8.2	82.2	34.1	50.8
+ Smooth + Threshold	52.6	49.5	50.5	63.2	74.9	76.1	75.0	32.9	14.6	9.0	84.2	34.1	51.4
+ Smooth + Threshold + AUcorr	52.6	49.5	50.5	63.2	74.9	76.1	75.0	32.9	14.6	15.3	84.2	38.5	52.3

3.5 Training and Inference

Training. The learning objective is the sum of the weighted BCE loss and weighted multi-label loss^*^**https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelSoft
MarginLoss.html proposed by Jiang et al. [9] which is the second place of the 3rd ABAW challenge [12].

\mathcal{L}=\mathcal{L}_{\text{BCE}}(y,\hat{y},W_{1})+\mathcal{L}_{\text{Multi-label}}(y,\hat{y},W_{2})

(4)

where $W_{1}$ and $W_{2}$ are the loss weights for each AU.

Inference. Our model outputs the probability $p$ for the AU occurrence (ranging from 0 to 1). If $p>\tau$ , then the detected AU is activated otherwise inactive, where $\tau$ is the threshold. $\tau$ is a hyper-parameter that is fine-tuned in the post-processing stage.

Post-processing. While our models are trained with respect to frame-level Action Unit labels, facial expressions should be temporally consistent. Therefore, we apply smoothing with a sliding window size of $6$ (frames) to the prediction logits produced by our trained model. Figure 4 shows the average F1 score with different smoothing window sizes ranging from 2 to 32 on both our validation and test set. We pick $k=6$ as the optimal window size as it avoids over-fitting on both sets.

We further adjust the positive thresholds with the logits produced by our model on our validation set ( $10\%$ of the official train set) and apply the optimized thresholds on our test set (official validation set). We further investigate correlations between pairs of different Action Units and use the prediction logits of a better-performing AU to improve the performance of correlated AUs. We provide further details of the technique in the discussion section.

4 Experiments

4.1 Dataset

There are 594 videos in Aff-Wild2 for AU detection. Each frame is labeled with 12 AUs (1, 2, 4, 6, 7, 10, 12, 15, 23, 24, 25, and 26). The dataset provides cropped and aligned face images.

We use the official validation set as our testing set and we randomly divide the official training set into our training and validation set with a 90/10 split. The split is video-independent which means the videos in the training set will not appear in the validation set. Table 1 provides the details and statistics of our splits.

4.2 Implementation and Training Details

All methods are implemented in PyTorch [26]. Code and model weights are available, for the sake of reproducibility^*^**https://github.com/intelligent-human-perception-laboratory/ABAW5. We use a machine with two Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPUs with eight NVIDIA Quadro RTX8000 GPUs for all the experiments. The input image is first resized to $256\times 256$ . For data augmentation, each face is randomly cropped into $224\times 224$ and randomly flipped in the horizontal direction. We train the model with the AdamW optimizer [24] for 15 epochs with a batch size of 256. The learning rate is 1e-5. The weight decay is 1e-5. The gradient clipping is 1.0. The number of epochs is set to be 20 (maximum). We train the models with early stopping (patience is 5). The model is evaluated on our validation set at the end of every epoch. Following Jiang et al. [9], the loss weights of the BCE loss is $W_{bce}=[1,2,1,1,1,1,1,6,6,5,1,5]$ and the weights of the multi-label loss is $W_{multi}=[1,2,1,1,1,1,1,6,6,6,1,2]$ . The thresholds for the 12 AUs are $\tau=[0.5,0.55,0.5,0.4,0.45,0.45,0.45,0.5,0.5,0.55,0.4,0.5]$ .

4.3 Results

We show the experimental results in Tables 2 and 3. F1 score ( $\uparrow$ ) is the evaluation metric. We also compare with the AU detection results from Zhang et al. [33], the winner of the ABAW challenge at CVPR2022. The results show that our method outperforms the baseline by more than 10% in terms of the F1 score. In addition, we find that with label smoothing and threshold fine-tuning, our model’s performance gets further improved and gains comparable performance with Zhang et al. ’s method on the official validation set (52.3% vs. 52.5%), showing the effectiveness of our proposed method.

4.4 Discussions

4.4.1 AU Correlation (AUcorr)

Action units, which represent the activation of facial muscles, are not independent. For example, AU6 and AU12 are usually activated together to represent the action of smiling. Although our models are not trained to classify AUs independently, we do not introduce any module that can enhance the dependency between different AU predictions. Therefore, we propose a post-processing step, namely AUcorr, to exploit the dependencies of different AUs. The technique is especially useful for a pair of correlated AUs when the trained AU estimation model performs much better for one AU than the other, which introduces the possibility of using the predictions of the easier-to-detect AU to compensate for the performance of the harder-to-detect AU. To do this, we first need to identity pairs of AU that are strongly correlated with each other.

Figure 5 shows the computed Pearson Correlation Coefficient (PCC) between different pairs of AUs, using the labels provided by the official training set of the competition. The “redness” of the squares represents more positively correlated AUs (two AUs that are usually occurred together) while the “blueness” of the squares represents more negatively correlated AUs (two AUs that do not occur together). As a sanity check, we can see that AU6 and AU12 have the strongest positive correlation out of all AU pairs, as these two AUs are usually activated together when a person smiles. From the figure, we can see two notable correlated groups of AUs, namely (AU4+AU24 with $\rho=0.32$ ) and (AU26+AU1+AU2 with $\rho\geq 0.48$ ). More importantly, our model can detect AU4 and AU1/AU2 more reliably than AU24 (56.7% vs. 9.0%) and AU26 (52.6/49.5% vs. 34.1%). Therefore, we update the predicted activated probability of AU24 and AU26 as follows

p_{24}\leftarrow\frac{p_{24}+p_{4}}{2},

(5)

p_{26}\leftarrow\frac{p_{26}+p_{1}+p_{2}}{3},

(6)

Table 4: Performance boost with AUcorr.

	AU24	AU26
Val Set (ours)	13.7	26.0
Val Set w/ AUcorr	26.7	31.2
Test Set (ours)	9.0	34.1
Test Set w/ AUcorr	15.3	38.5

where $p_{i}$ denotes the probability AU- $i$ is activated, as originally predicted by our trained models. Table 4 shows the performance comparison of our proposed method on both our validation and test sets for AU24 and AU26. For both AUs, we can see a boost of 13% on AU24 and 5.2% on AU26 on the validation set and a boost of 6.3% on AU24 and 4.4% on AU26 on the test set (ABAW’s official Validation set). Such large boosts validate the usefulness of AUcorr. It is important to note that we provide a very crude way to utilize the mined AU correlations (averaging computed AU probabilities), leaving ample room for improvement using a better information fusion method.

Table 5: Ablation Study of different features for our model.

	Val Set	Test Set
Base	45.0	49.9
- Swin Transformer	42.6	45.1
- GH-Feat Encoder	42.5	48.9
- GH-Feat Encoder (Static)	45.0	49.3
- GH-Feat Encoder (Temporal)	44.7	49.2
- Audio	44.9	47.8
- Text	44.9	49.1

4.4.2 Ablation Study

Table 5 shows the ablation study results for our model. In particular, we report the performance of our model when one or more of the input features are zeroed out.

Effect of Visual Features. We can see that super-resolution is important to achieve robust AU detection performance as the average F1 score drops 2.5% on the validation set and 1.0% on the test set (-GH-Feat Encoder). However, we should not rely on super-resolution images alone, as removing the information on the original image (-Swin Transformer) results in the most significant loss in performance (2.4% drop in the validation set and 4.8% drop in the test set). One potential cause is the failure of Real-ESRGAN [30] to produce satisfactory results for some images with challenging conditions such as hand over the face (see Fig 6) or partially occluded face.

Effect of Audio and Text modalities. We can see that audio and lexical features contribute to the robustness of the prediction results, especially with the test set. Although showing a marginal effect on the validation set, HuBERT is one of the most important features to achieve good performance on the test set (missing audio feature results in a drop of 2.1%). Lexical feature is not as relevant as audio for the task of AU detection but still contributes to the overall prediction performance with a boost of 0.1% on the validation and 0.8% on the test set.

Effect of Temporal Features. Finally, we can see that it is important to add temporal visual information to the model as removing the temporal GH-Feat Encoder results in an average F1 loss of 0.3% on the validation set and 0.7% on the test set. Such behavior is expected as AUs are temporally consistent and hence, facial movements before and after the current frame contain relevant information to classify the current AUs.

4.4.3 Improvement via Facial Expression Recognition (Future Work)

Although detected AUs are usually used for facial expression analysis, we wonder if the reversed relationship is true. Using the provided AffWild2 dataset AU and facial expression annotations, we find a common set of frames (around 560K frames) in the released Training set and validation set for both the EXPR and AU challenges. Figure 7 shows the computed PCC between each pair of the 12 AUs and eight classes of facial expressions. We can see some strongly correlated pairs of facial expressions and AUs, demonstrating the possibility of using a robust facial expression in-the-wild classification model to enhance the performance of AU detection. For example, some of the notable pairs are (AU2 & Surprise with $\rho=0.62$ ), (AU4 & Sadness with $\rho=0.73$ ), (AU24 & Anger with $\rho=0.73$ ), and (AU26 & Surprise with $\rho=0.52$ ). It is important to note that AU2, AU4, AU24, and AU26 are the more challenging AUs, in which our train model shows limited performance compared to the other AUs.

5 Conclusion

In this work, we propose a multi-modal framework for AU detection with large pre-trained model features. We apply super-resolution and face alignment to the training data to provide high-quality details for visual feature extraction. The experimental results show the effectiveness of our proposed framework, achieving 52.3% in terms of F1 score on the official validation set of Aff-wild2.

6 Acknowledgement

The project or effort depicted was or is sponsored by the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005, and that the content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

References

[1] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[3] Paul Ekman. Facial action coding system, 1977.
[4] Itir Onal Ertugrul, Jeffrey F Cohn, László A Jeni, Zheng Zhang, Lijun Yin, and Qiang Ji. Cross-domain au detection: Domains, learning approaches, and measures. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE, 2019.
[5] Itir Onal Ertugrul, Jeffrey F Cohn, László A Jeni, Zheng Zhang, Lijun Yin, and Qiang Ji. Crossing domains for au coding: Perspectives, approaches, and measures. IEEE transactions on biometrics, behavior, and identity science, 2(2):158–171, 2020.
[6] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[8] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
[9] Wenqiang Jiang, Yannan Wu, Fengsheng Qiao, Liyu Meng, Yuanyuan Deng, and Chuanhe Liu. Facial action unit recognition with multi-models ensembling. arXiv preprint arXiv:2203.13046, 2022.
[10] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
[11] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
[12] Dimitrios Kollias. Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2328–2336, 2022.
[13] Dimitrios Kollias. Abaw: learning from synthetic data & multi-task learning challenges. In European Conference on Computer Vision, pages 157–172. Springer, 2023.
[14] D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. Analysing affective behavior in the first abaw 2020 competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pages 794–800.
[15] Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111, 2019.
[16] Dimitrios Kollias, Viktoriia Sharmanska, and Stefanos Zafeiriou. Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790, 2021.
[17] Dimitrios Kollias, Panagiotis Tzirakis, Alice Baird, Alan Cowen, and Stefanos Zafeiriou. Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. arXiv preprint arXiv:2303.01498, 2023.
[18] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pages 1–23, 2019.
[19] Dimitrios Kollias and Stefanos Zafeiriou. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855, 2019.
[20] Dimitrios Kollias and Stefanos Zafeiriou. Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792, 2021.
[21] Dimitrios Kollias and Stefanos Zafeiriou. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3652–3660, 2021.
[22] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[23] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[25] Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. arXiv preprint arXiv:2205.01782, 2022.
[26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017.
[27] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. Jaa-net: Joint facial action unit detection and face alignment via adaptive attention. International Journal of Computer Vision, 129(2):321–340, 2021.
[28] Tengfei Song, Zijun Cui, Wenming Zheng, and Qiang Ji. Hybrid message passing with performance-driven structures for facial action unit detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6267–6276, 2021.
[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[30] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In International Conference on Computer Vision Workshops (ICCVW), 2021.
[31] Yinghao Xu, Yujun Shen, Jiapeng Zhu, Ceyuan Yang, and Bolei Zhou. Generative hierarchical features from synthesizing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4432–4442, 2021.
[32] Stefanos Zafeiriou, Dimitrios Kollias, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, and Irene Kotsia. Aff-wild: Valence and arousal ‘in-the-wild’challenge. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1980–1987. IEEE, 2017.
[33] Wei Zhang, Feng Qiu, Suzhen Wang, Hao Zeng, Zhimeng Zhang, Rudong An, Bowen Ma, and Yu Ding. Transformer-based multimodal information fusion for facial expression analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2428–2437, 2022.
[34] Zheng Zhang, Taoyue Wang, and Lijun Yin. Region of interest based graph convolution: A heatmap regression approach for action unit detection. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2890–2898, 2020.
[35] Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3391–3399, 2016.

Multi-modal Facial Action Unit Detection with Large Pre-trained Models for the 5th Competition on Affective Behavior Analysis in-the-wild