This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-supervised Facial Action Unit Detection with Region and Relation Learning

Abstract

Facial action unit (AU) detection is a challenging task due to the scarcity of manual annotations. Recent works on AU detection with self-supervised learning have emerged to address this problem, aiming to learn meaningful AU representations from numerous unlabeled data. However, most existing AU detection works with self-supervised learning utilize global facial features only, while AU-related properties such as locality and relevance are not fully explored. In this paper, we propose a novel self-supervised framework for AU detection with the region and relation learning. In particular, AU related attention map is utilized to guide the model to focus more on AU-specific regions to enhance the integrity of AU local features. Meanwhile, an improved Optimal Transport (OT) algorithm is introduced to exploit the correlation characteristics among AUs. In addition, Swin Transformer is exploited to model the long-distance dependencies within each AU region during feature learning. The evaluation results on BP4D and DISFA demonstrate that our proposed method is comparable or even superior to the state-of-the-art self-supervised learning methods and supervised AU detection methods.

Index Terms—  Self-supervised learning, AU detection, correlation, Swin Transformer

1 Introduction

Facial Action Coding System (FACS) [1] defines a unique set of non-overlapping facial muscle actions known as Action Units (AUs), the combination of the occurrence or absence of each AU can achieve almost any facial expression. Due to the objectivity of action units, AU detection has drawn significant interest and has been widely applied in human-computer interaction, affect analysis, mental health assessment, etc. However, current research on AU detection is mostly implemented based on supervised learning methods, which rely on large-scale AU-labeled images.

Supervised learning-based AU detection methods tend to mine AUs’ characteristics to obtain more discriminative AU features, such as localization and correlation. Several works are based on those attributes to improve the accuracy and robustness of the model. For example, Zhao et al. [2] proposed a locally connected convolutional layer that learns region-specific convolutional filters from sub-areas of the face. JAA-Net [3, 4] integrated the facial alignment task with AU detection to train the network to learn better local features. To exploit the correlation among AUs, Niu et al. [5] proposed the LP-Net to model person-specific shape information to standardize local relationship learning. Li et al. [6] proposed SRERL to embed the relation knowledge in regional features with a gated graph neural network. Recently, the transformer has also improved the performance of the model [7, 8]. These works benefit from leveraging more AU-related qualities. However, supervised work depends on a considerable number of accurately labeled images. Since labeling AUs is time-consuming and error-prone, the lack of available AU-labeled data limits their generalization. In contrast, unlabeled face data is easy to collect and available in large quantities.

In this paper, discriminative AU representations are learned using the self-supervised method to decrease the demand for AU annotations. To extract its supervised knowledge from vast amounts of unlabeled input, self-supervised learning (SSL) mostly requires auxiliary tasks. It demonstrates excellent promise in applications like object detection and image classification  [9, 10, 11]. Several recent works have attempted to use self-supervised learning to boost the accuracy of AU detection [12, 13, 14, 15, 7]. Wiles et al. [12] proposed FAb-Net to utilize the facial movement’s transformation between two adjacent frames as the supervisory signal and learn the facial embedding by mapping the source frame to the target frame through reconstruction loss. Inspired by FAb-Net, Li et al. [14] respectively change the facial actions and head poses of the source face to those of the target face, decoupled facial features from the head posture. Lu et al. [15] leveraged the temporal consistency to learn feature representation through contrastive learning. In other work, self-supervised learning is used as an auxiliary task [7]. However, most of these studies only learn about global facial features and ignore the task-related domain knowledge, such as the unique properties of AU: locality and relevance. Only global features have a limited impact on AU recognition since each AU is associated with one or a small group of facial muscles and does not manifest independently. Besides, several self-supervised methods [9, 10] learn powerful visual representations for single object detection via contrastive learning. However, because the majority of contrastive approaches now in use create self-supervised tasks utilizing the random crop or temporal consistency, they are prone to provide insufficient AU representations and underutilize the advantages of static datasets.

This paper proposes a novel self-supervised framework for facial action unit detection using region and relation learning (RRL). Our proposed framework learns AU representations through a two-step training process. The first phase is self-supervised pre-training, where we investigate three levels of representation while using the special qualities of the AU to understand the features associated with it. Specifically, for each image, we create two different augmented views to guarantee global similarity and employ the Swin Transformer [16] as our backbone to extract long-range dependencies and global features. Then, we propose an AU-related local feature learning method guiding the network to learn a variety of AU features in the facial image. Moreover, an improved optimal transport algorithm is proposed to exploit the relations among AUs, providing a new way for learning the correlation characteristics of AUs. In the transfer learning stage, we train simple linear classifiers upon the self-supervised representations for downstream AU detection tasks.

2 METHOD

In this section, we describe the proposed framework in detail. Fig.1 demonstrates the overview of the proposed framework. RRL contains two stages: self-supervised pre-training and downstream transfer learning. The pre-training stage involves three modules: global feature learning, local feature learning, and optimal transport for relation learning. Finally, after pre-training, the learned facial representations will be used for AU detection.

Refer to caption
Fig. 1: The framework of RRL. In the self-supervised pre-training stage, f()f(\cdot) based on Swin Transformer and a CNN are utilized to extract global and local representations for each augmented view. g()g(\cdot) maps local facial features to low-dimensional latent space and three components are introduced to train the framework in a self-supervised manner. At the downstream transfer stage, everything but f()f(\cdot) is discarded.

2.1 Global Feature Learning

Motivated by BYOL [17], we use two neural networks: the online network which is defined by a set of parameters θ{\theta} and the target network parameterized by ξ{\xi}. The target network provides a regression target to train the online network. The parameters ξ{\xi} of the target network are updated by the online network’s parameter θ{\theta} according to Eq. 1, in which m{m} is the exponential moving average (EMA) parameter with decay rate.

ξ=mξ+(1m)θ\xi=m*\xi+(1-m)*\theta (1)

To model long-distance dependencies, we use Swin Transformer as our backbone. Swin Transformer was selected due to its competitive results in both supervised and self-supervised works, with hierarchical and shifted windowing mechanisms that can help extract more precise facial features and save computational effort.

Given an input image x{x}, we first generates two augmented images x1x_{1} and x2x_{2}. These augmented images are fed into f()f(\cdot), which are respectively named by fθf_{\theta} and fξf_{\xi}. Within the online network, we then perform the projection z1g=gθ(og){z}_{1}^{g}=g_{\theta}(o_{g}) from the pooled representation ogo_{g}, followed by the prediction qθ(z1g)q_{\theta}({z}_{1}^{g}). At the same time, the target network outputs the target projection from tgt_{g} such that z2g=gθ(tg){z}_{2}^{g}=g_{\theta}(t_{g}). Our global loss is then defined as the cosine similarity between the prediction and the target projection as shown in Eq. 2.

glo𝐳2g,qθ(𝐳1g)𝐳2g2qθ(𝐳1g)2\mathcal{L}_{\text{glo}}\triangleq-\frac{\left\langle\mathbf{z}_{2}^{g},q_{\theta}(\mathbf{z}_{1}^{g})\right\rangle}{\|\mathbf{z}_{2}^{g}\|_{2}\cdot\left\|q_{\theta}(\mathbf{z}_{1}^{g})\right\|_{2}} (2)

2.2 Local Feature Learning

As noted in self-supervised learning works, comparing random crops of an image plays a central role in capturing information in terms of relations between parts of a scene or an object [9]. However, recent work on self-supervised AU detection has not taken this into account. Moreover, the multi-crop strategy in conventional self-supervised object detection could undermine facial AU information. To obtain rich facial information, the primary issue is how to learn AU local features without destroying AU information. Unlike previous methods that random crop patches, RRL uses the attention map to make the model focus more on AU-specific regions, and the local feature learning module is used for AU feature refinement.

Specifically, We first detect 68 facial landmarks to initialize the attention maps of the AUs for a given input image. Landmarks specific to each action unit are defined similarly to EAC-Net [18]. We fit ellipses to landmarks as the initial regions of interest for each AU and smooth the image (Gaussian with σ\sigma = 3), thus obtaining KK AU attention maps iki_{k} of 14x14 size, where k{k} = {1,…, K{K}}. Then, We fuse the features learned by Swin Transformer with the attention maps and input them into the local feature learning network together with the global feature. We use [o1o_{1},o2o_{2},…,oko_{k}] for local vectors and ogo_{g} for global vectors, where the letter oo represents the output of the online network and tt represents the output of the target network.

Finally, we minimize the cosine distance of the global and local features corresponding to the two networks as shown in Eq. 3.

loc12Kk=1K[qθ(z1k),gξ(𝐭𝐠)qθ(z1k)2gξ(𝐭𝐠)2+gξ(𝐭𝐤),qθ(𝐳1g)gξ(𝐭𝐤)2qθ(𝐳1g)2]\mathcal{L}_{\text{loc}}\triangleq-\frac{1}{2K}\sum_{k=1}^{K}\left[\frac{\left\langle q_{\theta}({z}_{1}^{k}),g_{\xi}(\mathbf{t_{g}})\right\rangle}{\|q_{\theta}({z}_{1}^{k})\|_{2}\cdot\left\|g_{\xi}(\mathbf{t_{g}})\right\|_{2}}+\frac{\left\langle g_{\xi}(\mathbf{t_{k}}),q_{\theta}(\mathbf{z}_{1}^{g})\right\rangle}{\|g_{\xi}(\mathbf{t_{k}})\|_{2}\cdot\left\|q_{\theta}(\mathbf{z}_{1}^{g})\right\|_{2}}\right]

(3)

2.3 Optimal Transport for Relation Learning.

Similar to [19] using Optimal Transport (OT) to learn the discrimination between instances, an improved OT method is proposed here to learn the correlations of AUs.

The AU outputs from the online network [o1o_{1},o2o_{2},…,oko_{k}] are considered to be the kk goods of the supplier, while vector [t1t_{1},t2t_{2},…,tkt_{k}] from the target network are considered to be the demands of the demander. In addition, we define the unit transportation cost between the demander and the supplier as the value of the AU relationship matrix. Taking [20] as the reference, we count the correlation between AU labels in the training set and get the AU relationship matrix MijM_{ij} of AU, as shown in Fig.2. The larger the matrix value, the greater the correlation between the two AUs. Therefore, the unit transportation cost from the supplier node to the demander node is defined as Eq. 4.

Cij=1Mij\begin{array}[]{ll}{C}_{ij}=1-{M}_{ij}\end{array} (4)

The correlation degree of each AU pair can be expressed as the optimal matching cost between two sets of vectors. The more relevant the vector, the lower the transport cost.

Refer to caption
Fig. 2: Relationship matrix MijM_{ij} of AUs.

Following [19], the marginal weights aia_{i} and bjb_{j} are expressed by the product of global vectors and local vectors. In that case, the importance of the AU region depends on its contribution to the whole face as defined in Eq. 5, where the function max(·) ensures the weights are always non-negative.

ai={𝐨iT𝐭g,0},bj={𝐭iT𝐨g,0}\displaystyle a_{i}=\left\{\mathbf{o}_{i}^{T}\cdot\mathbf{t}_{g},0\right\},\quad b_{j}=\left\{\mathbf{t}_{i}^{T}\cdot\mathbf{o}_{g},0\right\} (5)

We use the cost matrix CC to calculate the optimal transmission scheme π{\pi}^{*}, and solve the OT problem through the Sinkhorn-Knopp iteration. Then we can define the correlation loss function to learn the distinguishing characteristics as shown in Eq. 6. The loss would be minimized only if the representations of each AU are similar to the associated AU.

corr(𝐎,𝐓)i=1Kj=1K𝐨iT𝐭j𝐨i𝐭jπ\mathcal{L}_{\text{corr}}(\mathbf{O},\mathbf{T})\triangleq-\sum_{i=1}^{K}\sum_{j=1}^{K}\frac{\mathbf{o}_{i}^{T}\mathbf{t}_{j}}{\left\|\mathbf{o}_{i}\right\|\left\|\mathbf{t}_{j}\right\|}{\pi}^{*} (6)

After introducing each loss term, the overall loss is defined as following:

all=α1Lglo+α2Lloc+α3Lcorr\mathcal{L}_{\text{all}}={\alpha_{1}}{L}_{\text{glo}}+{\alpha_{2}}{L}_{\text{loc}}+{\alpha_{3}}{L}_{\text{corr}} (7)

where α1\alpha_{1}, α2\alpha_{2}, α3\alpha_{3} are trade-off parameters.

3 Experiments

3.1 Experimental Setup

Training. We conducted the self-supervised pre-training on the EmotioNet [21] dataset. The EmotioNet is collected in the wild, including a total of 950,000 images. All images are resized to 224 × 224 as the input of the network. Facial landmarks are detected with Dlib. In addition to random horizontal flipping, random grey scaling, and random Gaussian blur, the data augmentation also has random color distortions. Where, the random color distortion with different brightness, saturation, contrast, and hue, augmentation of different values is applied to two views. We employ an AdamW optimizer and the learning rate is lr=0.05batchsize/256lr=0.05*batchsize/256, using a cosine decay learning rate scheduler. The weight decay also follows a cosine decay from 0.04 to 0.4. The trade-off parameters α1\alpha_{1}, α2\alpha_{2}, and α3\alpha_{3} are empirically set to 0.4, 0.6, and 1.0 respectively. The exponential moving average parameter mm is initialized as 0.98 and is increased to 1.0 during training.

Evaluation. We then evaluated the methods on BP4D [22] and DISFA [23] datasets. BP4D contains 41 subjects (23 females and 18 males). There are 328 videos with about 146000 frames with available AU labels. DISFA consists of 26 participants. The AUs are labeled with intensities from 0 to 5. The frames with intensities greater than 1 were considered positive, while others were treated as negative. We choose KK=12 on BP4D, and 8 on DISFA. In the downstream transfer stage, the linear classifier consists of two layers: a batch-norm layer followed by a linear fully connected layer with no bias. The linear classifier was trained with a cross-entropy loss for each AU and adopted the F1 score to evaluate the performance of the method. We also computed the average overall AUs to measure the overall performance.

3.2 Experimental Results

We compared RRL with several self-supervised methods and supervised AU detection methods. Table 1 and Table 2 report the F1 score on DISFA and BP4D.

Table 1: F1 score for self-supervised and supervised methods on DISFA [23] of multiple facial action units (AUs).
AU Supervised Self-Supervised
DRML [2]\ast JAA-Net [4]\ast FAb-Net [12] TCAE [14] TC-Net [15]k=1\ast_{k=1} TC-Net [15]\ast RRL(ours)
1 17.3 43.7 27.5 24.8 10.8 18.7 15.4
2 17.7 46.2 19.6 25.5 20.7 27.4 15.9
4 37.4 56.0 28.7 37.3 43.3 35.1 49.5
6 29.0 41.4 45.2 34.7 37.6 33.6 48.8
9 10.7 44.7 20.9 31.1 12.2 20.7 22.1
12 37.7 69.6 65.6 59.6 68.7 67.5 70.3
25 38.5 88.3 67.9 58.1 62.9 68.0 81.4
26 20.1 58.4 24.0 25.2 46.2 43.8 46.8
Avg. 26.7 56.0 37.4 37.0 37.8 39.4 43.8
  • \ast means that the values are reported in the original papers.

Comparison with self-supervised methods We compare our method with self-supervised methods: FAb-Net [12], TCAE [14], and TC-Net [15]. FAb-Net and TCAE followed the settings in TC-Net. Those models have all been pre-trained on the combined VoxCeleb dataset [24], in contrast to the static and in-the-wild EmotioNet dataset we employ in this paper. These two datasets consist of videos of interviews containing around 7,000 subjects. TC-Netk=1 is the result of TC-Net when the time interval is 1. As shown in Table 1, the F1 score of our method outperforms all self-supervised methods. Even though the DISFA dataset has significant problems with data imbalance, the majority of our AU results are encouraging. The performance demonstrates the benefit of the relational learning module, which offers comprehensive AU correlation knowledge. Evaluations on BP4D are shown in Table 2. Without any temporal information, we still got identical results compared with recent self-supervised methods. In particular, AU2 (outer brow raiser), which is difficult to capture, has achieved remarkable results, suggesting that the local feature learning module can extract subtle AU changes effectively. As we can see from the results on BP4D, our method performs slightly worse than TCAE and TC-Net. This is because both of these methods were trained on video frames, and in addition, TC-Net results were obtained using an ensemble model that combined several independently learned encoders.

Table 2: F1 score for self-supervised and supervised methods on BP4D [22] of multiple facial action units (AUs).
AU Supervised Self-Supervised
AlexNet [25]\ast DRML [2]\ast JAA-Net [4]\ast FAb-Net [12] TCAE [14] TC-Net [15]k=1\ast_{k=1} TC-Net [15]\ast RRL(ours)
1 40.3 36.4 47.2 33.4 33.5 35.2 42.3 42.0
2 39.0 41.8 44.0 24.8 32.2 25.5 24.3 35.7
4 41.7 43.0 54.9 41.0 43.8 30.2 44.1 34.0
6 62.8 55.0 77.5 73.5 73.7 71.3 71.8 67.4
7 54.2 67.0 74.6 66.2 67.7 69.6 70.5 67.8
10 75.1 66.3 84.0 78.8 80.1 81.3 77.6 79.1
12 78.1 65.8 86.9 84.7 81.5 83.3 83.3 80.6
14 44.7 54.1 61.9 57.9 57.4 59.1 61.2 63.9
15 32.9 33.2 43.6 21.2 26.5 30.3 31.6 28.6
17 47.3 48.0 60.3 55.7 54.5 56.1 51.6 48.6
23 27.3 31.7 42.7 26.8 23.2 27.0 29.8 26.5
24 40.1 30.0 41.9 37.9 31.8 33.4 38.6 32.4
Avg. 48.6 48.3 60.0 50.2 50.5 50.2 52.0 50.250.2
  • \ast means that the values are reported in the original papers.

Comparison with supervised methods We also compare our method with state-of-the-art supervised methods, including AlexNet [25], DRML [2], and JAA-Net [4]. On top of the encoder, we simply train a straightforward linear classifier using frozen weights. Experimental results shown in Table 1 and Table 2 demonstrate that our RRL is comparable to fully supervised models. It outperforms AlexNet and DRML in BP4D and outperforms DRML in DISFA datasets. It lags behind JAA-Net, which adopts facial landmarks to jointly AU detection and face alignment. Besides, all supervised results are directly collected from the original work.

3.3 Ablation Studies

In this section, we conduct ablation experiments on BP4D dataset to explore the effect of each component in our proposed model and find the best structural configurations of the framework.

Table 3: Evaluation of different components.
Global
Local
Correlation
F1-score 48.0 48.9 49.5 50.2

The Effectiveness of different components: As shown in Table 3, Global is a baseline method that shows the results of comparative learning of only two augmented images. The F1-score rises by 0.9%0.9\% after the implementation of local learning, indicating the significance of local features in AU detection. By enhancing the AU relation learning via OT, the model improves the baseline by 1.5%1.5\%. The performance demonstrates the necessity of taking into account the unique characteristics of AUs while developing self-supervised tasks. Moreover, the average F1-score is further improved by 2.2%2.2\% compared to the baseline when we combine those two components. Overall, both local learning and relation learning components are essential for improving the performance of the RRL network.

Table 4: Evaluation of the Swin Transformer and augmentation.
VGG-16
Swin-T
Augment
F1-score 48.8 49.1 50.2

The Effectiveness of Swin Transformer and Augmentation: To investigate the effect of the model backbone on the performance of AU detection, we replace the Swin Transformer backbone with VGG16 and report the results in Table 4. We can find that the use of the Swin Transformer significantly improves the detection performance because the hierarchical architecture in Swin-T extracts rich face representations. When the image augmentation is discarded, we observe that the results decline by 1.1%1.1\%, indicating that data augmentation is crucial to self-supervised learning in natural images.

4 CONCLUSION

In this paper, we have proposed a novel self-supervised learning framework for AU detection, in which the localization and correlation of AUs are fully considered. The framework is trained on unlabeled databases. The model has better generalization on AU detection. Extensive experimental results demonstrate the superiority of our method. In the future, we will adaptively capture the correlated regions of each AU to further extend our framework for AU intensity estimation.

References

  • [1] Paul Ekman and Wallace V Friesen, “Facial action coding system,” Environmental Psychology & Nonverbal Behavior, 1978.
  • [2] Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang, “Deep region and multi-label learning for facial action unit detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3391–3399.
  • [3] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma, “Deep adaptive attention for joint facial action unit detection and face alignment,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 705–720.
  • [4] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma, “Jaa-net: joint facial action unit detection and face alignment via adaptive attention,” International Journal of Computer Vision, vol. 129, no. 2, pp. 321–340, 2021.
  • [5] Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan, “Local relationship learning with person-specific shape regularization for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2019, pp. 11917–11926.
  • [6] Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin, “Semantic relationships guided representation learning for facial action unit recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 8594–8601.
  • [7] Jingwei Yan, Jingjing Wang, Qiang Li, Chunmao Wang, and Shiliang Pu, “Self-supervised regional and temporal auxiliary tasks for facial action unit recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1038–1046.
  • [8] Geethu Miriam Jacob and Bjorn Stenger, “Facial action unit detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7680–7689.
  • [9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
  • [10] Xinlei Chen and Kaiming He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15750–15758.
  • [11] Byungseok Roh, Wuhyun Shin, Ildoo Kim, and Sungwoong Kim, “Spatially consistent representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1144–1153.
  • [12] Olivia Wiles, A Koepke, and Andrew Zisserman, “Self-supervised learning of a facial attribute embedding from video,” arXiv preprint arXiv:1808.06882, 2018.
  • [13] Yanan Chang and Shangfei Wang, “Knowledge-driven self-supervised representation learning for facial action unit recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20417–20426.
  • [14] Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen, “Self-supervised representation learning from videos for facial action unit detection,” in Proceedings of the IEEE/CVF Conference on Computer vision and pattern recognition, 2019, pp. 10924–10933.
  • [15] Liupei Lu, Leili Tavabi, and Mohammad Soleymani, “Self-supervised learning for facial action unit recognition through temporal consistency,” in BMVC, 2020.
  • [16] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
  • [17] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284, 2020.
  • [18] Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin, “Eac-net: A region-based deep enhancing and cropping approach for facial action unit detection,” in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017, pp. 103–110.
  • [19] Zhaowen Li, Yousong Zhu, Fan Yang, Wei Li, Chaoyang Zhao, Yingying Chen, Zhiyang Chen, Jiahao Xie, Liwei Wu, Rui Zhao, et al., “Univip: A unified framework for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14627–14636.
  • [20] Zhilei Liu, Jiahui Dong, Cuicui Zhang, Longbiao Wang, and Jianwu Dang, “Relation modeling with graph convolutional networks for facial action unit detection,” in International Conference on Multimedia Modeling. Springer, 2020, pp. 489–501.
  • [21] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5562–5570.
  • [22] Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, and Peng Liu, “A high-resolution spontaneous 3d dynamic facial expression database,” in 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 2013, pp. 1–6.
  • [23] S Mohammad Mavadati, Mohammad H Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013.
  • [24] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
  • [25] Wen-Sheng Chu, Fernando De la Torre, and Jeffrey F Cohn, “Learning spatial and temporal cues for multi-label facial action unit detection,” in 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017). IEEE, 2017, pp. 25–32.