¹¹institutetext: Department of Artificial Intelligence, Korea University, Seoul, Korea
¹¹email: [email protected]
Department of Artificial Intelligence & Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea
¹¹email: [email protected]

Comparing Facial Expression Recognition in Humans and Machines: Using CAM, GradCAM, and Extremal Perturbation

Serin Park
Christian Wallraven 11 0000-0001-7704-1167 22** 0000-0002-2604-9115

Abstract

Facial expression recognition (FER) is a topic attracting significant research in both psychology and machine learning with a wide range of applications. Despite a wealth of research on human FER and considerable progress in computational FER made possible by deep neural networks (DNNs), comparatively less work has been done on comparing the degree to which DNNs may be comparable to human performance. In this work, we compared the recognition performance and attention patterns of humans and machines during a two-alternative forced-choice FER task. Human attention was here gathered through click data that progressively uncovered a face, whereas model attention was obtained using three different popular techniques from explainable AI: CAM, GradCAM and Extremal Perturbation. In both cases, performance was gathered as percent correct. For this task, we found that humans outperformed machines quite significantly. In terms of attention patterns, we found that Extremal Perturbation had the best overall fit with the human attention map during the task.

Keywords:

Facial Expression Recognition AffectNet humans versus machines human-in-the-loop

1 Introduction

Facial expression is a natural and powerful tool of communication among humans. Facial expressions are instantly processed by humans conveying a wealth of messages: a smiling face spreads happiness, and a sad face makes our heart ache. There is evidence that some emotional facial expressions are largely universal [6], that is, they are recognized well in different cultures and associated with similar semantic content. Recent work, however, has cast some doubt to the degree of this universality [10, 13] and the degree to which facial expressions are actually a reliable signal of an internal mental state [1], but that does not lessen the importance of facial expressions in human-to-human communication [22].

Because of its importance to humans, facial expression recognition is also a major topic in the field of machine learning. If machines would be able to interpret human facial expression correctly - and possibly make appropriate facial expressions in return - human-to-machine interaction would become more natural and efficient. Recognition of facial expression by machines is called automatic facial expression recognition, or automatic FER. Automatic FER has come a long way, from hand-crafted approaches to the current end-to-end deep learning models that locate and recognize facial expressions [16]. Nonetheless, there still is a long way to go: current algorithms are good enough at recognizing laboratory-controlled facial expression images, but they struggle to recognize expressions from naturalistic images [18, 25].

This leads us to the natural question: do humans and machines process facial expression images differently? And if they do, can we teach machines to act more like humans? In this paper, we adopted a human-in-the-loop (HIL) paradigm to address this question: we first collected click data from human participants to gather those spatial locations that may be important for disambiguating an expression in a two-alternative forced-choice task. Click data are reported to be a cost-efficient substitute for eye-tracking, and reflect the regional attention well [5, 14]. For automatic FER, we trained an ensemble of deep neural networks on AffectNet [19], a large, in-the-wild facial expression dataset, and compared its activation map with human click data using three different explainability or visualization methods. We also tried to further fine-tune the models with the human attention map to see whether this would improve the FER performance.

2 Related work

2.1 Automatic FER

After the advent of deep learning, FER has typically been implemented with deep neural networks due to their superior performance and robustness over the past years [16]. In this section, we will go through some of the most popular datasets of FER and their benchmarks.

FER datasets can be broadly categorized as either controlled or in-the-wild. Controlled datasets are posed by trained actors and photographed in the lab with regular illumination. The extended Cohn-Kanade dataset (CK+; [17]) is a classic example of a controlled dataset. It contains 593 video sequences from 123 individuals that start from neutral expression and culminate in the intended expression (one of seven categories: anger, contempt, disgust, fear, happiness, sadness, and surprise). In contrast, in-the-wild datasets are crawled from the web by searching for emotion-related keywords. This type of dataset is typically larger than controlled datasets, and noisier in terms of identity, illumination, etc. FER+ [2], a re-labeled version of FER2013 [11], has been a popular early dataset in the field, with 28,709 training images, 3,589 validation images and 3,589 test images, consisting of the same seven expressions as the CK+ dataset. Currently, AffectNet [19] is the largest publicly-available labelled dataset on facial expressions (see Dataset section below for detailed information).

FER models have reached excellent performance on controlled datasets: on CK+, Frame Attention networks proposed by [18] have attained 99.7% accuracy. However, FER models perform less well with in-the-wild datasets: the state-of-the-art (SOTA) accuracy on the FER+ dataset with cleaned and updated labels is 89.75% reached by a PSR model on seven expressions [29]. SOTA on the AffectNet dataset, in contrast, is only 65.74% for seven of the eight included emotion categories [25]. Given that FER systems in practical use, such as humanoid robots and surveillance systems, will not be fed with regular illuminations, frontal head position and exemplary expressions, it is important to improve FER accuracy on such in-the-wild datasets.

2.2 Human FER

As automatic FER is an effort to mimic the natural capacity of humans to recognize facial expressions, one must look back on humans to get insight for the models. One thing to note is that humans are not necessarily better than deep neural networks in the task of classifying images. Human performance on the (non-updated) FER2013 dataset was 65 $\pm$ 5% [11], while the then-SOTA model, ResMaskingNet [23] reached 76.82%. However, this does not mean machines have outperformed humans in facial expression recognition in general. As stated above, recognizing someone’s facial expression in the real world represents quite different challenges.

One of the key differences between humans and computers in FER is that humans pay attention to a limited region in the face, while computers treat all pixels equally in the initial phase. Humans distribute most of their attention to the eyes and mouth [20, 22], which partly explains the efficiency with which humans recognize facial expression. Interestingly, the region of interest can differ depending on culture [12]. East Asians focus on the eyes, while Western Caucasian also pay attention to the mouth region. This difference leads East Asians to perform less well when discriminating ’fear’ from ’disgust’, in which pair the mouth region holds the key information.

2.3 Transfer learning

Transfer learning has two major approaches: the first approach is to pretrain and then fine-tune. Fine-tuning is a common practice in training neural networks. As it is difficult to collect large datasets for specific problems, researchers often train their networks first on ImageNet, a large-scale object classification dataset that contains 1.2 million images with 1000 classes [4]. ImageNet-pretrained models are also available in deep learning libraries. Fine-tuning may be done several times: [21] introduced cascaded fine-tuning, where the researchers first trained a deep CNN model on ImageNet dataset, an auxiliary dataset related to emotion recognition, and finally on the target dataset.

Another approach is Knowledge Distillation, also known as Teacher-Student model. It was originally developed as a model compression method [3]. The Teacher network first learns the representations and outputs prediction labels. Then the Student network is trained on the prediction labels of Teacher. One key aspect of Knowledge Distillation is that it can transfer knowledge across models with different structure. It even enables human-machine transfer learning: [26] implemented this type of learning, albeit indirectly. The researchers first trained a Teacher model on an FER dataset and obtained a saliency map for each image by visualizing the activations of the Teacher model. Then the researchers masked the images by leaving only the most important parts of the image based on the saliency maps - usually around the eyes and mouth - and used the masked images to train the Student model. The masked images initially helped accelerate the training, but the acceleration was retained only if the training data was switched to unmasked images after a certain point - there was no effect on accuracy, however. The critical aspect was that the researchers validated the masked images by comparing them to eye-tracking results by human observers. As the saliency maps of Teacher model were shown to be similar to human attention maps, this work is an indirect example of human-machine transfer learning.

3 Dataset

We chose the AffectNet dataset [19] to train our models for two reasons. First, it is the largest labelled dataset for facial expressions, containing a total of 440,601 labelled images in eight categories: neutral, happy, sad, surprise, fear, disgust, anger, and contempt. Second, it seems to be a difficult dataset to improve on: it was collected by web-based crawling methods, and each image was labeled by one, though expert, human annotator. Therefore, the AffectNet dataset contains in-the-wild images that may be mislabelled or vague. Specifically, out of 36,000 images that were annotated by two human annotators to calculate agreement, their agreement was only 60.7 percent. Moreover, the dataset is highly imbalanced: the largest class, happy, contains 143,991 images while the smallest class, contempt, contains only 5,119 images. This imbalance reflects the real-world proportion of expression occurrences; one does observe happy expressions more often than one observes contempt.

The baseline for AffectNet was measured with AlexNet [15]. As the test set is not publicly available, the validation set, which contains 500 images for each expression, is used as the benchmark dataset. Baseline accuracy with weighted loss was 58% on the validation set. The state-of-the-art is an SL + SSL in-panting-pl model [24], with an average accuracy of 61.72% for eight emotion categories. Table 1 summarizes major results on AffectNet benchmarks and the baseline. In cases where one paper listed several methods with slight differences on the ranking, we only chose one method with the best result. Moreover, we only list methods that were tested on all eight emotion categories, as we will focus also on eight categories in this paper.

Table 1: SOTA and baseline results on AffectNet dataset

Method	Accuracy	Reference
SL + SSL in-panting-pl (B0)	61.72	[24]
Distilled student	61.60	[27]
Multi-task EfficientNet-B0	61.32	[25]
RW loss	61.03	[7]
PSR	60.68	[29]
Baseline (Weighted Loss)	58.0	[19]

For the human experiment, hand-picked images from the AffectNet validation set were used (see Fig 1 for example images). For each of the eight facial expression categories, 35 images were chosen that were deemed to be good representatives of the intended expressions. The total number of images was therefore 280. All computational experiments used the full training set of AffectNet and the validation set minus our 280 images.

Refer to caption — Figure 1: Sample images used for human experiment

4 Human experiment

4.1 Participants

24 Korea University undergraduates were recruited by an online advertisement (12 female, mean age $23.75\pm 3.33$ years (SD) - all had normal or corrected-to-normal vision). Of the 24 participants, two had to be excluded from statistical analyses; one being an outlier (overall accuracy more than 3 $\sigma$ lower compared to the sample mean), and the other having skipped an entire block due to mistake.

4.2 Methods and task

The 280 experimental images were blurred by an opencv function (cv2.blur) with $k=70$ , and converted to grayscale, so that the expression could not be recognized by just looking at the blurred image. Human participants were asked to click on these blurred images to reveal circular parts. The revealed parts were not blurred and in original colors. From the revealed parts, participants had to determine which expression the picture portrayed. There was no limit to the number of clicks, but the participants were asked to make as few clicks as possible. Participants were seated in a quiet room in front of a monitor at a distance of roughly 57 cms. The faces subtended roughly 4.5^∘ of visual angle.

The response options were given in a two-alternative forced-choice ( $2\mathrm{AFC}$ ) paradigm, one being the correct label and the other the false label. For one picture, the false label was fixed across all participants, so there was a fixed picture set for a pair of labels; such as ’happy’ versus ’sad’. Moreover, participants were instructed to pay attention to the option pair before clicking on the image and use that information to guide clicks. This instruction was given in order to find key regions for discriminating between a pair of expressions. For example, the mouth region is crucial for discriminating ’fear’ from ’surprise’ [12]. There was a total of 280 trials, as 35 images were selected from each of the 8 categories. Order of trial was randomized for each participant. The trials were split into 4 sessions of 70 trials each with breaks in-between.

4.3 Results

The average accuracy across all participants in this 2AFC task was 83.9%. The confusion matrix (Figure 2) illustrates the response pattern of participants. The numbers in the cells are actual normalized values, but the colormap was based upon square roots of the values in order to highlight the differences among non-diagonal values. The confusion pattern shows that some pairs of emotions are confused more often than others: ’contempt’ is mistaken for ’neutral’, ’fear’ for ’surprise’, ’anger’ for ’disgust’, and ’disgust’ for ’contempt’.

Next, we plotted the accuracy for each pair of expressions as a heatmap (Figure 3(a)). ’True’ labels (on the y axis) mean the true label for a given image, and ’false’ labels (on the x axis) are the false options in the $2\mathrm{AFC}$ experiment. This matrix is not symmetric since an image with ’happy’ as true label and ’sad’ as false label is qualitatively different from an image with ’sad’ as true label and ’happy’ as false label. Same-label pairs do not exist in the experiment, but were included as empty cells for a more legible visualization of pair structures. We also explored two other variables as a function of expression pairs: the number of clicks before label decision for one image, and the time taken between the first click and the label decision of one image (Figures 3(b),3(c)). As in the accuracy heatmap, same-label pairs were included as empty cells. A clear positive correlation was observed between number of clicks and time, $r=.93$ , $p<.001$ . Significant negative correlations were observed between number of clicks and accuracy, $r=-.56$ , $p<.001$ , and between time and accuracy, $r=-.65$ , $p<.001$ .

Next, we analyzed the pattern of clicks to investigate the strategies participants used. For visualization, clicks were color-coded according to their sequence: the first click was coded in red, the last click in yellow, and the clicks in between were given interpolated colors. We first obtained the colored click map for one picture, clicked by one participant, and averaged the click maps from all participants on the same picture, given that the label choice was correct. We found that participants mostly click the left eye first, then the right eye, and finally the mouth. In the case of images with low accuracy, the last few clicks were often around the left eye; which means people go back to the left eye when the stimulus is difficult to classify. Figure 4(a) is an example of an image with high accuracy, where the trend to start from the left eye and to end at the mouth is clear. Figure 4(b) is an example of an image with low accuracy, where the last click is often on the left eye.

5 Algorithms

5.1 Model training

To compare the human FER results with those of our DNN, the multiclass problem of classifying eight expression categories of the AffectNet dataset was split into a set of binary classification problems. Specifically, there were 28 classifiers (all possible pairs in eight emotions; ${8\choose 2}$ ) such as ’happy’ versus ’sad’, etc. For all classifiers, a ResNet-50 model was used with cross-entropy-loss and Adam optimizer with a base learning rate of 0.0001. Images were augmented by horizontal flip, small degrees of shifting, scaling, rotating, and changes in brightness and contrast. The batch size was 64. The classifiers were trained with original AffectNet training dataset restricted to the two categories in question, and the larger class was undersampled to match the smaller class. Therefore, the size of the dataset was different for each model. Because of this difference, all models were trained until they reached training accuracy of 90 (with maximum epoch of 150), rather than for a fixed number of epochs, to enable fair comparison.

The trained binary classifiers were fine-tuned with the click-revealed images used in the human experiment. There were only 10 source images for each classifier, but as there were 22 participants, the size of fine-tuning dataset could be as large as 220, if the accuracy was 100 percent. The images were also augmented by shifting the click mask by one pixel in eight directions relative to the image. Lastly, we also gave unmasked versions of hand-picked images to the models to prevent catastrophic forgetting. These unmasked, or original, images were augmented in the same manner that was used in pretraining.

5.2 Model results

We ensembled the prediction results from 28 binary classifiers by simple vote and weighted vote methods (Figure 5(a), 5(b)). In the simple vote method, each classifier votes for a class for each test sample. The class with the most votes becomes the predicted label. In the weighted vote method, the largest output value from the fully connected layer of each classifier becomes the weighted vote [9]. This value reflects the confidence of the model. With the train-up-to-90 classifiers, accuracy was 49% for simple vote and a similar level of 50% for weighted vote.

Additionally, we tested a set of classifiers trained for 30 epochs to compare the pattern of prediction (Figure 6(a), 6(b)). The performance was similar, with 48% for simple vote and 49% for weighted vote. However, classifiers trained for a set number of epochs showed a greater bias towards ’happy’ in the confusion matrix. This tendency was more pronounced in the weighted vote method than in simple vote. This is because ’happy’, the representation of which is learned in a relatively short time, gave high confidence while the confidence was low for other expressions.

We also trained a multiclass model for comparison (Figure 7). In this model, we implemented weighted cross entropy loss instead of undersampling. The number of epochs was 40, with other hyperparameters staying the same. The total accuracy of multiclass model was 54%, which is higher than the ensemble model. However, the two minor classes, ’disgust’ and ’contempt’, showed higher recall rates for the ensemble model (Figure 5(b)). The pattern of confusion of the multiclass model was similar to Figure 6(a), where each pair is trained for the same number of epochs and the simple vote method is used. Overall, the ensemble method showed relatively similar performance across different classes, although the average accuracy dropped by 4%.

Lastly, we fine-tuned the models with masked images. Contrary to initial expectations, the overall accuracy decreased, to 43% for simple vote and 44% for weighted vote. Varying the ratio of masked and unmasked images did not improve performance. In the confusion matrix (Figure 8(a), 8(b)), we observed a strong bias towards the ’neutral’ expression.

5.3 Comparing humans and models

We first looked at correlations of the confusion matrices for the different computational models and the human confusion matrix shown in Figure 2. There was a positive correlation between human and trained-up-to-accuracy-90 simple-vote model (Figure 5(a)), $r=.93$ , $p<.001$ , and weighted-vote model (Figure 5(b)), $r=.92$ , $p<.001$ . We also found a positive correlation between human and trained-for-30-epochs simple-vote model (Figure 6(a)), $r=.91$ , $p<.001$ , and weighted-vote model (Figure 5(b)), $r=.80$ , $p<.001$ . Lastly, we found a positive correlation between human and multiclass model (Figure 7), $r=.91$ , $p<.001$ . This result supports our claim that although the ensemble model has lower average accuracy than multiclass model, its pattern of confusion is slightly more similar to that of humans than multiclass model.

We visualized the activations of pretrained models with three visualization techniques: CAM [30], GradCAM [28] and Extremal Perturbation [8]. Figure 9 demonstrates visualizations of each method over the same image.

To see how similar each method is to the human attention map, we computed dice coefficients between human attention maps and model saliency maps. The dice coefficient is given by two times the area of overlap between two binary masks, divided by the total number of 1’s in both masks. The attention, or saliency, maps were averaged within expression pairs, normalized and scaled to integer values between 0 and 255. The averaged masks were thresholded at value of 50; that is, values below 50 were set to 0 and values over 50 were set to 1.

Figure 10 illustrates dice coefficients of three different visualization methods in a box plot. If the dice coefficient is close to $1$ , it means the method is similar to human attention maps. The plot reveals that Extremal Perturbation has the highest mean dice coefficient value. One-way analysis of variance (ANOVA) verifies this observation, as there is a highly significant difference among variables, $F=85.45$ , $p<.001$ . Pairwise Tukey analysis reveals that Extremal Perturbation had significantly higher dice coefficients than CAM, $t=10.31$ , $p<.001$ , and GradCAM, $t=12.12$ , $p<.001$ . There was no significant difference between CAM and GradCAM, $t=1.81$ , $p>.05$ . Additional analysis shows that the effect of facial expression was not significant, $F=1.34$ , $p>.05$ and hence that all eight expressions were more similar at equivalent levels for Extremal Perturbation compared to the other two methods.

6 Conclusion and future work

Our work compared human attention maps represented by click data, and model saliency maps using three different visualization methods: CAM, GradCAM, and Extremal Perturbation. We found that Extremal Perturbation had the best fit with the human attention. It will be interesting to extend this comparison also to other, more standard N-AFC FER tasks, in which humans need to disambiguate between more than two expressions.

Our computational experiments showed that the ensemble models of binary classifiers for FER did not perform as well as the standard multiclass model. However, by training the binary classifiers until they reach training accuracy of 90% and combining the classification results by weighted vote method, we obtained a model that is less biased towards major expressions. Interestingly, we failed to improve the model using attentional information from humans as using the masked images as fine-tuning dataset proved an inadequate method for guiding the attention of a CNN-based model - this is in some way similar to the work by [26] who also found little to no improvement when using masked images. In the future, we will work on incorporating the attention mechanism to our model to channel the model’s attention to meaningful regions more effectively.

Lastly, our participants pool was limited in that the participants were all East Asians. According to previous research [12], East Asians tend to focus only on the eyes compared to Western Caucasians, and are thus less accurate at discriminating between ’surprise’ and ’fear’, ’anger’ and ’disgust’, respectively. In Figure 4, we can actually see some evidence for this, with the mouth region often being the last to be revealed, which may be in line with this aforementioned research. Future experiments with Western Caucasian participants may show a different pattern of results.

7 Acknowledgments

This work was supported by Institute of Information Communications Technology Planning Evaluation (IITP; No. 2019-0-00079, Department of Artificial Intelligence, Korea University) and National Research Foundation of Korea (NRF; NRF-2017M3C7A1041824) grant funded by the Korean government (MSIT).

References

[1] Barrett, L.F., Adolphs, R., Marsella, S., Martinez, A.M., Pollak, S.D.: Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological science in the public interest 20(1), 1–68 (2019)
[2] Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction. pp. 279–283 (2016)
[3] Buciluǎ, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 535–541 (2006)
[4] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
[5] Egner, S., Reimann, S., Hoeger, R., Zangemeister, W.H.: Attention and information acquisition: Comparison of mouse-click with eye-movement attention tracking. Journal of Eye Movement Research 11(6), 1–27 (2018)
[6] Ekman, P., Keltner, D.: Universal facial expressions of emotion. Segerstrale U, P. Molnar P, eds. Nonverbal communication: Where nature meets culture pp. 27–46 (1997)
[7] Fan, X., Deng, Z., Wang, K., Peng, X., Qiao, Y.: Learning discriminative representation for facial expression recognition from uncertainties. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 903–907. IEEE (2020)
[8] Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2950–2958 (2019)
[9] Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition 44(8), 1761–1776 (2011)
[10] Gendron, M., Roberson, D., van der Vyver, J.M., Barrett, L.F.: Perceptions of emotion from facial expressions are not culturally universal: evidence from a remote culture. Emotion 14(2), 251 (2014)
[11] Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H., et al.: Challenges in representation learning: A report on three machine learning contests. In: International conference on neural information processing. pp. 117–124. Springer (2013)
[12] Jack, R.E., Blais, C., Scheepers, C., Schyns, P.G., Caldara, R.: Cultural confusions show that facial expressions are not universal. Current biology 19(18), 1543–1548 (2009)
[13] Jack, R.E., Garrod, O.G., Yu, H., Caldara, R., Schyns, P.G.: Facial expressions of emotion are not culturally universal. Proceedings of the National Academy of Sciences 109(19), 7241–7244 (2012)
[14] Kim, N.W., Bylinskii, Z., Borkin, M.A., Gajos, K.Z., Oliva, A., Durand, F., Pfister, H.: Bubbleview: an interface for crowdsourcing image importance maps and tracking visual attention. ACM Transactions on Computer-Human Interaction (TOCHI) 24(5), 1–40 (2017)
[15] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, 1097–1105 (2012)
[16] Li, S., Deng, W.: Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing (2020)
[17] Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: 2010 ieee computer society conference on computer vision and pattern recognition-workshops. pp. 94–101. IEEE (2010)
[18] Meng, D., Peng, X., Wang, K., Qiao, Y.: Frame attention networks for facial expression recognition in videos. In: 2019 IEEE International Conference on Image Processing (ICIP). pp. 3866–3870. IEEE (2019)
[19] Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10(1), 18–31 (2017)
[20] Moon, H.J.: Facial Expression Processing with Deep Neural Networks: from Implementation to Comparison with Humans. Master’s thesis, Korea University, Seoul, Korea (2019)
[21] Ng, H.W., Nguyen, V.D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on international conference on multimodal interaction. pp. 443–449 (2015)
[22] Nusseck, M., Cunningham, D.W., Wallraven, C., Bülthoff, H.H.: The contribution of different facial regions to the recognition of conversational expressions. Journal of vision 8(8), 1–1 (2008)
[23] Pham, L., Vu, T.H., Tran, T.A.: Facial expression recognition using residual masking network. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 4513–4519. IEEE (2021)
[24] Pourmirzaei, M., Esmaili, F., Montazer, G.A.: Using self-supervised co-training to improve facial representation. arXiv preprint arXiv:2105.06421 (2021)
[25] Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. arXiv preprint arXiv:2103.17107 (2021)
[26] Schiller, D., Huber, T., Dietz, M., André, E.: Relevance-based data masking: a model-agnostic transfer learning approach for facial expression recognition. Frontiers in Computer Science 2(6) (2020)
[27] Schoneveld, L., Othmani, A., Abdelkawy, H.: Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters (2021)
[28] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
[29] Vo, T.H., Lee, G.S., Yang, H.J., Kim, S.H.: Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 8, 131988–132001 (2020)
[30] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)