EmotionNet Nano: An Efficient Deep Convolutional Neural Network Design for Real-time Facial Expression Recognition

James Ren Hou Lee, Linda Wang, and Alexander Wong
Department of Systems Design Engineering, University of Waterloo, Canada
Waterloo Artificial Intelligence Institute, Canada
DarwinAI Corp., Canada
{jrhlee, linda.wang, a28wong}@uwaterloo.ca

Abstract

While recent advances in deep learning have led to significant improvements in facial expression classification (FEC), a major challenge that remains a bottleneck for the widespread deployment of such systems is their high architectural and computational complexities. This is especially challenging given the operational requirements of various FEC applications, such as safety, marketing, learning, and assistive living, where real-time requirements on low-cost embedded devices is desired. Motivated by this need for a compact, low latency, yet accurate system capable of performing FEC in real-time on low-cost embedded devices, this study proposes EmotionNet Nano, an efficient deep convolutional neural network created through a human-machine collaborative design strategy, where human experience is combined with machine meticulousness and speed in order to craft a deep neural network design catered towards real-time embedded usage. Two different variants of EmotionNet Nano are presented, each with a different trade-off between architectural and computational complexity and accuracy. Experimental results using the CK+ facial expression benchmark dataset demonstrate that the proposed EmotionNet Nano networks demonstrated accuracies comparable to state-of-the-art in FEC networks, while requiring significantly fewer parameters (e.g., 23 $\times$ fewer compared to [26] at a higher accuracy). Furthermore, we demonstrate that the proposed EmotionNet Nano networks achieved real-time inference speeds (e.g. $>25$ FPS and $>70$ FPS at 15W and 30W, respectively) and high energy efficiency (e.g. $>1.7$ images/sec/watt at 15W) on an ARM embedded processor, thus further illustrating the efficacy of EmotionNet Nano for deployment on embedded devices.

1 Introduction

Refer to caption — Figure 1: Examples of six different expressions, taken from the CK+ dataset, that needs to be distinguished via facial expression classification: A) anger, B) disgust, C) fear, D) happiness, E) sadness, and F) surprise.

Facial expression classification (FEC) is an area in computer vision that has benefited significantly from the rapid advances in machine learning, which has enabled data collections comprising a diversity of facial expressions captured of different individuals to be leveraged to learn classifiers for differentiating between different facial expression types. In particular, deep learning when applied to FEC has led to significant improvements in accuracy under complex conditions, such as varying lighting, angle, or occlusion.

Even though the performance of deep learning-based FEC systems continue to rise, widespread deployment of such systems is limited, with one of the biggest hurdles being the high architectural and computational complexities of the deep neural networks that drive such systems. This hurdle is particularly limiting for real-time embedded scenarios, where low latency operation is required on the low-cost embedded devices. For example, in the area of assistive technologies for improving quality of life, the majority of individuals using such technologies are unwilling to carry large, bulky, and expensive devices with them during their daily lives, as that would be a big hindrance that limits their ability to leverage the technologies in a seamless manner. As such, the assistive devices must leverage small, low-cost, embedded processors, yet provide low latency to enable real-time feedback to the user. Another example is in-car driver monitoring [12], where a FEC system would record the driver and determine their current mental state, and warn them if their awareness level is deteriorating. In cases such as these, the difference of a few milliseconds of processing is paramount for the safety of not only the user, but also other drivers on the road. In applications for fields such as marketing or security, real-time processing is important to provide salespeople or security guards immediate feedback such that an appropriate response can be made as soon as possible. For those relying on software assistance for social purposes, information is required at no delay in order to keep a conversation alive and not cause discomfort for both parties.

Motivated by the desire to design deep neural network architectures catered for real-time embedded facial expression recognition, in this study we explore the efficacy of leveraging a human-machine collaborative design strategy that leverages human experience and ingenuity with the raw speed and meticulousness of machine driven design exploration, in order to find the optimal balance between accuracy and architectural and computational complexity. The resulting deep neural network architecture, which we call EmotionNet Nano, is specifically tailored for real-time embedded facial expression recognition and created via a two phase design strategy. The first phase focuses on leveraging residual architecture design principles to capture the complex nuances of facial expressions, while the second phase employed machine-driven design exploration to generate the final tailor-made architecture design that achieves high architectural and computational efficiency while maintaining a high performance. We present two variants of EmotionNet Nano, each with a different trade-off between accuracy and complexity, and evaluate both variants on the CK+ [17] benchmark dataset against state-of-the-art facial expression classification networks.

The paper is organized as follows. Section 2 discusses related work in the area of facial expression classification and efficient deep neural network architecture design. Section 3 presents in detail the methodology leveraged to design the proposed EmotionNet Nano. Section 4 presents in detail the network architecture of EmotionNet Nano and explores interesting characteristics observed in the overall design. Section 5 presents the experiments conducted to evaluate the efficacy of EmotionNet Nano in terms of accuracy, architectural complexity, speed, and energy efficiency. Section 6 provides a discussion on not only performance but also social implications of EmotionNet Nano. Finally, conclusions are drawn and future directions are discussed in Section 7.

2 Related Work

A variety of deep neural network architectures have been proposed for FEC, ranging from deep convolutional neural networks (DCNN) to recurrent neural networks (RNN) [6] to long-short term memory (LSTM) [24] and have been explored, but those introduced in literature have generally required significant architectural complexity and computational power in order to detect and interpret the nuances of human facial expressions. As an alternative to deep learning, strategies leveraging other machine learning strategies such as Support Vector Machines (SVM) [18] and hand-crafted features such as Local Binary Patterns (LBP) [8, 23], dense optical flow [1], Histogram of Oriented Gradients (HOG) [15], or Facial Action Coding System [5] have also been explored in literature, but generally have been shown to achieve lower accuracy when compared to deep learning-based approaches, which can better learn the subtle differences that exist between human facial expressions.

To mitigate the aforementioned hurdle and improve widespread adoption of powerful deep learning-driven approaches for FEC in real-world applications, a key direction that is worth exploring is the design of highly efficient deep neural network architectures tailored for the task of real-time embedded facial expression recognition. A number of strategies for designing highly efficient architectures have been explored. One strategy is reducing the depth of the neural network architecture [13] to reduce computational and architectural complexity; more specifically, neural networks with a depth of just five were leveraged to learn discriminating facial features. Another strategy is reducing the input resolution of the neural network architecture, with Shan et. al. [23] showing that FEC can be performed even at low image resolutions of 14 x 19 pixels, which can further reduce the number of operations required for inference by a large margin. Despite the improved architectural or computational efficiencies gained by leveraging such efficient network design strategies, they typically lead to noticeable reductions in facial expression classification accuracy and as such alternative strategies that enable a better balance between accuracy, architectural complexity, and computational complexity are highly desired.

More recently, there has been a focus on human-driven design principles for efficient deep neural network architecture design, ranging from depth-wise separable convolutions [2] to Inception [25] macroarchitectures to residual connections [11]. Such design principles can substantially improve FEC performance while reducing architectural complexity [22]. However, despite the improvements gained in architectural efficiency, one challenge with human-driven design principles is that it is quite time consuming and challenging for humans to hand-craft efficient neural network architectures that are tailored for specific applications such as FEC that possesses a strong balance between a high performance accuracy, fast inference speed, and low memory footprint, primarily due to the sheer complexity of neural network behaviours under different architectural configurations.

In an attempt to address this challenge, neural architecture search (NAS) strategies have been introduced to automate the model architecture engineering process by finding the maximally performing network design from all possible network designs within a search space. However, given the infinitely large search space within which the optimal network architecture may exist in, significant human effort is often required in designing the search space in a way that reduces it to a feasible size, as well as defining a search strategy that can run within desired operational constraints and requirements in a reasonable amount of time. Therefore, a way to combine both human-driven design principles and machine-driven design exploration is highly desired and can lead to efficient architecture designs catered specifically to FEC.

3 Methods

In this study, we present EmotionNet Nano, a highly efficient deep convolutional neural network architecture design for the task of real-time facial emotion classification for embedded scenarios. EmotionNet Nano was designed using a human-machine collaborative strategy in order to leverage both human experience as well as the meticulousness of machines. The human-machine collaborative design strategy leveraged to create the proposed EmotionNet Nano network architecture design is comprised of two main design stages: i) principled network design prototyping, and ii) machine-driven design exploration.

3.1 Principled Network Design Prototyping

In the first design stage, an initial network design prototype, $\varphi$ , was designed using human-driven design principles in order to guide the subsequent machine-driven exploration design stage. In this study, the initial network design prototype of EmotionNet Nano leveraged residual architecture design principles [11], as it was previously demonstrated to achieve strong performance on a variety of recognition tasks. More specifically, the presence of residual connections within a deep neural network architecture have been shown to provide a good solution to both the vanishing gradient and curse of dimensionality problems. Residual connections also enable networks to learn faster and easier, with little additional cost to architectural or computational complexity. Additionally, as the network architecture depth increases, each consecutive layer should perform no worse than its previous layer due to the identity mapping option. As a result, residual network architecture designs have been shown to work well for the problem of FEC [9, 28, 13]. In this study, the final aspects of the initial network design prototype, $\varphi$ , consists of an average pooling operation followed by a fully connected softmax activation layer to produce the final expression classification results. The final macroarchitecture and microarchitecture designs of the individual modules and convolutional layers of the proposed EmotionNet Nano were left to the machine-driven design exploration stage to design in an automatic manner. To ensure a compact and efficient real-time model catered towards embedded devices, this second stage was guided by human-specified design requirements and constraints targeting embedded devices possessing limited computational and memory capabilities.

3.2 Machine Driven Design Exploration

Following the initial human-driven network design prototyping stage, a machine-driven design exploration stage was employed to determine the macroarchitecture and microarchitecture designs at the individual module level to produce the final EmotionNet Nano. In order to determine the optimal network architecture based on a set of human defined constraints, generative synthesis [27] was leveraged for the purpose of machine-driven design exploration. Defined in Equation 1, we can formulate generative synthesis as a constrained optimization problem, where the goal is to find a generator $\mathcal{G}$ that, given a set of seeds $\mathcal{S}$ , can generate networks $\{\mathcal{N}_{s}|s\in\mathcal{S}\}$ that maximize a universal performance function $\mathcal{U}$ while also satisfying constraints defined in an indicator function $1_{r}(\cdot)$ ,

\mathcal{G}=\max_{\mathcal{G}}\mathcal{U}(\mathcal{G}(s))\text{ subject to }1_{r}(\mathcal{G}(s))=1,\forall s\in\mathcal{S}

(1)

As such, given a human-defined indicator function $1_{r}(\cdot)$ and an initial network design prototype $\varphi$ , generative synthesis is guided towards learning generative machines that generate networks within the human-specified constraints.

An important factor in leveraging generative synthesis for machine-driven design exploration is to define the operational constraints and requirements based on the desired task and scenario in a quantitative manner via the indicator function $1_{r}(\cdot)$ . In this study, in order to learn a compact yet highly efficient facial expression classification network architecture, the indicator function $1_{r}(\cdot)$ was set up such that: i) accuracy $\geq$ 92% on CK+ [17], and ii) network architecture complexity $\leq$ 1M parameters. These constraint values were chosen to explore how compact a network architecture for facial expression classification can be while still maintaining sufficient classification accuracy for use in real-time embedded scenarios. As such, we use the accuracy of Feng & Ren [7] as the reference baseline for determining the accuracy constraint in the indicator function.

4 EmotionNet Nano Architecture

The network architecture of the proposed EmotionNet Nano is shown in Figure 2. A number of notable characteristics of the proposed EmotionNet Nano network architecture design are worth discussing as they give insights into architectural mechanisms that strike a strong balance between complexity and accuracy.

4.1 Architectural heterogeneity

A notable characteristic about the architecture that allows the network to achieve high efficiency even with a low number of parameters is the macroarchitecture and microarchitecture heterogeneity. Unlike hand-crafted architecture designs, the macroarchitecture and microarchitecture designs within the EmotionNet Nano network architecture as generated via machine-driven design exploration differ greatly from layer to layer. For instance, there are a mix of convolution layers with varying shapes and different number of channels per layer depending on the needs of the network. As shown in Figure 2, there are a greater number of channels needed as the sizes of feature maps decrease.

The benefit of high microarchitecture and macroarchitecture heterogeneity in the EmotionNet Nano network architecture is that it enables different parts of the network architecture to be tailored to achieve a very strong balance between architectural and computational complexity while maintaining model expressiveness in capturing necessary features. The architectural diversity in EmotionNet Nano demonstrates the advantage of leveraging a human-collaborative design strategy as it would be difficult for a human designer, or other design exploration methods to customize a network architecture to the same level of architectural granularity.

4.2 Selective long-range connectivity

Another notable characteristic of the EmotionNet Nano network architecture is that it exhibits selective long range connectivity throughout the network architecture. The use of long range connectivity in a very selective manner enables a strong balance between model expressiveness and ease of training, and computational complexity. Most interesting and notable is the presence of two densely connected $1\times 1$ convolution layers that take in outputs from multiple $3\times 3$ convolution layers as input, with its output connected farther down at later layers. Such a $1\times 1$ convolution layer design provides dimensionality reduction while retaining salient features of the channels through channel mixing, thus further improving architectural and computational efficiency while maintaining strong model expressiveness.

5 Experimental Results

To evaluate the efficacy of the proposed EmotionNet Nano, we examine the network complexity, computational cost and classification accuracy against other facial expression classification networks on the CK+ [17] dataset, which is the most extensively used laboratory-controlled FEC benchmark dataset [16, 21].

5.1 Dataset

The Extended Cohn-Kanade (CK+) [17] dataset contains 593 video sequences from a total of 123 different subjects, ranging from 18 to 50 years of age with a variety of genders and heritage. Each video shows a facial shift from the neutral expression to a targeted peak expression, recorded at 30 frames per second (FPS) with a resolution of either 640x490 or 640x480 pixels. Out of these videos, 327 are labelled with one of seven expression classes, anger, contempt, disgust, fear, happiness, sadness, and surprise. The CK+ database is widely regarded as the most extensively used laboratory-controlled FEC database available, and is used in the majority of facial expression classification methods [16, 21]. Figure 3 shows that the CK+ dataset has good diversity for each expression type, which is important from an evaluation perspective. However, as the CK+ dataset does not provide specific training, validation, and test set splits, a mixture of splitting techniques can be observed in literature. For experimental consistency, we adopt the most common dataset creation strategy where the last three frames of each sequence is extracted and labeled with the video label [16]. In this study, we performed subject-independent 10-fold cross validation on the resulting 981 facial expression images.

5.2 Implementation Details

EmotionNet Nano was trained for 200 epochs using an initial learning rate of $1\mathrm{e}{-3}$ , multiplied by $1\mathrm{e}{-1}$ , $1\mathrm{e}{-2}$ , $1\mathrm{e}{-3}$ , and $0.5\mathrm{e}{-3}$ at epochs 81, 121, 161, and 181 respectively. Categorical cross-entropy loss was used with the Adam [14] optimizer. Data augmentation was applied to the inputs, including rotation, width and height shifts, zoom, and horizontal flips. Following this initial training, we leveraged a machine-driven exploration stage to fine tune the network specifically for the task of FEC. Training was performed using a GeForce RTX 2080 Ti GPU. The Keras [3] library was leveraged for this study.

5.3 Performance Evaluation

Two variants of EmotionNet Nano were created to examine the different trade-offs between architectural and computational complexity and accuracy. In order to demonstrate the efficacy of the proposed models in a quantitative manner, we compare the performance of both variants against state-of-the-art facial expression classification networks introduced in literature, shown in Table 1. It can be observed that both EmotionNet Nano-A and Nano-B networks achieve strong classification accuracy, with EmotionNet Nano-A in particular achieving comparable accuracy with the highest-performing state-of-the-art networks that are more than a magnitude larger. While EmotionNet Nano-B has lower accuracy than the highest-performing networks, it is still able to achieve comparable accuracy as [7] while being three orders of magnitude smaller. A more detailed discussion of the performance comparison will be provided in the next section; overall, it can be observed that both EmotionNet Nano variants provide the greatest balance between accuracy and complexity, making it well-suited for embedded scenarios.

Table 1: Comparison of facial expression classification networks on the CK+ dataset. We report 10-fold cross-validation average accuracy on the CK+ dataset with 7 classes (anger, contempt, disgust, fear, happiness, sadness, and surprise).

Method	Params (M)	Accuracy (%)
Ouellet [20]	58	94.4
Feng & Ren [7]	332	92.3
Wang & Gong [26]	5.4	97.2
Otberdout et al. [19]	11	98.4
EmotionNet Nano-A	0.232	97.6
EmotionNet Nano-B	0.136	92.7

The distribution of expressions in CK+ is unequal, which results in an unbalanced dataset both for training and testing. The effects of this are prevalent when classifying the contempt or fear expressions, both of which are underrepresented in CK+ (e.g. there are only 18 examples of contempt, whereas there are 83 examples of surprise). Due to the nature of human facial expressions, similarities between expressions do exist, but the networks are generally able to learn the high-level distinguishing features that separate one expression from another. However, incorrect classifications can still occur, as shown in Figure 4, where a “disgust” expression is falsely predicted to be “anger.”

5.4 Speed and Energy Efficiency

We also perform a speed and energy efficiency analysis, shown in Table 2, to demonstrate the efficacy of EmotionNet Nano in real-time embedded scenarios. Here, an ARM v8.2 64-Bit RISC embedded processor was used for evaluation. Referring to Table 2, both EmotionNet Nano variants are able to perform inference at $>$ 25 FPS and $>$ 70 FPS on the tested embedded processor at 15W and 30W respectively, which more than fulfills a real-time system constraint. In terms of energy efficiency, both EmotionNet Nano variants demonstrated high power efficiency, with the Nano-B variant running at 5.29 images/sec/watt on the embedded processor.

Table 2: EmotionNet Nano Speed and Energy Efficiency. All metrics are computed on an ARM v8.2 64-Bit RISC embedded processor at different power levels.

	15W		30W
Model	FPS	[ $\frac{\text{images/s}}{\text{watt}}$ ]	FPS	[ $\frac{\text{images/s}}{\text{watt}}$ ]
EmotionNet Nano-A	25.8	1.72	70.1	2.34
EmotionNet Nano-B	32.8	2.19	72.9	2.43

6 Discussion

In this study, we explore the human-machine collaborative design of a deep convolutional neural network architecture capable of performing facial expression classification in real-time on embedded devices. It is important to note that other extremely fast deep convolutional neural network architectures exist, such as MicroExpNet [4], which is capable of processing 1851 FPS on an Intel i7 CPU, is less than 1 MB in size, and is tested on the CK+ 8 class problem (7 facial expression classes plus neutral) on which it achieves 84.8% accuracy. Although a motivating result, a direct comparison cannot be made with EmotionNet Nano as well as other facial expression classification networks evaluated in this study due to the different class sizes.

Compared against state-of-the-art facial expression classification network architectures tested on CK+ using the same seven expression classes (see Figure 1), both variants of the proposed EmotionNet Nano are at least an order of magnitude smaller yet provide comparable accuracy to state-of-the-art network architectures. For example, EmotionNet Nano-A is $>$ 23 $\times$ smaller than [26], yet achieves higher accuracy by 0.4%. Furthermore, while EmotionNet Nano-A achieves an accuracy that is 0.8% lower than the top-performing [19], it possesses $>$ 47 $\times$ fewer parameters. In the case of EmotionNet Nano-B, it achieved higher accuracy (by 0.4%) than [7] while having three orders of magnitude fewer parameters.

Looking at the experimental results around inference speed and energy efficiency on an embedded processor at different power levels (see Table 2), it can be observed that both variants of EmotionNet Nano achieved real-time performance and high energy efficiencies. For example, in the case of EmotionNet Nano-A, it was able to exceed 25 FPS and 70 FPS at 15W and 30W, respectively, with energy efficiencies exceeding 1.7 images/s/watt and 2.34 images/s/watt at 15W and 30W, respectively. This demonstrates that the proposed EmotionNet Nano is well-suited for high-performance facial expression classification in real-time embedded scenarios. An interesting observation that is worth noting is the fact that while the inference speed improvements of EmotionNet Nano-B over EmotionNet Nano-A exceeds 27% at 15W, there is only a speed improvement of 4% at 30W. As such, it can be seen that EmotionNet Nano-B is more suitable at low-power scenarios but at high-power scenarios the use of EmotionNet Nano-A is more appropriate given the significantly higher accuracy achieved.

6.1 Implications and Concerns

The existence of an efficient facial expression classification network of running in real-time on embedded devices can have an enormous impact in many fields, including safety, marketing, and assistive technologies. In terms of safety, driver monitoring or improved surveillance systems are both areas that benefit from higher computational efficiency, as it lowers the latency between event notifications as well as reduces the probability that a signal will be missed. With a real-time facial expression classification system in the marketing domain, companies will gain access to enhanced real-time feedback when demonstrating or promoting a product, either in front of live audiences or even in a storefront. The largest impact however, is likely in the assistive technology sector, due to the increased accessibility that this efficiency provides. The majority of individuals do not have access to powerful computing devices, nor are they likely to be willing to carry a large and expensive system with them as it would be considered an inconvenience to daily living.

As shown in this study, EmotionNet Nano can achieve accurate real-time performance on embedded devices at a low power budget, granting the user access to a facial expression classification system on their smartphone or similar edge device with embedded processors without rapid depletion of their battery. This can be extremely beneficial towards tasks such as depression detection, empathetic tutoring, or ambient interfaces, and can also help individuals who suffer from Autistic Spectrum Disorder better infer emotional states from facial expressions during social interaction in the form of augmented reality (see Figure 5 for a visual illustration of how EmotionNet Nano can be used to aid in conveying emotional state via an augmented reality overlay).

Although EmotionNet Nano has many positive implications, there exist concerns that must be considered before deployment. The first concern is privacy, as individuals may dislike being on camera, even if no data storage is taking place. Privacy concerns, especially ones centered around filming without consent, are likely to arise if these systems start to be used in public areas. The combination of facial expression classification together with facial recognition could result in unwanted targeted advertising, even though this could be seen as a positive outcome for some. Additionally, wrong classifications could result in unintended implications. When assisting a user in an ambient interface or expression interpretation task, a misclassified expression could result in a negative experience with major consequences. For example, predicting “sad” or “angry” expressions as “happy” could influence the user to behave in the wrong manner. These concerns and issues are all worth further exploration and investigation to ensure that such systems are used in a responsible manner.

7 Conclusion

In this study, we introduced EmotionNet Nano, a highly efficient deep convolutional neural network design tailored for facial expression classification in real-time embedded scenarios by leveraging a human-machine collaborative design strategy. By leveraging a combination of human-driven design principles and machine-driven design exploration, the EmotionNet Nano architecture design possesses several interesting characteristics (e.g., architecture heterogeneity and selective long-range connectivity) that makes it tailored for real-time embedded usage. Two variants of the proposed EmotionNet Nano network architecture design were presented, both of which achieve a strong balance between architecture complexity and accuracy while illustrating performance trade-offs at that scale. Using the CK+ dataset, we show that the proposed EmotionNet Nano can achieve comparable accuracy to state-of-the-art facial expression classification networks (at 97.6%) while possessing a significantly more efficient architecture design (possessing just 232K parameters). Furthermore, we demonstrated that EmotionNet Nano can achieve real-time inference speed on an embedded processor at different power levels, thus further illustrating its suitability for real-time embedded scenarios.

Future work involves incorporating temporal information into the proposed EmotionNet Nano design when classifying video sequences. Facial expressions are highly dynamic and transient in nature [10], meaning that information about the previous expression is valuable when predicting the current expression. Therefore, the retention of temporal information can lead to increased performance, at the expense of computational complexity. Investigating this trade-off between computational complexity and improved performance when leveraging temporal information would be worthwhile.

References

[1] Sarah Adel Bargal, Emad Barsoum, Cristian Canton Ferrer, and Cha Zhang. Emotion recognition in the wild from videos using images. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 433–436. ACM, 2016.
[2] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[3] François Chollet et al. Keras. https://keras.io, 2015.
[4] İlke Çuğu, Eren Şener, and Emre Akbaş. Microexpnet: An extremely small and fast model for expression recognition from frontal face images. arXiv preprint arXiv:1711.07011, 2017.
[5] Paul Ekman and Wallace V. Friesen. Facial action coding system: Manual. 1978.
[6] Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 445–450. ACM, 2016.
[7] Duo Feng and Fuji Ren. Dynamic facial expression recognition based on two-stream-cnn with lbp-top. In 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pages 355–359. IEEE, 2018.
[8] SL Happy, Anjith George, and Aurobinda Routray. A real time facial expression classification system using local binary patterns. In 2012 4th International conference on intelligent human computer interaction (IHCI), pages 1–5. IEEE, 2012.
[9] Behzad Hasani and Mohammad H Mahoor. Facial expression recognition using enhanced deep 3d convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 30–40, 2017.
[10] Behzad Hasani and Mohammad H Mahoor. Facial expression recognition using enhanced deep 3D convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 30–40, 2017.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[12] Mira Jeong and Byoung Chul Ko. Driver’s facial expression recognition in real-time for safe driving. Sensors, 18(12):4270, 2018.
[13] Pooya Khorrami, Thomas Paine, and Thomas Huang. Do deep neural networks learn facial action units when doing expression recognition? In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 19–27, 2015.
[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[15] Pranav Kumar, SL Happy, and Aurobinda Routray. A real-time robust facial expression recognition system using hog features. In 2016 International Conference on Computing, Analytics and Security Trends (CAST), pages 289–293. IEEE, 2016.
[16] Shan Li and Weihong Deng. Deep facial expression recognition: A survey. arXiv preprint arXiv:1804.08348, 2018.
[17] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pages 94–101. IEEE, 2010.
[18] Philipp Michel and Rana El Kaliouby. Real time facial expression recognition in video using support vector machines. In Proceedings of the 5th international conference on Multimodal interfaces, pages 258–264. ACM, 2003.
[19] Naima Otberdout, Anis Kacem, Mohamed Daoudi, Lahoucine Ballihi, and Stefano Berretti. Automatic analysis of facial expressions based on deep covariance trajectories. IEEE transactions on neural networks and learning systems, 2019.
[20] Sébastien Ouellet. Real-time emotion recognition for gaming using deep convolutional network features. arXiv preprint arXiv:1408.3750, 2014.
[21] Maja Pantic, Michel Valstar, Ron Rademaker, and Ludo Maat. Web-based database for facial expression analysis. In 2005 IEEE international conference on multimedia and Expo, pages 5–pp. IEEE, 2005.
[22] Christopher Pramerdorfer and Martin Kampel. Facial expression recognition using convolutional neural networks: state of the art. arXiv preprint arXiv:1612.02903, 2016.
[23] Caifeng Shan, Shaogang Gong, and Peter W McOwan. Recognizing facial expressions at low resolution. In IEEE Conference on Advanced Video and Signal Based Surveillance, 2005., pages 330–335. IEEE, 2005.
[24] Bo Sun, Qinglan Wei, Liandong Li, Qihua Xu, Jun He, and Lejun Yu. LSTM for dynamic emotion and group emotion recognition in the wild. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pages 451–457. ACM, 2016.
[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[26] Guan Wang and Jun Gong. Facial expression recognition based on improved lenet-5 cnn. In 2019 Chinese Control And Decision Conference (CCDC), pages 5655–5660. IEEE, 2019.
[27] Alexander Wong, Mohammad Javad Shafiee, Brendan Chwyl, and Francis Li. Ferminets: Learning generative machines to generate efficient neural networks via generative synthesis. arXiv preprint arXiv:1809.05989, 2018.
[28] Yitao Zhou, Fuji Ren, Shun Nishide, and Xin Kang. Facial sentiment classification based on resnet-18 model. In 2019 International Conference on Electronic Engineering and Informatics (EEI), pages 463–466. IEEE, 2019.