This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Spontaneous Driver Emotion Facial Expression (DEFE) Dataset for Intelligent Vehicles

Wenbo Li,  Yaodong Cui,  Yintao Ma,  Xingxin Chen,  Guofa Li, 
Gang Guo,  and Dongpu Cao, 
Wenbo Li and Gang Guo are with the School of Automotive Engineering, Chongqing University, Chongqing, 400044, China (e-mail: [email protected], [email protected]).Yaodong Cui, Yintao Ma, Xingxin Chen and Dongpu Cao are with the Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. Wenbo Li and Guofa Li are also with this affiliation. (e-mail: [email protected], [email protected], [email protected], [email protected]). G. Li is with the Institute of Human Factors and Ergonomics, College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen, Guangdong, 518060, China (e-mail: [email protected]).
Abstract

Abstract—In this paper, we introduce a new dataset, the driver emotion facial expression (DEFE) dataset, for driver spontaneous emotions analysis. The dataset includes facial expression recordings from 60 participants during driving. After watching a selected video-audio clip to elicit a specific emotion, each participant completed the driving tasks in the same driving scenario and rated their emotional responses during the driving processes from the aspects of dimensional emotion and discrete emotion. We also conducted classification experiments to recognize the scales of arousal, valence, dominance, as well as the emotion category and intensity to establish baseline results for the proposed dataset. Besides, this paper compared and discussed the differences in facial expressions between driving and non-driving scenarios. The results show that there were significant differences in AUs (Action Units) presence of facial expressions between driving and non-driving scenarios, indicating that human emotional expressions in driving scenarios were different from other life scenarios. Therefore, publishing a human emotion dataset specifically for the driver is necessary for traffic safety improvement. The proposed dataset will be publicly available so that researchers worldwide can use it to develop and examine their driver emotion analysis methods. To the best of our knowledge, this is currently the only public driver facial expression dataset.

Index Terms:
Driving safety, Driver emotion, Facial expression dataset, Spontaneous expression, Affective computing, Intelligent vehicles,

I Background and Related Work

Driver emotion plays a vital role in driving because it affects driving safety and comfort. Among the 20-50 million non-fatal injuries and 1.24 million fatal road traffic accidents occurring every year worldwide [1], driver’s inability to control his emotions has been regarded as one of the critical factors degrading driving safety [2][3]. The rapid development in intelligent vehicles also calls for an emerging demand in the integration of driver-automation interaction and collaboration to enhance driving comfort, where driver emotion is one of the critical states [4]. Therefore, recognizing driver emotions is essential to improve driving safety and comfort of intelligent vehicles [5].

To describe human emotion, psychological researchers have provided two methodologies to classify emotions, which are discrete emotions and dimensional emotions [6]. Due to the discrete language words used by humans to describe emotions, discrete models are well-established and widely-accepted, such as the basic emotions of Ekman et al. [7] and the emotion tree structure of Parrott [8]. Specifically, Ekman et al. categorized discrete emotions into six basic emotions (happiness, sadness, anger, fear, surprise, and disgust) [7], which are supported by cross-cultural researches showing that humans perceived these basic emotions in a similar form regardless of culture differences [9]. The dimensional emotion models propose that the emotional state can be accurately expressed as a combination of several psychological dimensions, such as the 2D ” circumplex model ” proposed by Russell [10] and the 3D dimensional model of Mehrabian et al. [11]. In the widely adopted model proposed by Russell [10], the valence dimension measures whether humans feel negative or positive, and the arousal dimension measures whether humans are bored or excited. Mehrabian et al. [11] extended the emotional model from 2D to 3D by adding a dominance dimension, which measures submissive or empowered feelings.

The discrete emotion method is intuitive and widely used in peoples’ daily lives. However, it fails to cover the whole range of emotions exhibits by humans. The dimensional emotion method is less intuitive and often requires training the participants to use the dimensional emotion labelling system. Nevertheless, the dimensional emotion method is a more pragmatic and context-dependent approach to describe emotions [6]. In this study, considering the primary emotions of drivers during driving, we combine both the discrete emotion and dimensional emotion methods to describe drivers’ negative emotions (e.g., anger) and positive emotions (e.g., happiness) quantitatively by employing the well-known emotion difference scale (DES) [12] and self-assessed human body model (SAM) [13].

TABLE I: The Summary of Reviewed publicly available datasets for facial expression-based emotion recognition
dataset
# of images/videos
and resolution
Emotion # of participants Condition Emotion model/lables
JAFFE
[14]
213 images
256×256256\times 256
Neutral, sadness, surprise,
happiness, fear, anger,
and disgust
10
10 females
Static life scenario
Controlled
Posed
Discrete emotion model
- Neutral+6 basic emotions
KDEF
[15]
4,900 images
562×762562\times 762
Neutral, sadness, surprise,
happiness, fear, anger,
and disgust
70
35 females and 35 males
Age between 20 and 30
Static life scenario
Controlled
Posed
Discrete emotion model
- Neutral+6 basic emotions
MMI
[16]
Over 2,900
video sequences
720×576720\times 576
Sadness, surprise, fear,
happiness, anger,
and disgust
75
Age between 19 and 62
Static life scenario
Controlled
Posed & Spontaneous
Discrete emotion model
- 6 basic emotions
BU-3DFE
[17]
2,500 3D models
1040×13291040\times 1329
Neutral, sadness, surprise,
happiness, fear, anger,
and disgust
100
56 females and 44 males
Age between 18 and 70
Static life scenario
Controlled
Posed
Discrete emotion model
- Neutral+6 basic emotions
- 4 levels of emotional intensity
Multi-Pie
[18]
755,370 images
3072×20483072\times 2048
Neutral, smile, surprise,
squint, disgust,
and scream
337
102 females and 235 males
Static life scenario
Controlled
Posed
Discrete emotion model
- Neutral+6emotion categories
CK+
[19]
593 video
sequences
640×480640\times 480,
640×490640\times 490
Neutral, sadness, surprise,
happiness, fear, anger,
contempt, and disgust
123
Age between 18 and 50
Static life scenario
Controlled
Posed & Spontaneous
Discrete emotion model
- Neutral+7 emotion categories
RaFD
[20]
8,040 images
681×1024681\times 1024
Neutral, sadness, surprise,
contempt, happiness, fear,
anger, and disgust
67
25 females and 42 males
Static life scenario
Controlled
Posed
Discrete emotion model
- Neutral+7 emotion categories
DEAP
[21]
880 video clips
(22 subjects)
786×576786\times 576
Physiological
signals
Valence, arousal,
dominance
32
16 females and 16 males
Age between 19 and 33
Static life scenario
Controlled
Spontaneous
Dimensional emotion model
- 9 levels of valence, arousal,
dominance
Belfast
[22]
1,400 video clips
720×576720\times 576 and
1920×10801920\times 1080
Disgust, fear, amusement,
frustration, surprise,
anger, and sadness
256
119 females and 137 males
Static life scenario
Controlled
Natural tasks induced
Discrete emotion model
- 7 emotion categories
- emotional intensity
DISFA
[23]
130,000 video
frames
1024×7681024\times 768
AU intensity for each
video frame (12 AUs)
27
12 females and 15 males
Age between 18 and 50
Static life scenario
Controlled
Spontaneous
12AUs
RECOLA
[24]
3.8 hours videos
1080×7201080\times 720
Physiological,
audio signals
Valence, arousal
46
27 females and 19 males
Mean age 22
Static life scenario
Controlled
Spontaneous
Dimensional emotion model
- 9 levels of valence, arousal
CFEE
[25]
5,060 images
3000×40003000\times 4000
22 categories of basic
and compound
emotions
230
130 females and 100 males
Mean age 23
Static life scenario
Controlled
Posed
Discrete emotion model
- 22 categories of basic
and compound emotions
BP4D-
Spontaneous
[26]
328 sequences
of 3D+2D
1040×13291040\times 1329
Sadness, surprise, fear,
anger, embarrassment,
physical pain, happiness,
and disgust
41
23 females and 18 males
Age between 18 and 29
Static life scenario
Controlled
Spontaneous
Discrete emotion model
- 8 basic emotions
ISED
[27]
428 video
sequences
1920×10801920\times 1080
Sadness, surprise,
happiness, and disgust
50
21 females and 29 males
Age between 18 and 22
Static life scenario
Controlled
Spontaneous
Discrete emotion model
- 8 basic emotions
- 5 levels of emotional intensity
FER+
[28]
35,887 images
48×4848\times 48
Neutral, surprise, sadness,
happiness, anger, disgust,
fear, contempt
\sim35,887 wild setting
Discrete emotion model
- Neutral+7 emotion categories
EmotioNet
[29]
1,000,000 images
Various resolution
23 basic or compound
emotions
\sim100,000 wild setting
Discrete emotion model
- 23 categories of basic and
compound emotions
Aff-Wild
[30]
298 video clips
Various resolution
Valence and arousal
200
70 females and 130 males
wild setting
Dimensional emotion model
- valecne and arousal
RAVDESS
[31]
7,356 video and
audio clips
1280×7201280\times 720
Neutral, calm, happiness,
sadness, anger, fear,
surprise, and disgust
24
12 females and 12 males
Age between 21 and 33
Static life scenario
Controlled
Posed
Discrete emotion model
- Neutral+6 basic emotions
- 2 levels of emotional intensity
AffectNet
[32]
450,000 images
annotated manually
Various resolution
Neutral, sadness, surprise,
happiness, fear, disgust,
anger, contempt
Valence, arousal
450,000 wild setting
Discrete emotion model
- Neutral+7 emotion categories
Dimensional emotion model
- valence and arousal
DEFE
(This
work)
164 video clips
each 30s
640×480640\times 480,
1920×10801920\times 1080
Neutral, happiness,
anger
Valence, arousal,
dominance
60
13 females and 47 males
Age between 19 and 56
Dynamic driving
scenarios
Controlled
Spontaneous
Discrete emotion model
- 3 emotion categories
- 5 levels of emotional intensity
Dimensional emotion model
- valence, arousal, dominance
- 9 levels of valence, arousal,
dominance

Driver emotion recognition is often conducted by analyzing driver emotion expressions. The expressions of human emotions consists of facial expressions, speech, body posture and physiological changes. So far, different behavioural measurements (e.g., facial expression analysis, speech analysis, driving behaviour) [33][34][35] , physiological signal measurements (e.g., skin electrical activity, respiration) [36][37], or self-reported scales (e.g., self-assessment manikin) [38] have been applied in driver emotion recognition. Comparatively, physiological measurements are more objective and can be measured continuously. However, this measurement is highly invasive and may affect drivers’ driving performance. Self-reported measurements measure the subjective experience of drivers when applied correctly, but such measurements cannot take place during the study without interruption. For the study on the driver emotion in the driving environment, it is crucial to use non-invasive and non-contact measurement methods. High intrusiveness has a significant impact on both on the driver emotion expression and actual emotional experience, therefore should be avoided [39]. To this end, this study employed facial expression to recognize driver emotions and ensure the continuity of data collection.

Facial expression is a powerful channel for drivers to express emotions [40]. Recent advances in facial expression-based emotion recognition have motivated the creation of multiple facial expression datasets. Publicly available datasets are fundamental for accelerating facial expression research. As shown in Table 1, we summarized the up-to-date representative public available datasets containing facial expressions. These datasets have been used for emotion recognition and to achieve different levels of success. As shown in Table I, one of the common aspects of these datasets is the collection of participants’ facial expression data in static life scenarios and wild settings. Although facial expression data collected in static life scenarios and wild settings can be employed to recognize emotions using various algorithms, it restricts the application of these algorithms into static life scenarios.

However, driving a car is a complex cognitive process [41], which requires the driver to dynamically respond to driving tasks, such as visual cues, hazard assessment, decision-making, strategic planning[42]. Consequently, driving occupies a lot of driver’s cognitive resources [43], and cognitive processing is needed to elicit emotional responses [44]. Driving affects drivers’ emotion expressions, which are different from the expressions in static life scenarios. As a result, if the above-mentioned algorithms are applied to dynamic driving scenarios, reliable recognition results may not be obtained. Thus, it is necessary to collect drivers’ facial expression data specifically for driver emotion recognition in dynamic driving scenarios and to analyze the human facial expression differences between dynamic driving scenarios and static life scenarios.

II DEFE Data Collection Framework

To address the above-mentioned limitations, we introduce a driver emotion facial expression dataset (DEFE) in this study for driver emotion studies in intelligent vehicles. Table 2 presents the details of the experimental design for stimulus material selection, data collection, experiment protocol, and emotional labels. The performance of different emotion recognition algorithms was analyzed in this study. Also, this paper analyzed the differences in human facial expressions between dynamic driving scenarios and static life scenarios.

TABLE II: Summary of DEFE dataset
Video-audio stimulus selection
Number of stimulus 18
Stimulus duration 30s-120s
Initial stimulus selection Manually selected
No. of rating per stimulus 35
Rating scales
Dimensional emotion model (SAM)
- Arousal
- Valence
- Dominance
Discrete emotion model (DES)
- Emotional categories
- Emotional intensity
Rating values Discrete scale of 1-9
Selection method
Subset of annotated video-audio chips
with clearest and highest response
Driver facial expression data collection
Number of participants 60 (17females, 43 males)
Number of stimulus 3
Number of driving tasks 3
Rating scales
Dimensional emotion model (SAM)
- Arousal
- Valence
- Dominance
Discrete emotion model (DES)
- Emotional categories
- Emotional intensity
Rating values Discrete scale of 1-9
Recorded signals Driver facial expression videos
DEFE dataset content
Number of video clips 164
Emotions elicited
Anger (52 clips)
Happiness (56 clips)
Neutral (56 clips)
Clip duration 30s
Video clips format MP4
Image resolution 1920*1080, 648*480
Self-report of emotion Yes
Emotion categories labels
3 categories
- Anger, happiness and neutral
Emotion intensity labels
5 levels of anger and happiness
- 5 = no emotion, 9 = maximum intensity
Valence, arousal,
dominance labels
9 levels of valence, arousal, dominance
- 1 = not at all, 9 = extremely

In our DEFE dataset, video-audio clips were used as the stimuli to induce different emotions. To this end, a large number of video-audio clips were collected using a manual selection method. Subjective annotation was then performed to select the most appropriate stimulus material. Each stimulus material was labelled at least 35 times using SAM and DES scales, and the most effective three video-audio clips were selected to induce a specific driver emotion in our following experiments for data collection. Then, 60 drivers participated in the data collection experiment. After watching each of the three randomly sequenced video-audio clips selected to elicit a specific emotion, each participant completed the driving tasks in the same driving scenarios and rated their emotional responses in the driving process from the aspects of dimensional emotion and discrete emotion. Besides, we conducted classification experiments for the scales of arousal, valence, dominance, as well as the emotion category and intensity to establish the baseline results for our dataset in terms of classification accuracy and F1 score. Furthermore, we discussed the differences in facial expressions between dynamic driving scenarios and static life scenarios in the same culture by comparing the responses of different action units (AUs) in our DEFE and the JAFFE datasets.

The main contributions of this paper can be described as:

  • We provide a new, publicly available dataset DEFE for spontaneous driver emotions analysis. The dataset contains frontal facial videos from 60 drivers, including their biographic information (gender, age, driving age), and subjective ratings on driver emotions (arousal, valence, dominance scales, as well as emotion category and intensity). To the best of our knowledge, this dataset is currently the only public dataset of driver facial expressions.

  • We compared the classification results of driver emotions on our DEFE dataset using the mainstream classification algorithms. The DEFE dataset supports driver emotion classification from two aspects, dimensional emotion (arousal, valence and dominance) and discrete emotion (emotional category and intensity). The comparisons established the baseline results of the introduced dataset with classification accuracy and F1 score.

  • The differences in human facial expressions between dynamic driving scenarios and static life scenarios were compared by analyzing drivers’ AUs presence, and the results showed significant differences between these two types of scenarios. Therefore, the previous human emotion datasets cannot be directly used for driver emotion analysis, and our introduced DEFE dataset fills this research gap.

The structure of this paper is as follows: Section III presents the selection of stimulus materials. Section IV introduces the DEFE data collection details, and the data processing, classification methods and results are described in Section V. Section VI compares human facial expression differences in dynamic driving scenarios and static life scenarios. The final conclusions are shown in Section VII.

III Video-audio Stimulus Selection

The stimulus is necessary to elicit target emotions. All emotion datasets present the stimuli to evoke emotions, such as the international affective picture system (IAPS) [45] and the international affective digitized sound system (IADS) [46]. Compared with images and music, videos and audios always bring strong emotional feelings. The existed researches have confirmed that video-audio clips can elicit the emotions of subjects reliably [12, 47] , hence video-audio clips were selected in our experiments. Eighteen initial video-audio clips were manually selected, and then we recruited participants to join a subjective rating experiment of these video-audio clips. Finally, three video-audio clips were selected based on the subjective rating results. Each of these steps is explained in detail as follows.

III-A Initial Video-Audio Clips Selection

To select the most effective video-audio clips, two research assistants (1 male and 1 female) reviewed more than 500 video-audio clips and conducted the preliminary screening. They were asked to select video-audio clips that lasted 30-120 seconds and contained content to elicit a single target emotion, including a negative emotion (anger), a positive emotion (happy), and a neutral state. Another two research experts (1 male and 1 female) with rich experience in driver emotions analysis evaluated each selected video-audio clip. A consensus of the two experts decided the selections of the video-audio clips.

The selected video-audio clips are mainly based on Chinese real-life scenarios and events, such as aggressive driving and chatting. Other video-audio clips selection criteria include: 1) the video background should not be too dark, 2) the clip should contain complete speech segments, and 3) there is only one wanted expressing emotion in the clip. Accordingly, we selected 18 video-audio clips and checked them further in subjective annotation session.

TABLE III: Brief Description of Selected Chinese Video-Audio stimulus
Targert Emotion
Duration (sec)
Clip Content
Happiness 62
Parents mentor their children to do
homework
Anger 45
Many people were used in cruel
human experiments during the war
Neutral 48
Man drives on city road with
nothing happened

III-B Subjective Annotation

The web-based subjective emotion annotation experiment was conducted to evaluate the video-audio clips. For each participant, the 18 video clips were displayed in a random order, and there was a relatively long break time (3 minutes) between every two clips to avoid interference from the previous one. After watching each video-audio clip, participants finished two questionnaires based on their true feelings, namely the self- assessment manikin (SAM) [13] and the differential emotion scale (DES)[12]. SAM uses non-verbal graphical representations to assess the arousal, valence, and dominance level. The study in [13] has concluded the effectiveness of SAM. We adopted a 9-point scale (1 = ”not at all”, 9 = ” extremely ”) SAM [13] in our study for evaluation. DES is used to assess the different component of emotions, which consists of ten basic emotions. In this study, we used a 9-point scale DES (1= ”not at all”, 9= ” extremely ”) [12] to assess the intensity of each self-reported emotional dimension. None of the clips was evaluated twice by the same participant, and at least 35 assessments were collected for each video.

Refer to caption
Figure 1: Experimental setup of driver facial expression data collection. (a) driver facial expression recording, (b) fix-based driving simulator, (c) experiment setup, (d) driving scenarios, (e) video-audio stimulus display

III-C Video-Audio clips Selection

Three video-audio clips were selected by comprehensively considering the SAM and DES results. Firstly, we normalized the variables by calculating the Z-scores and then conducted a cluster analysis using the K-means algorithm to identify the clusters of emotions based on the SAM data. The clustering results showed that a total of three emotion categories were generated, which corresponded to the positive emotion (happy), negative emotion (anger), and neutral, respectively. The video-audio clip whose rating was closest to the extreme corner of each quadrant was selected and marked as the representative video-audio clip of the cluster [21].

Moreover, we selected video-audio clips for each emotional category based on the following scores of the DES data: 1) Hit rate was defined as the proportion of participants who chose the target emotion. 2) The intensity value was defined as the average score of target emotions. 3) The success index represented the sum of the two Z-scores, which were obtained by normalizing the hit rate and intensity values. Next, video-audio clips with the highest success index were selected from each emotion category, and representative video-audio clips were also selected according to the SAM data for verification. It should be noted that the neutral video-audio clip was only selected based on the clustering results. Eventually, as shown in Table 3, three of the most effective videos were selected for driver facial expression data collection experiment.

IV Driver Facial Expression Data Collection

IV-A Ethics Statement

The experimental procedure and the video content shown to the participants were approved by Chongqing University Cancer Hospital Ethics Committee, China. Participants and data from participants were treated according to the Declaration of Helsinki. The participants were also informed that they had the right to quit the experiment at any time. The video recordings of the participants were included in the dataset only after they gave written consent for the use of their videos for research purpose. A few participants were also agreed to use their face images in research articles.

IV-B Participants

Sixty Chinese participants(47 males and 13 females) with aging from 19 to 56 years old (mean [M] = 27.3 years, standard deviation [SD] = 7.7. Years) were recruited to participate in this experiment from Shapingba District, Chongqing, China. Each participant had a valid driving license with at least one year of driving experience (average [M] = 5.5 years, standard deviation [SD] = 5.8, range = 1-30 years). All the participants had normal or corrected to normal vision (36 participants wear glasses) and normal hearing ability. The presence of occlusion such as glasses is a significant research challenge of facial expression recognition; hence participants wearing glasses were included to evaluate the robustness of emotion recognition. All the participants signed the consent form to participate in the study and received 100 CNY as financial reimbursement for their participation.

IV-C Experiment Setup

The experiments were carried out in a fix-based driving simulator (Figure.1(b)) with illumination-controlled (RDS2000, Real-time technology SimCreator, Ann Arbor, Michigan, USA). Figure.1(d) shows the front view, which was presented using three projectors, and the rear view was displayed using three LCD screens (one for the rearview in the vehicle and two for the left and right rear views). Another two LCD screens were used to display the dashboard and central stackf. The ambient noise and sound of the engine were presented through two speakers. The vibration of the vehicle was simulated through a woofer under the driver’s seat. For the presentation of stimulus without changing the internal environment of the driving simulator, as shown in Figure.1(e), we used a 20-inch central stack screen (1,280×1,0241,280\times 1,024, 60Hz) to display the video-audio stimulus materials. A stereo Bluetooth speaker (Xiaomi) was used to play the audio, and the audio volume was set to a relatively loud level. However, each participant was asked before the experiment whether the volume was comfortable and adjusted when necessary for clear hearing. During the experiment, as shown in Figure.1(a), the participants’ faces were continuously imaged with a visual camera. The visual face camera was an HD Pro Webcam C920 (Logitech, Newark, CA.) with a resolution of 1,920×1,0801,920\times 1,080 pixels, collecting data at a frame rate of 30 fps. Also, an iPad (Apple) was used for participant self-reported emotion. Figure.1(c) shows the overall data collection experiment setup.

Refer to caption
Figure 2: Driving scenarios and tasks. (a) practice driving, (b) emotional driving

IV-D Driving Scenarios and Tasks

Two driving scenarios on highways were realized in the simulator. The reason for setting these two scenarios is to minimize the impact of complex driving scenarios on driver performance. The first was a practice scenario (PD) to help participants familiarize themselves with the simulator before the experiment. As shown in Figure.2(a), the practice scenario was an 8km straight section of a four-lane highway with two for each driving direction. The participants were asked to drive on the right lane with the speed changes in the range of 80km/h – 50km/h – 100km/h. The second scenario is an emotional driving (ED) scenario. As shown in Figure.2(b), the emotional driving scenario was a 3km straight section of the same highway with a posted speed limit of 80km/h. The participants were asked to drive on the right lane with speed around 80km/h.

IV-E Experiment protocol

To obtain drivers’ ED data, we designed an experimental protocol about 45 minutes driving. The protocol was composed of one PD, followed by three ED. ED driving included angry driving (AD), happy driving (HD) and neutral driving (ND). Figure.3 presented details of the protocol. Before the experiment, each participant signed a consent form and filled out a basic information questionnaire (gender, age, driving age). Next, they were provided with a set of instructions to inform them of the experimental protocol and the definition of different scales used for self-reported emotions. Then, the participants were required to drive a 10-minute PD to help them get familiar with the operation and motion performance of the driving simulator. After a short break following PD, the participants started the three EDs. The corresponding emotion was induced by watching the selected video-audio clip at the beginning of each ED, following by driving with emotion. At the end of each ED session, the participant was required to report his/her self-evaluated emotion level using SAM and DES. There was a 3 minutes break between each two EDs. During the entire experiment, if the participants felt any discomfort, they could withdraw from the experiment at any time.

IV-F Self-Reported Emotion

To identify the emotion experienced by participants, we employed self-reported scales for subjective assessment of emotions. After each driving task, the participants were asked to assess their emotional experience while driving using SAM and DES. The SAM and DES scales were presented to participants by an iPad. In SAM, the valence scale ranged from unhappy to happy, the arousal scale ranged from calm to stimulation, and the dominance scale ranged from submissive (or ”without control”) to dominant (or ”under control, empowered ”). In DES, there were ten emotion dimensions, and each dimension evaluated the intensity of emotions from ”not at all” to ” extremely ”. Each dimension of the SAM scale and the DES scale is represented from one to nine by a Likert scale. If the self-assessments from participants were not consistent with the induced target emotions, we would use the participants’ self-reported data as the ground truth to label the facial video data.

Refer to caption
Figure 3: Experiment protocol.

V Data Processing and Evaluation

V-A Data Processing

In this section, we described the processing of driver facial expression data. First, we labelled the facial expression data of 60 drivers according to their self-reported emotion and removed the ED data that was not successfully induced. Second, we reported how to split data for driver emotion recognition, including splitting effective video clips from the original data and extracting driver facial expression.

Refer to caption
Figure 4: The emotional driving induction success rate for 60 drivers.

During data collection, each participant completed three ED sessions with average recording data of 405s. Also, we compiled the self-reported data for each participant. As shown in Figure.4, the numbers of successfully induced emotional drivers were 52, 56, and 56 for the anger, happy, and neutral driving, respectively. Participants’ self-reported data were used as the ground truth to label driver facial expression data.

Refer to caption
Figure 5: Sample images of the 3 emotion categories in DEFE dataset, anger (1st row), happiness (2nd row) and neutral (3rd row).

As per [48] and [49], the facial expression video sequences 15s after drivers started driving were clipped as the most effective data. Face detection and alignment in driving environments are challenging due to various poses, illuminations and occlusions (glasses). MTCNN (Multi-task Cascaded Convolutional Networks) is a cascade structure based on deep learning, which is relatively accurate when detecting faces in multiple pose angles and in unconstrained scenes [50]. Hence, we used MTCNN to track and extract driver face data from each video frame. After extracting driver face expression data, we obtained a total of 17,310 image frames of driver faces with 64*64 pixel.

Therefore, the created dataset contains facial expression videos and images from 60 drivers with the ground truth of dimensional emotion (valence, arousal and dominance) and discrete emotion (emotion categories and its intensity). A few examples of the dataset images are provided in Figure.5, which shows that drivers’ facial expressions varied with the types of emotion, but the variation was weak in some cases during driving, for example, the difference between AD and ND was tiny. Most video clips were challenging to observe peak expressions, and we also observed that the change of emotion with driving duration was weak, and this phenomenon is probably because the facial expression of emotion was affected by driving tasks.

V-B Classification Protocol

In this section, we introduced two different types of protocols for driver emotion recognition based on facial expression data. (1) To investigate driver emotion classification results based on the dimensional emotion model, we proposed three different nine-classification problems: valence, arousal, and dominance. To this end, the SAM scores of participants were used as the ground truth. Each classification (valence, arousal, dominance) on these scales was divided into nine levels (1 = ”not at all”, 9 = ”extremely”). (2) To study driver emotion classification results based on the discrete emotion model, we proposed a three-emotion classification protocol, namely anger, happiness, and neutral. Besides, we discussed the intensity recognition for anger and happy emotions, respectively. To this end, the DES scores were taken as the ground truth. Each emotion(anger and happiness) intensity was divided into 5 levels (5 = ”no emotion”, 9 = ”maximum intensity”).

It should be noted that the above approach can lead to unbalanced classes for some participants and scales. In light of this, we included F1 scores in order to report reliable results. The F1 score is a commonly used metric in classification tasks, which considers both precision (P) and recall (R) of the model. It quantifies the correct prediction of the positive samples. When categories are unbalanced, the F1 score will be attenuated [51]. We additionally used accuracy as another metric. Accuracy quantifies how well the classification correctly identifies or excludes conditions, and it is robust to unbalanced data.

Both the traditional and the deep learning methods for emotion recognition tasks were included in this study. As the most effective traditional method in most classification tasks [19], SVM (Support Vector Machine) was selected to be implemented by the sklearn toolbox with a linear kernel. As for the deep learning-based classification methods, Xception [52] was applied. The Xception network has been widely adopted in emotion recognition tasks, and many state-of-the-art emotion recognition networks are developed based on the Xception network[53][54]. For the network, the loss function can be expressed as:

L(y,y^)=j=0Mi=0N(yijlog(y^ij))L(y,\hat{y})=-\sum_{j=0}^{M}\sum_{i=0}^{N}\left(y_{ij}*\log\left(\hat{y}_{ij}\right)\right) (1)

Equation1 where y^\hat{y} is the prediction and yy is the ground truth. The above deep learning method used the same training strategy. First, it employed Adam optimizer [55], which has a learning rate of 10310^{-3} and a weight decay of 10610^{-6} for training. Second, image augmentations, including random horizontal flips, random crop, and random rotation, were applied on-the-fly to increase the amount of training images effectively. SVM was applied with Intel R CoreTM i5-dual-core CPU. Xception was used with TITAN XP.

V-C Evaluation Results

Apart from the emotion recognition results for the proposed dataset, we also selected the DEAP [21] and CK + [19] datasets which were collected in static life scenarios as the comparison datasets. The DEAP dataset consists of 32 participants. Each participant watched 40 1-minute long video-audio chips as the emotional stimulus while recording facial videos and physiological signals. There are 40 trials recorded per participant, each corresponding to one emotion elicited by one video-audio chip. After watching each video, the participants were asked to assess their real emotions from five dimensions: valence, arousal, dominance, liking and familiarity. The rating ranges from 1 (weakest) to 9 (strongest), except liking and familiarity, which rating from 1 to 5. Facial videos from 22 of the participants were also recorded at the same time. This paper adopted the 22 facial videos in this dataset and investigated the emotion classification results based on the dimensional emotion model for comparison. The CK + dataset consists of 123 participants. This dataset was posed and spontaneous by multiple participants whose facial expressions started from neutral to the peak. In the CK + dataset, 327 sequences have discrete emotion labels including neutral, sadness, surprise, happiness, fear, anger, contempt and disgust. This paper selected the neutral, anger and happy sequences in this dataset to compare the emotion classification results based on discrete emotion models.

TABLE IV: Average Accuracies (ACC) and F1-Scores (F1, Average Score for each class) in protocol one based on the dimensional emotion model(in %).
dataset Method Valence Arousal Dominance
ACC F1 ACC F1 ACC F1
DEFE SVM 53.39 54.79 59.49 63.04 59.49 63.04
Xception 86.00 83.73 91.54 91.76 88.17 79.55
DEAP SVM 27.88 23.24 29.82 23.25 28.12 24.14
Xception 24.10 21.41 35.06 31.80 31.00 24.24

Table 4 shows the average accuracies and F1 scores (average F1 scores for nine classes) for each rating scale (valence, arousal and dominance) when using protocol one on DEFE. We compared the performances of SVM and Xception on the DEFE dataset. In general, the accuracies when using Xception method were at least 30%30\% higher than the accuracy when using SVM. The highest classification accuracy for valence, arousal, and dominance achieved 86.00%86.00\%, 91.54%91.54\%, and 88.17%88.17\%, respectively, when using Xception. In terms of the F1 scores, the highest scores for valence, arousal, and dominance were: 83.73%83.73\%, 91.76%91.76\%, and 79.55%79.55\% respectively, when using Xception. In addition to the emotion recognition results on the DEFE dataset, Table VI also shows the comparison results on the DEAP dataset when using the same recognition algorithms. The results show that the DEFE dataset had higher recognition accuracies and F1 scores than the DEAP dataset, which may be because the participants’ faces were affixed with electrode pads for physiological signals collection in the DEAP dataset, which affected the facial expression recognition results.

TABLE V: Average Accuracies (ACC) and F1-Scores (F1, Average Score for each class) in protocol two based on the discrete emotion model(in %).
Dataset Method Emotion category Angry intensity Happy intensity
ACC F1 ACC F1 ACC F1
DEFE SVM 53.08 52.93 86.01 87.42 85.41 85.57
Xception 90.34 90.21 97.60 97.12 97.88 97.59
CK+ SVM 82.70 71.45 - - - -
Xception 94.31 93.25 - - - -

Similarly, Table 5 shows the average accuracies and F1 scores for the emotion categories (anger, happiness, and neutral) when using protocol two. We also compared the classification results when using SVM, and Xception in Table V. The results show that both the highest classification accuracy (90.34%90.34\%) and the highest F1 scores (90.21%90.21\%) were obtained when using Xception. Apart from the emotion recognition results on DEFE, Table V also presented the comparison results on the CK + dataset when using the same recognition algorithms. The results show that the recognition results of the CK + dataset were higher than that of DEFE dataset.

Moreover, Table 5 shows the average accuracies and F1 scores of the intensity classification results on anger and happiness emotions when using protocol two with different algorithms. Five classes of the intensity of anger and happiness were classified based on facial expression data. The results show that the highest classification accuracies for angry and happy driving intensity were 97.60%97.60\% and 97.88%97.88\%, respectively. The highest F1 scores for angry and happy intensity were 97.88%97.88\% and 97.59%97.59\%, respectively. It should be noted that in recognition of emotion intensity, we did not compare the results with other datasets, because there was currently no spontaneous facial expression datasets with emotional intensity labels.

The comparison results in this section show that there is a difference in human facial expression between DEFE and CK+. Due to the influence of driving tasks in driving scenarios, facial expressions of drivers may be suppressed when they experience emotional states. Hence, it is necessary to discuss further the difference between human facial expressions in dynamic driving scenarios and static life scenarios.

VI The facial expression difference between dynamic driving and static life conditions

VI-A dataset Selection for Comparison

In this section, we conducted a differential analysis of the facial expressions between dynamic driving and static life conditions by comparing the DEFE and JAFFE datasets. The static life dataset, Japanese Female Facial Expression (JAFFE) dataset[14], was selected as a baseline. It was posed by 10 East-Asian females with seven emotion expressions (happy, anger, disgust, fear, sad, and neutral). Each female had two to four examples for each emotion. In total, there are 213 grayscale facial expression images in this dataset.

Given the East-Asian cultural background with small difference, the JAFFE dataset was the most optimal control group for our DEFE dataset because of the excluded most cultural bias[56]. Since DEFE only include two emotions (anger and happiness), we also selected anger and happiness expressions from JAFFE for analysis. Meanwhile, gender differences may affect the results so that we removed the male drivers from the initial DEFE dataset.

Refer to caption
Figure 6: Facial action coding system (FACS) codes can be used to describe the facial configuration in adults. (a) and (b) display the common FACS codes for anger and happiness, respectively, and (c) presents the AUs description for anger and happiness [57]

VI-B Differential Analysis Protocol

Each participant’s facial expressions were evaluated by observing subtle changes in facial features. The Facial Action Coding System (FACS) [58] is a systematic approach to describe what a face looks like when facial muscle movements have occurred. There are 44 coded facial muscle movements, namely Action Units (AUs), in FACS according to the presence and intensity of facial movements. Ekman et al. further proposed that facial emotion expressions could be coded as a combination of several AUs. Figure.6 (a) and (b) display the common FACS [57] codes for anger and happiness, respectively, and Figure.6 (c) presents the AUs descriptions for anger and happiness. In this study, the AU codes for anger (AU 4, 5, 7 and 23) and happiness (AU 6 and 12) were used as the basic units for differential analysis.

We utilized OpenFace [59], a facial expression analysis toolkit, to detect the presence of AUs. When an AU was detected, we coded it as 1 and otherwise 0. Due to video enable to capture enriched data, DFEE contained more facial expression information compared than JAFFE. In the end, the number of observations of happy and anger expressions in JAFFE was 61. DEFE, as a video dataset, had 10020 and 6660 number of observations of happy and angry expressions.

To analyze the differences of AU presence between dynamic driving and static life conditions, we conducted a statistical analysis to investigate the presence of AUs in the two datasets. Given the same emotions in both datasets, we should not observe a statistical difference if the facial expressions were similar between dynamic driving and static life conditions. Meanwhile, the average difference between the two datasets may not fully reflect emotional changes. Instead, it may be led by the baseline difference of two datasets.

Hence, to study the relationship of these AUs to anger and happiness in the two datasets, a logit regression was performed on the two datasets separately with happiness coded as 1 and anger as 0. If the relationship coefficients of AUs had differences in the two datasets, it could be concluded that some AUs performed differently between dynamic driving and static life scenarios. It should be noted that positive coefficient means the AU is related to happiness and negative coefficient means the AU is related to anger.

TABLE VI: Statistics analysis results of AUs’ presence in Anger and Happiness cross DEFE and JAFFE dataset
The presence of AUs in anger AU 4 AU 5 AU 7 AU 23
Anger JAFFE Average 0.433 0.683 0 0.05
STD 0.5 0.469 0 0.22
DEFE Average 0.066 0.351 0.467 0.157
STD 0.248 0.477 0.499 0.364
T-test 5.689*** 5.464*** -76.378*** -3.733***
The presence of AUs in Happiness AU 6 AU 12 - -
Happiness JAFFE Average 0.361 0.475 - -
STD 0.484 0.504 - -
DEFE Average 0.177 0.18 - -
STD 0.382 0.384 - -
T-test 2.950*** 4.578*** - -
Note: p<0.01: ***, 0.01<p<0.05: **
TABLE VII: Logit regression results
dataset AUs AU 4 AU 5 AU 6 AU 7 AU 12 AU 23
JAFFE Coefficient -1.156*** -0.415*** 0.084 0.571*** 1.743*** -0.450***
S.E. 0.1 0.039 0.066 0.038 0.101 0.055
DEFE Coefficient -1.6** 0.373 33.442 31.959 -15.207 0.978
S.E. 0.631 0.729 3961.164 3826.095 2037.702 1.516
Note: p<0.01: ***, 0.01<p<0.05: **

VI-C Result and Discussion

Refer to caption
Figure 7: Sample images of facial expressions in JAFFE (1st row) and DEFE (2nd row) with labelled AUs. Left: anger, right: happiness.

Statistical analysis results of AUs presence are shown in Table 6. For happiness, the results show that AU6 and AU12 movements could be observed in both JAFFE and DEFE. However, compared with JAFFE, the presence frequencies of AU 6 and AU 12 in DEFE were significantly lower (p<0.01). For anger, the results show that AU4, AU5 and AU23 movements could be observed in both JAFFE and DEFE, and there are significant differences (p<0.01). Besides, we found that AU7 related to anger from DEFE did not appear in the anger expressions from JAFFE. Sample images of facial expressions in JAFFE and DEFE with labelled AUs as shown in Figure.7.

Compared with JAFFE, DEFE had a lower presence frequency on AU4, AU5, AU6, and AU12, especially AU4, which is highly related to anger, had a slight presence frequency in DEFE. The results may be caused by the main driving task, which requires concentration during driving, and the concentration may decrease the presence of AUs near eyes. On the other hand, the presence frequencies of AU7 and AU23 were lower in JAFFE, which maybe because of the difficulties to express negative emotions in Japanese culture [60].

The logit regression results are shown in Table 7. According to our regression results, in JAFFE, for happiness, the coefficients of AU6 and AU12 were consistent with the results from FACS [57], which means AU6 and AU12 were related to happiness. However, only the results of AU12 are significant (p<0.01). For anger, the coefficients of AU4, AU5, and AU23 were consistent with the results from FACS [57], which means AU4, AU5, and AU23 were related to happiness. The results of AU4, AU5, and AU23 are significant (p<0.01). Interestingly, AU7 (lid tightener) presence shows that AU7 was related to happiness which was different from previous researches[57]. In DEFE, only the result of AU4 was significant(0.01<p<0.05), and the coefficient was consistent with the research in FACS, indicating that AU4 had a significant predictive ability for anger. Other AUs were not observed with significant results.

Overall, for AUs presence, AU4 (Brow Lowerer), AU5 (Upper Lid Raiser), AU6 (Cheek Raiser), AU7 (Lid Tightener), AU12 (Lip Corner Puller), AU23 (Lip Tightener) are significant differences between dynamic driving and static life scenarios. The presence of AU4, AU5, AU6 and AU12 are higher in static life scenarios, indicating that AU AU4, AU5, AU6 and AU12 in dynamic driving scenarios are affected by the main driving tasks, which suppresses the facial expression of the driver ’s emotions. Meanwhile, the presence of AU7 and AU23 is higher in dynamic driving scenarios, which may be because Japanese culture suppresses the expression of negative emotions [60]. As for logit regression results, there are also significant differences between dynamic driving and static life scenarios. For anger, the results in dynamic driving scenarios show that only AU4 is significantly related to anger, while in static life scenarios AU4, AU5, and AU23 are all significantly related to anger. For happiness, the logistic regression results in dynamic driving scenarios show that there is no significant correlation between AUs and happiness, but the results in static life scenarios show that AU12 is significantly related to happiness. These significant differences were most likely due to the main driving tasks, which reduced the frequency and amplitude of facial muscle movements. Due to the limitation of JAFFE data amount, these results may require further investigations.

VII Conclusion and future work

In this work, a dataset for the analysis of spontaneous driver emotions elicited by video-audio stimuli is presented. The dataset includes facial expression recordings from 60 participants during driving. After watching each of the three video-audio clips selected to elicit specific emotions, each participant completed the driving tasks in the same driving scenarios and rated their emotional response in this driving process from the aspects of dimensional emotion and discrete emotion. These self-reported emotions include the scales of arousal, valence, and dominance as well as emotion category and intensity. We selected these three video-audio chips using the SAM and DES scales, which ensured the effectiveness of these stimulus materials aimed at the Chinese cultural background. Besides, we conducted the classification experiment for the scales of arousal, valence, and dominance as well as emotion category and intensity to establish baseline results for the proposed dataset in terms of accuracy and F1 scores, and these results were significantly higher than the results for random classification.

Moreover, we also compared the classification results in terms of accuracy and F1 score of the DEFE dataset with the DEAP and CK+ datasets, and the results show that the recognition results of the DEFE dataset are lower than the CK + dataset. Furthermore, we discussed the differences in facial expressions between driving and non-driving scenarios by comparing the presence of AU in the DEFE and JAFFE datasets. The results show that there were significant differences in AUs presence of facial expressions between driving and non-driving scenarios, and the difference will affect the results of facial emotion prediction, indicating that human emotional expressions in driving scenarios were different from other life scenarios. Therefore, publishing a human emotion dataset specifically for the driver is necessary for traffic safety improvement.

The DEFE dataset will be made publicly available after the work is published to allow researchers to evaluate their algorithms on an off-the-shelf driver facial expression dataset and investigate the possibility of applying them to applications. The DEFE data set provides the possibility to study emotion recognition from different emotion models simultaneously. Meantime, DEFE data can also be used to analyze the difference between driving and non-driving. Also, there are facial occlusions in DEFE, such as glasses and hands, which increases the complexity of facial expression recognition which is a significant research challenge.

Acknowledgment

The authors would like to thank Peizhi Wang, Qianjing Hu, Mingqing Tang, Bingbing Zhang, Guanzhong Zeng and Mengna Liao for their assistance.

References

  • [1] W. H. Organization, Global status report on road safety 2015. World Health Organization, 2015.
  • [2] L. James, Road rage and aggressive driving: Steering clear of highway warfare. Prometheus Books, 2000.
  • [3] G. Li, W. Lai, X. Sui, X. Li, X. Qu, T. Zhang, and Y. Li, “Influence of traffic congestion on driver behavior in post-congestion driving,” Accident Analysis and Prevention, vol. 141, 2020.
  • [4] G. Li, S. E. Li, R. Zou, Y. Liao, and B. Cheng, “Detection of road traffic participants using cost-effective arrayed ultrasonic sensors in low-speed traffic situations,” Mechanical Systems and Signal Processing, vol. 132, pp. 535–545, 2019.
  • [5] F. Eyben, M. Wöllmer, T. Poitschke, B. Schuller, C. Blaschke, B. Färber, and N. Nguyen-Thien, “Emotion on the road—necessity, acceptance, and feasibility of affective computing in the car,” Advances in human-computer interaction, vol. 2010, 2010.
  • [6] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1, pp. 39–58, 2008.
  • [7] P. Ekman, W. V. Friesen, M. O’sullivan, A. Chan, I. Diacoyanni-Tarlatzis, K. Heider, R. Krause, W. A. LeCompte, T. Pitcairn, P. E. Ricci-Bitti, et al., “Universals and cultural differences in the judgments of facial expressions of emotion.,” Journal of personality and social psychology, vol. 53, no. 4, p. 712, 1987.
  • [8] W. G. Parrott, Emotions in social psychology: Essential readings. Psychology Press, 2001.
  • [9] P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.,” Journal of personality and social psychology, vol. 17, no. 2, p. 124, 1971.
  • [10] J. A. Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
  • [11] A. Mehrabian, “Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament,” Current Psychology, vol. 14, no. 4, pp. 261–292, 1996.
  • [12] J. J. Gross and R. W. Levenson, “Emotion elicitation using films,” Cognition & emotion, vol. 9, no. 1, pp. 87–108, 1995.
  • [13] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessment manikin and the semantic differential,” Journal of behavior therapy and experimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
  • [14] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek, “The japanese female facial expression (jaffe) database,” in Proceedings of third international conference on automatic face and gesture recognition, pp. 14–16, 1998.
  • [15] D. Lundqvist, A. Flykt, and A. Öhman, “The karolinska directed emotional faces (kdef),” CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, vol. 91, no. 630, pp. 2–2, 1998.
  • [16] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in 2005 IEEE international conference on multimedia and Expo, pp. 5–pp, IEEE, 2005.
  • [17] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” in 7th international conference on automatic face and gesture recognition (FGR06), pp. 211–216, IEEE, 2006.
  • [18] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.
  • [19] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pp. 94–101, IEEE, 2010.
  • [20] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg, “Presentation and validation of the radboud faces database,” Cognition and emotion, vol. 24, no. 8, pp. 1377–1388, 2010.
  • [21] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion analysis; using physiological signals,” IEEE transactions on affective computing, vol. 3, no. 1, pp. 18–31, 2011.
  • [22] I. Sneddon, M. McRorie, G. McKeown, and J. Hanratty, “The belfast induced natural emotion database,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 32–41, 2011.
  • [23] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013.
  • [24] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pp. 1–8, IEEE, 2013.
  • [25] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014.
  • [26] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,” Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014.
  • [27] S. Happy, P. Patnaik, A. Routray, and R. Guha, “The indian spontaneous expression database for emotion recognition,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 131–142, 2015.
  • [28] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283, 2016.
  • [29] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5562–5570, 2016.
  • [30] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia, “Aff-wild: Valence and arousal’in-the-wild’challenge,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–41, 2017.
  • [31] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, 2018.
  • [32] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
  • [33] X. Wang, Y. Liu, F. Wang, J. Wang, L. Liu, and J. Wang, “Feature extraction and dynamic identification of drivers’ emotions,” Transportation research part F: traffic psychology and behaviour, vol. 62, pp. 175–191, 2019.
  • [34] H. Gao, A. Yüce, and J.-P. Thiran, “Detecting emotional stress from facial expressions for driving safety,” in 2014 IEEE International Conference on Image Processing (ICIP), pp. 5961–5965, IEEE, 2014.
  • [35] G. Li, S. E. Li, B. Cheng, and P. Green, “Estimation of driving style in naturalistic highway traffic using maneuver transition probabilities,” Transportation Research Part C: Emerging Technologies, vol. 74, pp. 113–125, 2017.
  • [36] P. Wan, C. Wu, Y. Lin, and X. Ma, “On-road experimental study on driving anger identification model based on physiological features by roc curve analysis,” IET Intelligent Transport Systems, vol. 11, no. 5, pp. 290–298, 2017.
  • [37] B. G. Lee, T. W. Chong, B. L. Lee, H. J. Park, Y. N. Kim, and B. Kim, “Wearable mobile-based emotional response-monitoring system for drivers,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 5, pp. 636–649, 2017.
  • [38] L. Malta, C. Miyajima, N. Kitaoka, and K. Takeda, “Analysis of real-world driver’s frustration,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 109–118, 2010.
  • [39] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in Proceedings of the 6th international conference on Multimodal interfaces, pp. 205–211, 2004.
  • [40] L. Yang, I. O. Ertugrul, J. F. Cohn, Z. Hammal, D. Jiang, and H. Sahli, “Facs3d-net: 3d convolution based spatiotemporal representation for action unit detection,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 538–544, IEEE, 2019.
  • [41] J. A. Groeger, Understanding driving: Applying cognitive psychology to a complex everyday task. Psychology Press, 2000.
  • [42] G. Li, Y. Wang, F. Zhu, X. Sui, N. Wang, X. Qu, and P. Green, “Drivers’ visual scanning behavior at signalized and unsignalized intersections: A naturalistic driving study in china,” Journal of safety research, vol. 71, pp. 219–229, 2019.
  • [43] T. Lajunen, D. Parker, and H. Summala, “The manchester driver behaviour questionnaire: a cross-cultural study,” Accident Analysis & Prevention, vol. 36, no. 2, pp. 231–238, 2004.
  • [44] T. Brosch, K. R. Scherer, D. M. Grandjean, and D. Sander, “The impact of emotion on perception, attention, memory, and decision-making,” Swiss medical weekly, vol. 143, p. w13786, 2013.
  • [45] P. J. Lang, M. M. Bradley, B. N. Cuthbert, et al., “International affective picture system (iaps): Technical manual and affective ratings,” NIMH Center for the Study of Emotion and Attention, vol. 1, pp. 39–58, 1997.
  • [46] M. M. Bradley and P. J. Lang, “The international affective digitized sounds (; iads-2): Affective ratings of sounds and instruction manual,” University of Florida, Gainesville, FL, Tech. Rep. B-3, 2007.
  • [47] A. Schaefer, F. Nils, X. Sanchez, and P. Philippot, “Assessing the effectiveness of a large database of emotion-eliciting films: A new tool for emotion researchers,” Cognition and Emotion, vol. 24, no. 7, pp. 1153–1172, 2010.
  • [48] D. O. Bos et al., “Eeg-based emotion recognition,” The Influence of Visual and Auditory Stimuli, vol. 56, no. 3, pp. 1–17, 2006.
  • [49] R. W. Levenson, L. L. Carstensen, W. V. Friesen, and P. Ekman, “Emotion, physiology, and expression in old age.,” Psychology and aging, vol. 6, no. 1, p. 28, 1991.
  • [50] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  • [51] L. A. Jeni, J. F. Cohn, and F. De La Torre, “Facing imbalanced data–recommendations for the use of performance metrics,” in 2013 Humaine association conference on affective computing and intelligent interaction, pp. 245–251, IEEE, 2013.
  • [52] O. Arriaga, M. Valdenegro-Toro, and P. Plöger, “Real-time convolutional neural networks for emotion and gender classification,” arXiv preprint arXiv:1710.07557, 2017.
  • [53] C. Pramerdorfer and M. Kampel, “Facial expression recognition using convolutional neural networks: state of the art,” arXiv preprint arXiv:1612.02903, 2016.
  • [54] S. Li and W. Deng, “Deep facial expression recognition: A survey,” arXiv preprint arXiv:1804.08348, 2018.
  • [55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [56] R. E. Jack, O. G. B. Garrod, H. Yu, R. Caldara, and P. G. Schyns, “Facial expressions of emotion are not culturally universal,” Proceedings of the National Academy of Sciences, vol. 109, no. 19, pp. 7241–7244, 2012.
  • [57] L. F. Barrett, R. Adolphs, S. Marsella, A. M. Martinez, and S. D. Pollak, “Emotional expressions reconsidered: challenges to inferring emotion from human facial movements,” Psychological Science in the Public Interest, vol. 20, no. 1, pp. 1–68, 2019.
  • [58] P. Ekman, W. V. Friesen, and J. C. Hager, Facial action coding system: the manual. Salt Lake City, Utah: Research Nexus, 2002.
  • [59] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10, IEEE, 2016.
  • [60] D. Matsumoto and P. Ekman, “American-japanese cultural differences in intensity ratings of facial expressions of emotion,” Motivation and Emotion, vol. 13, no. 2, pp. 143–157, 1989.
[Uncaptioned image] Wenbo Li received the B.S. and M.Sc. degree in automotive engineering from Chongqing University, Chongqing, China, in 2014, and 2017, respectively. He is currently working toward the Ph.D. degree with the Advanced Manufacturing and Information Technology Laboratory,Department of Automotive Engineering, Chongqing University, Chongqing, China. He is also a visiting Ph.D. student at the Waterloo Cognitive Autonomous Driving (CogDrive) Lab at University of Waterloo, Canada. His research interests include intelligent vehicle, human emotion, driver emotion detection, emotion regulation, human-machine interaction, brain computer interface.
[Uncaptioned image] Yaodong Cui received the B.S. degree in automation from Chang’an University, Xi’an, China, in 2017, received the M.Sc. degree in Systems, Control and Signal Processing from the University of Southampton, Southampton, UK, in 2019. He is currently working toward a Ph.D. degree with the Waterloo Cognitive Autonomous Driving (CogDrive) Lab, Department of Mechanical Engineering, University of Waterloo, Waterloo, Canada. His research interests include sensor fusion, perception for the intelligent vehicle, driver emotion detection.
[Uncaptioned image] Yintao Ma received the B.Sc. degree in Engineering Mechanics from the University of Illinois in Urbana Champaign, USA, in 2018. She is currently working toward the M.Sc. degree with the Cognitive Autonomous Driving Laboratory, Department of Mechanical and Mechatronics Engineering, University of Waterloo, ON, Canada. Her research interests include machine learning, image processing, and facial expression recognition.
[Uncaptioned image] Xingxin Chen received the B.Sc. degree from Nanjing University, Nanjing, China, in 2018. He is a Master of Applied Science (MASc) student in Waterloo Cognitive Autonomous Driving (CogDrive) Laboratory, Department of Mechanical and Mechatronics Engineering, University of Waterloo, Canada. His research interests include domain adaptation, transfer learning, computer vision.
[Uncaptioned image] Guofa Li (M’18) received the Ph.D. degree in Mechanical Engineering from Tsinghua University, Beijing, China, in 2016. He is currently an Assistant Professor in mechanical engineering and automation with the College of Mechatronics and Control Engineering, Shenzhen University, Guangdong, China. His research interests include driving safety in autonomous vehicles, driver behavior and decision making, computer vision, machine learning, and human factors in automotive and transportation engineering. He is the recipient of the Young Elite Scientists Sponsorship Program by SAE-China (2018), the Excellent Young Engineer Innovation Award from SAE-China (2017), and the NSK Sino-Japan Outstanding Paper Prize from NSK Ltd. (2014).
[Uncaptioned image] Gang Guo received the B.S., M.S., and Ph.D degrees in automotive engineering from Chongqing University, Chongqing, China, in 1982, 1984, and 1994, respectively. He is currently the Chair and professor at the Department of Automotive Engineering, Chongqing University. He also serves as the Associate Director for the Chongqing Automotive Collaborative Innovation Center. He has authored and co-authored over 100 refereed journal and conference publications. His research interests include intelligent vehicle, multi-sense perception, human-machine interaction, brain computer interface, intelligent manufacturing, and user experience. Dr. Guo is a senior member of the China Mechanical Engineering Society and the Director of the China Automotive Engineering Society. He is also a member of the China User Experience Alliance Committee.
[Uncaptioned image] Dongpu Cao (M’08) received the Ph.D. degree from Concordia University, Canada, in 2008. He is the Canada Research Chair in Driver Cognition and Automated Driving, and currently an Associate Professor and Director of Waterloo Cognitive Autonomous Driving (CogDrive) Lab at University of Waterloo, Canada. His current research focuses on driver cognition, automated driving and cognitive autonomous driving. He has contributed more than 200 papers and 3 books. He received the SAE Arch T. Colwell Merit Award in 2012, and three Best Paper Awards from the ASME and IEEE conferences. Dr. Cao serves as an Associate Editor for IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, IEEE/ASME TRANSACTIONS ON MECHATRONICS, IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, IEEE/CAA JOURNAL OF AUTOMATICA SINICA and ASME JOURNAL OF DYNAMIC SYSTEMS, MEASUREMENT AND CONTROL. He was a Guest Editor for VEHICLE SYSTEM DYNAMICS and IEEE TRANSACTIONS ON SMC: SYSTEMS. He serves on the SAE Vehicle Dynamics Standards Committee and acts as the Co-Chair of IEEE ITSS Technical Committee on Cooperative Driving.