A Spontaneous Driver Emotion Facial Expression (DEFE) Dataset for Intelligent Vehicles

Wenbo Li, Yaodong Cui, Yintao Ma, Xingxin Chen, Guofa Li,
Gang Guo, and Dongpu Cao, Wenbo Li and Gang Guo are with the School of Automotive Engineering, Chongqing University, Chongqing, 400044, China (e-mail: [email protected], [email protected]).Yaodong Cui, Yintao Ma, Xingxin Chen and Dongpu Cao are with the Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. Wenbo Li and Guofa Li are also with this affiliation. (e-mail: [email protected], [email protected], [email protected], [email protected]). G. Li is with the Institute of Human Factors and Ergonomics, College of Mechatronics and Control Engineering, Shenzhen University, Shenzhen, Guangdong, 518060, China (e-mail: [email protected]).

Abstract

Abstract—In this paper, we introduce a new dataset, the driver emotion facial expression (DEFE) dataset, for driver spontaneous emotions analysis. The dataset includes facial expression recordings from 60 participants during driving. After watching a selected video-audio clip to elicit a specific emotion, each participant completed the driving tasks in the same driving scenario and rated their emotional responses during the driving processes from the aspects of dimensional emotion and discrete emotion. We also conducted classification experiments to recognize the scales of arousal, valence, dominance, as well as the emotion category and intensity to establish baseline results for the proposed dataset. Besides, this paper compared and discussed the differences in facial expressions between driving and non-driving scenarios. The results show that there were significant differences in AUs (Action Units) presence of facial expressions between driving and non-driving scenarios, indicating that human emotional expressions in driving scenarios were different from other life scenarios. Therefore, publishing a human emotion dataset specifically for the driver is necessary for traffic safety improvement. The proposed dataset will be publicly available so that researchers worldwide can use it to develop and examine their driver emotion analysis methods. To the best of our knowledge, this is currently the only public driver facial expression dataset.

Index Terms:

Driving safety, Driver emotion, Facial expression dataset, Spontaneous expression, Affective computing, Intelligent vehicles,

I Background and Related Work

Driver emotion plays a vital role in driving because it affects driving safety and comfort. Among the 20-50 million non-fatal injuries and 1.24 million fatal road traffic accidents occurring every year worldwide [1], driver’s inability to control his emotions has been regarded as one of the critical factors degrading driving safety [2][3]. The rapid development in intelligent vehicles also calls for an emerging demand in the integration of driver-automation interaction and collaboration to enhance driving comfort, where driver emotion is one of the critical states [4]. Therefore, recognizing driver emotions is essential to improve driving safety and comfort of intelligent vehicles [5].

To describe human emotion, psychological researchers have provided two methodologies to classify emotions, which are discrete emotions and dimensional emotions [6]. Due to the discrete language words used by humans to describe emotions, discrete models are well-established and widely-accepted, such as the basic emotions of Ekman et al. [7] and the emotion tree structure of Parrott [8]. Specifically, Ekman et al. categorized discrete emotions into six basic emotions (happiness, sadness, anger, fear, surprise, and disgust) [7], which are supported by cross-cultural researches showing that humans perceived these basic emotions in a similar form regardless of culture differences [9]. The dimensional emotion models propose that the emotional state can be accurately expressed as a combination of several psychological dimensions, such as the 2D ” circumplex model ” proposed by Russell [10] and the 3D dimensional model of Mehrabian et al. [11]. In the widely adopted model proposed by Russell [10], the valence dimension measures whether humans feel negative or positive, and the arousal dimension measures whether humans are bored or excited. Mehrabian et al. [11] extended the emotional model from 2D to 3D by adding a dominance dimension, which measures submissive or empowered feelings.

The discrete emotion method is intuitive and widely used in peoples’ daily lives. However, it fails to cover the whole range of emotions exhibits by humans. The dimensional emotion method is less intuitive and often requires training the participants to use the dimensional emotion labelling system. Nevertheless, the dimensional emotion method is a more pragmatic and context-dependent approach to describe emotions [6]. In this study, considering the primary emotions of drivers during driving, we combine both the discrete emotion and dimensional emotion methods to describe drivers’ negative emotions (e.g., anger) and positive emotions (e.g., happiness) quantitatively by employing the well-known emotion difference scale (DES) [12] and self-assessed human body model (SAM) [13].

TABLE I: The Summary of Reviewed publicly available datasets for facial expression-based emotion recognition

dataset

# of images/videos

and resolution

Emotion

# of participants

Condition

Emotion model/lables

JAFFE

[14]

213 images

256\times 256

Neutral, sadness, surprise,

happiness, fear, anger,

and disgust

10 females

Static life scenario

Controlled

Posed

Discrete emotion model

- Neutral+6 basic emotions

KDEF

[15]

4,900 images

562\times 762

Neutral, sadness, surprise,

happiness, fear, anger,

and disgust

35 females and 35 males

Age between 20 and 30

Static life scenario

Controlled

Posed

Discrete emotion model

- Neutral+6 basic emotions

MMI

[16]

Over 2,900

video sequences

720\times 576

Sadness, surprise, fear,

happiness, anger,

and disgust

Age between 19 and 62

Static life scenario

Controlled

Posed & Spontaneous

Discrete emotion model

- 6 basic emotions

BU-3DFE

[17]

2,500 3D models

1040\times 1329

Neutral, sadness, surprise,

happiness, fear, anger,

and disgust

100

56 females and 44 males

Age between 18 and 70

Static life scenario

Controlled

Posed

Discrete emotion model

- Neutral+6 basic emotions

- 4 levels of emotional intensity

Multi-Pie

[18]

755,370 images

3072\times 2048

Neutral, smile, surprise,

squint, disgust,

and scream

337

102 females and 235 males

Static life scenario

Controlled

Posed

Discrete emotion model

- Neutral+6emotion categories

CK+

[19]

593 video

sequences

640\times 480

640\times 490

Neutral, sadness, surprise,

happiness, fear, anger,

contempt, and disgust

123

Age between 18 and 50

Static life scenario

Controlled

Posed & Spontaneous

Discrete emotion model

- Neutral+7 emotion categories

RaFD

[20]

8,040 images

681\times 1024

Neutral, sadness, surprise,

contempt, happiness, fear,

anger, and disgust

25 females and 42 males

Static life scenario

Controlled

Posed

Discrete emotion model

- Neutral+7 emotion categories

DEAP

[21]

880 video clips

(22 subjects)

786\times 576

Physiological

signals

Valence, arousal,

dominance

16 females and 16 males

Age between 19 and 33

Static life scenario

Controlled

Spontaneous

Dimensional emotion model

- 9 levels of valence, arousal,

dominance

Belfast

[22]

1,400 video clips

720\times 576

and

1920\times 1080

Disgust, fear, amusement,

frustration, surprise,

anger, and sadness

256

119 females and 137 males

Static life scenario

Controlled

Natural tasks induced

Discrete emotion model

- 7 emotion categories

- emotional intensity

DISFA

[23]

130,000 video

frames

1024\times 768

AU intensity for each

video frame (12 AUs)

12 females and 15 males

Age between 18 and 50

Static life scenario

Controlled

Spontaneous

12AUs

RECOLA

[24]

3.8 hours videos

1080\times 720

Physiological,

audio signals

Valence, arousal

27 females and 19 males

Mean age 22

Static life scenario

Controlled

Spontaneous

Dimensional emotion model

- 9 levels of valence, arousal

CFEE

[25]

5,060 images

3000\times 4000

22 categories of basic

and compound

emotions

230

130 females and 100 males

Mean age 23

Static life scenario

Controlled

Posed

Discrete emotion model

- 22 categories of basic

and compound emotions

BP4D-

Spontaneous

[26]

328 sequences

of 3D+2D

1040\times 1329

Sadness, surprise, fear,

anger, embarrassment,

physical pain, happiness,

and disgust

23 females and 18 males

Age between 18 and 29

Static life scenario

Controlled

Spontaneous

Discrete emotion model

- 8 basic emotions

ISED

[27]

428 video

sequences

1920\times 1080

Sadness, surprise,

happiness, and disgust

21 females and 29 males

Age between 18 and 22

Static life scenario

Controlled

Spontaneous

Discrete emotion model

- 8 basic emotions

- 5 levels of emotional intensity

FER+

[28]

35,887 images

48\times 48

Neutral, surprise, sadness,

happiness, anger, disgust,

fear, contempt

\sim

35,887

wild setting

Discrete emotion model

- Neutral+7 emotion categories

EmotioNet

[29]

1,000,000 images

Various resolution

23 basic or compound

emotions

\sim

100,000

wild setting

Discrete emotion model

- 23 categories of basic and

compound emotions

Aff-Wild

[30]

298 video clips

Various resolution

Valence and arousal

200

70 females and 130 males

wild setting

Dimensional emotion model

- valecne and arousal

RAVDESS

[31]

7,356 video and

audio clips

1280\times 720

Neutral, calm, happiness,

sadness, anger, fear,

surprise, and disgust

12 females and 12 males

Age between 21 and 33

Static life scenario

Controlled

Posed

Discrete emotion model

- Neutral+6 basic emotions

- 2 levels of emotional intensity

AffectNet

[32]

450,000 images

annotated manually

Various resolution

Neutral, sadness, surprise,

happiness, fear, disgust,

anger, contempt

Valence, arousal

450,000

wild setting

Discrete emotion model

- Neutral+7 emotion categories

Dimensional emotion model

- valence and arousal

DEFE

(This

work)

164 video clips

each 30s

640\times 480

1920\times 1080

Neutral, happiness,

anger

Valence, arousal,

dominance

13 females and 47 males

Age between 19 and 56

Dynamic driving

scenarios

Controlled

Spontaneous

Discrete emotion model

- 3 emotion categories

- 5 levels of emotional intensity

Dimensional emotion model

- valence, arousal, dominance

- 9 levels of valence, arousal,

dominance

Driver emotion recognition is often conducted by analyzing driver emotion expressions. The expressions of human emotions consists of facial expressions, speech, body posture and physiological changes. So far, different behavioural measurements (e.g., facial expression analysis, speech analysis, driving behaviour) [33][34][35] , physiological signal measurements (e.g., skin electrical activity, respiration) [36][37], or self-reported scales (e.g., self-assessment manikin) [38] have been applied in driver emotion recognition. Comparatively, physiological measurements are more objective and can be measured continuously. However, this measurement is highly invasive and may affect drivers’ driving performance. Self-reported measurements measure the subjective experience of drivers when applied correctly, but such measurements cannot take place during the study without interruption. For the study on the driver emotion in the driving environment, it is crucial to use non-invasive and non-contact measurement methods. High intrusiveness has a significant impact on both on the driver emotion expression and actual emotional experience, therefore should be avoided [39]. To this end, this study employed facial expression to recognize driver emotions and ensure the continuity of data collection.

Facial expression is a powerful channel for drivers to express emotions [40]. Recent advances in facial expression-based emotion recognition have motivated the creation of multiple facial expression datasets. Publicly available datasets are fundamental for accelerating facial expression research. As shown in Table 1, we summarized the up-to-date representative public available datasets containing facial expressions. These datasets have been used for emotion recognition and to achieve different levels of success. As shown in Table I, one of the common aspects of these datasets is the collection of participants’ facial expression data in static life scenarios and wild settings. Although facial expression data collected in static life scenarios and wild settings can be employed to recognize emotions using various algorithms, it restricts the application of these algorithms into static life scenarios.

However, driving a car is a complex cognitive process [41], which requires the driver to dynamically respond to driving tasks, such as visual cues, hazard assessment, decision-making, strategic planning[42]. Consequently, driving occupies a lot of driver’s cognitive resources [43], and cognitive processing is needed to elicit emotional responses [44]. Driving affects drivers’ emotion expressions, which are different from the expressions in static life scenarios. As a result, if the above-mentioned algorithms are applied to dynamic driving scenarios, reliable recognition results may not be obtained. Thus, it is necessary to collect drivers’ facial expression data specifically for driver emotion recognition in dynamic driving scenarios and to analyze the human facial expression differences between dynamic driving scenarios and static life scenarios.

II DEFE Data Collection Framework

To address the above-mentioned limitations, we introduce a driver emotion facial expression dataset (DEFE) in this study for driver emotion studies in intelligent vehicles. Table 2 presents the details of the experimental design for stimulus material selection, data collection, experiment protocol, and emotional labels. The performance of different emotion recognition algorithms was analyzed in this study. Also, this paper analyzed the differences in human facial expressions between dynamic driving scenarios and static life scenarios.

TABLE II: Summary of DEFE dataset

Video-audio stimulus selection

Number of stimulus

Stimulus duration

30s-120s

Initial stimulus selection

Manually selected

No. of rating per stimulus

Rating scales

Dimensional emotion model (SAM)

- Arousal

- Valence

- Dominance

Discrete emotion model (DES)

- Emotional categories

- Emotional intensity

Rating values

Discrete scale of 1-9

Selection method

Subset of annotated video-audio chips

with clearest and highest response

Driver facial expression data collection

Number of participants

60 (17females, 43 males)

Number of stimulus

Number of driving tasks

Rating scales

Dimensional emotion model (SAM)

- Arousal

- Valence

- Dominance

Discrete emotion model (DES)

- Emotional categories

- Emotional intensity

Rating values

Discrete scale of 1-9

Recorded signals

Driver facial expression videos

DEFE dataset content

Number of video clips

164

Emotions elicited

Anger (52 clips)

Happiness (56 clips)

Neutral (56 clips)

Clip duration

30s

Video clips format

MP4

Image resolution

1920*1080, 648*480

Self-report of emotion

Yes

Emotion categories labels

3 categories

- Anger, happiness and neutral

Emotion intensity labels

5 levels of anger and happiness

- 5 = no emotion, 9 = maximum intensity

Valence, arousal,

dominance labels

9 levels of valence, arousal, dominance

- 1 = not at all, 9 = extremely

In our DEFE dataset, video-audio clips were used as the stimuli to induce different emotions. To this end, a large number of video-audio clips were collected using a manual selection method. Subjective annotation was then performed to select the most appropriate stimulus material. Each stimulus material was labelled at least 35 times using SAM and DES scales, and the most effective three video-audio clips were selected to induce a specific driver emotion in our following experiments for data collection. Then, 60 drivers participated in the data collection experiment. After watching each of the three randomly sequenced video-audio clips selected to elicit a specific emotion, each participant completed the driving tasks in the same driving scenarios and rated their emotional responses in the driving process from the aspects of dimensional emotion and discrete emotion. Besides, we conducted classification experiments for the scales of arousal, valence, dominance, as well as the emotion category and intensity to establish the baseline results for our dataset in terms of classification accuracy and F1 score. Furthermore, we discussed the differences in facial expressions between dynamic driving scenarios and static life scenarios in the same culture by comparing the responses of different action units (AUs) in our DEFE and the JAFFE datasets.

The main contributions of this paper can be described as:

•

We provide a new, publicly available dataset DEFE for spontaneous driver emotions analysis. The dataset contains frontal facial videos from 60 drivers, including their biographic information (gender, age, driving age), and subjective ratings on driver emotions (arousal, valence, dominance scales, as well as emotion category and intensity). To the best of our knowledge, this dataset is currently the only public dataset of driver facial expressions.
•

We compared the classification results of driver emotions on our DEFE dataset using the mainstream classification algorithms. The DEFE dataset supports driver emotion classification from two aspects, dimensional emotion (arousal, valence and dominance) and discrete emotion (emotional category and intensity). The comparisons established the baseline results of the introduced dataset with classification accuracy and F1 score.
•

The differences in human facial expressions between dynamic driving scenarios and static life scenarios were compared by analyzing drivers’ AUs presence, and the results showed significant differences between these two types of scenarios. Therefore, the previous human emotion datasets cannot be directly used for driver emotion analysis, and our introduced DEFE dataset fills this research gap.

The structure of this paper is as follows: Section III presents the selection of stimulus materials. Section IV introduces the DEFE data collection details, and the data processing, classification methods and results are described in Section V. Section VI compares human facial expression differences in dynamic driving scenarios and static life scenarios. The final conclusions are shown in Section VII.

III Video-audio Stimulus Selection

The stimulus is necessary to elicit target emotions. All emotion datasets present the stimuli to evoke emotions, such as the international affective picture system (IAPS) [45] and the international affective digitized sound system (IADS) [46]. Compared with images and music, videos and audios always bring strong emotional feelings. The existed researches have confirmed that video-audio clips can elicit the emotions of subjects reliably [12, 47] , hence video-audio clips were selected in our experiments. Eighteen initial video-audio clips were manually selected, and then we recruited participants to join a subjective rating experiment of these video-audio clips. Finally, three video-audio clips were selected based on the subjective rating results. Each of these steps is explained in detail as follows.

III-A Initial Video-Audio Clips Selection

To select the most effective video-audio clips, two research assistants (1 male and 1 female) reviewed more than 500 video-audio clips and conducted the preliminary screening. They were asked to select video-audio clips that lasted 30-120 seconds and contained content to elicit a single target emotion, including a negative emotion (anger), a positive emotion (happy), and a neutral state. Another two research experts (1 male and 1 female) with rich experience in driver emotions analysis evaluated each selected video-audio clip. A consensus of the two experts decided the selections of the video-audio clips.

The selected video-audio clips are mainly based on Chinese real-life scenarios and events, such as aggressive driving and chatting. Other video-audio clips selection criteria include: 1) the video background should not be too dark, 2) the clip should contain complete speech segments, and 3) there is only one wanted expressing emotion in the clip. Accordingly, we selected 18 video-audio clips and checked them further in subjective annotation session.

TABLE III: Brief Description of Selected Chinese Video-Audio stimulus

Targert Emotion

Duration (sec)

Clip Content

Happiness

Parents mentor their children to do

homework

Anger

Many people were used in cruel

human experiments during the war

Neutral

Man drives on city road with

nothing happened

III-B Subjective Annotation

The web-based subjective emotion annotation experiment was conducted to evaluate the video-audio clips. For each participant, the 18 video clips were displayed in a random order, and there was a relatively long break time (3 minutes) between every two clips to avoid interference from the previous one. After watching each video-audio clip, participants finished two questionnaires based on their true feelings, namely the self- assessment manikin (SAM) [13] and the differential emotion scale (DES)[12]. SAM uses non-verbal graphical representations to assess the arousal, valence, and dominance level. The study in [13] has concluded the effectiveness of SAM. We adopted a 9-point scale (1 = ”not at all”, 9 = ” extremely ”) SAM [13] in our study for evaluation. DES is used to assess the different component of emotions, which consists of ten basic emotions. In this study, we used a 9-point scale DES (1= ”not at all”, 9= ” extremely ”) [12] to assess the intensity of each self-reported emotional dimension. None of the clips was evaluated twice by the same participant, and at least 35 assessments were collected for each video.

Refer to caption — Figure 1: Experimental setup of driver facial expression data collection. (a) driver facial expression recording, (b) fix-based driving simulator, (c) experiment setup, (d) driving scenarios, (e) video-audio stimulus display

III-C Video-Audio clips Selection

Three video-audio clips were selected by comprehensively considering the SAM and DES results. Firstly, we normalized the variables by calculating the Z-scores and then conducted a cluster analysis using the K-means algorithm to identify the clusters of emotions based on the SAM data. The clustering results showed that a total of three emotion categories were generated, which corresponded to the positive emotion (happy), negative emotion (anger), and neutral, respectively. The video-audio clip whose rating was closest to the extreme corner of each quadrant was selected and marked as the representative video-audio clip of the cluster [21].

Moreover, we selected video-audio clips for each emotional category based on the following scores of the DES data: 1) Hit rate was defined as the proportion of participants who chose the target emotion. 2) The intensity value was defined as the average score of target emotions. 3) The success index represented the sum of the two Z-scores, which were obtained by normalizing the hit rate and intensity values. Next, video-audio clips with the highest success index were selected from each emotion category, and representative video-audio clips were also selected according to the SAM data for verification. It should be noted that the neutral video-audio clip was only selected based on the clustering results. Eventually, as shown in Table 3, three of the most effective videos were selected for driver facial expression data collection experiment.

IV Driver Facial Expression Data Collection

IV-A Ethics Statement

The experimental procedure and the video content shown to the participants were approved by Chongqing University Cancer Hospital Ethics Committee, China. Participants and data from participants were treated according to the Declaration of Helsinki. The participants were also informed that they had the right to quit the experiment at any time. The video recordings of the participants were included in the dataset only after they gave written consent for the use of their videos for research purpose. A few participants were also agreed to use their face images in research articles.

IV-B Participants

Sixty Chinese participants(47 males and 13 females) with aging from 19 to 56 years old (mean [M] = 27.3 years, standard deviation [SD] = 7.7. Years) were recruited to participate in this experiment from Shapingba District, Chongqing, China. Each participant had a valid driving license with at least one year of driving experience (average [M] = 5.5 years, standard deviation [SD] = 5.8, range = 1-30 years). All the participants had normal or corrected to normal vision (36 participants wear glasses) and normal hearing ability. The presence of occlusion such as glasses is a significant research challenge of facial expression recognition; hence participants wearing glasses were included to evaluate the robustness of emotion recognition. All the participants signed the consent form to participate in the study and received 100 CNY as financial reimbursement for their participation.

IV-C Experiment Setup

The experiments were carried out in a fix-based driving simulator (Figure.1(b)) with illumination-controlled (RDS2000, Real-time technology SimCreator, Ann Arbor, Michigan, USA). Figure.1(d) shows the front view, which was presented using three projectors, and the rear view was displayed using three LCD screens (one for the rearview in the vehicle and two for the left and right rear views). Another two LCD screens were used to display the dashboard and central stackf. The ambient noise and sound of the engine were presented through two speakers. The vibration of the vehicle was simulated through a woofer under the driver’s seat. For the presentation of stimulus without changing the internal environment of the driving simulator, as shown in Figure.1(e), we used a 20-inch central stack screen ( $1,280\times 1,024$ , 60Hz) to display the video-audio stimulus materials. A stereo Bluetooth speaker (Xiaomi) was used to play the audio, and the audio volume was set to a relatively loud level. However, each participant was asked before the experiment whether the volume was comfortable and adjusted when necessary for clear hearing. During the experiment, as shown in Figure.1(a), the participants’ faces were continuously imaged with a visual camera. The visual face camera was an HD Pro Webcam C920 (Logitech, Newark, CA.) with a resolution of $1,920\times 1,080$ pixels, collecting data at a frame rate of 30 fps. Also, an iPad (Apple) was used for participant self-reported emotion. Figure.1(c) shows the overall data collection experiment setup.

IV-D Driving Scenarios and Tasks

Two driving scenarios on highways were realized in the simulator. The reason for setting these two scenarios is to minimize the impact of complex driving scenarios on driver performance. The first was a practice scenario (PD) to help participants familiarize themselves with the simulator before the experiment. As shown in Figure.2(a), the practice scenario was an 8km straight section of a four-lane highway with two for each driving direction. The participants were asked to drive on the right lane with the speed changes in the range of 80km/h – 50km/h – 100km/h. The second scenario is an emotional driving (ED) scenario. As shown in Figure.2(b), the emotional driving scenario was a 3km straight section of the same highway with a posted speed limit of 80km/h. The participants were asked to drive on the right lane with speed around 80km/h.

IV-E Experiment protocol

To obtain drivers’ ED data, we designed an experimental protocol about 45 minutes driving. The protocol was composed of one PD, followed by three ED. ED driving included angry driving (AD), happy driving (HD) and neutral driving (ND). Figure.3 presented details of the protocol. Before the experiment, each participant signed a consent form and filled out a basic information questionnaire (gender, age, driving age). Next, they were provided with a set of instructions to inform them of the experimental protocol and the definition of different scales used for self-reported emotions. Then, the participants were required to drive a 10-minute PD to help them get familiar with the operation and motion performance of the driving simulator. After a short break following PD, the participants started the three EDs. The corresponding emotion was induced by watching the selected video-audio clip at the beginning of each ED, following by driving with emotion. At the end of each ED session, the participant was required to report his/her self-evaluated emotion level using SAM and DES. There was a 3 minutes break between each two EDs. During the entire experiment, if the participants felt any discomfort, they could withdraw from the experiment at any time.

IV-F Self-Reported Emotion

To identify the emotion experienced by participants, we employed self-reported scales for subjective assessment of emotions. After each driving task, the participants were asked to assess their emotional experience while driving using SAM and DES. The SAM and DES scales were presented to participants by an iPad. In SAM, the valence scale ranged from unhappy to happy, the arousal scale ranged from calm to stimulation, and the dominance scale ranged from submissive (or ”without control”) to dominant (or ”under control, empowered ”). In DES, there were ten emotion dimensions, and each dimension evaluated the intensity of emotions from ”not at all” to ” extremely ”. Each dimension of the SAM scale and the DES scale is represented from one to nine by a Likert scale. If the self-assessments from participants were not consistent with the induced target emotions, we would use the participants’ self-reported data as the ground truth to label the facial video data.

V Data Processing and Evaluation

V-A Data Processing

In this section, we described the processing of driver facial expression data. First, we labelled the facial expression data of 60 drivers according to their self-reported emotion and removed the ED data that was not successfully induced. Second, we reported how to split data for driver emotion recognition, including splitting effective video clips from the original data and extracting driver facial expression.

During data collection, each participant completed three ED sessions with average recording data of 405s. Also, we compiled the self-reported data for each participant. As shown in Figure.4, the numbers of successfully induced emotional drivers were 52, 56, and 56 for the anger, happy, and neutral driving, respectively. Participants’ self-reported data were used as the ground truth to label driver facial expression data.

As per [48] and [49], the facial expression video sequences 15s after drivers started driving were clipped as the most effective data. Face detection and alignment in driving environments are challenging due to various poses, illuminations and occlusions (glasses). MTCNN (Multi-task Cascaded Convolutional Networks) is a cascade structure based on deep learning, which is relatively accurate when detecting faces in multiple pose angles and in unconstrained scenes [50]. Hence, we used MTCNN to track and extract driver face data from each video frame. After extracting driver face expression data, we obtained a total of 17,310 image frames of driver faces with 64*64 pixel.

Therefore, the created dataset contains facial expression videos and images from 60 drivers with the ground truth of dimensional emotion (valence, arousal and dominance) and discrete emotion (emotion categories and its intensity). A few examples of the dataset images are provided in Figure.5, which shows that drivers’ facial expressions varied with the types of emotion, but the variation was weak in some cases during driving, for example, the difference between AD and ND was tiny. Most video clips were challenging to observe peak expressions, and we also observed that the change of emotion with driving duration was weak, and this phenomenon is probably because the facial expression of emotion was affected by driving tasks.

V-B Classification Protocol

In this section, we introduced two different types of protocols for driver emotion recognition based on facial expression data. (1) To investigate driver emotion classification results based on the dimensional emotion model, we proposed three different nine-classification problems: valence, arousal, and dominance. To this end, the SAM scores of participants were used as the ground truth. Each classification (valence, arousal, dominance) on these scales was divided into nine levels (1 = ”not at all”, 9 = ”extremely”). (2) To study driver emotion classification results based on the discrete emotion model, we proposed a three-emotion classification protocol, namely anger, happiness, and neutral. Besides, we discussed the intensity recognition for anger and happy emotions, respectively. To this end, the DES scores were taken as the ground truth. Each emotion(anger and happiness) intensity was divided into 5 levels (5 = ”no emotion”, 9 = ”maximum intensity”).

It should be noted that the above approach can lead to unbalanced classes for some participants and scales. In light of this, we included F1 scores in order to report reliable results. The F1 score is a commonly used metric in classification tasks, which considers both precision (P) and recall (R) of the model. It quantifies the correct prediction of the positive samples. When categories are unbalanced, the F1 score will be attenuated [51]. We additionally used accuracy as another metric. Accuracy quantifies how well the classification correctly identifies or excludes conditions, and it is robust to unbalanced data.

Both the traditional and the deep learning methods for emotion recognition tasks were included in this study. As the most effective traditional method in most classification tasks [19], SVM (Support Vector Machine) was selected to be implemented by the sklearn toolbox with a linear kernel. As for the deep learning-based classification methods, Xception [52] was applied. The Xception network has been widely adopted in emotion recognition tasks, and many state-of-the-art emotion recognition networks are developed based on the Xception network[53][54]. For the network, the loss function can be expressed as:

L(y,\hat{y})=-\sum_{j=0}^{M}\sum_{i=0}^{N}\left(y_{ij}*\log\left(\hat{y}_{ij}\right)\right)

(1)

Equation1 where $\hat{y}$ is the prediction and $y$ is the ground truth. The above deep learning method used the same training strategy. First, it employed Adam optimizer [55], which has a learning rate of $10^{-3}$ and a weight decay of $10^{-6}$ for training. Second, image augmentations, including random horizontal flips, random crop, and random rotation, were applied on-the-fly to increase the amount of training images effectively. SVM was applied with Intel R CoreTM i5-dual-core CPU. Xception was used with TITAN XP.

V-C Evaluation Results

Apart from the emotion recognition results for the proposed dataset, we also selected the DEAP [21] and CK + [19] datasets which were collected in static life scenarios as the comparison datasets. The DEAP dataset consists of 32 participants. Each participant watched 40 1-minute long video-audio chips as the emotional stimulus while recording facial videos and physiological signals. There are 40 trials recorded per participant, each corresponding to one emotion elicited by one video-audio chip. After watching each video, the participants were asked to assess their real emotions from five dimensions: valence, arousal, dominance, liking and familiarity. The rating ranges from 1 (weakest) to 9 (strongest), except liking and familiarity, which rating from 1 to 5. Facial videos from 22 of the participants were also recorded at the same time. This paper adopted the 22 facial videos in this dataset and investigated the emotion classification results based on the dimensional emotion model for comparison. The CK + dataset consists of 123 participants. This dataset was posed and spontaneous by multiple participants whose facial expressions started from neutral to the peak. In the CK + dataset, 327 sequences have discrete emotion labels including neutral, sadness, surprise, happiness, fear, anger, contempt and disgust. This paper selected the neutral, anger and happy sequences in this dataset to compare the emotion classification results based on discrete emotion models.

TABLE IV: Average Accuracies (ACC) and F1-Scores (F1, Average Score for each class) in protocol one based on the dimensional emotion model(in %).

dataset	Method	Valence		Arousal		Dominance
dataset	Method	ACC	F1	ACC	F1	ACC	F1
DEFE	SVM	53.39	54.79	59.49	63.04	59.49	63.04
DEFE	Xception	86.00	83.73	91.54	91.76	88.17	79.55
DEAP	SVM	27.88	23.24	29.82	23.25	28.12	24.14
DEAP	Xception	24.10	21.41	35.06	31.80	31.00	24.24

Table 4 shows the average accuracies and F1 scores (average F1 scores for nine classes) for each rating scale (valence, arousal and dominance) when using protocol one on DEFE. We compared the performances of SVM and Xception on the DEFE dataset. In general, the accuracies when using Xception method were at least $30\%$ higher than the accuracy when using SVM. The highest classification accuracy for valence, arousal, and dominance achieved $86.00\%$ , $91.54\%$ , and $88.17\%$ , respectively, when using Xception. In terms of the F1 scores, the highest scores for valence, arousal, and dominance were: $83.73\%$ , $91.76\%$ , and $79.55\%$ respectively, when using Xception. In addition to the emotion recognition results on the DEFE dataset, Table VI also shows the comparison results on the DEAP dataset when using the same recognition algorithms. The results show that the DEFE dataset had higher recognition accuracies and F1 scores than the DEAP dataset, which may be because the participants’ faces were affixed with electrode pads for physiological signals collection in the DEAP dataset, which affected the facial expression recognition results.

TABLE V: Average Accuracies (ACC) and F1-Scores (F1, Average Score for each class) in protocol two based on the discrete emotion model(in %).

Dataset	Method	Emotion category		Angry intensity		Happy intensity
Dataset	Method	ACC	F1	ACC	F1	ACC	F1
DEFE	SVM	53.08	52.93	86.01	87.42	85.41	85.57
DEFE	Xception	90.34	90.21	97.60	97.12	97.88	97.59
CK+	SVM	82.70	71.45	-	-	-	-
CK+	Xception	94.31	93.25	-	-	-	-

Similarly, Table 5 shows the average accuracies and F1 scores for the emotion categories (anger, happiness, and neutral) when using protocol two. We also compared the classification results when using SVM, and Xception in Table V. The results show that both the highest classification accuracy ( $90.34\%$ ) and the highest F1 scores ( $90.21\%$ ) were obtained when using Xception. Apart from the emotion recognition results on DEFE, Table V also presented the comparison results on the CK + dataset when using the same recognition algorithms. The results show that the recognition results of the CK + dataset were higher than that of DEFE dataset.

Moreover, Table 5 shows the average accuracies and F1 scores of the intensity classification results on anger and happiness emotions when using protocol two with different algorithms. Five classes of the intensity of anger and happiness were classified based on facial expression data. The results show that the highest classification accuracies for angry and happy driving intensity were $97.60\%$ and $97.88\%$ , respectively. The highest F1 scores for angry and happy intensity were $97.88\%$ and $97.59\%$ , respectively. It should be noted that in recognition of emotion intensity, we did not compare the results with other datasets, because there was currently no spontaneous facial expression datasets with emotional intensity labels.

The comparison results in this section show that there is a difference in human facial expression between DEFE and CK+. Due to the influence of driving tasks in driving scenarios, facial expressions of drivers may be suppressed when they experience emotional states. Hence, it is necessary to discuss further the difference between human facial expressions in dynamic driving scenarios and static life scenarios.

VI The facial expression difference between dynamic driving and static life conditions

VI-A dataset Selection for Comparison

In this section, we conducted a differential analysis of the facial expressions between dynamic driving and static life conditions by comparing the DEFE and JAFFE datasets. The static life dataset, Japanese Female Facial Expression (JAFFE) dataset[14], was selected as a baseline. It was posed by 10 East-Asian females with seven emotion expressions (happy, anger, disgust, fear, sad, and neutral). Each female had two to four examples for each emotion. In total, there are 213 grayscale facial expression images in this dataset.

Given the East-Asian cultural background with small difference, the JAFFE dataset was the most optimal control group for our DEFE dataset because of the excluded most cultural bias[56]. Since DEFE only include two emotions (anger and happiness), we also selected anger and happiness expressions from JAFFE for analysis. Meanwhile, gender differences may affect the results so that we removed the male drivers from the initial DEFE dataset.

VI-B Differential Analysis Protocol

Each participant’s facial expressions were evaluated by observing subtle changes in facial features. The Facial Action Coding System (FACS) [58] is a systematic approach to describe what a face looks like when facial muscle movements have occurred. There are 44 coded facial muscle movements, namely Action Units (AUs), in FACS according to the presence and intensity of facial movements. Ekman et al. further proposed that facial emotion expressions could be coded as a combination of several AUs. Figure.6 (a) and (b) display the common FACS [57] codes for anger and happiness, respectively, and Figure.6 (c) presents the AUs descriptions for anger and happiness. In this study, the AU codes for anger (AU 4, 5, 7 and 23) and happiness (AU 6 and 12) were used as the basic units for differential analysis.

We utilized OpenFace [59], a facial expression analysis toolkit, to detect the presence of AUs. When an AU was detected, we coded it as 1 and otherwise 0. Due to video enable to capture enriched data, DFEE contained more facial expression information compared than JAFFE. In the end, the number of observations of happy and anger expressions in JAFFE was 61. DEFE, as a video dataset, had 10020 and 6660 number of observations of happy and angry expressions.

To analyze the differences of AU presence between dynamic driving and static life conditions, we conducted a statistical analysis to investigate the presence of AUs in the two datasets. Given the same emotions in both datasets, we should not observe a statistical difference if the facial expressions were similar between dynamic driving and static life conditions. Meanwhile, the average difference between the two datasets may not fully reflect emotional changes. Instead, it may be led by the baseline difference of two datasets.

Hence, to study the relationship of these AUs to anger and happiness in the two datasets, a logit regression was performed on the two datasets separately with happiness coded as 1 and anger as 0. If the relationship coefficients of AUs had differences in the two datasets, it could be concluded that some AUs performed differently between dynamic driving and static life scenarios. It should be noted that positive coefficient means the AU is related to happiness and negative coefficient means the AU is related to anger.

TABLE VI: Statistics analysis results of AUs’ presence in Anger and Happiness cross DEFE and JAFFE dataset

The presence of AUs in anger			AU 4	AU 5	AU 7	AU 23
Anger	JAFFE	Average	0.433	0.683	0	0.05
	JAFFE	STD	0.5	0.469	0	0.22
	DEFE	Average	0.066	0.351	0.467	0.157
	DEFE	STD	0.248	0.477	0.499	0.364
	T-test		5.689***	5.464***	-76.378***	-3.733***
The presence of AUs in Happiness			AU 6	AU 12	-	-
Happiness	JAFFE	Average	0.361	0.475	-	-
	JAFFE	STD	0.484	0.504	-	-
	DEFE	Average	0.177	0.18	-	-
		STD	0.382	0.384	-	-
	T-test		2.950***	4.578***	-	-
Note: p<0.01: *, 0.01<p<0.05:

TABLE VII: Logit regression results

dataset	AUs	AU 4	AU 5	AU 6	AU 7	AU 12	AU 23
JAFFE	Coefficient	-1.156***	-0.415***	0.084	0.571***	1.743***	-0.450***
JAFFE	S.E.	0.1	0.039	0.066	0.038	0.101	0.055
DEFE	Coefficient	-1.6**	0.373	33.442	31.959	-15.207	0.978
DEFE	S.E.	0.631	0.729	3961.164	3826.095	2037.702	1.516
Note: p<0.01: *, 0.01<p<0.05:

VI-C Result and Discussion

Statistical analysis results of AUs presence are shown in Table 6. For happiness, the results show that AU6 and AU12 movements could be observed in both JAFFE and DEFE. However, compared with JAFFE, the presence frequencies of AU 6 and AU 12 in DEFE were significantly lower (p<0.01). For anger, the results show that AU4, AU5 and AU23 movements could be observed in both JAFFE and DEFE, and there are significant differences (p<0.01). Besides, we found that AU7 related to anger from DEFE did not appear in the anger expressions from JAFFE. Sample images of facial expressions in JAFFE and DEFE with labelled AUs as shown in Figure.7.

Compared with JAFFE, DEFE had a lower presence frequency on AU4, AU5, AU6, and AU12, especially AU4, which is highly related to anger, had a slight presence frequency in DEFE. The results may be caused by the main driving task, which requires concentration during driving, and the concentration may decrease the presence of AUs near eyes. On the other hand, the presence frequencies of AU7 and AU23 were lower in JAFFE, which maybe because of the difficulties to express negative emotions in Japanese culture [60].

The logit regression results are shown in Table 7. According to our regression results, in JAFFE, for happiness, the coefficients of AU6 and AU12 were consistent with the results from FACS [57], which means AU6 and AU12 were related to happiness. However, only the results of AU12 are significant (p<0.01). For anger, the coefficients of AU4, AU5, and AU23 were consistent with the results from FACS [57], which means AU4, AU5, and AU23 were related to happiness. The results of AU4, AU5, and AU23 are significant (p<0.01). Interestingly, AU7 (lid tightener) presence shows that AU7 was related to happiness which was different from previous researches[57]. In DEFE, only the result of AU4 was significant(0.01<p<0.05), and the coefficient was consistent with the research in FACS, indicating that AU4 had a significant predictive ability for anger. Other AUs were not observed with significant results.

Overall, for AUs presence, AU4 (Brow Lowerer), AU5 (Upper Lid Raiser), AU6 (Cheek Raiser), AU7 (Lid Tightener), AU12 (Lip Corner Puller), AU23 (Lip Tightener) are significant differences between dynamic driving and static life scenarios. The presence of AU4, AU5, AU6 and AU12 are higher in static life scenarios, indicating that AU AU4, AU5, AU6 and AU12 in dynamic driving scenarios are affected by the main driving tasks, which suppresses the facial expression of the driver ’s emotions. Meanwhile, the presence of AU7 and AU23 is higher in dynamic driving scenarios, which may be because Japanese culture suppresses the expression of negative emotions [60]. As for logit regression results, there are also significant differences between dynamic driving and static life scenarios. For anger, the results in dynamic driving scenarios show that only AU4 is significantly related to anger, while in static life scenarios AU4, AU5, and AU23 are all significantly related to anger. For happiness, the logistic regression results in dynamic driving scenarios show that there is no significant correlation between AUs and happiness, but the results in static life scenarios show that AU12 is significantly related to happiness. These significant differences were most likely due to the main driving tasks, which reduced the frequency and amplitude of facial muscle movements. Due to the limitation of JAFFE data amount, these results may require further investigations.

VII Conclusion and future work

In this work, a dataset for the analysis of spontaneous driver emotions elicited by video-audio stimuli is presented. The dataset includes facial expression recordings from 60 participants during driving. After watching each of the three video-audio clips selected to elicit specific emotions, each participant completed the driving tasks in the same driving scenarios and rated their emotional response in this driving process from the aspects of dimensional emotion and discrete emotion. These self-reported emotions include the scales of arousal, valence, and dominance as well as emotion category and intensity. We selected these three video-audio chips using the SAM and DES scales, which ensured the effectiveness of these stimulus materials aimed at the Chinese cultural background. Besides, we conducted the classification experiment for the scales of arousal, valence, and dominance as well as emotion category and intensity to establish baseline results for the proposed dataset in terms of accuracy and F1 scores, and these results were significantly higher than the results for random classification.

Moreover, we also compared the classification results in terms of accuracy and F1 score of the DEFE dataset with the DEAP and CK+ datasets, and the results show that the recognition results of the DEFE dataset are lower than the CK + dataset. Furthermore, we discussed the differences in facial expressions between driving and non-driving scenarios by comparing the presence of AU in the DEFE and JAFFE datasets. The results show that there were significant differences in AUs presence of facial expressions between driving and non-driving scenarios, and the difference will affect the results of facial emotion prediction, indicating that human emotional expressions in driving scenarios were different from other life scenarios. Therefore, publishing a human emotion dataset specifically for the driver is necessary for traffic safety improvement.

The DEFE dataset will be made publicly available after the work is published to allow researchers to evaluate their algorithms on an off-the-shelf driver facial expression dataset and investigate the possibility of applying them to applications. The DEFE data set provides the possibility to study emotion recognition from different emotion models simultaneously. Meantime, DEFE data can also be used to analyze the difference between driving and non-driving. Also, there are facial occlusions in DEFE, such as glasses and hands, which increases the complexity of facial expression recognition which is a significant research challenge.

Acknowledgment

The authors would like to thank Peizhi Wang, Qianjing Hu, Mingqing Tang, Bingbing Zhang, Guanzhong Zeng and Mengna Liao for their assistance.

References

[1] W. H. Organization, Global status report on road safety 2015. World Health Organization, 2015.
[2] L. James, Road rage and aggressive driving: Steering clear of highway warfare. Prometheus Books, 2000.
[3] G. Li, W. Lai, X. Sui, X. Li, X. Qu, T. Zhang, and Y. Li, “Influence of traffic congestion on driver behavior in post-congestion driving,” Accident Analysis and Prevention, vol. 141, 2020.
[4] G. Li, S. E. Li, R. Zou, Y. Liao, and B. Cheng, “Detection of road traffic participants using cost-effective arrayed ultrasonic sensors in low-speed traffic situations,” Mechanical Systems and Signal Processing, vol. 132, pp. 535–545, 2019.
[5] F. Eyben, M. Wöllmer, T. Poitschke, B. Schuller, C. Blaschke, B. Färber, and N. Nguyen-Thien, “Emotion on the road—necessity, acceptance, and feasibility of affective computing in the car,” Advances in human-computer interaction, vol. 2010, 2010.
[6] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 1, pp. 39–58, 2008.
[7] P. Ekman, W. V. Friesen, M. O’sullivan, A. Chan, I. Diacoyanni-Tarlatzis, K. Heider, R. Krause, W. A. LeCompte, T. Pitcairn, P. E. Ricci-Bitti, et al., “Universals and cultural differences in the judgments of facial expressions of emotion.,” Journal of personality and social psychology, vol. 53, no. 4, p. 712, 1987.
[8] W. G. Parrott, Emotions in social psychology: Essential readings. Psychology Press, 2001.
[9] P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.,” Journal of personality and social psychology, vol. 17, no. 2, p. 124, 1971.
[10] J. A. Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
[11] A. Mehrabian, “Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament,” Current Psychology, vol. 14, no. 4, pp. 261–292, 1996.
[12] J. J. Gross and R. W. Levenson, “Emotion elicitation using films,” Cognition & emotion, vol. 9, no. 1, pp. 87–108, 1995.
[13] M. M. Bradley and P. J. Lang, “Measuring emotion: the self-assessment manikin and the semantic differential,” Journal of behavior therapy and experimental psychiatry, vol. 25, no. 1, pp. 49–59, 1994.
[14] M. J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, and J. Budynek, “The japanese female facial expression (jaffe) database,” in Proceedings of third international conference on automatic face and gesture recognition, pp. 14–16, 1998.
[15] D. Lundqvist, A. Flykt, and A. Öhman, “The karolinska directed emotional faces (kdef),” CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, vol. 91, no. 630, pp. 2–2, 1998.
[16] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, “Web-based database for facial expression analysis,” in 2005 IEEE international conference on multimedia and Expo, pp. 5–pp, IEEE, 2005.
[17] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato, “A 3d facial expression database for facial behavior research,” in 7th international conference on automatic face and gesture recognition (FGR06), pp. 211–216, IEEE, 2006.
[18] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,” Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.
[19] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in 2010 ieee computer society conference on computer vision and pattern recognition-workshops, pp. 94–101, IEEE, 2010.
[20] O. Langner, R. Dotsch, G. Bijlstra, D. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg, “Presentation and validation of the radboud faces database,” Cognition and emotion, vol. 24, no. 8, pp. 1377–1388, 2010.
[21] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion analysis; using physiological signals,” IEEE transactions on affective computing, vol. 3, no. 1, pp. 18–31, 2011.
[22] I. Sneddon, M. McRorie, G. McKeown, and J. Hanratty, “The belfast induced natural emotion database,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 32–41, 2011.
[23] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, vol. 4, no. 2, pp. 151–160, 2013.
[24] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pp. 1–8, IEEE, 2013.
[25] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014.
[26] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,” Image and Vision Computing, vol. 32, no. 10, pp. 692–706, 2014.
[27] S. Happy, P. Patnaik, A. Routray, and R. Guha, “The indian spontaneous expression database for emotion recognition,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 131–142, 2015.
[28] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283, 2016.
[29] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5562–5570, 2016.
[30] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, and I. Kotsia, “Aff-wild: Valence and arousal’in-the-wild’challenge,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–41, 2017.
[31] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, 2018.
[32] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database for facial expression, valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017.
[33] X. Wang, Y. Liu, F. Wang, J. Wang, L. Liu, and J. Wang, “Feature extraction and dynamic identification of drivers’ emotions,” Transportation research part F: traffic psychology and behaviour, vol. 62, pp. 175–191, 2019.
[34] H. Gao, A. Yüce, and J.-P. Thiran, “Detecting emotional stress from facial expressions for driving safety,” in 2014 IEEE International Conference on Image Processing (ICIP), pp. 5961–5965, IEEE, 2014.
[35] G. Li, S. E. Li, B. Cheng, and P. Green, “Estimation of driving style in naturalistic highway traffic using maneuver transition probabilities,” Transportation Research Part C: Emerging Technologies, vol. 74, pp. 113–125, 2017.
[36] P. Wan, C. Wu, Y. Lin, and X. Ma, “On-road experimental study on driving anger identification model based on physiological features by roc curve analysis,” IET Intelligent Transport Systems, vol. 11, no. 5, pp. 290–298, 2017.
[37] B. G. Lee, T. W. Chong, B. L. Lee, H. J. Park, Y. N. Kim, and B. Kim, “Wearable mobile-based emotional response-monitoring system for drivers,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 5, pp. 636–649, 2017.
[38] L. Malta, C. Miyajima, N. Kitaoka, and K. Takeda, “Analysis of real-world driver’s frustration,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 109–118, 2010.
[39] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in Proceedings of the 6th international conference on Multimodal interfaces, pp. 205–211, 2004.
[40] L. Yang, I. O. Ertugrul, J. F. Cohn, Z. Hammal, D. Jiang, and H. Sahli, “Facs3d-net: 3d convolution based spatiotemporal representation for action unit detection,” in 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 538–544, IEEE, 2019.
[41] J. A. Groeger, Understanding driving: Applying cognitive psychology to a complex everyday task. Psychology Press, 2000.
[42] G. Li, Y. Wang, F. Zhu, X. Sui, N. Wang, X. Qu, and P. Green, “Drivers’ visual scanning behavior at signalized and unsignalized intersections: A naturalistic driving study in china,” Journal of safety research, vol. 71, pp. 219–229, 2019.
[43] T. Lajunen, D. Parker, and H. Summala, “The manchester driver behaviour questionnaire: a cross-cultural study,” Accident Analysis & Prevention, vol. 36, no. 2, pp. 231–238, 2004.
[44] T. Brosch, K. R. Scherer, D. M. Grandjean, and D. Sander, “The impact of emotion on perception, attention, memory, and decision-making,” Swiss medical weekly, vol. 143, p. w13786, 2013.
[45] P. J. Lang, M. M. Bradley, B. N. Cuthbert, et al., “International affective picture system (iaps): Technical manual and affective ratings,” NIMH Center for the Study of Emotion and Attention, vol. 1, pp. 39–58, 1997.
[46] M. M. Bradley and P. J. Lang, “The international affective digitized sounds (; iads-2): Affective ratings of sounds and instruction manual,” University of Florida, Gainesville, FL, Tech. Rep. B-3, 2007.
[47] A. Schaefer, F. Nils, X. Sanchez, and P. Philippot, “Assessing the effectiveness of a large database of emotion-eliciting films: A new tool for emotion researchers,” Cognition and Emotion, vol. 24, no. 7, pp. 1153–1172, 2010.
[48] D. O. Bos et al., “Eeg-based emotion recognition,” The Influence of Visual and Auditory Stimuli, vol. 56, no. 3, pp. 1–17, 2006.
[49] R. W. Levenson, L. L. Carstensen, W. V. Friesen, and P. Ekman, “Emotion, physiology, and expression in old age.,” Psychology and aging, vol. 6, no. 1, p. 28, 1991.
[50] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
[51] L. A. Jeni, J. F. Cohn, and F. De La Torre, “Facing imbalanced data–recommendations for the use of performance metrics,” in 2013 Humaine association conference on affective computing and intelligent interaction, pp. 245–251, IEEE, 2013.
[52] O. Arriaga, M. Valdenegro-Toro, and P. Plöger, “Real-time convolutional neural networks for emotion and gender classification,” arXiv preprint arXiv:1710.07557, 2017.
[53] C. Pramerdorfer and M. Kampel, “Facial expression recognition using convolutional neural networks: state of the art,” arXiv preprint arXiv:1612.02903, 2016.
[54] S. Li and W. Deng, “Deep facial expression recognition: A survey,” arXiv preprint arXiv:1804.08348, 2018.
[55] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[56] R. E. Jack, O. G. B. Garrod, H. Yu, R. Caldara, and P. G. Schyns, “Facial expressions of emotion are not culturally universal,” Proceedings of the National Academy of Sciences, vol. 109, no. 19, pp. 7241–7244, 2012.
[57] L. F. Barrett, R. Adolphs, S. Marsella, A. M. Martinez, and S. D. Pollak, “Emotional expressions reconsidered: challenges to inferring emotion from human facial movements,” Psychological Science in the Public Interest, vol. 20, no. 1, pp. 1–68, 2019.
[58] P. Ekman, W. V. Friesen, and J. C. Hager, Facial action coding system: the manual. Salt Lake City, Utah: Research Nexus, 2002.
[59] T. Baltrušaitis, P. Robinson, and L.-P. Morency, “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10, IEEE, 2016.
[60] D. Matsumoto and P. Ekman, “American-japanese cultural differences in intensity ratings of facial expressions of emotion,” Motivation and Emotion, vol. 13, no. 2, pp. 143–157, 1989.