“I think you need help! Here’s why”: Understanding the Effect of Explanations on Automatic Facial Expression Recognition

1^st Sanjeev Nahulanthran Faculty of Information Technology
Monash University
Melbourne, Australia
[email protected] 2^nd Mor Vered Faculty of Information Technology
Monash University
Melbourne, Australia
[email protected] 3^rd Leimin Tian Faculty of Engineering
Monash University
& CSIRO Robotics
Melbourne, Australia
[email protected] 4^th Dana Kulić Faculty of Engineering
Monash University
Melbourne, Australia
[email protected]

Abstract

Facial expression recognition (FER) has emerged as a promising approach to the development of emotion-aware intelligent systems. The performance of FER in multiple domains is continuously being improved, especially through advancements in data-driven learning approaches. However, a key challenge remains in utilizing FER in real-world contexts, namely ensuring user understanding of these systems and establishing a suitable level of user trust towards this technology. We conducted an empirical user study to investigate how explanations of FER can improve trust, understanding and performance in a human-computer interaction task that uses FER to trigger helpful hints during a navigation game. Our results showed that users provided with explanations of the FER system demonstrated improved control in using the system to their advantage, leading to a significant improvement in their understanding of the system, reduced collisions in the navigation game, as well as increased trust towards the system.

Index Terms:

affective computing, explainable artificial intelligence, facial expression recognition, human-computer interaction

I Introduction

Facial expression recognition (FER) is a key challenge in affective computing and has shown potential to improve human-computer interaction (HCI) [12] Despite recent advancements, several challenges remain in developing accurate, reliable, and robust FER that can operate in real-world contexts. One major challenge is the lack of transparency to the end users, which is becoming increasingly important in the context of practical affective computing research and application [39]. A lack of transparency with regards to how an integrated FER system uses the data input to make predictions or decisions could affect a user’s perception and reliance towards the system [29]. A system that is not transparent could be perceived as ambiguous or not accountable. These situations should be avoided to allow a shared understanding between the user and the system [19] and avoid any mismatch of goals [1].

This study aims to develop and evaluate methods for improving an off-the-shelf FER model’s transparency and user understanding, thus allowing users to comprehend the performance of the FER system, allowing for a seamless and effective collaboration. Although there have been previous attempts to address inconsistencies in FER models [28, 10] and improve their performance from the developer’s perspective, to the best of our knowledge, there has been no existing work that focuses on facilitating end-users’ understanding on the predictions of FER systems. Towards this aim, we utilize eXplainable Artificial Intelligence (XAI) and investigate its potential benefits for systems employing FER from the end-user perspective of these technologies.

Specifically, we used an off-the-shelf machine learning model for FER, utilizing Support Vector Machines (SVMs) [3] to predict Facial Action Units (FAUs), which feeds into an expert system [36] to predict categorical emotions. We also use FAUs to generate explanations, which ensures consistency between the explanation and model’s decision, i.e., the features used to make the model decisions (based on FAUs) are directly used as explanations. We then developed an interactive game in which the FER system is used to trigger helpful hints to the user and compared performance, user understanding, number of hints triggered as well as accepted, and trust between cohorts who were provided with explanations about the FER system and those who were not. In the study (N=20), we compared providing explanations to the emotion-aware hint system with a control condition without explanations through a between-subject study. We then conducted human-grounded evaluation methods [13, 21] using a combination of quantitative in-game metrics and surveys as well as qualitative interviews to determine the effect of explanations on FER towards overall system understanding and transparency.

Our study addresses the research question: How do explanations on a FER model’s prediction affect a user’s performance in HCI, as well as their understanding and trust towards a HCI system’s assistance? Based on [42], our work has three main contributions:

•

Artifact: We develop a framework system which can be used to test different XAI methods on HCI tasks. The tool is modular and can be used with FER systems that output the 6-categorical emotion predictions on a variety of user developed tasks. This system is open-source and shall be provided for future FER explainability research;
•

Empirical: We demonstrate that users who are provided with explanations on FER have a better understanding and control of the system, leading to greater task performance. We also demonstrate that these users have a higher degree of trust towards the system;
•

Dataset: We contribute a dataset containing fine-grain interaction data and participant in-game metric, survey results and interview responses for both explanation and control conditions. We also provide open-source code of the game environment for replication.¹¹1Author contact for links to the dataset and code: [email protected]

II Related Work

We briefly review the various benefits and limitations of FER systems for HCI. We then review XAI and discuss its potential benefits in the context of FER and HCI.

II-A FER Systems

FER is a widely adopted, non-contact method for emotion recognition and has garnered growing research interest in the past few years [43, 16, 6]. FER has been shown to be naturalistic, unobtrusive, economical and easy to deploy and maintain using commercially available sensing systems [26, 22, 27, 35]. One commonly adopted FER method is the Facial Action Coding System (FACS) [14]. This method encodes human facial expressions as facial muscle activations associated with a person’s emotional state. Each FAU represents a visible facial muscle movement. We used this method as it provides a formal representation of facial expressions through muscle activations which are easy to explain and grounded in anatomy.

Literature in FER has focused on improving classification accuracy and rarely investigates the perception, reliance and trust of users [11]As the field continues to advance rapidly, it is crucial to provide a deeper understanding of machine learning models so that end users can have a greater sense of autonomy when using these systems [1]. Aside from enhancing the system’s intelligence and social acceptance, empowering users to make informed decisions and retain control is equally vital.

II-B XAI

Explanations are crucial for imparting knowledge and sharing experience among people [24]. It can be defined in multiple ways such as an assignment of causal responsibility [23] or simply as an answer to a “why-question” [18]. In the field of artificial intelligence (AI), explainability (also sometimes equated with interpretability) [13] helps to create a shared understanding [32] between end users and the AI system that is being used.

The two main approaches to explanation generation in XAI are intrinsic explanations and post-hoc explanations. Intrinsic explanations can be generated for white-box models, systems that inherently have some amount of interpretability “built-in” that accompanies the model’s output [34], whereas post-hoc methods are techniques applied “after” a model’s prediction is made [13]. In our research, we focus on intrinsic explanations for any FER model that utilizes FAUs as inputs to the expression recognition system. Specifically, we use SVMs [9] to perform FAU prediction that feeds into an expert system [36] for emotion prediction. We then generate human-readable explanations for the predicted emotions based on the activated FAUs to improve system interpretability (seen in Sec. IV-B).

Explanation can also vary depending on the type of insight provided to users. These are, namely, Global and Local explanations [13, 32]. Global explanations provide the user with general knowledge, such as the features a model uses to make a decision. Local explanations on the other hand provide knowledge about the specific prediction that was made by a model. As there has been no prior literature on which explanation type would be beneficial to end-users in the context of FER, in this research, we use both global and local explanations to provide a complete understanding of our system to end-users. More details on the global and local explanation design is provided in Sec. IV-B as it is embedded in the testing environment.

Miller [32] provides insights on what constitutes a good explanation, specifically that “probabilities and statistics don’t matter” and that “explanations are social”. We incorporate these insights by showing the user which FAUs lead to a prediction accompanied by natural language explanations. This allows the explanation to take the form of a conversation or interaction with the user.

Past studies of XAI found explanations to improve users’ understanding of AI systems [5, 8]. However, empirical results regarding its impact on users’ subjective experience such as trust [44, 41] and acceptance [17] are varied and particularly rare when it comes to the domain of FER.

II-C XAI in the context of FER

In the context of explanations for expression recognition, very few research publications have been produced. One example of how explanations of facial expressions can be beneficial is shown in [15], who used patient-clinician simultaneous fMRI scans synchronised with facial expression recordings in order to correlate active brain regions with FAUs associated with pain stimulus. The authors used SHAP [30], to determine the features most highly associated to pain and correlated them to studies that identified brain regions most associated with pain. While this work is relevant in understanding how pain is related to involuntary facial expressions, it may not generalize to other interaction scenarios. However, this work demonstrates that explanations can be utilised to extract context-specific knowledge necessary to better understand end-users.

Most relevant to our work are [38] and [10], in which the authors themselves received explanations (generated through Grad-CAM, SoftGrad, LIME and CEM methods) on emotions identified by CNN models and proceed to qualitatively analyse the explanations. The explanations highlight the features (superpixels) which contribute most to the predictions being made. The authors distinguish between the different XAI methods through qualitative comparison obtained by the authors themselves. However, how helpful these explanations are to an average, non-expert user remains unclear. Crucially, these studies do not empirically evaluate the perceptions of non-expert users and the usability of the explanations with regards to the emotions predicted and the underlying mechanisms of the system. Moreover, these studies do not evaluate the performance of FER models in a real world setting, nor did they measure perception and trust of the users towards the FER when exposed to explanations [39, 33]. Although there has been work in comparing some XAI techniques in the literature [10] with regards to FER, utilizing FAUs to explain facial expression in an empirical setting has not been carried out to the best of our knowledge and their current benefits or usability are unknown.

Various evaluation methods have been proposed to evaluate the effectiveness of XAI systems from the end users’ perspective [21]. These evaluations have not been conducted in the context of explanations for FER, which is the focus in this work. By moving experiments from a human-grounded approach to an application-grounded approach [13], we expect effective XAI systems can be developed to better suit the context of emotion recognition. In this work, we leverage these evaluation methods and construct an empirical user study to evaluate the need and effects of explanations on FER systems, thereby providing a foundation for future research on explaining FER systems to end-users.

III Hypotheses

The study was designed to determine the effect of explanations on user interaction with a FER System. We hypothesised that the users provided with explanations would have better understanding and control of the system, leading to higher hint acceptance and compliance. Through this greater control and performance, we expect that users will trust the system more when explanations are provided. Therefore, we investigated the following hypotheses:

1.

H1: A user will have better task performance when provided with explanations
2.

H2: A user will report and demonstrate better understanding of the system when provided with explanations
3.

H3: A user will report higher trust towards the system when provided with explanations

IV System Design

IV-A FER System and Hint Trigger Mechanism

We implemented a FER system that streams video data and detects faces using the face predictor implemented with the dlib library [25]. We then utilize OpenFace’s SVM library [3] to predict FAU activations from the detected facial region. FAU activations are fed into the HERCULES production rules expert system [36]. This expert system computes the confidence of the facial expression being associated with each of the “Big-6” emotion categories: happy, sad, angry, surprised, disgusted, fearful. If the confidence value of all emotions is below 50%, the system classifies that facial expression as neutral. Otherwise, the emotion with the highest confidence value is selected as the classification output. The system has an accuracy of 86.3% when tested against annotations by 5 certified FACS coders [36] and 89% on the benchmark Cohn-Kanade dataset [16]. The output of the emotion prediction is used by the Hint System as seen in Fig. 1.

Refer to caption — Figure 1: The Facial Expression Recognition System triggers hints based on a user’s facial expressions.

The Hint System stores the emotions inferred within the last 2 seconds and uses majority voting to determine when to trigger a hint for the user. If the majority of the emotions inferred within the 2 second window was negative emotions (anger, fear, sadness, disgust) or surprise, a hint is triggered.

The Hint System was tested in a separate pilot study with 20 participants who did not participate in the main study. It was conducted as a within-subject experiment to measure the usability of the hint mechanism. The pilot study tested the use of a manual hint button compared with the proposed automatic hint mechanism triggered by facial expressions. Each participant interacted with 3 versions of the system in random order – the autohint system using a 2-second timing window (decided based on previous research [31, 7, 37]), a manually triggered hint system and a system without any hints. A pairwise Wilcoxon signed-rank sum test was used (non-normally distributed data) to compare the means of the game score, number of collisions and hints triggered between each version. The results of the test indicated no significant difference (p $>$ 0.1) between the manual and autohint system. This suggests that the autohint system has similar usability to the manual hint system and therefore we adopted the autohint system in the main study.

IV-B Explanation Mechanism

IV-B1 Global explanations

As part of the global explanation, participants were shown a video at the start of the session, which provided examples of the emotions that trigger hints, as well as FAUs that contribute to the emotion prediction. This video was shown only once prior to the start of the game and was 5 minutes in length. The purpose of global explanations was to provide a high level understanding of what emotions help to trigger the hint function, how long the emotion should be shown for effective hint triggering and which FAUs were important for each emotion. This explanation aimed to answer the question ’How does the model generally work?’ corresponding to the definition of global explanations as defined in [20].

IV-B2 Local explanations

In the case of local explanations, participants had the opportunity to review why a certain emotion prediction was made during the Explorer Game. Participants were able to view these local explanations at any time they wished (with or without the hint system being triggered) through the use of buttons.

The purpose of local explanations was to help a user understand why a particular decision is made in a given situation [20]. In this instance, the facial action units (FAUs) used to determine the emotional expression were also provided as explanations to the predicted emotion. Example FAUs include: ‘Lips parted’, ‘Eyebrow raised’, ‘Nose wrinkled’. The FAUs were presented to the user both visually and textually as shown in Fig. 2. We elected to show explanations on-demand, as opposed to all the time, to avoid overloading the user with too much information [40].

V Methodology

We designed a between-subject study to investigate the effect of explanations on performance, trust and system utilization of a FER system in a navigation-based game. The system triggered helpful hints in the game based on recognized expressions. In addition to detailed objective metrics collected in-game, the experiment was followed by surveys to measure perceived trust and system usability. An overview of the system’s interaction with the user is shown in Fig. 3.

V-A Game Play

The task chosen for this study involves a multi-agent navigation game referred to as the Explorer Game (see Fig. 4a). The participant (blue dot) navigates in a grid-world with obstacles (grey boxes) and enemy agents (red dots). The aim of the game is to navigate to the end zone (green box) while making the least number of moves possible and also avoiding collision with enemy agents or obstacles. The participant begins with 30 fuel units and can move up, down, left and right (not diagonally) or skip their turn. The enemy agents’ moves are random and cannot be predicted by the participant (forcing the participant to place more reliance on the hint option which will be explained in the next section).

Each move the participant makes consumes 1 fuel unit. Colliding with an obstacle or the edges of the grid-world results in a penalty of 1 fuel unit, while colliding into an enemy agent results in a penalty of 3 fuel units. Each game instance had 16 obstacle boxes and 3 enemy agents. If the participant runs out of fuel, they lose that particular game trial and continue to the next. Winning a game involves getting the participant’s agent to the end zone before the fuel runs out. The amount of remaining fuel after completing all trials determines the participants’ score and was used to populate a leader-board which was shared among all participants. Each participant played 9 trials each with different maps configurations.

During the game, if the participant triggers the hint function through the mechanism explained in the previous section, the system will generate a pop-up window stating that the participant may need a hint and asking if the participant would like one (as seen in Fig. 4b (top)). If the participant selects “Yes”, we consider this as hint “acceptance”. If the participant selects “No”, the window will close and the participant continues to play the game. For participants who “accepted” the hint, they are then given information on the enemy agents’ next moves as well as a recommended move to make (as seen in Fig. 4b (bottom)). If the participant selects “Yes” to follow the recommended move, the system will automatically make that move for the participant. This is defined as hint “compliance”.

In addition to the Explorer Game, we designed a Test Game to evaluate the participants’ knowledge of the FER system. This gives us an opportunity to estimate each participant’s particular understanding of how the FER works, following their experience with the Explorer Game. In the Test Game, there are no enemy agents (red dots), obstacles (gray squares) or hints. The agent has a pre-planned set of moves used to get to the end zone and the keyboard is disabled. The participant can only move the agent by triggering the FER system with the same facial expressions used to trigger the hint system in the Explorer Game. The objective of the Test Game is to get as far as possible within a time limit of 1 minute. The farther the participant moved, the higher the score. This Test Game enabled us to objectively measure the users’ understanding of the FER system, through the score.

V-B Study Design

The study was designed as a between-subject study with 2 cohorts:

1.

AutoHint Participants were provided with hints only (no explanations). Participants were instructed prior to the experiment that video data will be used to detect emotions which, in turn, trigger the hint system. No further details were provided.
2.

XAutoHint Participants are exposed to global (mandatory) as well as local (optional) explanations of the FER system explained in Sec. IV-B.

V-C Experimental Procedure

Participants performed the experiment individually on a laptop in a private, well-lit room. An onboard camera was used to capture the video stream for processing. Participants used the onboard keyboard and a mouse to interface with the system. After a few practice runs, both cohorts proceeded to play the same set of maps for the Explorer Game with a randomized map order to control for any ordering effects. Participants in the XAutoHint cohort were able to view local explanations as described in the previous section. After playing the Explorer Game, participants proceeded to play the Test Game which aims to evaluate their understanding of the FER. During the games, the experimenter exited the testing room to ensure that participants could express themselves freely and to reduce experimenter bias. After both games were completed participants answered the the Trust Scale [21].

After the questionnaires were completed and prior to conclusion, participant exit interviews were conducted. Participants were asked to explain how the hint mechanism worked and which facial muscles or emotions were needed to trigger the hint mechanism in the exit interview.

Thematic analysis [4] was used to evaluate the interview responses and identify themes and patterns in the data. Following the thematic analysis method outlined by [4], the qualitative data from the interviews were analysed in depth to extract common themes. We identified themes using a data-driven/inductive approach. The themes were updated throughout the review of all interview data. We reported the identified themes and added relevant extracts from participants’ own ‘verbatim’ transcripts within an analytic narrative [2]. This narrative of the qualitative analysis is presented in Sec. VI-B.

V-D Participants

The study was conducted with a total of 20 participants, 10 participants in each cohort (5 M, 5 F). No significant differences were found between the cohorts in terms of age ( $U=44.5$ , $p=0.70$ ), gender ( $U=50$ , $p=0.97$ ), self-reported navigation capabilities ( $U=38$ , $p=0.38$ ) and self-reported emotional expressiveness/suppressiveness ( $U=42.5$ , $p=0.60$ ) using a Wilcoxon rank-sum test.

VI Results

VI-A Task Performance

The in-game performance data for the Explorer Game was analysed to evaluate H1 (A user will have better task performance when provided with explanations). Since the hints helped users avoid collision with enemy agents we compared the amount of collisions experienced by participants in each cohort. We found a statistically significant difference in the number of collisions ( $W=5434.5$ , $p<0.001$ ) between both cohorts with the XAutoHint cohort experiencing a significantly lower number of collisions (average collision of 0.32) compared to the AutoHint cohort (average collision of 0.99) as can be seen in Fig. 5 ²²2This significance held during post-analysis when outliers were removed ( $W=3967.5$ , $p-value<0.001$ ).). The size of the effect was medium with a Cohen’s d value of 0.681.

When analysing overall game performance as determined by the game scores (amount of fuel leftover at the end of the game) and completion rate we found no significant difference for the scores ( $W=3522$ , $p=0.13$ ) and completion rate ( $W=53$ , $p=0.84$ ) using the Wilcoxon rank-sum test. This, however, is not necessarily a reflection on the helpfulness of the hints since fewer collisions does not directly correlate to less fuel usage. The hints allowed participants to avoid collisions (which have a higher penalty), however the extra moves made to avoid collisions still costs fuel (lower penalty) and may have lowered the scores of participants in the XAutoHint group which resulted in no significance being detected for the overall scores or completion rate.

VI-B FER System Understanding

We next attempt to address H2 (A user will report and demonstrate better understanding of the system when provided with explanations). This can be ascertained by comparing the number of hints triggered during game execution between both cohorts. A higher number of triggered and accepted hints indicates a better understanding of which expressions can trigger the system and which cannot. The number of hints triggered in the Explorer Game is shown to be significantly higher in the XAutoHint cohort ( $W=1906.5$ , $p<0.001$ ) as can be seen in Fig. 6a ³³3This significance held during post-analysis when outliers were removed ( $W=1636.5$ , $p-value<0.001$ ).. The average number of hint triggers for the XAutoHint cohort was 2.76 whereas for the AutoHint cohort it was 0.66. The size of the effect was large with a Cohen’s d value of 1.03. We further evaluated the overall scores of the Test Game which is determined by the number of times the hint system was activated. When comparing between the AutoHint and XAutoHint cohorts, the scores for the Test Game are significantly higher for participants in the XAutoHint cohort ( $W=17.5$ , $p=0.015$ ) indicating more hints were triggered, as seen in Fig. 6b. The average score was 53.8 for the XAutoHint cohort and 26.7 for the AutoHint cohort. The size of the effect was large with a Cohen’s d value of 1.23.

Qualitatively, we compare the interview responses from participants explaining how the hint function works in their own words. For the AutoHint group, the themes found indicated that participants in this group have a ‘vague’ and sometimes ‘incorrect understanding’ of what emotions/expressions seem to trigger the hint mechanism. Although 4 out of 10 participants in this group mentioned that the “eyes getting smaller, not happy face, and squeeze eyebrows” (P1, P5, P13) seem to trigger the hint, 9 out of 10 of the participants also mentioned a mix of different thoughts that indicate incorrect understanding of how the hints work. For example, some mentioned that a “thinking, anxious, concerned or even confused face” (P3, P9, P13, P15, P17, P19) seemed to work in triggering the hints. For the participants in the XAutoHint group however, the understanding of the system is more ‘concise’ and ‘accurate‘. All 10 participants mentioned that the emotions that worked for them were “mainly negative emotions” (P6) such as “disgust, surprise, anger, fear” (P8, P10, P12, P14) whilst others went even further to mention facial movements that worked for them such as “raising/lowering eyebrows, tightening their mouth or opening it, and opening their eyes wider” (all XAutoHint participants).

VI-C User Trust

To evaluate H3 (A user will report higher trust towards the system when provided with explanations), we analysed perceived trust as obtained from the survey data of the Trust Scale [21]. The overall Trust Scale scores are significantly higher for the XAutoHint cohort ( $W=17$ , $p=0.014$ ). The size of the effect was large with a Cohen’s d value of 1.29.

A further indication of demonstrated trust towards the FER is obtained by analysing the number of hints “accepted”. To compare the hint acceptance (measured by Yes/No button clicks) we performed a Welch’s T-test, as the number of hints triggered were different for both groups and equal sample tests could not be carried out. The results indicate that participants demonstrated a significantly higher hint “acceptance” ( $t=-5.7162$ , $df=40.694$ , $p<0.001$ ) in the XAutoHint condition ⁴⁴4This significance held during post-analysis when outliers were removed ( $t=-7.7042$ , $df=32$ , $p<0.001$ ).. The analysis of the data supports H3, indicating that users have a higher trust towards the system when provided with explanations. We also note that participants in the XAutoHint condition had higher hint “compliance” ( $t=-2.3647$ , $df=19.77$ , $p=0.028$ ) ⁵⁵5This significance held during post-analysis when outliers were removed ( $t=-3.2929$ , $df=18$ , $p=0.004$ )..

VII Discussion

In this study we sought to understand how explanations of a Facial Expression Recognition (FER) used to provide assistance affect user perception, reliance and performance. We further investigated how the explanations improved user understanding and subsequent use of the facial expression triggered assistance system.

Explanation effect on FER model understanding and user control: To evaluate the influence of the explanations on user understanding (H2) and control (H1) of the FER model, we examined the number of times the hint function was triggered between both cohorts. We found that participants in the XAutoHint group had a significantly higher hint trigger rate when compared to participants in the AutoHint group, as seen in Fig. 6a. The higher hint trigger rate as well as hint “acceptance” shows that participants in the AutoHint group triggered the hints intentionally to aid their next move and avoid collisions (Fig. 5). The Test Game scores for participants in the XAutoHint group were also significantly higher when compared to participants in the AutoHint group (Fig. 6b). These results empirically show that when using an imperfect, static FER model, explanations improve user understanding and subsequent control of the model. As static models do not change, explanations can provide insights to the user on how to better utilize these models for their benefit and improve their usability in real world scenarios.

By further analysing the local explainer data, we found that when the local explainer was clicked, the hints “accepted” were much higher for the XAutoHint group as compared to the AutoHint group. In terms of the hint “compliance”, when the local explainer was clicked, there was no significant difference in the rates between the two groups. This indicates that local explanations had an effect on user trust towards the accuracy of the hint triggering mechanism, however did not effect user trust towards the quality of the move recommendations. This is a particularly interesting finding as it illustrates the possible different effects that global and local explanations have towards a user’s understanding, control and trust of a FER model and the HCI system adopting it, and thus should be further investigated.

Explanation effect on user trust: When considering explanation influence on user trust (H3), we analyse the perceived trust through subjective surveys as well as demonstrated trust through objective in-game metrics. The purpose of analysing both demonstrated trust and perceived trust is to determine if the practical action taken by the user is the same as the user’s perceived trust, which may not always align [33]. The results of both these analyses show that the XAutoHint group exhibited a higher degree of trust compared to the AutoHint group. This supports the claim that explanations can engender greater trust in a system through understanding how the model functions. Introducing explanations is useful to promoting the user adoption of FER-based applications to their advantage. However, care must be taken to reduce over or under reliance or trust [33]. As the data captured on the hint compliance shows, the XAutoHint group complied with the recommended next move to be taken more often. This could be attributed to the fact that hints were provided in a timely manner when required. Therefore, participants were more likely to follow the recommended move in order to avoid collisions. For the participants in the AutoHint cohort, the hints were not provided at the right time due to a lack of user understanding as to how to trigger hints through the relevant FAUs. We hypothesize that this is the reason participants were less likely to follow the recommended move. This results in user trust in the overall system to reduce. Future research should be carried out to ensure that the explanation mechanism engenders an appropriate level of trust in the FER.

VIII Limitations and Future Work

Our study should be viewed in light of the following limitations. The sample size for the study (N=20) was relatively small and therefore certain effects may not be apparent. However, we still noted significant differences for the number of collisions, the number of hint triggers and the perceived and demonstrated trust between the two cohorts. In future, we intend to conduct validation studies with increased sample size while maintaining an appropriate balance of demographics between different cohorts. DL methods have achieved improved performance in FER compared to the SVM model adopted in this work. Using FAUs as a form of explanation can be combined with these models. It is worth looking into replacing the SVM component to allow the interpretability method to be scaled to DL methods. And finally, we note that having users interact concurrently with the Local Explainer and with the task might affect user behaviour, potentially biasing participants’ decisions. While necessary measures were taken to minimise this influence, such as pausing the timer while participants interacted with the Local Explainer and showing explanations on-demand, in future we plan to implement a similar review interface for both cohorts to further reduce its influence including basic game information without displaying FER explanations.

IX Conclusion

We proposed a novel explanation system for facial expressions recognition (FER) and investigated how explanations affect a user’s task performance, system understanding, and trust when using a human-computer interaction (HCI) system that utilises FER to automatically trigger helpful hints. Through an improved understanding (H2) of the system, participants in our user study showed better task performance (H1) in two screen-based navigation games. This indicates that users provided with explanations have better control of the system compared to those who did not receive explanations. The survey and in-game data also suggests that users provided with explanations have a higher degree of perceived and demonstrated trust towards the system (H3). Our results indicate that explanations on a FER system help to engender a greater understanding and trust by the users towards an emotion-aware HCI system.

Ethical Impact Statement

The experimental protocols were reviewed and approved by the Monash University Human Ethics Review Committee before participant recruitment and user studies (project ID 37086). Full participant consent was obtained and all data in non-identifiable. The majority of the participants were recruited from our University therefore may be biased towards a younger age with higher education background. We did not collect cultural background of participants.

References

[1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankanhalli. Trends and trajectories for explainable, accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI conference on human factors in computing systems, pages 1–18, 2018.
[2] Anne Adams, Peter Lunt, and Paul Cairns. A qualititative approach to hci research. 2008.
[3] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE, 2018.
[4] Virginia Braun and Victoria Clarke. Using thematic analysis in psychology. Qualitative research in psychology, 3(2):77–101, 2006.
[5] Zana Buçinca, Phoebe Lin, Krzysztof Z Gajos, and Elena L Glassman. Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In Proceedings of the 25th international conference on intelligent user interfaces, pages 454–464, 2020.
[6] Giovanna Castellano, Berardina De Carolis, and Nicola Macchiarulo. Automatic facial emotion recognition at the covid-19 pandemic time. Multimedia Tools and Applications, 82(9):12751–12769, 2023.
[7] Daniel Cernea, Achim Ebert, and Andreas Kerren. A study of emotion-triggered adaptation methods for interactive visualization. In UMAP Workshops, 2013.
[8] Hao-Fei Cheng, Ruotong Wang, Zheng Zhang, Fiona O’connell, Terrance Gray, F Maxwell Harper, and Haiyi Zhu. Explaining decision-making algorithms through ui: Strategies to help non-expert stakeholders. In Proceedings of the 2019 chi conference on human factors in computing systems, pages 1–12, 2019.
[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20:273–297, 1995.
[10] Guillermo del Castillo Torres, Maria Francesca Roig-Maimó, Miquel Mascaró-Oliver, Esperança Amengual-Alcover, and Ramon Mas-Sansó. Understanding how cnns recognize facial expressions: A case study with lime and cem. Sensors, 23(1):131, 2022.
[11] Laurence Devillers. Human–robot interactions and affective computing: The ethical implications. Robotics, AI, and Humanity: Science, Ethics, and Policy, pages 205–211, 2021.
[12] Fadi Dornaika and Bogdan Raducanu. Facial expression recognition for hci applications. In Encyclopedia of Artificial Intelligence, pages 625–631. IGI Global, 2009.
[13] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
[14] Paul Ekman and Wallace V Friesen. Facial action coding system. Environmental Psychology & Nonverbal Behavior, 1978.
[15] Dan-Mikael Ellingsen, Andrea Duggento, Kylie Isenburg, Changjin Jung, Jeungchan Lee, Jessica Gerber, Ishtiaq Mawla, Roberta Sclocco, Robert R Edwards, John M Kelley, et al. Patient–clinician brain concordance underlies causal dynamics in nonverbal communication and negative affective expressivity. Translational Psychiatry, 12(1):44, 2022.
[16] Beat Fasel and Juergen Luettin. Automatic facial expression analysis: a survey. Pattern recognition, 36(1):259–275, 2003.
[17] Maximilian Förster, Mathias Klier, Kilian Kluge, and Irina Sigler. Fostering human agency: A process for the design of user-centric xai systems. 2020.
[18] Mara Graziani, Lidia Dutkiewicz, Davide Calvaresi, José Pereira Amorim, Katerina Yordanova, Mor Vered, Rahul Nair, Pedro Henriques Abreu, Tobias Blanke, Valeria Pulignano, et al. A global taxonomy of interpretable ai: unifying the terminology for the technical and social sciences. Artificial intelligence review, 56(4):3473–3504, 2023.
[19] Shirley Gregor and Izak Benbasat. Explanations from intelligent systems: Theoretical foundations and implications for practice. MIS quarterly, pages 497–530, 1999.
[20] Robert Hoffman, Tim Miller, Shane T Mueller, Gary Klein, and William J Clancey. Explaining explanation, part 4: a deep dive on deep nets. IEEE Intelligent Systems, 33(3):87–95, 2018.
[21] Robert R Hoffman, Shane T Mueller, Gary Klein, and Jordan Litman. Metrics for explainable ai: Challenges and prospects. arXiv preprint arXiv:1812.04608, 2018.
[22] Maryam Imani and Gholam Ali Montazer. A survey of emotion recognition methods with emphasis on e-learning environments. Journal of Network and Computer Applications, 147:102423, 2019.
[23] Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.
[24] Frank C Keil. Explanation and understanding. Annu. Rev. Psychol., 57:227–254, 2006.
[25] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
[26] Abhiram Kolli, Alireza Fasih, Fadi Al Machot, and Kyandoghere Kyamakya. Non-intrusive car driver’s emotion recognition using thermal camera. In Proceedings of the Joint INDS’11 & ISTET’11, pages 1–5. IEEE, 2011.
[27] Louisa Kulke, Dennis Feyerabend, and Annekathrin Schacht. A comparison of the affectiva imotions facial expression analysis software with emg for identifying facial expressions of emotion. Frontiers in Psychology, 11:329, 2020.
[28] Agnieszka Landowska, Teresa Zawadzka, and Michał Zawadzki. Mining inconsistent emotion recognition results with the multidimensional model. IEEE Access, 10:6737–6759, 2021.
[29] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18, 2020.
[30] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
[31] Emily McQuillin, Nikhil Churamani, and Hatice Gunes. Learning socially appropriate robo-waiter behaviours through real-time user feedback. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 541–550. IEEE, 2022.
[32] Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial intelligence, 267:1–38, 2019.
[33] Tim Miller. Are we measuring trust correctly in explainability, interpretability, and transparency research? arXiv preprint arXiv:2209.00651, 2022.
[34] Dang Minh, H Xiang Wang, Y Fen Li, and Tan N Nguyen. Explainable artificial intelligence: a comprehensive review. Artificial Intelligence Review, pages 1–66, 2022.
[35] Gregory Mone. Sensing emotions. Communications of the ACM, 58(9):15–16, 2015.
[36] Maja Pantic and Leon JM Rothkrantz. Expert system for automatic analysis of facial expressions. Image and Vision Computing, 18(11):881–905, 2000.
[37] Prateek Panwar, Adam Bradley, and Christopher Collins. Providing contextual assistance in response to frustration in visual analytics tasks. In 2018 IEEE Workshop on Machine Learning from User Interaction for Visualization and Analytics (MLUI), pages 1–7. IEEE, 2018.
[38] Manish Rathod, Chirag Dalvi, Kulveen Kaur, Shruti Patil, Shilpa Gite, Pooja Kamat, Ketan Kotecha, Ajith Abraham, and Lubna Abdelkareim Gabralla. Kids’ emotion recognition using various deep-learning models with explainable ai. Sensors, 22(20):8066, 2022.
[39] Leimin Tian, Sharon Oviatt, Michal Muszynski, Brent C. Chamberlain, Jennifer Healey, and Akane Sano. Applied Affective Computing, volume 41. Association for Computing Machinery, New York, NY, USA, 1 edition, 2022.
[40] Mor Vered, Piers Howe, Tim Miller, Liz Sonenberg, and Eduardo Velloso. Demand-driven transparency for monitoring intelligent agents. IEEE Transactions on Human-Machine Systems, 50(3):264–275, 2020.
[41] Mor Vered, Tali Livni, Piers Douglas Lionel Howe, Tim Miller, and Liz Sonenberg. The effects of explanations on automation bias. Artificial Intelligence, page 103952, 2023.
[42] Jacob O Wobbrock. Seven research contributions in hci. studies, 1(1):52–80, 2012.
[43] Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. A survey of affect recognition methods: audio, visual and spontaneous expressions. In Proceedings of the 9th international conference on Multimodal interfaces, pages 126–133, 2007.
[44] Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 295–305, 2020.