EVOKE: Emotion Enabled Virtual Avatar Mapping Using Optimized Knowledge Distillation
^†^†thanks: ^‡Authors with equal contributions.

Maryam Nadeem1, Raza Imam13, Rouqaiah Al-Refai13, Meriem Chkir1,
Mohamad Hoda12 and Abdulmotaleb El Saddik12 {maryam.nadeem, raza.imam, rouqaiah.al-refai, meriem.chkir, mohamad.hoda, a.elsaddik}@mbzuai.ac.ae
[email protected] 1Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
2University of Ottawa

Abstract

As virtual environments continue to advance, the demand for immersive and emotionally engaging experiences has grown. Addressing this demand, we introduce Emotion enabled Virtual avatar mapping using Optimized KnowledgE distillation (EVOKE), a lightweight emotion recognition framework designed for the seamless integration of emotion recognition into 3D avatars within virtual environments. Our approach leverages knowledge distillation involving multi-label classification on the publicly available DEAP dataset, which covers valence, arousal, and dominance as primary emotional classes. Remarkably, our distilled model, a CNN with only two convolutional layers and 18 times fewer parameters than the teacher model, achieves competitive results, boasting an accuracy of 87% while demanding far less computational resources. This equilibrium between performance and deployability positions our framework as an ideal choice for virtual environment systems. Furthermore, the multi-label classification outcomes are utilized to map emotions onto custom-designed 3D avatars.

Index Terms:

emotion recognition, EEG signals, 3D avatars, knowledge distillation, wellbeing

I Introduction

Nowadays, there is a growing need for better human-computer interactions, especially in expressing emotions in virtual environments [1]. Emotions can be recognized using text, speech, facial expressions, gestures, and physiological signals [2]. Utilizing EEG signals for emotion recognition has evolved from defining certain sets of emotions to recognizing a wide range of them [3, 4, 5]. Mapping these emotions to avatar expressions in virtual environments can enhance the user experience and facilitate human-computer interactions in various applications such as healthcare and education, leading to advanced digital connectivity. Different frameworks have been introduced for emotion recognition using EEG signals, ranging from traditional machine learning algorithms to more complicated deep learning approaches [6, 7]. However, deploying these approaches in real-time can be a challenging task because of the huge number of parameters, resource-constrained environments, and the computational power needed to run them. In applications where fast processing is needed, using a lightweight and accelerated approach becomes a priority. Our primary objective is to bridge the gap between emotion recognition in the world of digital avatars while focusing on their smooth integration into immersive virtual environments. To achieve this, we propose a knowledge distillation-based framework for multi-label classification that enables the transfer of features from a comparatively larger and more complex model (teacher model) to a smaller, more efficient one (student model). This distilled student model retains much of the teacher’s accuracy but operates with significantly reduced computational resources [8], making it highly suitable for deployment in resource-constrained environments or real-time applications as illustrated in Fig 1.

Refer to caption — Figure 1: Inference speed of teacher and student models. Distilled student model can be deployed for real-time applications, as smaller models are faster with lower inference times.

EVOKE finds meaningful application in the realm of healthcare and emotional well-being. It enables the development of recognition systems that can benefit patients and therapists by establishing a baseline of a patient’s typical emotional states through extended EEG data collection, serving as a valuable health indicator. Moreover, visual representations of emotions via avatars enhance communication between therapists and patients, allowing patients to express their emotional experiences more effectively, thereby improving therapy outcomes. This approach also opens doors to virtual therapy and support groups within virtual environments, offering increased accessibility to emotional well-being, particularly for individuals with social anxiety or physical limitations.

The major contributions of this paper are as follows:

–

Proposed lightweight distilled model for EEG-based emotion recognition called EVOKE, which significantly reduces computational parameters by a factor of $18$ x as compared to the originally trained model while maintaining a comparable performance.
–

This work introduces a combination Binary Cross Entropy with Logits Loss and the concept of knowledge distillation for multi-label classification, which to our current understanding has not been explored previously for this task.
–

Personalized mapping of the multi-label classification outputs to custom made 3D avatars ready to be deployed in any virtual environment.

II Related Works

EEG and Emotion Recognition. Research on EEG-based emotion recognition algorithms has been growing rapidly over the past few decades. In 2009, a hierarchical binary classification approach was tested for EEG-based emotion classification [9]. EEG emotion recognition problems have shown improved performance when utilizing deep learning techniques. For instance, in a recent work, Xiao et al [10] introduced a novel method, the four-dimensional attention-based neural network (4D-aNN). It transforms raw EEG signals into 4D spatial-spectral-temporal representations and uses attention mechanisms to assign weights to brain regions and frequency bands. Liu et al. [11] introduced a three-dimensional convolutional attention neural network called 3DCANN which extracts dynamic relationships among multi-channel EEG signals over time and fuses spatio-temporal features with attention weights outperforming existing models. Yang et al. [12] introduce a 3D representation of EEG segments to amalgamate features from various frequency bands while retaining spatial information among channels using a continuous CNN. Another research highlights attention mechanisms in EEG signal analysis using vision transformer-based methods [13].

Emotion Recognition and Digital Avatars. There are two theoretical models for emotions, one is known as the discrete model which has a set of 6 basic emotions initially introduced by [14] and it was later expanded to 15 emotions. In contrast, there is the dimensional model that expresses a wide range of emotional states in two or three dimensions. In a multi-dimensional space, emotions are expressed with multiple fundamental features. According to Russell et al. [15], emotions are mapped to two different dimensions, valence and arousal. While 2D model can span many emotions, it can struggle when the valence and arousal have the same levels. Therefore, to address this, Russell and Mehrabian [16, 17] introduced dominance as a third dimension for differentiation. These emotion schemes were used mainly for devoloping emotion recognition datasets but they can also be utilized for connecting the research with real world applications. For instance, in 2010, a fractal dimension-based algorithm for real-time EEG-based emotion recognition visualizing emotions through Haptek avatars was introduced [18]. While this emotion mapping to digital avatars has its applications, they lack realism, expressiveness, and adaptability to virtual environment.

III Methodology

III-A Preprocessing

III-A1 Signal

In the preprocessing of EEG signals, several steps were conducted to enhance data quality. Initially the original data was downsampled to 128Hz and artifacts from eye movements (EOG) were eliminated following established procedures [19]. Moreover, a bandpass frequency filter, spanning 4.0Hz to 45.0Hz, was applied to isolate relevant EEG frequencies. The data was subsequently averaged to establish a common reference baseline. To ensure consistent channel ordering, the EEG channels were rearranged according to the prescribed Geneva order. The data was then segmented into 60-second trials, with a preceding of 3-second pre-trial baseline removal (see Fig. 2 (2)). Furthermore, to enhance data coherence and relevance, the trial sequences were reorganized to align with experiments ID. In order to extract relevant features from raw EEG signals, differential entropy (DE) denoted as ( $h(X)$ ) was employed to construct features in four frequency bands ( $alpha\colon 8-14Hz$ , $beta\colon 14-31Hz$ , $gamma\colon 31-49Hz$ , $theta\colon 4-8Hz$ ) [20]. Assuming a signal $X$ follows a probability Gaussian distribution $f(X)\sim\mathcal{N}(\mu,\sigma^{2})$ , then;

h(X)=-\int_{-\infty}^{\infty}f(X)\log(f(X))dx

where $f(X)=(1/\sqrt{2\pi\sigma^{2}})\exp(-(x-\mu)^{2}/2\sigma^{2})$ . Further, we removed the baseline signal and noise interference that was not associated with the emotional stimulus. To retain the spatial information and ensuring model compatibility, the signals were transformed into a grid-like 3D representation to be aligned with the positions of the electrodes.

III-A2 Label

We applied a binary transformation to the emotional values using a consistent threshold of 5.0 for all categories (see Fig. 2 (1)), including valence, arousal, and dominance. This step streamlined the subsequent correlation into eight distinct emotions after the classification process is completed. Labels will be set to 1 if their value is larger than the threshold and set to 0 if it is smaller. This procedure was implemented to streamline the classification process.

III-B Knowledge-Distillation

To develop a readily deployable model, we incorporated knowledge distillation into the training process. Following the foundational methodology introduced by Hinton et al. [21], and making certain adaptations to the implementation for our specific task.

Understanding the similarity between instances is crucial in comprehending the knowledge acquired by neural networks, especially in the context of multi-label classification of EEG signals into valence, arousal, and dominance. This is vital because gaining insights into how EEG patterns corresponds to these emotions interrelate can significantly enhance system accuracy. In the standard sigmoid activation function, the output ( $\sigma(x)$ ) is a hard binary decision, mapping inputs to either 0 or 1. It’s a step-like function with a sharp transition at $x=0$ . To mitigate this, distillation incorporates a temperature parameter, denoted as $T$ , into the sigmoid activation ( $g$ ), the output is softened.

g(x)=\begin{cases}0&\text{if }x/T\to 0^{+}\\ 1&\text{if }x/T\to 0^{-}\end{cases}

When $T$ is set to 1, it corresponds to the standard sigmoid operation. This temperature-based operation is represented as:

q_{i}=\frac{1}{1+\exp(-z_{i}/T)}\hskip 2.84544pt,\hskip 4.26773ptp_{i}=\frac{1}{1+\exp(-v_{i}/T)}

Here, $z_{i}$ represents the logit for each class, and $q_{i}$ transforms the teacher logits into probabilities, with $T$ controlling the level of smoothness. During knowledge distillation, the teacher imparts its knowledge in the form of soft targets, calculated using this modified sigmoid with $T>1$ . If $v_{i}$ represents the logits by the student model then the student probabilities are represented as $p_{i}$ .

The knowledge distillation process combines two distinct objective functions, $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ . The first objective function, $\mathcal{L}_{1}$ , referred to as the soft target loss, captures the knowledge in the soft targets provided by the teacher model. In our case, it calculates the Binary Cross-Entropy (BCE) with logits loss between the teacher’s soft targets ( $q_{i}$ ) and the student’s predictions ( $p_{i}$ ), scaled by the square of the temperature ( $T$ ). If $N$ denotes the number of samples in the dataset and $C$ denotes the number of emotions, then,

\mathcal{L}_{1}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}\left[q_{i,j}\log(p_{i,j})+(1-q_{i,j})\log(1-p_{i,j})\right]\cdot T^{2}

The second objective function, $\mathcal{L}_{2}$ , aims to align the student’s predictions ( $S_{i}$ ) with the true labels ( $Y_{i}$ ), while referencing the soft targets ( $q_{i}$ ) from the teacher model,

\mathcal{L}_{2}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C}\left[Y_{i,j}\log(p_{i,j})+(1-Y_{i,j})\log(1-p_{i,j})\right]

The aim is to minimize the student model’s distillation loss, $\mathcal{L}_{distill}$ , which combines $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ with a weighting factor $\alpha$ :

\mathcal{L}_{distill}=\alpha*\mathcal{L}_{1}+(1-\alpha)*\mathcal{L}_{2}

(1)

This combined loss ensures that the student model effectively captures the knowledge distilled from the teacher model while maintaining alignment with the true labels. Fig. 2 (3), (4), and (5) depict the process discussed above.

TABLE I: Comparison of the experimented model properties in terms of performance and computational parameters.

	Computational Performance						Differential Entropy		Power Spectral Density
Model	#Layers	#Params	FLOPs	Weight	Inference (ms)	Throughput	F1 Score	Accuracy (%)	F1 Score	Accuracy (%)
ViT [22]	140	85.64M	16.86G	321.56MB	6.96	277.25	0.80	79.74	0.74	75.21
Arjun ViT [13]	41	144.356K	767.592K	587.35KB	1.6691	20110.31	0.65	62.24	0.77	77.78
CCNN [12]	12	6.235M	79.961M	23.79MB	0.6414	24572.37	0.92	93.23	0.46	61.11
EVOKE (Ours)	8	353.363K	1.991M	1.35MB	0.3300	80176.24	0.88	87.62	-	-

III-C Model Architectures

III-C1 Teacher

We employed Continuous Convolutional Neural Network (CCNN) [12] architecture as the teacher model for distillation, which consists of four convolutional layers with no pooling layers between them. The first three convolutional layers uses a $4$ x $4$ kernel size and a stride of $1$ . After each convolution operation, ReLU activation function was applied. To fuse different feature maps and reduce computational cost a $1$ x $1$ convolutional layer with 64 feature maps was added. Following these four continuous convolutional layers, a fully connected layer to map the (64x9x9) feature maps into a final feature vector of size 1x1024 is added, followed by a final softmax layer for classification.

III-C2 Student

The student model, a lightweight neural network, designed for multi-label classification of EEG signals into valence, arousal, and dominance, employs a 2 convolution layered architecture. The input, denoted as $Z$ , represents EEG signal data organized in a grid-like fashion with shape $[n,4,9,9]$ , where $n$ signifies the batch size, $4$ indicates the number of input channels corresponding to different electrodes, and $(9,9)$ denotes the grid size. The model comprises two primary components: feature extraction and classification. The feature extraction phase consists of two convolutional layers, $c_{1}$ and $c_{2}$ , followed by Rectified Linear Unit (ReLU) activation functions. These layers process the input EEG data, producing intermediate outputs $Z^{(1)}$ and $Z^{(2)}$ , respectively. The flattening operation then transforms $Z^{(2)}$ into a one-dimensional tensor $Z^{(3)}$ . Let $G$ represent the classifier or the MLP, then, the feature representation, denoted as $Z^{(3)}$ is passed through $G$ to predict the output labels:

	$\displaystyle\hat{y}_{\text{valence}}$	$\displaystyle=G(Z^{(3)})\in\{0,1\}$
	$\displaystyle\hat{y}_{\text{arousal}}$	$\displaystyle=G(Z^{(3)})\in\{0,1\}$
	$\displaystyle\hat{y}_{\text{dominance}}$	$\displaystyle=G(Z^{(3)})\in\{0,1\}$

In our specific case, this classifier enables multi-label classification for the 3 umbrella classes, namely valence ( $y_{\text{valence}}$ ), arousal ( $y_{\text{arousal}}$ ), and dominance ( $y_{\text{dominance}}$ ).

IV Experiments and Results

IV-A Dataset

The DEAP dataset, developed by Koelstra et al. [19], was employed for our study. It comprises EEG signals from 32 subjects watching a series of 40 one-minute music videos, each accompanied by emotional response ratings on scales of arousal, valence, dominance, liking, and familiarity. In our study, we focus exclusively on arousal, valence, and dominance, as liking and familiarity are more related to individual perspectives [19]. DEAP’s 32-channel EEG data collection during emotional stimuli distinguishes it in EEG emotion recognition research, offering richer features for improved emotion state differentiation. Consequently, it serves as the exclusive dataset in our research.

IV-B Implementation Details

The optimization process employed the Adam optimizer with a learning rate set to $1\times 10^{-3}$ . ReLU activation layers were used for the student model. A consistent batch size of $128$ was applied across all models. The training phase extended for $100$ epochs and encompassed a 5-fold cross-validation strategy. The entire experimentation process was conducted within the PyTorch framework [23], utilizing a single Nvidia RTX A6000 GPU with 40 GB of memory.

For creating 3D avatars, we used Character Creator 4’s ¹¹1https://www.reallusion.com/character-creator/ Headshot plugin to turn a character’s image into a 3D face mask. This involved adjusting expressions, hair color, and skin tone. Then, in Blender[24], we created clothing based on the 3D avatar. Importing it back to Character Creator ensured proper rigging. After fine-tuning, we had a lifelike 3D avatar, blending real-world aesthetics with virtual artistry. Headshot, a key feature of Character Creator, analyzed real-life photos to generate a detailed 3D face, accounting for contours, skin texture, and expressions.

IV-C Model Selection and Pretraining

We conducted a meticulous evaluation of three state-of-the-art Emotion-Recognition models: Arjun ViT [13], ViT [22], and CCNN [12]. We started with a clean slate, training these models entirely from scratch without the application of any knowledge distillation techniques. Our objective was to select the optimal candidate from these pretrained models to serve as the teacher model for our framework. For model evaluation, we employed accuracy and F1-score, which offer comprehensive insights into a model’s performance in predicting multiple labels [25]. These metrics were derived from the mean values obtained through five-fold cross-validation.

The models were first evaluated under two distinct EEG-based feature extraction techniques: Differential Entropy (DE) and Power Spectral Density (PSD) as presented in Table I. Notably, the results demonstrate the superior performance of CCNN across both the evaluation metrics. This can be attributed to several key factors; firstly, the transformer models, although meticulously customized for EEG data, are inherently data-hungry and struggled to harness the full potential of the dataset and exhibited comparatively limited generalization capabilities. Secondly, the unique nature of EEG data, which includes spatial information pertaining to the arrangement of electrodes on the scalp, presents a significant advantage for CCNN. This spatial awareness empowers CCNN to extract fine-grained details from the EEG data, effectively capturing the nuances associated with emotional states. All subsequent experiments were conducted with CCNN as the teacher. For student model, we experimented various architectures for our custom CNN model, discovering that an 8-layer design with two convolutional layers was the most suitable.

IV-D Computation and Performance

Our distilled model, EVOKE, stands out with significantly fewer parameters (18x lesser than the teacher model) while achieving the fastest inference time of $0.33$ ms and highest throughput of $80176$ compared to other models (see Table I). It’s performance surpasses that of ViT and ArjunViT, comparable with the performance of the teacher model. This compelling balance between performance and deployability makes this framework suitable for virtual environment systems. Note that we omitted the Power Spectral Density (PSD) feature extraction technique during EVOKE’s training due to inferior results compared to those of differential entropy as shown in Table I. Additionally, we conducted experiments involving various temperature parameter $(T)$ values, as shown in Figure 4 (a) and (c). Our findings indicate that the model performs optimally when $T$ is set to $1.25$ . Notably, performance starts to decline gradually for $T$ values exceeding $2$ . Further, we also explored different values of $\alpha$ , the weight factor in Eq. 1. The model achieved its best performance when $\alpha$ was set to $0.25$ (see Fig 4 (b) and (d)). It was observed that excessively large or small $\alpha$ values resulted in a decreased performance. In Fig. 4, the plots illustrate the model’s performance during the initial $50$ epochs.

IV-E Emotion Mapping to 3D Avatars

The three-dimensional emotions model namely Valence-Arousal-Dominance or VAD model for short, includes the basic emotions [14] defined by the rating of each dimension. For instance, the closer the rating of each dimension to zero the lower the emotion distinction. Which means as shown in Fig. 3 when all categories are low or zero then the emotion is neutral. Hence, in this paper, we have developed a set of combinations and their corresponding emotion mappings that bridge the gap between emotion classification and its representation in 3D avatars. Unlike some prior approaches [18], which utilized the 2D Valence-Arousal emotion model, our work incorporates an additional dimension of dominance. This additional dimension allows for a broader range of emotions, including neutral and excited states, alongside the six fundamental emotions. The number of emotions is limited intentionally to maintain focus and manageability, especially concerning the emotion mapping onto avatars. The combinations and their emotion mapping are shown in Fig. 3 where 0 indicates the low level of the primary emotion and 1 indicate having a high level for that emotion.

V Conclusion and Future Work

In response to the practical deployment of real-time EEG emotion classification, we introduced EVOKE, a knowledge distillation-based lightweight model for emotion recognition and integration into virtual environments. The framework maps multi-label classification results to eight distinct emotions and links them to custom-created 3D avatars. When tested on the publicly available DEAP EEG dataset, our proposed model achieved competitive performance along with significantly fewer parameters as compared to other state-of-the-art models. This work paves the way for enhanced emotional communication in virtual settings, offering numerous applications in healthcare, therapy, and beyond, ultimately making emotional well-being more accessible and immersive.

One notable application is in Virtual Therapy Sessions:

–

Scenario: In virtual therapy sessions, EVOKE can be utilized to enhance the emotional interaction between therapists and clients by mapping real-time emotional states onto virtual avatars.
–

Application: As clients express their feelings and emotions, EVOKE processes the emotional cues through its knowledge-distilled model, providing therapists with a visual representation of emotional changes in the virtual avatars. This visual feedback aids therapists in understanding and responding empathetically to the client’s emotional state.

In the future, we aim to integrate real-time EEG signals into a healthcare virtual environment system, making it readily accessible for use by healthcare professionals.

References

[1] Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. Emotion recognition in human-computer interaction. IEEE Signal processing magazine, 18(1):32–80, 2001.
[2] Yan Wang, Wei Song, Wei Tao, Antonio Liotta, Dawei Yang, Xinlei Li, Shuyong Gao, Yixuan Sun, Weifeng Ge, Wei Zhang, et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Information Fusion, 83:19–52, 2022.
[3] Danny Oude Bos et al. Eeg-based emotion recognition. The influence of visual and auditory stimuli, 56(3):1–17, 2006.
[4] Nazmi Sofian Suhaimi, James Mountstephens, Jason Teo, et al. Eeg-based emotion recognition: A state-of-the-art review of current trends and opportunities. Computational intelligence and neuroscience, 2020, 2020.
[5] Yisi Liu, Olga Sourina, and Minh Khoa Nguyen. Real-time eeg-based emotion recognition and its applications. Transactions on Computational Science XII: Special Issue on Cyberworlds, pages 256–277, 2011.
[6] Xiao-Wei Wang, Dan Nie, and Bao-Liang Lu. Emotional state classification from eeg data using machine learning approach. Neurocomputing, 129:94–106, 2014.
[7] Omid Bazgir, Zeynab Mohammadi, and Seyed Amir Hassan Habibi. Emotion recognition with machine learning using eeg signals. In 2018 25th national and 3rd international iranian conference on biomedical engineering (ICBME), pages 1–5. IEEE, 2018.
[8] Raza Imam, Ibrahim Almakky, Salma Alrashdi, Baketah Alrashdi, and Mohammad Yaqub. Seda: Self-ensembling vit with defensive distillation and adversarial training for robust chest x-rays classification, 2023.
[9] Yuan-Pin Lin, Chi-Hong Wang, Tien-Lin Wu, Shyh-Kang Jeng, and Jyh-Horng Chen. Eeg-based emotion recognition in music listening: A comparison of schemes for multiclass support vector machine. In 2009 IEEE international conference on acoustics, speech and signal processing, pages 489–492. IEEE, 2009.
[10] Guowen Xiao, Meng Shi, Mengwen Ye, Bowen Xu, Zhendi Chen, and Quansheng Ren. 4d attention-based neural network for eeg emotion recognition. Cognitive Neurodynamics, pages 1–14, 2022.
[11] Shuaiqi Liu, Xu Wang, Ling Zhao, Bing Li, Weiming Hu, Jie Yu, and Yu-Dong Zhang. 3dcann: A spatio-temporal convolution attention neural network for eeg emotion recognition. IEEE Journal of Biomedical and Health Informatics, 26(11):5321–5331, 2021.
[12] Yilong Yang, Qingfeng Wu, Yazhen Fu, and Xiaowei Chen. Continuous convolutional neural network with 3d input for eeg-based emotion recognition. In Neural Information Processing: 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, December 13–16, 2018, Proceedings, Part VII 25, pages 433–443. Springer, 2018.
[13] Arjun Arjun, Aniket Singh Rajpoot, and Mahesh Raveendranatha Panicker. Introducing attention mechanism for eeg signals: Emotion recognition with vision transformers. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 5723–5726. IEEE, 2021.
[14] Paul Ekman, E Richard Sorenson, and Wallace V Friesen. Pan-cultural elements in facial displays of emotion. Science, 164(3875):86–88, 1969.
[15] James A Russell. A circumplex model of affect. Journal of personality and social psychology, 39(6):1161, 1980.
[16] James A Russell and Albert Mehrabian. Evidence for a three-factor theory of emotions. Journal of research in Personality, 11(3):273–294, 1977.
[17] Albert Mehrabian. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Current Psychology, 14:261–292, 1996.
[18] Yisi Liu, Olga Sourina, and Minh Khoa Nguyen. Real-time eeg-based human emotion recognition and visualization. In 2010 international conference on cyberworlds, pages 262–269. IEEE, 2010.
[19] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. Deap: A database for emotion analysis; using physiological signals. IEEE transactions on affective computing, 3(1):18–31, 2011.
[20] Ruo-Nan Duan, Jia-Yi Zhu, and Bao-Liang Lu. Differential entropy feature for eeg-based emotion classification. In 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER), pages 81–84. IEEE, 2013.
[21] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[22] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021.
[23] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[24] Tony Mullen. Mastering blender. John Wiley & Sons, 2011.
[25] Gabriel Bénédict, Vincent Koops, Daan Odijk, and Maarten de Rijke. Sigmoidf1: A smooth f1 score surrogate loss for multilabel classification. arXiv preprint arXiv:2108.10566, 2021.

EVOKE: Emotion Enabled Virtual Avatar Mapping Using Optimized Knowledge Distillation ††thanks: ‡Authors with equal contributions.