\jyear

2021

[1]\fnmAli \surAbedi

1]\orgdivKITE, \orgnameUniversity Health Network, Canada

Detecting Disengagement in Virtual Learning as an Anomaly using Temporal Convolutional Network Autoencoder

[email protected] \fnmShehroz \surS. Khan [email protected] [

Abstract

Student engagement is an important factor in meeting the goals of virtual learning programs. Automatic measurement of student engagement provides helpful information for instructors to meet learning program objectives and individualize program delivery. Many existing approaches solve video-based engagement measurement using the traditional frameworks of binary classification (classifying video snippets into engaged or disengaged classes), multi-class classification (classifying video snippets into multiple classes corresponding to different levels of engagement), or regression (estimating a continuous value corresponding to the level of engagement). However, we observe that while the engagement behaviour is mostly well-defined (e.g., focused, not distracted), disengagement can be expressed in various ways. In addition, in some cases, the data for disengaged classes may not be sufficient to train generalizable binary or multi-class classifiers. To handle this situation, in this paper, for the first time, we formulate detecting disengagement in virtual learning as an anomaly detection problem. We design various autoencoders, including temporal convolutional network autoencoder, long-short-term memory autoencoder, and feedforward autoencoder using different behavioral and affect features for video-based student disengagement detection. The result of our experiments on two publicly available student engagement datasets, DAiSEE and EmotiW, shows the superiority of the proposed approach for disengagement detection as an anomaly compared to binary classifiers for classifying videos into engaged versus disengaged classes (with an average improvement of $9$ % on the area under the curve of the receiver operating characteristic curve and $22$ % on the area under the curve of the precision-recall curve).

keywords:

Student Engagement, Disengagement Detection, Affect States, Autoencoder, Anomaly Detection, TCN, Temporal Convolutional Autoencoder

1 Introduction

With the widespread accessibility and adoption of internet services across major urban centers and universities, virtual learning programs are becoming more ubiquitous and mainstream dhawan2020online . Virtual learning programs offer many advantages compared to traditional in-person learning programs in terms of being more accessible, economical, and personalizable dumford2018online . However, virtual learning programs also bring other types of challenges. For instance, in a virtual learning setting, students and tutor are behind a “virtual wall”, and it becomes very difficult for the tutor to assess the students’ engagement in the class being taught. This problem is further intensified if the group of students is large dumford2018online ; sumer2021multimodal . Therefore, from a tutor’s perspective, it is crucial to automatically measure student engagement to provide them with real-time feedback and take necessary actions to engage the students to maximize their learning objectives.

According to Sinatra et al. sinatra2015challenges , in engagement measurement, the focus is on the behavioral, affective, and cognitive states of the student in the moment of interaction with a particular contextual environment. Engagement is not stable over time and is best captured with physiological and psychological measures at fine-grained time scales, from seconds to minutes d2017advanced . Behavioral engagement involves general on-task behavior and paying attention at the surface level. The indicators of behavioral engagement, in the moment of interaction, include eye contact, blink rate, and head pose d2017advanced ; woolf2009affect . Affective engagement is defined as the affective and emotional reactions of the student to the content. Its indicators are positive versus negative and activating versus deactivating emotions sinatra2015challenges ; d2017advanced ; woolf2009affect . Cognitive engagement pertains to the psychological investment and effort allocation of the student to deeply understand the learning materials sinatra2015challenges . To measure cognitive engagement, information such as student’s speech could be processed to recognize the level of student’s comprehension of the context sinatra2015challenges . Contrary to behavioral and affective engagements, measuring cognitive engagement requires knowledge about context materials. Measuring student’s engagement in a specific context depends on the knowledge about the student and the context. From a data analysis perspective, it depends on the data modalities available to analyze. In this paper, the focus is on automatic video-based disengagement detection. The only data modality is video, without audio, and with no knowledge about the context.

Majority of the recent works on student engagement measurement are based on the video data of students acquired by cameras and using deep-learning, machine-learning, and computer-vision techniques dewan2019engagement ; doherty2018engagement ; abedi2021improving . Previous works on student engagement measurement tried to solve different machine-learning problems, including

•

Binary classification problem aung2018harnessing ; chen2019faceengage ; booth2017toward ; gupta2016daisee ; mehta2022three , classifying student’s video into engaged or disengaged classes,
•

Multi-class classification problem abedi2021affect ; booth2017toward ; gupta2016daisee ; liao2021deep ; abedi2021improving ; huang2019fine ; ma2021automatic ; dresvyanskiy2021deep ; mehta2022three , classifying student’s video into multiple classes corresponding to different levels of engagement, or
•

Regression abedi2021affect ; booth2017toward ; liao2021deep ; thomas2018predicting ; kaur2018prediction ; copur2022engagement , estimating a continuous value corresponding to the level of engagement of the student in the video.

The state of high and medium engagement is mostly well understood and well defined in terms of behavioral and affective states; if a student looking at the camera with attention and having a focused affect state is considered engaged. However, disengagement could be expressed in diverse ways, including various combinations of behavioral and affective states, such as off-task behavioral states; not looking at the camera sinatra2015challenges ; d2017advanced ; woolf2009affect ; aslan2017human , high blink rate ranti2020blink , and face-palming, or negative and deactivating affect states; annoyed (high negative valence and high positive arousal), and bored (high negative valence and high negative arousal) emotional states sinatra2015challenges ; d2017advanced ; woolf2009affect ; aslan2017human , see Figure 1 for some examples. Collecting large amounts of data corresponding to these diverse disengagement states is challenging. As a consequence, in many engagement measurement datasets, data corresponding to different types of disengagement states could be sparse khan2022inconsistencies , and building supervised models on a diverse class of ”disengagement” can be very difficult dresvyanskiy2021deep ; liao2021deep ; abedi2021improving . In this paper, for the first time, we formulate student disengagement detection as an anomaly detection problem. This formulation does not imply that disengagement is a rare event, rather it is meant to detect the diversity and lack of consistency in expressing disengagement in virtual learning. Using an anomaly detection framework, those behavioral and affect states that significantly deviate from a well-defined engaged state can be identified as the disengaged state. The anomaly detection frameworks to detect (not necessarily rare) deviant behaviors from normal behaviors have been successfully exploited in other domains, such as driver anomalous behavior detection kopuklu2021driver ; khan2021modified .

Refer to caption — Figure 1: 5 (out of 300) frames of six video samples of three students in the DAiSEE dataset gupta2016daisee . While the students in (a), (c), and (e) are engaged, the students in (b), (d), and (f) are disengaged. As can be seen in these exemplary frames, while the students in the engaged videos show similar behavioural and affective states (attentively looking at the camera with relatively positive affect states), the behavioural and affective indicators of disengagement are different in the disengaged students, face-palming and sleepy affect state in (b), looking somewhere other than camera in (d), yawning, eye rubbing, and eye closure in (f). The proposed algorithm for disengagement detection, temporal convolutional autoencoder with behavioural and affect features after applying a threshold on its output reconstruction error (described in Section 3), successfully detects disengagement in (b), (d), and (f) (true positive). It correctly classifies the students in (a), (c), and (e) into engaged (true negative).

We also observed that in many existing student engagement datasets, the distribution of engaged to disengaged samples is highly imbalanced; in many cases, the percentage of disengaged samples is extremely low dresvyanskiy2021deep ; khan2022inconsistencies . The interpretation of this imbalanced data distribution in educational psychology is that the students are more engaged in the courses that are relevant to their major barlow2020development ; canziani2021student . In this severely skewed disengagement class scenario, supervised approaches using weighted loss functions could be helpful abedi2021improving . However, according to the poor disengagement class accuracies reported in previous works liao2021deep ; mehta2022three ; abedi2021improving , the diversity of disengaged states is challenging to be modeled, and the anomaly detection approach is a viable approach to detect these states.

Based on the above discussion of anomaly detection framework to detect disengaged behaviors, we design various autoencoder (AE) neural networks to detect student disengagement in videos. Different behavioral and affect features, as indicators of behavioral and affective components of engagement sinatra2015challenges ; d2017advanced ; woolf2009affect ; aslan2017human are extracted from videos. Various Temporal Convolutional Network (TCN) thill2021temporal , Long-Short-Term Memory (LSTM), and feedforward AEs using the extracted features are designed for video-based student disengagement detection. The AEs are trained on only engaged video snippets that are abundantly available. It is expected that an AE will be able to model the ”engaged” concept through low reconstruction error. An unseen ”disengaged” video snippet would result in a larger reconstruction error indicating deviance from the ”engaged” class. Thus the reconstruction error can be thresholded to detect disengagement. The developed temporal and convolutional network AEs analyze the sequences of features extracted from consecutive video frames. Therefore, in addition to being able to detect anomalous behavioral and affective states (static indicators of disengagement), they are able to detect anomalous temporal changes in the behavioral and affective states (dynamic indicators of disengagement) d2012dynamics . The experimental result on two publicly available student engagement datasets shows that the anomaly detection approach outperforms the equivalent binary classification formulation. Our primary contributions are as follows:

•

This is the first work formulating disengagement detection as an anomaly detection problem and developing AEs for this task dewan2019engagement ; doherty2018engagement ; khan2022inconsistencies .
•

Temporal Convolutional Network AE (TCN-AE) has recently been introduced for unsupervised anomaly detection in time series thill2021temporal . For the first time in the field of affective computing, TCN-AE is used for disengagement detection.
•

Extensive experiments are conducted on the only two publicly available student engagement datasets gupta2016daisee ; kaur2018prediction , and the proposed anomaly detection formulation is compared with various feature-based and end-to-end binary classification methods.

This paper is structured as follows. Section 2 introduces related works on student engagement measurement, focusing on the machine-learning problem they solve. In Section 3, the proposed approach for student disengagement detection is presented. Section 4 describes experimental settings and results on the proposed methodology. In the end, Section 5 presents our conclusions and directions for future works.

2 Literature Review

Over the recent years, extensive research efforts have been devoted to automating student engagement measurement dewan2019engagement ; doherty2018engagement ; abedi2021affect . This section briefly discusses previous works on computer-vision-based engagement measurement, focusing on their machine-learning problem. For more information on features, machine-learning algorithms, deep neural network architectures, and datasets of the previous methods, refer to dewan2019engagement ; doherty2018engagement ; abedi2021affect ; khan2022inconsistencies .

Table 1: Previous video-based engagement measurement approaches compared to the proposed approach in this paper, the machine-learning problem they solve (Binary Classification: BC, Multi-class Classification: MC, or Regression: R), their features, and their machine-learning/deep-learning models.

Authors	Problem	Features	Model
Gupta et el., 2016 gupta2016daisee	MC, BC	end-to-end	C3D, LSTM
Bosch et al. 2016 bosch2016detecting	BC	AU, LBP-TOP, Gabor	SVM
Booth et al., 2017 booth2017toward	BC, MC, R	facial landmark, AU, optical flow, head pose	SVM, KNN, RF
Aung et al., 2018 aung2018harnessing	MC, R	box filter, Gabor filter, facial Action Units (AU)	GentleBoost, SVM, LR
Niu et al., 2018 niu2018automatic	R	gaze, head pose, AU	GRU
Thomas et al., 2018 thomas2018predicting	R	gaze, head pose, AU	TCN
Huang et al., 2019 huang2019fine	MC	gaze, eye location, head pose, AU	LSTM
Kaur et al., 2018 kaur2018prediction	R	LBP-TOP	MLP
Chen et al., 2019 chen2019faceengage	BC	gaze, blink rate, head pose, facial embedding	SVM, KNN, RF, RNN
Wu et al., 2020 wu2020advanced	R	gaze, head pose, body pose, facial embedding	LSTM, GRU
Liao et al., 2021 liao2021deep	MC, R	gaze, eye location, head pose	LSTM
Ma et al., 2021 ma2021automatic	MC	gaze direction, head pose, AU, C3D	neural Turing machine
Abedi et al., 2021 abedi2021affect	MC, R	affect, facial embedding, blink rate, gaze head pose	LSTM, TCN, MLP, SVM, RF
Dresvyanskiy et al., 2021 dresvyanskiy2021deep	MC	AU, facial embedding	LSTM
Abedi et al., 2021 abedi2021improving	MC	end-to-end	ResNet + TCN
Copur et al. 2022 copur2022engagement	R	eyae gaze, head pose, AU	LSTM
proposed method	anomaly detection	valence, arousal, blink rate, gaze, head pose	TCN-AE, LSTM-AE

Table 1 shows previous related works on video-based engagement measurement and the machine-learning problem they solve. The majority of previous works on engagement measurement solve three types of problems: binary classification, multi-class classification, and regression.

According to the datasets used in the previous approaches, the authors developed different machine-learning algorithms or deep neural-network architectures to measure engagement. Two publicly available video-based engagement measurement datasets are (i) DAiSEE (Dataset for Affective States in E-Environments) gupta2016daisee in which engagement measurement is defined as a four-class classification problem, and (ii) EmotiW (Emotion Recognition in the Wild) kaur2018prediction in which engagement measurement is defined as a regression problem. The approaches that use DAiSEE mostly consider it as a four, three, or two-class classification problem abedi2021affect ; gupta2016daisee ; liao2021deep ; abedi2021improving ; huang2019fine ; ma2021automatic ; dresvyanskiy2021deep , whereas the approaches that use EmotiW apply various regression techniques to model the level of engagement abedi2021affect ; liao2021deep ; thomas2018predicting ; kaur2018prediction ; niu2018automatic ; wu2020advanced . Many authors altered the original machine-learning problem in these two datasets and designed models based on the new problem, such as converting the original four-class classification problem in DAiSEE into a regression problem and developing regression deep-learning models liao2021deep . Other previous methods defined their models based on the problems in their non-public datasets. For instance, Aung et al., aung2018harnessing collected a video dataset for student engagement measurement and designed different machine-learning models to solve a four-class classification problem and a regression problem. In the end-to-end approaches in Table 1 gupta2016daisee ; abedi2021improving , there is no feature extraction step, and the deep-learning models output the engagement level of raw video frames as inputs. Abedi and Khan abedi2021affect proposed to use affect features (continuous values of valence and arousal) along with behavioral features for engagement level classification and regression. In this paper, using these handcrafted features, we propose to use AE neural networks for student disengagement detection in videos.

3 Methodology

Figure 2 depicts the conceptual block diagram of the proposed method for student disengagement detection using AEs. The input is a video clip of a student in a virtual learning session sitting in front of the camera of a laptop or PC, see Figure 1. Various behavioral and affective features are extracted from the consecutive frames of the input video to construct a feature vector as the input to a neural network. The neural network model is an AE that has been trained on the feature vectors of the videos of engaged students (normal samples). The AE outputs the error of reconstructing the input video. This reconstruction error shows how much the student in the input video clip is disengaged, i.e., how much the behavioral and affective states of the student are deviant from the states of an engaged student (a normal sample).

3.1 Feature Extraction

Various affect and behavioral features, corresponding to the affective and behavioral states of the student in the video, are extracted from the consecutive video frames.

Affect features: Altuwairqi et al. altuwairqi2021new linked engagement to affective states. They defined different levels of engagement according to different values of valence and arousal in the circumplex model of affect altuwairqi2021new . Abedi et al. abedi2021affect showed that sequences of continuous values of valence and arousal extracted from consecutive video frames are significant indicators of affective engagement. In the circumplex model of affect altuwairqi2021new , different combinations of values of valence and arousal correspond to different affect states. Positive values of valence and arousal with low fluctuations over time correspond to affective engagement, and any deviation from these values indicates affective disengagement. The pretrained EmoFAN toisoul2021estimation on the AffectNet dataset (mollahosseini2017affectnet, ), a deep neural network to analyze facial affects in face images, is used to extract valence and arousal from consecutive video frames (refer to abedi2021affect for more detail).

Behavioral features: Woolf et al. woolf2009affect defined different characteristics of behavioral engagement as on-task behavior against off-task behavior corresponding to behavioral disengagement. In many cases, a student engaged in a virtual learning material focusedly looks at the computer screen, i.e., the student’s head pose and eye gaze direction are perpendicular to the computer screen with low fluctuations. There are low fluctuations in yaw, pitch, and roll of the student’s head while engaged with the learning material. Accordingly, inspired by previous research abedi2021affect ; liao2021deep ; thomas2018predicting ; kaur2018prediction ; niu2018automatic , eye location, head pose, and eye gaze direction in consecutive video frames are considered as one set of behavioral features. Ranti et al. ranti2020blink demonstrated that eye blinks are withdrawn at precise moments in time so as to minimize the loss of visual information that occurs during a blink. High and irregular blink rate indicates disengagement niu2018automatic ; ranti2020blink . The intensity of facial action unit AU45, indicating how closed the eyes are abedi2021affect , will be used as another behavioral feature.

According to the above explanations, the following 11 features are extracted from consecutive video frames, 2-element affect features: valence and arousal (2 features), and 9-element behavioral features: eye-closure intensity (1 feature), x and y components of eye gaze direction w.r.t. camera (2 features), x, y, and z components of head location w.r.t. camera (3 features); pitch, yaw, and roll of the head (3 features).

To illustrate how the extracted behavioral and affect features are capable of differentiating between engaged and disengaged samples, some of the extracted features from the consecutive 300 frames of the videos in Figures 1 (c) and (d) are drawn in Figure 3. The difference between features extracted from the video in Figure 1 (c), an engaged sample, and Figure 1 (d), a disengaged sample, is observable in Figure 3. These differentiating features and anomalies in features of the disengaged sample over time help AEs detect disengagement.

In addition to using the above 11-element frame-level features, being extracted from single frames of video, 37-element segment-level features are also extracted from video segments. Each video segment consists of multiple consecutive video frames. The mean and standard deviation of valence and arousal values over consecutive video frames (4 features), the blink rate, derived by counting the number of peaks above a certain threshold divided by the number of frames in the AU45 intensity time series extracted from the input video (1 feature), the mean and standard deviation of the velocity and acceleration of x and y components of eye gaze direction (8 features), the mean and standard deviation of the velocity and acceleration of x, y, and z components of eye gaze direction (12 features), and the mean and standard deviation of the velocity and acceleration of head’s pitch, yaw, and roll (12 features). As will be described in Section 4, the frame-level, and segment-level features are used to extract features from short, and long videos, respectively.

3.2 Video-based Disengagement Detection Using Temporal Convolutional Network Autoencoder

Thill et al. thill2021temporal introduced TCN-AE for unsupervised anomaly detection in time series. As depicted in Figure 4, a TCN-AE consists of an encoder for compressing (encoding) the input $n$ -dimensional feature vector of length $T$ along the time and feature axes. The TCN-AE decoder then attempts to decode the compressed representation and reconstruct the original input feature vector. The architecture of the TCNs in the encoder (TCN1) and decoder (TCN2) are identical and are the vanilla TCN with dilated convolutions bai2018empirical . However, the weights of the TCNs are updated independently during training. The Conv1 and Conv2 are trainable $1\times 1$ convolution layers. The average-pooling bottleneck layer down-samples the feature map along the time axis. The upsampling layer then up-samples and restores the length of the original feature vector. The up-sampled feature vector is passed through the TCN2 and Conv2 to reconstruct the original feature vector with dimension $n$ and length $T$ .

The Mean Squared Error (MSE) is used as the loss function in the training phase. The TCN-AE neural network is trained only by the normal (engaged) samples and is forced to minimize the reconstruction error for these samples. In the test phase, the MSE between the input feature vector and its reconstructed version by the trained TCN-AE is calculated. High values of MSE are expected for the anomalous (disengaged) samples that significantly differ from the normal (engaged) samples on whom the TCN-AE is trained, see Figure 2. As will be explained in Section 4, in addition to the TCN-AE, LSTM- and feedforward-AEs are also examined for disengagement detection in virtual learning.

4 Experiments

In this section, the performance of the proposed disengagement detection approach is evaluated. There is no previous work on disengagement detection as an anomaly detection problem to be compared with the proposed approach. Therefore, the performance of different AE architectures with different feature sets for disengagement detection is evaluated compared to various binary classification neural networks to show the effectiveness of defining disengagement detection as an anomaly detection problem. A comparison is also made between the proposed method and some of the existing end-to-end and feature-based methods for engagement binary classification.

The evaluation metrics are the Area Under the Curve of the Receiver Operating Characteristic curve (AUC ROC) and the Area Under the Curve of the Precision-Recall curve (AUC PR). The confusion matrix, after finding an optimal threshold on the output reconstruction errors of AEs and on the output class probabilities of binary classifiers, is also calculated (see Section 4.3).

As described in Section 3, while the AEs are trained on only normal (engaged) samples, the binary classifiers are trained on both normal (engaged) and anomalous (disengaged) samples. The test sets of both AE and binary classifiers are the same and contain both normal and anomalous samples.

The experiments were implemented in PyTorch paszke2019pytorch and Scikit-learn pedregosa2011scikit on a server with 64 GB of RAM and NVIDIA TeslaP100 PCIe 12 GB GPU. The code of our implementations is available at https://github.com/abedicodes/ENG-AE.

4.1 Dataset

The performance of the proposed method is evaluated on two publicly available video-only student engagement datasets, DAiSEE gupta2016daisee and EmotiW kaur2018prediction .

DAiSEE: The DAiSEE dataset gupta2016daisee contains 9,068 videos captured from 112 students in online courses, see Figure 1 for some examples. The videos were annotated by four states of students while watching online courses, boredom, confusion, frustration, and engagement. Each state is in one of the four levels (ordinal classes), level 0 (very low), 1 (low), 2 (high), and 3 (very high). The length, frame rate, and resolution of the videos are 10 seconds, 30 frames per second, and 640 × 480 pixels. In this paper, the focus is only on disengagement detection. Therefore, the videos with engagement levels of 0 and 1 are considered as disengaged (anomalous or positive) samples, and the videos with engagement levels of 2 and 3 are considered as engaged (normal or negative) samples. The number of disengaged and engaged samples in training, validation, and test sets of the DAiSEE dataset is shown in Table 2. As can be seen in Table 2, the dataset is highly imbalanced.

EmotiW: The EmotiW dataset kaur2018prediction contains videos of 78 students in online classroom setting. The total number of videos is 195, including 147 training and 48 validation video samples. The videos are at a resolution of 640 × 480 pixels and 30 fps. The lengths of the videos are around 5 minutes. The engagement levels of students in the videos are in the range 0, 0.33, 0.66, and 1, corresponding to the lowest to highest levels of engagement, where 0, and 1 indicate that the person is completely disengaged, and highly engaged, respectively. Two dichotomizations are applied to use EmotiW for disengagement detection. In the first setting (EmotiW1), the samples with engagement level of 0, and the samples with engagement levels of 0.33, 0.66, and 1.0 are considered as disengaged, and engaged samples, respectively. In the second setting (EmotiW2), the samples with engagement levels of 0 and 0.33, and the samples with engagement levels of 0.66 and 1.0 are considered as disengaged, and engaged samples, respectively, see Table 3.

Table 2: The distribution of engaged and disengaged samples in the train, validation, and test sets in the DAiSEE dataset gupta2016daisee after dichotomization of engagement levels in the original dataset, see Section 4.1.

Engagement	Train	Validation	Test
disengaged	247	166	88
engaged	5111	1263	1696
total	5358	1429	1784

Table 3: The distribution of engaged and disengaged samples in the train and validation sets in the EmotiW dataset kaur2018prediction after dichotomization of engagement levels in the original dataset in two different ways resulting in EmotiW1 and EmotiW2, see Section 4.1.

	EmotiW1		EmotiW2
Engagement	Train	Validation		Train	Validation
disengaged	5	4		40	14
engaged	142	44		107	34
total	147	48		147	48

4.2 Experimental Setting

The frame-level behavioral features, described in Section 3.1, are extracted by the OpenFace baltrusaitis2018openface . The OpenFace also outputs the extracted face regions from video frames. The extracted face regions of size 256 × 256 are fed to the pretrained EmoFAN toisoul2021estimation on AffectNet mollahosseini2017affectnet for extracting affect features (see Section 3.1). The frame-level features are extracted from 10-second video samples (with 300 frames) of the DAiSEE dataset to be fed to the AEs. The extracted feature vector from each frame is considered as one time step of the temporal AE models. As the EmotiW dataset contains videos of around 5-minute length, following the previous works on this dataset thomas2018predicting ; liao2021deep ; abedi2021affect , the videos are divided into 10-second segments with $50\%$ overlap, and segment-level features, described in Section 3.1, are extracted from each segment. The segment-level feature vector extracted from each video segment is considered as one time step of the temporal AE models.

For the TCN-AE (described in Section 3.2), the parameters of the TCN1 and TCN2, giving the best results, are as follows, 8, $h=24$ , 8, and 0.05 for the number of levels, number of hidden units, kernel size, and dropout bai2018empirical . The Conv1 and Conv2 are $1\times 1$ convolutional layers with $h=24$ input channels and $n$ output channels. The average-pooling bottleneck layer has a kernel of size $d=4$ . The interpolation layer up-samples the feature maps with a factor of $d=4$ . $n$ is the dimensionality or the number of features in the feature vector. In frame-level features, the number of features n is 9, and 11 when using only behavioral, and both behavioral and affect features, respectively. In segment-level features, the number of features n is 33, and 37 when using only behavioral, and both behavioral and affect features, respectively. To compare the perfomance of the TCN-AE with the TCN binary classifier, one TCN with the same architecture as TCN1, trailed by a $24\times 1$ fully-connected layer and a Sigmoid activation function is also implemented. The binary cross-entropy loss is used as the loss function for the TCN binary classifier.

In addition to the TCN-AE, LSTM- and feedforward AEs are also implemented as follows. The encoder part of the LSTM AE contains two LSTMs. The first LSTM accepts the n-element feature vector and the number of neurons in its hidden layer is h. The second LSTM’s input dimension is h and the number of neurons in its hidden layer is b. b is the dimension of the encoded version of input, i.e., the output of encoder. The decoder part of the LSTM AE contains two LSTMs. The first LSTM’s input, and output dimensions are b, and h, while the second LSTM’s input, and output dimensions are h, and n, respectively. The n-dimensional output of the final LSTM is the reconstruction of the input. To compare the performance of AEs with a binary classifier, the encoder part of the LSTM AE trailed by a fully connected layer and a Sigmoid has also been implemented. The binary cross-entropy loss is used as the loss function for the binary classifier. The feedforward AE contains two fully-connected layers as encoder and two fully-connected layers as decoder of size, n × (2 × b), (2 × b) × b, b × (2 × b), and (2 × b) × n. The values of b, and h are 64, and 128 achieving the best results.

To compare the performance of the above feature-based AEs with end-to-end models, a 3D-Convolutional Neural Network (CNN) AE is also implemented. The encoder part of the 3D-CNN AE contains three 3D convolution, ReLU, and 3D max-pooling blocks. The decoder part contains three 3D transposed convolution, ReLU, and 3D max-pooling blocks to reconstruct the input. For comparison, several end-to-end binary classifiers are also implemented. See the references in Table 1 for more details. In all the models, the Adam optimizer with decaying learning rate is used for optimization.

The majority of the existing engagement measurement approaches solve multi-class classification or regression problems, according to Table 1. To compare the binary classification version of those methods with the proposed method, the final fully-connected layer of their proposed neural network architecture is altered to have one output neuron (for binary classification), and their loss function is changed to binary cross-entropy loss.

4.3 Experimental Result

Table 4 shows the AUC ROC and AUC PR of different models for disengagement detection of the test samples in the DAiSEE dataset. Table 5 shows the AUC ROC and AUC PR of different models for disengagement detection of the validation samples in the EmotiW1 and EmotiW2 settings of the EmotiW dataset. As the main focus of this paper is studying the performance of AEs in disengagement detection as an anomaly detection problem, for each AE model, the result of its binary classification counterpart is also reported. The results of using two feature vectors of behavioral features only and behavioral and affect features together are reported for each model. The higher AUC ROC and AUC PR of the AE models, including feedforward-, LSTM-, and TCN-AEs, compared to their equivalent binary classifiers (described in Section 4.2) show the effectiveness of detecting disengagement in an anomaly detection setting. In almost all the models, after adding the affect features to the behavioral features, the performance improves. It shows the effectiveness of the affect features as indicators of engagement (or disengagement). The superiority of the temporal LSTM- and TCN- models, analyzing sequences of feature vectors, over non-temporal feedforward models, shows the importance of analyzing the temporal changes in the affect and behavioral states for disengagement detection. In all the DAiSEE, EmotiW1, and EmotiW2, TCN-AE significantly outperforms LSTM-AE. This is because of the superiority of TCN in modeling sequences of larger length and retaining memory of history compared to the LSTM.

Table 4: AUC ROC and AUC PR of different sets of features and different neural network models for disengagement detection on the test set of the DAiSEE dataset, see Section 4.3. BC, and AE are Binary Classifier, and Auto Encoder, respectively.

Feature	Model	AUC ROC	AUC PR
end-to-end	C3D + LSTM-BC abedi2021improving	0.5931	0.1058
end-to-end	C3D + TCN-BC abedi2021improving	0.6076	0.1277
end-to-end	ResNet + LSTM-BC abedi2021improving	0.6243	0.146
end-to-end	ResNet + TCN-BC abedi2021improving	0.6084	0.1409
end-to-end	3D CNN-BC gupta2016daisee	0.6022	0.1131
end-to-end	3D CNN-AE	0.6119	0.1501
behavioral	feedforward-BC abedi2021affect	0.6855	0.1262
behavioral + affect	feedforward-BC abedi2021affect	0.6837	0.1141
behavioral	feedforward-AE	0.7093	0.1594
behavioral + affect	feedforward-AE	0.7171	0.2023
behavioral	LSTM-BC abedi2021affect	0.6742	0.1675
behavioral + affect	LSTM-BC abedi2021affect	0.6918	0.1556
behavioral	LSTM-AE	0.7007	0.1972
behavioral + affect	LSTM-AE	0.7733	0.2400
behavioral	TCN-BC abedi2021affect	0.6755	0.2027
behavioral + affect	TCN-BC abedi2021affect	0.7160	0.2182
behavioral + affect	TCN-BC (WL) abedi2021affect	0.7516	0.2007
behavioral	TCN-AE	0.7600	0.2324
behavioral + affect	TCN-AE	0.7974	0.2632

Table 5: AUC ROC and AUC PR of different sets of features and different neural network models for disengagement detection on the validation set of the EmotiW dataset in EmotiW1 and EmotiW2 settings, see Section 4.3. BC, and AE are Binary Classifier, and Auto Encoder, respectively.

Setting	Feature	Model	AUC ROC	AUC PR
EmotiW1	behavioral	LSTM-BC abedi2021affect	0.7811	0.3111
EmotiW1	behavioral + affect	LSTM-BC abedi2021affect	0.8077	0.3302
EmotiW1	behavioral	LSTM-AE	0.9012	0.3494
EmotiW1	behavioral + affect	LSTM-AE	0.907	0.5028
EmotiW1	behavioral	TCN-BC abedi2021affect	0.8372	0.301
EmotiW1	behavioral + affect	TCN-BC abedi2021affect	0.8488	0.3078
EmotiW1	behavioral + affect	TCN-BC (WL) abedi2021affect	0.8641	0.3127
EmotiW1	behavioral	TCN-AE	0.9128	0.3312
EmotiW1	behavioral + affect	TCN-AE	0.936	0.4125
EmotiW2	behavioral	LSTM-BC abedi2021affect	0.6883	0.3908
EmotiW2	behavioral + affect	LSTM-BC abedi2021affect	0.7056	0.4035
EmotiW2	behavioral	LSTM-AE	0.7424	0.4451
EmotiW2	behavioral + affect	LSTM-AE	0.7446	0.4448
EmotiW2	behavioral	TCN-BC abedi2021affect	0.7121	0.4148
EmotiW2	behavioral + affect	TCN-BC abedi2021affect	0.7013	0.4077
EmotiW2	behavioral + affect	TCN-BC (WL) abedi2021affect	0.7169	0.4215
EmotiW2	behavioral	TCN-AE	0.7403	0.4445
EmotiW2	behavioral + affect	TCN-AE	0.7489	0.4582

For the DAiSEE in Table 4, the results of end-to-end binary classification models are also reported for comparison (see abedi2021improving for details of the end-to-end models). Due to the longer length of videos, these models are not applicable to EmotiW. While the performance of 3D CNN AE is better than 3D CNN binary classifier and other end-to-end models, the performance of the end-to-end models is much lower than feature-based models. Therefore, the behavioral and affect handcrafted features are more successful in differentiation between engagement and disengagement.

The baseline value for AUC PR (the performance of a random binary classifier) is the ratio of the number of positive samples to the total number of samples khan2022unsupervised . Referring to Table 2 and Table 3, this baseline value is 88 / 1784 = 0.0493, 4 / 48 = 0.0833, and 14 / 48 = 0.2917 for DAiSEE, EmotiW1, and EmotiW2, respectively. As can be seen in Table 4 and 5, all the developed models outperform the random binary classifier by a large margin. Also, the AUC ROC of the developed models, specially AEs, is much higher than the AUC ROC of a random classifier (0.5).

Due to the imbalanced distribution of samples in the datasets 4.1, we reimplement the best binary classifiers (using both behavioral and affect features and TCN) using a weighted binary cross-entropy loss function abedi2021improving ; paszke2019pytorch . According to Tables 4 and 5, the binary classifiers with weighted loss functions (indicated by WL) outperform their non-weighted counterparts; however, their performance remains inferior to that of AEs.

Comparing the performance of the TCN-AE and TCN binary classifier using behavioral and affect features for the DAiSEE, EmotiW1, and EmotiW2 datasets in Table 4 and Table 5, the percentages of improvement in AUC ROC are 11.37%, 10.27%, and 6.78%, respectively, and the percentages of improvement in AUC PR are 20.62%, 34.01%, and 12.39%, respectively. Therefore, the average improvement of the TCN-AE, compared to the TCN binary classifier, is $9$ % on AUC ROC and $22$ % on AUC PR.

The confusion matrices in Table 6 are the results of thresholding reconstruction errors and output class probabilities of TCN-AE and TCN binary classifier using behavioral and affect features on the DAiSEE, EmotiW1, and EmotiW2 datasets. As the TCN-AE is trained in an unsupervised setting and only on the engaged (normal) samples, the disengaged (anomalous) samples cannot be used for finding an optimal threshold for thresholding reconstruction errors thill2021temporal ; khan2017detecting . Therefore, the instructions in khan2017detecting are followed to find an optimal threshold using only engaged (normal) samples of the training set. The numbers in the confusion matrices are the number of correctly and incorrectly classified samples. In all the DAiSEE, EmotiW1, and EmotiW2 datasets, TCN-AE outperforms TCN binary classifier in detecting disengaged samples.

Table 6: Confusion matrices of the TCN-AE and TCN binary classifier (TCN-BC) using behavioral and affect features on DAiSEE, EmotiW1, and EmotiW2 after thresholding reconstruction error of TCN-AE and output class probabilities of TCN-BC. Eng., and Dis. are Engaged (negative), and Disengaged (positive), respectively. The numbers are the number of correctly and incorrectly classified samples, (starting from Eng. and Eng. and clockwise) corresponding to true-negative, false-positive, true-positive, and false-negative. see Section 4.3.

	DAiSEE	EmotiW1	EmotiW2
TCN-BC	Eng. Dis. Eng. 1254 442 Dis. 51 37	Eng. Dis. Eng. 35 9 Dis. 1 3	Eng. Dis. Eng. 26 8 Dis. 9 5
TCN-AE	Eng. Dis. Eng. 1273 423 Dis. 38 50	Eng. Dis. Eng. 32 12 Dis. 0 4	Eng. Dis. Eng. 27 7 Dis. 3 11

5 Conclusion and Future Work

In this paper, for the first time, we defined and addressed the problem of detecting student disengagement in videos as an anomaly detection problem. We designed various AE neural networks, including TCN-AE thill2021temporal , using different behavioral and affect features and evaluated their performance on two publicly available student engagement measurement datasets, DAiSEE gupta2016daisee and EmotiW kaur2018prediction . The proposed approach successfully detects disengagement with high AUC ROC and AUC PR values compared to binary engagement/disengagement classifiers. In future work, we plan to investigate incorporating attention mechanism into the architecture of AEs for disengagement detection to make them pay attention to significant behavioral and affect states and temporal changes in the behavioral and affect states that are indicators of disengagement and to make the developed AEs more interpretable. We plan to explore supervised and unsupervised contrastive learning approaches kopuklu2021driver ; khan2021modified for video-based engagement/disengagement measurement. We aim to collect video datasets in virtual rehabilitation settings and analyze the effectiveness of the proposed approach to measure engagement in other types of virtual learning environments.

Data availability
The datasets analyzed during the current study are publicly available in the following repositories:
https://people.iith.ac.in/vineethnb/resources/daisee/index.html
https://sites.google.com/view/emotiw2020/

Conflict of Interest
The authors declare that they have no conflict of interest.

References

\bibcommenthead
(1) Dhawan, S.: Online learning: A panacea in the time of covid-19 crisis. Journal of educational technology systems 49(1), 5–22 (2020)
(2) Dumford, A.D., Miller, A.L.: Online learning in higher education: exploring advantages and disadvantages for engagement. Journal of Computing in Higher Education 30(3), 452–465 (2018)
(3) Sümer, Ö., Goldberg, P., D’Mello, S., Gerjets, P., Trautwein, U., Kasneci, E.: Multimodal engagement analysis from facial videos in the classroom. IEEE Transactions on Affective Computing (2021)
(4) Sinatra, G.M., Heddy, B.C., Lombardi, D.: The challenges of defining and measuring student engagement in science. Taylor & Francis (2015)
(5) D’Mello, S., Dieterle, E., Duckworth, A.: Advanced, analytic, automated (aaa) measurement of engagement during learning. Educational psychologist 52(2), 104–123 (2017)
(6) Woolf, B., Burleson, W., Arroyo, I., Dragon, T., Cooper, D., Picard, R.: Affect-aware tutors: recognising and responding to student affect. International Journal of Learning Technology 4(3/4), 129–164 (2009)
(7) Dewan, M., Murshed, M., Lin, F.: Engagement detection in online learning: a review. Smart Learning Environments 6(1), 1–20 (2019)
(8) Doherty, K., Doherty, G.: Engagement in hci: conception, theory and measurement. ACM Computing Surveys (CSUR) 51(5), 1–39 (2018)
(9) Abedi, A., Khan, S.S.: Improving state-of-the-art in detecting student engagement with resnet and tcn hybrid network. In: 2021 18th Conference on Robots and Vision (CRV), pp. 151–157 (2021). IEEE
(10) Aung, A.M., Whitehill, J.: Harnessing label uncertainty to improve modeling: An application to student engagement recognition. In: FG, pp. 166–170 (2018)
(11) Chen, X., Niu, L., Veeraraghavan, A., Sabharwal, A.: Faceengage: robust estimation of gameplay engagement from user-contributed (youtube) videos. IEEE Transactions on Affective Computing (2019)
(12) Booth, B.M., Ali, A.M., Narayanan, S.S., Bennett, I., Farag, A.A.: Toward active and unobtrusive engagement assessment of distance learners. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 470–476 (2017). IEEE
(13) Gupta, A., D’Cunha, A., Awasthi, K., Balasubramanian, V.: Daisee: Towards user engagement recognition in the wild. arXiv preprint arXiv:1609.01885 (2016)
(14) Mehta, N.K., Prasad, S.S., Saurav, S., Saini, R., Singh, S.: Three-dimensional densenet self-attention neural network for automatic detection of student’s engagement. Applied Intelligence, 1–21 (2022)
(15) Abedi, A., Khan, S.: Affect-driven engagement measurement from videos. arXiv preprint arXiv:2106.10882 (2021)
(16) Liao, J., Liang, Y., Pan, J.: Deep facial spatiotemporal network for engagement prediction in online learning. Applied Intelligence 51(10), 6609–6621 (2021)
(17) Huang, T., Mei, Y., Zhang, H., Liu, S., Yang, H.: Fine-grained engagement recognition in online learning environment. In: 2019 IEEE 9th International Conference on Electronics Information and Emergency Communication (ICEIEC), pp. 338–341 (2019). IEEE
(18) Ma, X., Xu, M., Dong, Y., Sun, Z.: Automatic student engagement in online learning environment based on neural turing machine. International Journal of Information and Education Technology 11(3), 107–111 (2021)
(19) Dresvyanskiy, D., Minker, W., Karpov, A.: Deep learning based engagement recognition in highly imbalanced data. In: International Conference on Speech and Computer, pp. 166–178 (2021). Springer
(20) Thomas, C., Nair, N., Jayagopi, D.B.: Predicting engagement intensity in the wild using temporal convolutional network. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 604–610 (2018)
(21) Kaur, A., Mustafa, A., Mehta, L., Dhall, A.: Prediction and localization of student engagement in the wild. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8 (2018). IEEE
(22) Copur, O., Nakıp, M., Scardapane, S., Slowack, J.: Engagement detection with multi-task training in e-learning environments. In: International Conference on Image Analysis and Processing, pp. 411–422 (2022). Springer
(23) Aslan, S., Mete, S.E., Okur, E., Oktay, E., Alyuz, N., Genc, U.E., Stanhill, D., Esme, A.A.: Human expert labeling process (help): towards a reliable higher-order user state labeling process and tool to assess student engagement. Educational Technology, 53–59 (2017)
(24) Ranti, C., Jones, W., Klin, A., Shultz, S.: Blink rate patterns provide a reliable measure of individual engagement with scene content. Scientific reports 10(1), 1–10 (2020)
(25) Khan, S.S., Abedi, A., Colella, T.: Inconsistencies in measuring student engagement in virtual learning–a critical review. arXiv preprint arXiv:2208.04548 (2022)
(26) Kopuklu, O., Zheng, J., Xu, H., Rigoll, G.: Driver anomaly detection: A dataset and contrastive learning approach. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 91–100 (2021)
(27) Khan, S.S., Shen, Z., Sun, H., Patel, A., Abedi, A.: Modified supervised contrastive learning for detecting anomalous driving behaviours. arXiv preprint arXiv:2109.04021 (2021)
(28) Barlow, A., Brown, S., Lutz, B., Pitterson, N., Hunsu, N., Adesope, O.: Development of the student course cognitive engagement instrument (sccei) for college engineering courses. International Journal of STEM Education 7(1), 1–20 (2020)
(29) Canziani, B.F., Esmizadeh, Y., Nemati, H.R.: Student engagement with global issues: the influence of gender, race/ethnicity, and major on topic choice. Teaching in Higher Education, 1–22 (2021)
(30) Thill, M., Konen, W., Wang, H., Bäck, T.: Temporal convolutional autoencoder for unsupervised anomaly detection in time series. Applied Soft Computing 112, 107751 (2021)
(31) D’Mello, S., Graesser, A.: Dynamics of affective states during complex learning. Learning and Instruction 22(2), 145–157 (2012)
(32) Bosch, N.: Detecting student engagement: human versus machine. In: Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, pp. 317–320 (2016)
(33) Niu, X., Han, H., Zeng, J., Sun, X., Shan, S., Huang, Y., Yang, S., Chen, X.: Automatic engagement prediction with gap feature. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 599–603 (2018)
(34) Wu, J., Yang, B., Wang, Y., Hattori, G.: Advanced multi-instance learning method with multi-features engineering and conservative optimization for engagement intensity prediction. In: Proceedings of the 2020 International Conference on Multimodal Interaction, pp. 777–783 (2020)
(35) Altuwairqi, K., Jarraya, S.K., Allinjawi, A., Hammami, M.: A new emotion–based affective model to detect student’s engagement. Journal of King Saud University-Computer and Information Sciences 33(1), 99–109 (2021)
(36) Toisoul, A., Kossaifi, J., Bulat, A., Tzimiropoulos, G., Pantic, M.: Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence 3(1), 42–50 (2021)
(37) Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10(1), 18–31 (2017)
(38) Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
(39) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
(40) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. the Journal of machine Learning research 12, 2825–2830 (2011)
(41) Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: Facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66 (2018). IEEE
(42) Khan, S.S., Mishra, P.K., Javed, N., Ye, B., Newman, K., Mihailidis, A., Iaboni, A.: Unsupervised deep learning to detect agitation from videos in people with dementia. IEEE Access 10, 10349–10358 (2022)
(43) Khan, S.S., Taati, B.: Detecting unseen falls from wearable devices using channel-wise ensemble of autoencoders. Expert Systems with Applications 87, 280–290 (2017)