Micro-Expression Recognition Based on Attribute Information Embedding and Cross-modal Contrastive Learning

Yanxin Song, Jianzong Wang^∗, Tianbo Wu, Zhangcheng Huang, Jing Xiao * Corresponding author: Jianzong Wang, [email protected]. Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China

Abstract

Facial micro-expressions recognition has attracted much attention recently. Micro-expressions have the characteristics of short duration and low intensity, and it is difficult to train a high-performance classifier with the limited number of existing micro-expressions. Therefore, recognizing micro-expressions is a challenge task. In this paper, we propose a micro-expression recognition method based on attribute information embedding and cross-modal contrastive learning. We use 3D CNN to extract RGB features and FLOW features of micro-expression sequences and fuse them, and use BERT network to extract text information in Facial Action Coding System. Through cross-modal contrastive loss, we embed attribute information in the visual network, thereby improving the representation ability of micro-expression recognition in the case of limited samples. We conduct extensive experiments in CASME II and MMEW databases, and the accuracy is 77.82% and 71.04%, respectively. The comparative experiments show that this method has better recognition effect than other methods for micro-expression recognition.

Index Terms:

Micro-expression recognition, 3D CNN, BERT, Attribute embedding, Cross-modal contrastive learning loss

I Introduction

Micro-expression is a very brief, subtle and involuntary facial expression[1]. It can reveal the true emotions people are trying to hide, so micro-expression can provide useful clues for identifying lies and can help humans make auxiliary judgments. Currently, micro-expression has been explored in various disciplines such as psychology, sociology, neuroscience, and computer vision. Micro-expression has shorter time intervals compared to macro-expression. Most people agree that the duration of macro-expression is 0.5 to 4 seconds, while micro-expressions should not exceed 0.5 seconds[2].

To address the task of micro-expression recognition, several methods have been proposed to simulate the subtle changes of micro-expressions in the spatio-temporal domain[3]. These methods mainly consist of two parts. The first part is to extract visual features from facial video clips. The second part is to choose a classifier for the extracted features, such as svm or softmax classifier. According to the different feature extraction methods, these methods are mainly divided into two categories: handcraft feature methods and deep feature methods. Although handcraft features are easy to implement and have good geometric or spatiot-emporal interpretations, they are not stable in the recognition and classification of micro-expressions due to the short duration and low intensity of micro-expressions; when using deep learning to recognize micro-expressions, the small number of samples is an important factor that limits the accuracy of the algorithm. Currently, the widely used micro-expression databases are: SAMM[4], CASME[5], CASME II[6], SMIC[7], CAS(ME)2[8] and MMEW[9]. However, the number of samples of these micro-expression databases is limited, and the largest dataset is less than a thousand samples.

Due to the small number of samples, the micro-expression recognition can be regarded as few-shot learning. Few-shot learning aims to learn an effective model identify new classes through a small number of samples. In recent years, many methods have achieved results in this field, mainly including three types: data-based [10, 11, 12], model-based[13, 14, 15], and algorithm-based[16, 17, 18]. Among them, contrastive learning is an effective solution. The essence of contrastive learning is to narrow the distance between similar positive sample pairs and widen the distance between negative sample pairs in the feature domain.

In order to solve the problem of few micro-expression samples, we propose a micro-expression recognition method based on attribute information embedding and cross-modal contrastive learning. Firstly, the micro-expression video sequence is divided into RGB sequence and FLOW sequence, and 3D CNN is used to extract the RGB feature and FLOW feature of the micro-expression sequence, and then fuse. The BERT network is used to extract the text information of the micro-expressions in the FACS encoding. Through cross-modal contrastive loss, we embed attribute information as a kind of auxiliary knowledge into the visual network to enhance the representation ability of micro-expression recognition. Our contributions are:

(1) Attribute information is creatively introduced into micro-expression recognition. This paper uses FACS to map the AU unit to the corresponding attribute information, and embed the attribute information in the video network.

(2) The cross-modal contrastive learning loss is proposed. The cross-modal contrastive learning loss is used to optimize the network, so that the distance between different modalities of the same sample is closer, and the distance between different samples is longer, so as to learn a stronger feature expression.

This paper is structured as follows: Section II introduces related work. Section III introduces proposed methold; The experimental results and analys is in Section VI. Finally, Section V concludes the paper.

II Related work

A lot of work has been devoted to micro-expression recognition, which is mainly divided into handcraft feature method and deep feature method.

II-A Handcraft Feature

Many researchers have proposed algorithms based on local binary patterns (LBP), such as local binary patterns on three orthogonal planes (LBP-TOP)[19], local binary patterns with six intersection points (LBP-SIP)[20]. Huang et al.[21] proposed Spatiotemporal Local Binary Pattern based on Integral Projection (STLBP-IP), this operator uses the global projection method based on the difference image to obtain horizontal and vertical projections, and uses the LBP operator to extract appearance and motion features in these two projection directions. Ben et al.[22] proposed improved dual-Cross Patterns from Three Orthogonal Planes (DCP-TOP) and Hot Wheel Patterns (HWP). At the same time, there have been many advances in optical flow-based micro-expression recognition methods. Xu et al[23] used the optical flow field to build the Facial Dynamics Map (FDM). Since background noise, scale change, and motion direction will affect the calculation of optical flow, Chaudhry et al[24] proposed the Histogram of Oriented Optical Flow (HOOF), which can represent both the time domain and be robust to scale changes and motion directions.[25] proposed another feature descriptor based on optical flow, Bidirectional Weighted Optical Flow (Bi-WOOF). Bi-WOOF can adaptively assign different weights to changes in local feature regions. In order to obtain a simple and effective optical flow feature, Liu[26] proposed the Main Directional Mean Optical-flow (MDMO) feature. The feature dimension of MDMO is 72 dimensions, which effectively reduces the amount of computation. However, MDMO features are computed by averaging a set of features frame by frame. Although the averaging operation in MDMO is simple, it easily loses the underlying manifold structure in the feature space. To improve MDMO, Liu et al.[27] proposed a sparse MDMO feature based on it.

II-B Deep Feature

The methods based on deep learning are mainly two-step networks and three-dimensional convolutional neural networks.

The two-step network first extracts spatial features of the micro-expression, and then uses time series models to extract the temporal features of the sequence, such as RNN or LSTM[28]. Due to limited samples, handcraft features are often used in the first step, such as optical flow and HOOF[29]. Another method is 3D CNN[30]. Patel et al. [31] proposed a deep learning model for micro-expression recognition: Transferring Long-term Convolutional Neural Network (TLCNN), they used transfer learning to initialize the CNN, and trained the weights of the CNN to avoid overfitting. TLCNN uses two steps of transfer learning: transferring from expression data and transferring from single frame of micro-expression video clips. Liong et al.[32] designed a shallow three-stream 3D CNN (STSTNet). He first preprocessed the micro-expression sequence, and calculated three optical flow features based on the apex frame and the start frame. These features are used as input to train the network, so that it can learn representative and discriminative features in micro-expression sequence. To capture the minute texture information from micro-expression videos, V. et al[33] introduces spatial-temporal attention and channel attention in 3D CNN. Graph-temporal convolutional network (Graph-TCN)[34] is proposed by Li to extract subtle facial movements. The network uses video motion magnification method to enhance the motion intensity of micro-expressions, and builds a graph network structure based on facial landmarks, and then learns the graph representation for classification. Wang[35] proposed a 2D-3D CNN. The network includes Net-A and Net-B. The former proposes a multi-scale convolution to extract spatio-temporal features, and the latter extracts the spatial information of differential image space.

In addition, the analysis of the AU is essential for recognizing subtle physical changes in facial expressions, because they are the basic units of micro-expressions. Facial AU recognition has many research focuses. Some studies[36],[37] have compared recent AU recognition work, but these only recognize the facial area of the AU, ignoring that the AU itself is also a kind of attribute information. In this paper, we will utilize the attribute information of AU and embed it into the video network to learn more robust features.

III Methodology

Because of the small number of micro-expression samples and short duration, it is difficult to improve the recognition rate of micro-expression. In order to obtain a stronger feature expression, this paper designs a micro-expression recognition method based on attribute information embedding and cross-modal contrastive learning. By constructing a cross-modal contrastive learning loss, the attribute information is embedded in the video network to improve the representation of the video network. The network structure is shown in Figure 1.

Refer to caption — Figure 1: The illustration of our architecture. The whole network consists of two subnets, including the video feature extraction network (upper branch) for visual information and attribute feature extraction network (lower branch) for attribute information.

III-A Video Feature Extraction Network

Micro-expression recognition is an image sequence recognition task, that is, to recognize changes between image frames. 3D CNN[38, 39] has good performance in processing video sequences, which is extended from 2D CNN. It can extract spatio-temporal information, and is often used for video classification and behavior recognition. At the same time, unlike other image sequence classification tasks, the micro-expression recognition has very small changes between frames, while the optical flow sequences can capture the subtle movement between adjacent frames. The video feature extraction network adopts a dual-stream network, as shown in the upper branch in Figure 1. This network includes RGB network and FLOW network, using 3D-Resnet as the encoder, and the network parameters are not shared. The network structure parameters of different layers are shown in Table I. The encoded features are $z_{rgb}$ and $z_{flow}$ with dimensions of 128, and then the two features are fused to obtain the feature $z_{m}$ , which has 256-dimensional:

{z_{m}}=Concat({z_{rgb}},{z_{flow}})

(1)

TABLE I: Network parameters of 3D-Resnet

Layer Name	Output Size	3D-Resnet10	3D-Resnet18	3D-Resnet34
conv1	$8\times 56\times 56$	$3\times 7\times$ 7, 64, stride $1\times 2\times 2$
conv2_x	$8\times 56\times 56$	$\left\{\begin{array}[]{c}3\times 3\times 3,64\\ 3\times 3\times 3,64\end{array}\right\}\times 1$	$\left\{\begin{array}[]{c}3\times 3\times 3,64\\ 3\times 3\times 3,64\end{array}\right\}\times 2$	$\left\{\begin{array}[]{c}3\times 3\times 3,64\\ 3\times 3\times 3,64\end{array}\right\}\times 3$
conv3_x	$4\times 28\times 28$	$\left\{\begin{array}[]{c}3\times 3\times 3,128\\ 3\times 3\times 3,128\\ \end{array}\right\}\times 1$	$\left\{\begin{array}[]{c}3\times 3\times 3,128\\ 3\times 3\times 3,128\\ \end{array}\right\}\times 2$	$\left\{\begin{array}[]{c}3\times 3\times 3,128\\ 3\times 3\times 3,128\\ \end{array}\right\}\times 4$
conv4_x	$2\times 14\times 14$	$\left\{\begin{array}[]{c}3\times 3\times 3,256\\ 3\times 3\times 3,256\\ \end{array}\right\}\times 1$	$\left\{\begin{array}[]{c}3\times 3\times 3,256\\ 3\times 3\times 3,256\\ \end{array}\right\}\times 2$	$\left\{\begin{array}[]{c}3\times 3\times 3,256\\ 3\times 3\times 3,256\\ \end{array}\right\}\times 6$
conv5_x	$1\times 7\times 7$	$\left\{\begin{array}[]{c}3\times 3\times 3,512\\ 3\times 3\times 3,512\\ \end{array}\right\}\times 1$	$\left\{\begin{array}[]{c}3\times 3\times 3,512\\ 3\times 3\times 3,512\\ \end{array}\right\}\times 2$	$\left\{\begin{array}[]{c}3\times 3\times 3,512\\ 3\times 3\times 3,512\\ \end{array}\right\}\times 3$
	1 $\times 1\times 1$	average pool, 128-d fc, softmax

III-B Attribute Feature Extraction Network

Attribute learning has been studied in various applications. Most notably, attribute learning and zero-sample learning are more and more closely combined. Zero-sample learning considers that each class is composed of a series of attributes, and identifies new classes based on category level attributes. To do this, the model takes attribute information as input to train the network so that new classes can be identified without training data. This paper uses attribute information to constrain video features.

TABLE II: Facial action coding system

Action Unit	Description	Action Unit	Description
AU1	Inner Brow Raiser	AU18	Lip Puckerer
AU2	Outer Brow Raiser	AU20	Lip stretcher
AU4	Brow Lowerer	AU22	Lip Funneler
AU5	Upper Lid Raiser	AU23	Lip Tightener
AU6	Check Raiser	AU24	Lip Pressor
AU7	Lid Tightener	AU25	Lips part
AU9	Nose Wrinkler	AU26	Jaw Drop
AU10	Upper Lip Raiser	AU27	Mouth Stretch
AU11	Nasolabial Deepener	AU28	Lip Suck
AU12	Lip Corner Puller	AU41	Lid droop
AU13	Check Puffer	AU42	Slit
AU14	Dimpler	AU43	Eyes Closed
AU15	Lip Corner Depressor	AU46	Wink
AU16	Lower Lip Depressor	AU44	Squint
AU17	Chin Raiser	AU45	Blink

FACS (Facial Action Coding System) is the facial behavior coding system[40], which specifically refers to a group of facial muscle motion states, as shown in Table II. The emotion can be analyzed and judged by using the facial action coding system. The composition of emotion categories is usually given in the micro-expression database, and people often only classify the emotion categories, ignoring the attribute information contained in the AU. According to Table II, AU can be marked as corresponding attribute information. For example, in the CASME II dataset, happiness corresponds to AU6+AU12, and the corresponding attribute information is: lift up the cheeks and raise the corners of the mouth.

BERT[41] is a pre-training model proposed by the Goole AI Research Institute in October 2018. Its network structure is mainly implemented by the transformer encoder, which is mainly used to extract text information. This paper uses the BERT network to extract the attribute information of the image, as shown in the lower branch in Figure 1. Given a sample $x_{i}$ , the corresponding semantic information after FACS mapping is $t_{i}$ , and the feature $z_{a}$ is obtained through the BERT network with a dimension of 256 dimensions.

{{z}_{a}}^{i}=BERT({{t}_{i}})

(2)

III-C Cross-Modal Contrastive Loss

Given a sample $x_{i}$ , the input of the video feature extraction network is $v_{i}$ , which is $z_{m}^{i}$ after encoding. The input of the attribute feature extraction network is $t_{i}$ , and the encoding is $z_{a}^{i}$ , ${{f}_{\theta}}(\bullet)$ and ${{f}_{\varphi}}(\bullet)$ are the corresponding encoders, and the network parameters are $\theta$ and $\varphi$ respectively. In this work, we hope to add attribute information to the components of visual features, so as to learn richer and more representative micro-expression representations.

According to whether the input of networks come from the same micro-expression sample, we construct a positive and negative sample pair. $x=\{{v_{i}},{t_{i}}\}$ is called a positive sample pair. $y=\{{v_{i}},{t_{j}}\}$ is called a negative sample pair. Each time, we select a positive sample pair $x$ and k negative sample pairs $\{{y_{1}},{y_{2}},...,{y_{k}}\}$ to calculate the loss. The cross-modal contrastive loss is:

{L_{\theta,\varphi}}=-{E_{S}}[\log\frac{{{d_{\theta,\varphi}}(x)}}{{{d_{\theta,\varphi}}(x)+\sum\nolimits_{i=1}^{k}{{d_{\theta,\varphi}}({y_{i}})}}}]

(3)

where, $S=\{x,{y_{1}},{y_{2}},...,{y_{k}}\}$ is the set of all samples. ${d_{\theta,\varphi}}(\bullet)$ is the defined distance function, which represents the cosine similarity of the two modal characteristics.

{d_{\theta,\varphi}}(\{{v_{i}},{t_{i}}\})=\exp(\frac{{{f_{\theta}}({v_{i}})\cdot{f_{\varphi}}({t_{i}})}}{{\left\|{{f_{\theta}}({v_{i}})}\right\|\cdot\left\|{{f_{\varphi}}({t_{i}})}\right\|}})

(4)

Since ${f_{\theta}}({v_{i}})={z_{m}^{i}},{f_{\varphi}}({v_{i}})={z_{a}^{i}}$ , the (4) can also be written as:

{d_{\theta,\varphi}}(\{{v_{i}},{t_{i}}\})=\exp(\frac{{{z_{m}^{i}}\cdot z_{a}^{i}}}{{\left\|{z_{m}^{i}}\right\|\cdot\left\|{z_{a}^{i}}\right\|}})

(5)

In order to ensure the classification effect, this paper uses cross entropy loss for classification. After adding the softmax classifier to the video feature extraction network and the attribute feature extraction network respectively, the corresponding classification losses are $L_{\theta}$ and $L_{\varphi}$ . Then

L_{\theta}=-\sum_{i=1}^{n}{p\left(v_{i}\right)}\log\left(q\left(v_{i}\right)\right)

(6)

L_{\varphi}=-\sum_{i=1}^{n}{p\left(t_{i}\right)}\log\left(q\left(t_{i}\right)\right)

(7)

where, ${p}(\bullet)$ represents the probability that the sample belongs to a certain class in the true distribution, and ${q}(\bullet)$ represents the probability that the sample belongs to a certain class in the predicted distribution. $n$ is the number of classes.

The total loss is:

L=\left(1-\alpha\right)\left(L_{\theta}+L_{\varphi}\right)+{\alpha}\left(L_{\theta,\varphi}\right)

(8)

where, $\alpha$ is the weighting factor, used to balance the classification loss and the cross-modal contrastive loss. Its value range is [0,1]

IV Experiments

IV-A Datasets

This paper conducts a lot of experiments on two datasets, namely CASME II and MMEW, in order to fully evaluate the performance of the proposed algorithm.

The CASME II micro-expression database mainly include seven types of micro-expression, namely happiness, surprise, fear, sadness, disgust, repression, and others. The database samples are mainly selected from the video sequences recorded by 26 subjects, and there are 255 micro-expression samples. The frame rate of the CASME II database is 200fps and the size is 680*480. This paper conducts experiments on happiness, disgust, repression, surprise and sadness.

The frame rate of MMEW micro-expression database is 90fps. The 36 subjects who participated in the video collection were all from Shandong University, and contained a total of 300 micro-expression samples. These micro-expression samples are divided into 7 types of emotions, namely sadness, happiness, disgust, surprise, anger, fear, and depression. The size is 2040*1088. This paper conducts experiments on happiness, disgust, surprise, sadness, fear and anger.

IV-B Pre-processing

The micro-expression dataset is preprocessed by dlib to align the face and locate facial landmarks. Then, each video frame is cropped according to facial landmarks, and the size is adjusted to 112*112. Finally, the TIM[42] model is used to interpolate the two databases, and the number of frames after interpolation is 16 frames. This paper uses the farneback optical flow algorithm[43] to extract optical flow sequence information.

IV-C Experimental Setup

The experiment chooses adam as the optimizer of the neural network, and the learning rate is set to 0.0001. The number of iterations is 200, and the batchsize is set to 32. The experiment in this paper is done on the Ubuntu 16.04, using Tesla V100-PCIE. The network structure is built using the Pytorch framework. The dataset is divided by people, the subjects have no crossover. The ratio of trainset to testset is 3:2 and then same training and testing sets are used for all the experiments. This paper uses accuracy to measure experimental results.

Implementation Details: In training, this paper uses 3D-Resnet and BERT network to train the network together. BERT is used as an auxiliary network to embed the prior knowledge of attribute information in 3D-Resnet; In inference, this paper only uses trained 3D-Resnet to recognize video sequences, and BERT no longer participates in the reasoning process.

IV-D Selection of 3D Convolutional Neural Networks Layers

In order to evaluate the influence of the number of convolutional layers of 3D-Resnet on the experiments, this paper selects 3D-Reset10, 3D-Resnet18 and 3D-Resnet34 to conduct experiments on two public datasets. The experiment consists of two parts: only 3D CNN and the method proposed in this paper, which are represented by 3D-Resnet and Muti-3D-Resnet, respectively. The experimental results are shown in Figures 2 and 3.

As can be seen from Figure 2 and Figure 3, as the number of layers of the 3D-Resnet network increases, the recognition rate decreases, and the deeper the number of layers, the lower the recognition rate. Because the number of samples in the micro-expression dataset is small, as the number of network layers increases, the model overfits.

Comparing Figure 2 and Figure 3, after introducing attribute information, the recognition rate of the proposed method is better than that of a single 3D-Resnet for the network model with the same number of layers.

IV-E Experimental Results and Analysis

In order to verify the effectiveness of all the algorithms, this paper conducted extensive experiments on the CASME II and MMEW datasets to calculate the average accuracy and standard deviation of the five experiments. Compared with other micro-expression recognition algorithms in the same period, the experimental results are shown in Table III.

Method	CASME II	MMEW
FDM[23]	40.03 $\pm$ 2.33	34.62 $\pm$ 2.54
LBP-TOP[19]	48.94 $\pm$ 3.45	37.05 $\pm$ 3.54
MDMO[26]	60.02 $\pm$ 3.56	65.73 $\pm$ 4.54
Sparse MDMO[27]	64.46 $\pm$ 3.64	60.07 $\pm$ 4.32
ELRCN[28]	55.63 $\pm$ 2.46	41.53 $\pm$ 3.26
NetAB[35]	63.32 $\pm$ 2.12	55.63 $\pm$ 2.73
TLCNN[31]	70.44 $\pm$ 3.25	69.46 $\pm$ 4.01
Ours	$\bm{$}{77.82}$ ±	$\bm{$}{71.04}$ ±