PGCN: Pyramidal Graph Convolutional Network for EEG Emotion Recognition

Ming Jin, Enwei Zhu, Member, IEEE, Changde Du, Huiguang He, Senior Member, IEEE, Jinpeng Li*, Member, IEEE This work was supported in part by National Natural Science Foundation of China (62106248), Zhejiang Provincial Natural Science Foundation of China (LQ20F030013), Ningbo Public Service Technology Foundation, China (202002N3181), and Medical Scientific Research Foundation of Zhejiang Province, China (2021431314).Ming Jin, Enwei Zhu and Jinpeng Li are with HwaMei Hospital, University of Chinese Academy of Sciences, Ningbo, Zhejiang Province, China. They are also with Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang Province, China. Changde Du is with Research Center for Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences, Beijing, China.Huiguang He is with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China.Corresponding author: Jinpeng Li (E-mail: [email protected])

Abstract

Emotion recognition is essential in the diagnosis and rehabilitation of various mental diseases. In the last decade, electroencephalogram (EEG)-based emotion recognition has been intensively investigated due to its prominative accuracy and reliability, and graph convolutional network (GCN) has become a mainstream model to decode emotions from EEG signals. However, the electrode relationship, especially long-range electrode dependencies across the scalp, may be underutilized by GCNs, although such relationships have been proven to be important in emotion recognition. The small receptive field makes shallow GCNs only aggregate local nodes. On the other hand, stacking too many layers leads to over-smoothing. To solve these problems, we propose the pyramidal graph convolutional network (PGCN), which aggregates features at three levels: local, mesoscopic, and global. First, we construct a vanilla GCN based on the 3D topological relationships of electrodes, which is used to integrate two-order local features; Second, we construct several mesoscopic brain regions based on priori knowledge and employ mesoscopic attention to sequentially calculate the virtual mesoscopic centers to focus on the functional connections of mesoscopic brain regions; Finally, we fuse the node features and their 3D positions to construct a numerical relationship adjacency matrix to integrate structural and functional connections from the global perspective. Experimental results on three public datasets indicate that PGCN enhances the relationship modelling across the scalp and achieves state-of-the-art performance in both subject-dependent and subject-independent scenarios. Meanwhile, PGCN makes an effective trade-off between enhancing network depth and receptive fields while suppressing the ensuing over-smoothing. Our codes are publicly accessible at https://github.com/Jinminbox/PGCN.

Index Terms:

Emotion Recognition, Electroencephalogram, Graph Convolutional Network, Knowledge-based Modelling

I Introduction

Emotion recognition is an important module in human-computer interaction [1], mental disease diagnosis and rehabilitation [2, 3], transportation [4] and security [5].

Existing works on emotion recognition fall into two categories: The first category uses low-cost, easily accessible behavioral signals such as speech [6], gesture [7] and facial expression [8]. The second category uses physiological signal, which has higher reliability and accuracy due to the robustness to artifacts and concealment [9]. The physiological signals include EEG [10], electrocardiogram (ECG), electromyogram (EMG) and galvanic skin response.

EEG has achieved longevity among physiological signals due to its direct acquisition of emotion-related brain signals. However, the EEG acquisition process is very susceptible to interference from other physiological signals of the human body and environmental noise. Therefore, preprocessing and feature extraction are required to expose the emotional information contained in the EEG signal [11], Figure 1 (a) illustrates the flowchart of EEG emotion recognition. Preprocessing includes removing noise and interference such as electrooculogram artifacts, EMG artifacts, ECG artifacts, galvanic skin, and power frequency interference. Feature extraction helps expose emotion-related information in EEG signals and helps to achieve more accurate emotion recognition.

Refer to caption — Figure 1: The EEG emotion recognition paradigm and the conceptual design of PGCN. (a) The basic flowchart of EEG emotion recognition, which consists of visual-audio stimuli presentation (to evoke corresponding emotions), EEG signal acquisition, pre-processing, feature extraction, emotion classification and (sometimes) feedback to the subject. (b) There are three main components for feature aggregation in the PGCN. The vanilla GCN extracts local bias information from neighboring nodes, the virtual mesoscopic center aggregates information within brain regions constructed with priori knowledge guidance, and the attentional GCN further fuses structural and functional connectivity at a global level.

The method of graph convolution models with the help of the topology of different nodes on the brain scalp is prevalent due to the possibility of exploiting the connections between brain electrodes. As the pioneering work of GCN in EEG emotion recognition, DGCNN [10] dynamically updates the connections between EEG channels through gradient backpropagation to obtain more discriminative EEG features. GCB-net [12] combines GCN with broad learning system to search features in broad spaces while exploring the deeper-level EEG information. RGNN [13] proposed two regularizers to handle cross-subject EEG variations better. V-IAG [14] was proposed to deal with the individual differences and the dynamic, uncertain relationships among EEG regions.

Although previous work has attempted to adopt GCN to EEG emotion recognition, there are still boundaries to explore.

(1)

The existing methods for graph construction are relatively rough: most methods for constructing adjacency matrix are based on whether the electrodes deployed on the scalp are adjacent to each other, with limited exploration of more in-depth structural and functional connectivity.
(2)

The emotion activation patterns on the brain scalp have been investigated for a long time. Nevertheless, to our knowledge, no precedent work has integrated these findings (e.g., functional clusters of electrodes) to the network architecture design in emotion recognition.
(3)

Most importantly, existing GCNs in EEG emotion recognition are shallow networks (2-3 layers) because multi-layer GCN is prone to over-smoothing. However, the human brain possesses long-range connectivity, and shallow networks fail to learn such long-range dependencies.

To address these issues, we propose a graph-based pyramidal network that progressively extends the perceptual field from local to mesoscopic and global and incorporates the priori knowledge from neuroscience research. As shown in Figure 1 (b), PGCN contains three main components. The first part focuses on strong local connections between different nodes. It constructs adjacency matrices based on 3D spatial distances of different electrodes and employs a two-layer GCN to fuse structural associations between adjacent nodes with strong region specificity. The second part focuses on the functional connections between nodes in specific brain regions. Different mesoscopic brain regions are delineated based on the brain research prior [15, 16]. The attention correlation coefficients between nodes within each region are calculated, and virtual mesoscopic nodes are generated. The third part focuses on possible long-distance dependencies between different nodes of the whole brain. We use attention to compute global adjacency matrices that fuse numerical and positional relationships between nodes and use GCN to fuse node relationships at the whole-brain scale. Finally, the features at different scales are fused with the original features, and a 3-layer fully connected network is used for final emotion recognition.

Compared with previous works that mostly use only 2D electrode relations for GCN construction, the PGCN fuses absolute positional, relative positional, and numerical relationships between nodes into the network while drawing on a priori studies in emotion-related neuroscience to construct virtual mesoscopic centers of brain regions. To achieve this goal, we construct the PGCN that aggregates EEG features at different scales from local, mesoscopic to global. The main contributions of this work is threefold:

(1)

We exploit the Pyramidal Graph Convolutional Network that disposes EEG electrode features at different scales. PGCN effectively excavates the information embedded in the node’s structural and functional connections, thus improving the network’s effectiveness.
(2)

Based on priori knowledge in emotion-related neuroscience, we design different mesoscopic regions and calculate their virtual centers to distinguish the role of different brain regions in emotion recognition tasks.
(3)

We evaluate the network’s performance on three open datasets. Experimental results demonstrate that PGCN achieves state-of-the-art performance. In addition, the visualization results further demonstrate the effectiveness of the proposed method. The code for the paper will be open sourced after publication.

The rest of this paper is organized as follows: Section II gives a brief review of related work; Section III provides a detailed interpretation of the proposed PGCN; the experimental results of the PGCN on three emotion recognition datasets are presented in Section IV; Section V attempts a more in-depth analysis of the PGCN and their visualization; Section VI elaborates our conclusions.

II Related Work

II-A EEG Emotion Recognition

Since the collected EEG raw data contain artifacts and noise, it is challenging to directly apply EEG raw data for emotion recognition. Pre-processing operations such as filtering, baseline correction, and re-referencing can effectively suppress or eliminate the artifacts and noises. Some work has reported using pre-processed EEG signals for end-to-end emotion recognition. TSception [17] uses temporal and spatial convolutional layers to extract the time-frequency characteristics and the difference between the left and right hemispheres for end-to-end emotion recognition. EEGNet [18] uses deep separable convolutions to extract Spatio-temporal patterns and is prominent in various BCI tasks.

Although the preprocessed EEG data can be directly used for emotion recognition, it is more effective to further extract emotion-related EEG features. After spectral analysis [19] and frequency band interception, frequency-domain feature extraction can obtain rhythmic information of neural activity in the brain. Commonly used frequency domain features include power spectral density (PSD) features [20], differential entropy (DE) features [21], differential asymmetry (DASM), and Rational Asymmetry (RASM), etc.

After extracting the EEG features, different models are designed for emotion recognition. Traditional machine learning methods such as support vector machine (SVM) [22], and clustering [23] have been shown to perform emotion recognition. Deep learning has become mainstream for processing EEG features with its rapid development in recent years. Zheng et al. [24] constructed a deep belief network (DBN) to explore the role of different frequency bands and channels. Yang et al. [25] proposes a hierarchical network structure with sub-network nodes to discriminate human emotions. Since human emotions have continuity in time, recurrent neural network (RNN) or long short-term memory can effectively utilize the temporal correlation of EEG [26], ACRNN [27] employed CNN to extract the spatial information and applied RNN with extended self-attention to explore the temporal information of EEG feature, Zhang et al. [28] introduced CNN and RNN to explore the preserved spatial and temporal information in either a cascade or a parallel manner.

II-B Graph Convolutional Networks

Thanks to the introduction of practical structural information, GCN is considered an effective method for processing non-Euclidean data and has achieved great success in the field of social networks [29], knowledge graphs [30] and traffic prediction [31], and so on.

An undirected graph can be expressed as $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , in which $\mathcal{E}$ represents the set of edges that connect different nodes in the set of $\mathcal{V}$ . By connecting all edge $\mathcal{E}$ , an adjacency matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ that characterizes the graph relationship can be constructed, at the same time, by aggregating the feature $\mathbf{x}\in\mathbb{R}^{d}$ of each node, a feature matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ characterizing the features of the nodes on the graph can be constructed, where $n$ denotes the number of nodes and $d$ is the dimension of input features.

The simplified GCN proposed by Kipf et al. [29] effectively simplifies the original complex spectral GCN method [32], and the simplified GCN can be expressed as

\mathbf{Z}=\sigma\left(\tilde{\mathbf{D}}^{-\frac{1}{2}}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{X}\mathbf{\Theta}\right),

(1)

where $\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}$ and $\tilde{\mathbf{D}}_{ii}=\sum_{j}\tilde{\mathbf{A}}_{ij}$ , $\mathbf{I}\in\mathbb{R}^{n\times n}$ is the n-dimensional degree matrix, $\mathbf{\Theta}$ is a trainable weight matrix.

II-C GCN in EEG emotion recognition

Since the EEG feature is a specific structured non-Euclidean data, GCN-based EEG sentiment recognition has been developed rapidly. DGCNN [10] dynamically updates the adjacency matrix characterizing the relationship between nodes by gradient backpropagation to obtain more accurate inter-node relationships and better emotion recognition. RGNN[13] used two regularizers, node-based domain adversarial training, and emotion-aware distribution learning, to optimize the cross-subject emotion recognition effect. Inspired by the neurological knowledge of brain cognitive processes, LGG-net [33] proposed local and global graphical filtering layers to learn brain activities within and between different brain functional regions to simulate the complex relationships in human brain cognitive processes, thus achieving the best emotion recognition effect. LR-GCN [34] employed self-attention forward updating Laplacian matrices and gradient backpropagation updating adjacency matrices to construct learnable brain electrode relationships jointly. To deal with individual differences and dynamic, uncertain relationships between different EEG regions, (V-IAG)[14] proposed the variational instance adaptive graph method and achieved good results.

II-D Emotion and Human Brain

Brain networks have been studied for a long time, with a view to contributing to the understanding of human emotions. Sporns et al. [35] proposed that studies related to the human brain’s connectome can be conducted at the microscale, mesoscale, and macroscale, corresponding to neurons, neuronal clusters, and brain regions, respectively. However, due to the vast number of neurons in human brain, neuron-level studies are not yet practical, and most of the work focuses on neuronal clusters and larger scales.

The human brain has significant small-world properties, as demonstrated by the fact that neighboring neurons are more likely to exchange information frequently. He et al. [15] used structural image data to construct a 54-node brain network for the first time and observed small-world properties. Many subsequent works have confirmed the small-world properties with different node partitioning methods, and the distribution of node degrees obeyed the power-law distribution with exponential truncation tails [36]. Hagmann et al. [16] used the diffusion spectrum imaging technique to construct a weighted brain structural network and found that the brain network could be divided into six modules, and each module had corresponding core nodes, which were mainly distributed in the frontal, temporal and occipital lobes.

To further enhance communication efficiency between neurons [37], the brain has been found to have significant ”long-range connections.” Although long-distance connections are more costly in energy and volume than short-distance connections, they significantly reduce the cost of wiring between several indirect short-distance connections, thus providing faster, more direct, and less noisy information transport [38].

III Methodology

III-A Overview

PGCN builds a pyramidal network that fuses EEG features at different scales layer by layer. The overall architecture of the model is shown in Figure 2, which can be divided into the following steps. (1) Considering the frequent local connections of brain networks, we construct a sparse structural adjacency matrix based on the spatial distance between electrodes and introduce a GCN to aggregate local features. (2) To better distinguish the relevance of different brain regions to emotions, we constructed mesoscopic-scale brain regions based on a priori studies and calculated virtual mesoscopic centers for each brain region to characterize the mesoscopic features. (3) To balance the importance and economics of long-distance connections, we fuse the original nodes with virtual mesoscopic nodes, construct a sparse global graph connectivity network with the help of an attention mechanism, and aggregate global features with graph convolution. (4) Finally, the fused features are fed into 3-layer fully connected network for final emotion recognition tasks.

III-B Local Feature Aggregation

Local feature aggregation focuses on the frequent local connections of the human brain and constructs an effective information transport mechanism to fuse the EEG information of neighboring nodes. To focus on short-range neighboring features, we first constructed sparse graph relations reasoning based on the relative spatial positions of electrodes and made their degrees obey exponentially truncated power distributions [36] and kept the adjacency matrix with sufficient sparsity [13]. After that, we constructed a two-layer GCN to aggregate the features of the adjacency nodes. Finally, we fused the original features with the features after graph convolution to suppress over-smoothing [39].

III-B1 Sparse Graph Relation Reasoning

GCN relies on constructing an accurate node relation matrix and calculating the corresponding Laplacian matrix. In the task of EEG emotion recognition, a common way to construct the adjacency matrix of different electrodes $i$ and $j$ is:

\mathbf{A}_{ij}=\left\{\begin{aligned} 1\quad\text{if}\quad{j}\in\mathcal{N}_{i}\\ 0\quad\text{if}\quad{j}\notin\mathcal{N}_{i}\end{aligned}\right.,

(2)

where $\mathcal{N}_{i}$ represents the 2D spatial neighbor of electrode $i$ .

However, the method faces some problems: (1) The electrodes of all commercially available brain electrode caps cannot be evenly distributed, and the relationship between electrodes cannot be accurately portrayed by simple numbers 0 and 1. (2) The constructed adjacency matrix will lead to over-sparsity, resulting in the loss of some critical connections in the optimization process. (3) Since EEG has very large cross-subject or even cross-session differences, the construction of a fixed adjacency matrix will lead to under-optimization.

Inspired by the work of Salvador et al. [40] and Zhong et al. [13], to better describe the connections between different nodes, we constructed an initial adjacency matrix based on the inverse square of the spatial distance between different nodes:

\mathbf{A}_{ij}=\left\{\begin{aligned} &1\quad&\text{if}&\quad\mathbf{A}_{ij}\geq 1\\ &\frac{\delta}{d_{ij}^{2}}\quad&\text{if}&\quad 0.1\leq\mathbf{A}_{ij}\leq 1\\ &0.1\quad&\text{if}&\quad\mathbf{A}_{ij}\leq 0.1\end{aligned}\right.,

(3)

where ${d_{ij}}$ is 3D distance between node ${i}$ and ${j}$ , and $\delta$ is the sparsity factor. In the experiment, we found that the best result is obtained when $\delta$ is set to 9. We clip $\textbf{A}_{ij}$ greater than 0.1 to maintain the sparsity; and clip $\textbf{A}_{ij}$ less than 1 to reduce the weights of the self-loop and extremely close neighbors.

Refering the work of Jin et al. [34] in LR-GCN, we set $\mathbf{A}$ as a learnable matrix and update it through gradient backpropagation.

III-B2 GCN for Local Representation Aggregation

After constructing the appropriate adjacency matrix $\mathbf{A}$ , we compute the corresponding laplacian matrix $\mathbf{\hat{L}}$ and perform information transport between neighbor nodes with GCN:

\mathbf{H}^{(l+1)}=\sigma\left(\mathbf{\hat{L}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}\right),

(4)

where $\mathbf{H}^{(l)}$ and $\mathbf{H}^{(l+1)}$ are the input and output node representations at layer $l$ , respectively; the initial input representations $\mathbf{H}^{(0)}$ are the original input features $\mathbf{X}$ . $\mathbf{W}^{(l)}$ is a learnable weight matrix and $\sigma$ is an activation function.

In order to prevent the GCN from over-smoothing, we designed only a two-layer local GCN network and introduced cross-layer connections; the final output of the local feature aggregation is:

\mathbf{X}^{local}=\text{concat}\left(\mathbf{X},\mathbf{H}^{(1)},\mathbf{H}^{(2)}\right)

(5)

III-C Mesoscopic Feature Aggregation

Different areas of the human cerebral cortex are highly connected and centralized, and some work has been reported dividing the cortex into several emotion-related brain regions [16, 41]. With reference to priori knowledge of current brain science research, we constructed different mesoscopic brain regions and then calculated the location and feature of the virtual mesoscopic center in each region. Since the features in the region only converge to the virtual mesoscopic center, it can effectively avoid the over-smoothing while increasing the perceptual field.

III-C1 Graph Coarsen for Mesoscopic Regions

To obtain the mesoscopic feature aggregation, we designed two different mesoscopic divisions with different sensory fields, as shown in Figure 3.

Figure 3(a) shows the first partition with reference to the anatomy of the cerebral cortex. The brain’s cortex is generally divided into four lobes, frontal, parietal, temporal, and occipital [42], each lobe being responsible for a different task; for example, the occipital lobe is associated with visual processing and interpretation. Firstly, we divided the electrodes into five regions; secondly, as electrodes located in the temporal lobe (FT7, T7, TP7, FT8, T8, TP8) have been reported to be important for emotion recognition, we performed a more detailed secondary division of the electrodes in the temporal lobe [24, 43]; finally, inspired by the work of Hagmann et al. [16], we adjusted the division of the regions to meet the needs of both structural and functional connectivity.

Figure 3(b) shows the other valid way to devide the mesoscopic regions based on the brain’s two hemispheres. The corpus callosum connects the two hemispheres of the brain, and all human activities are realized through their information interaction. However, the functions of the two halves of the brain differ significantly.

III-C2 Mesoscopic Node Relation

After constructing the mesoscopic brain regions, we focused on functional connectivity within the mesoscopic regions through a self-attention. The input is the set of features within each mesoscopic region, $\mathbf{h}=\left\{\vec{h}_{1},\vec{h}_{2},\ldots,\vec{h}_{N}\right\}$ , $\mathbf{h}\in\mathbb{R}^{N\times F}$ , where $N$ is the number of nodes in the mesoscopic region, and $F$ is the number of features in each node. The attention-based connectivity matrix $\mathbf{e}$ can be expressed as:

\mathbf{e}=\text{LeakyReLU}((\mathbf{hW})(\mathbf{hW})^{T}),

(6)

where $\mathbf{W}$ is a learnable weight matrix and $\cdot^{T}$ is transposition.

III-C3 Virtual Mesoscopic Center

After calculating the attention-based connectivity matrix of the mesoscopic region, we begin to construct the virtual mesoscopic centers. We compute the weight coefficient $\mathbf{\Lambda}\in\mathbb{R}^{1\times N}$ in each mesoscopic region by performing a row-wise summation for $\mathbf{e}$ , where $\mathbf{\Lambda}_{i}$ denotes the functional connection weight of each node to all other nodes. After that, we begin to calculate the features and locations of the virtual mesoscopic region center:

	$\displaystyle\mathbf{p}^{locate}$	$\displaystyle=\text{softmax}(\mathbf{\Lambda})\mathbf{P},$		(7)
	$\displaystyle\mathbf{m}^{feature}$	$\displaystyle=\text{softmax}(\mathbf{\Lambda})\mathbf{h},$		(7)

where $\mathbf{P}\in\mathbb{R}^{N\times 3}$ is the absolute position matrix of the electrodes.

After calculating the regional centers of each mesoscopic region, the features and locations are fused according to different divisions. The fused features and locations in Figure 3 (a) and Figure 3 (b) are $\mathbf{M}^{(1)}\in\mathbb{R}^{7\times F}$ and $\mathbf{P}^{(1)}\in\mathbb{R}^{7\times 3}$ , $\mathbf{M}^{(2)}\in\mathbb{R}^{2\times F}$ and $\mathbf{P}^{(2)}\in\mathbb{R}^{2\times 3}$ , respectively.

Finally, we implement virtual node imputation to fuse the virtual mesoscopic centers with the original electrode nodes, and obtained a fused feature and location matrix containing vital local and mesoscopic attributes.

	$\displaystyle\mathbf{X}^{meso}$	$\displaystyle=\left(concat\left((\mathbf{X}^{local})^{T},(\mathbf{M}^{(1)})^{T},(\mathbf{M}^{(2)})^{T}\right)\right)^{T},$		(8)
	$\displaystyle\mathbf{P}^{meso}$	$\displaystyle=\left(concat\left(\mathbf{P}^{T},(\mathbf{P}^{(1)})^{T},(\mathbf{P}^{(2)})^{T}\right)\right)^{T}.$		(8)

III-D Global Feature Aggregation

Although mesoscopic partitions can effectively alleviate the node information cannot transport at longer distances, a more extensive range of mesoscopic regions can easily cause information loss as the number of nodes increases.

Attention mechanisms are very effective in constructing global feature relationships [44]. In order to assemble long-distance node correlations, we construct an attention-based absolute position-dependent feature correlation adjacency matrix to characterize the possible connections between each node and eliminate the weak connection relationships among them. After constructing the global electrode connectivity relationship, we use graph convolution for global feature aggregation.

III-D1 Global Node Relation

We established node position encoding with their spatial location to better characterize the global node relationships. The full nodes $\mathbf{P}^{meso}$ consist of the original electrode nodes $\mathbf{P}$ and the virtual mesoscopic nodes $\mathbf{P}^{virtual}$ . Since the virtual nodes also have corresponding features, we sum the features of all nodes with the position encoding to obtain the position-enhanced node features as follows:

\mathbf{X}^{enhanced}=\mathbf{X}^{meso}+embed\left(\mathbf{P}^{meso}\right).

(9)

After that, we compute the relation matrix $\mathbf{G}\in\mathbb{R}^{6\times N^{\prime}\times N^{\prime}}$ using a multi-head self-attention containing 6 heads and fuse $\mathbf{G}$ into the final attentional relation matrix $\mathbf{A}^{global}\in\mathbb{R}^{N^{\prime}\times N^{\prime}}$ using a learnable weight vector $\mathbf{w}\in\mathbb{R}^{1\times 6}$ :

\mathbf{A}^{global}=\mathbf{w}\mathbf{G}.

(10)

The dense attention matrix empirically exacerbate over-smoothing but we intentionally preserve the top 20% connections in the adjacency matrix to ensure its sparsity.

III-D2 GCN for Global Representation Aggregation

After obtaining the global attention-based adjacency matrix $\mathbf{A}^{global}$ , we compute the corresponding laplacian matrix $\mathbf{\hat{L}}^{global}$ and employ it for graph convolution:

\mathbf{O}^{(l+1)}=\sigma\left(\mathbf{\hat{L}}^{global}\mathbf{O}^{(l)}\mathbf{W}^{(l)}\right),

(11)

where $\mathbf{O}^{(l)}$ are the input node representations and $\mathbf{O}^{(l+1)}$ are the output node representations, the initial input representations $\mathbf{O}^{(0)}$ are the original input features $\mathbf{X}^{meso}$ . $\mathbf{W}^{(l)}$ is a learnable weight matrix and $\sigma$ is activation function.

After that, we concatenate the input mesoscopic features $\mathbf{X}^{meso}$ with the global GCN output $\mathbf{O}^{(1)}$ to obtain the feature that covers the local, mesoscopic, and global perceptual fields.

\mathbf{X}^{global}=concat\left(\mathbf{X}^{meso},\mathbf{O}^{(1)}\right)

(12)

Finally, we set $\mathbf{X}^{global}$ as the input of a three-layer fully connected emotion recognition network to obtain the final output of the subject’s emotions. Details of the model implementation are in Appendix A.

IV Experiments

In this section, we evaluate the effectiveness of the proposed PGCN on three well-known emotion recognition database, SEED[24], SEED-IV[45], and SEED-V[46].

IV-A Datasets and Protocol

To comprehensively evaluate the effectiveness of the proposed PCGN, we conducted subject-dependent and subject-independent experiments on the above three datasets; pre-processing and feature extraction was carried out for subsequent objective model evaluation. In the subject-dependent experiments, the training data and testing data were both from the same subject to evaluate the effectiveness of the model for cross-temporal application to the same subject; in the subject-independent experiments, the training data and testing data were from different subjects to evaluate the effectiveness of the model for the cross-subject application. A detailed description of the dataset and protocol is in Appendix B. The experimental results for all baselines are extracted from the citations.

IV-B Experiment on SEED

Table I presents the subject-dependent emotion recognition accuracy of PGCN and all baselines on the SEED dataset, and the results of all baselines are extracted from the corresponding references. SVM is a traditional machine learning method, DBN and BiDANN-S are deep learning methods, and DGCNN, GCB-net+BLS, RGNN, and V-IAG are GCN-based methods.

TABLE I: Subject-dependent classification accuracy (mean/std) on the SEED dataset

Method	$\delta$ (1–3 Hz)	$\theta$ (4–7 Hz)	$\alpha$ (8–13 Hz)	$\beta$ (14–30 Hz)	$\gamma$ (31–50 Hz)	all bands
SVM [22]	60.50 / 14.14	60.95 / 10.20	66.64 / 14.41	80.76 / 11.56	79.56 / 11.38	83.99 / 9.72
DBN [24]	64.32 / 12.45	60.77 / 10.42	64.01 / 15.97	78.92 / 12.48	79.19 / 14.58	86.08 / 8.34
DGCNN [10] ^†	74.25 / 11.42	71.52 / 5.99	74.43 / 12.16	83.65 / 10.17	85.73 / 10.64	90.40 / 8.49
BiDANN-S [47]	76.97 / 10.95	75.56 / 7.88	81.03 / 11.74	89.65 / 9.59	88.64 / 9.46	92.38 / 7.04
GCB-net+BLS [12]	79.98 / 8.93	76.51 / 9.56	81.97 / 11.05	89.06 / 8.69	89.10 / 9.55	94.24 / 6.70
RGNN [13] ^†	76.17 / 7.91	72.26 / 7.25	75.33 / 8.85	84.25 / 12.54	89.23 / 8.90	94.24 / 5.95
V-IAG [14] ^†	81.14 / 9.46	82.37 / 7.44	84.51 / 9.68	92.15 / 8.90	92.96 / 6.19	95.64 / 5.08
PGCN (ours) ^†	79.62 / 10.53	83.62 / 6.91	83.74 / 9.57	92.33 / 8.66	93.05 / 5.78	96.93 / 5.11

${\dagger}$

calculate the average accuracy based on the results of two sessions.

Encouragingly, on the subject-dependent emotion recognition task, our PGCN achieves the best emotion recognition results on all-frequency band and theta, beta, and gamma bands and performs marginally worse than the previous best model V-IAG in delta and alpha frequencies. For the first time to our knowledge, the accuracy of emotion recognition has been increased to 96.93%. By comparing the results of emotion recognition in different frequency bands, it can be seen that the model is better at capturing the emotional information carried in high-frequency features to achieve higher accuracy, which is in line with the findings of many previous studies [13, 24].

Table II shows the results of the PGCN and all baselines models for subject-independent emotion recognition on the SEED dataset. In Table II, we collected the results of both supervise and transfer learning-based experiments.

The comparison shows that the proposed PGCN achieves the best results among the supervised learning-based methods, leading the RGNN with the domain adaptive module removed by 2.67%. Since the transfer learning uses additional data from the testing set (without using the labels from the testing set) for model training, making the supervised learning approach significantly behind the transfer learning approach, RGNN is 0.71% ahead of PGCN with the DA module added on. However, it is almost impossible to obtain enough data from the test set for training a transfer learning model in practical online emotion recognition.

TABLE II: Subject-independent classification accuracy (mean/std) on the SEED dataset

	Method	all bands
Transfer Learning	TCA [48]	63.64 / 14.88
	DANN [49]	75.08 / 11.18
	BiDANN-S [47]	84.14 / 6.87
	RGNN [13]	85.30 / 6.72
Supervised Learning	SVM [22]	56.73 / 16.29
	DGCNN [10]	79.95 / 9.02
	RGNN w/o DA [13]	81.92 / 9.35
	PGCN (ours)	84.59 / 8.68

IV-C Experiment on SEED-IV

Table III shows the results of the proposed PGCN for emotion recognition on SEED-IV. In the subject-dependent experiments, the accuracy of PGCN was 2.87% higher than that of RGNN under the same experimental setting. In addition, we present the subject-dependent results for different subjects in different sessions in Figure 4. The results show that more than two-thirds of the subjects had an accuracy of more than 80%, and one-third of the accuracy was above 90%. In contrast, the emotion recognition results in session 1 were almost evenly distributed between 0.6 and 1, visually demonstrating the considerable variation across subjects under the same experimental setting.

TABLE III: Subject-dependent and subject-independent classification accuracy (mean/std) on the SEED-IV dataset

Method	subject-dependent	subject-independent
SVM [22]	56.61 / 20.05	37.99 / 12.52
DBN [24]	66.77 / 7.38	- / -
DGCNN [10]	69.88 / 16.29	52.82 / 9.23
BiDANN-S [47]	70.29 / 12.63	65.59 / 10.39
BiHDM [50]	74.35 / 14.09	69.03 / 8.66
RGNN [13]	79.37 / 10.54	73.84 / 8.02
BiHDM w/o DA [50]	72.22 / 14.69	67.47 / 8.22
RGNN w/o DA [13]	- / -	71.65 / 9.43
PGCN (ours)	82.24 / 14.85	73.69 / 7.16

In the subject-independent experiments, the PGCN showed an improvement of over 3% relative to all baselines except the RGNN. For RGNN without the domain adaption (DA) module, PGCN also showed a 2.04% improvement, and when the DA module was added to RGNN, PGCN was slightly worse by 0.15%. We found that BiHDM and RGNN improved by 1.56% and 2.19%, respectively, with the addition of the DA module, demonstrating the excellent ability of domain adaptation in reducing distribution diversity between subjects.

IV-D Experiment on SEED-V

Table IV shows the results of the proposed PGCN for emotion recognition on the SEED-V dataset. All experimental results of baselines are extracted from the corresponding citations. BDAE [51] enhances emotion recognition with the help of high-level representational features extracted by a bimodal deep auto-encoder, MD-AGCN [52] proposes a multi-domain adaptive graph convolution network that incorporates frequency and temporal domain knowledge, making full use of the complementary information of EEG signals. It can be found that the PGCN improves 0.92% over the previous best MD-AGCN on the subject-dependent task, while the PGCN is able to achieve 61.78% accuracy in emotion recognition on the subject-independent task.

TABLE IV: Subject-dependent and subject-independent classification accuracy (mean/std) on the SEED-V dataset

Method	subject-dependent	subject-independent
SVM [24]	69.5 / 10.28	- / -
BDAE [51]	79.7 / 4.76	- / -
MD-AGCN [52]	80.77 / 6.61	- / -
PGCN (ours)	81.69 / 10.57	61.78 / 8.59

V Discussion

In this chapter, we demonstrate the role of the various modules of the PGCN with the help of ablation experiments, try to decipher why the PGCN works well. We also visualize the results of the PGCN.

V-A Ablation Study

We disassembled and combined the local, mesoscopic and global modules in the PGCN to demonstrate the effect of each module on emotion recognition and show the results in Table V. In the ablation experiment, we only modified the feature extraction network composed of the local, mesoscopic, and global modules without changing other parts. In Table V, the backbone represents the most basic and commonly used two-layer GCN network based on 2D electrode adjacency matrix [33, 14], and backbone module and local module do not activate at the same time.

TABLE V: Ablation study for subject-dependent classification accuracy (mean/std) on the SEED and SEED-IV dataset.

Method	Backbone	Local	Mesoscopic	Global	SEED	SEED-IV
Baseline	$\bullet$	$\circ$	$\circ$	$\circ$	92.34 / 7.68	75.94 / 13.02
Global-only	$\bullet$	$\circ$	$\circ$	$\bullet$	92.93 / 8.35	76.36 / 14.17
Local-only	$\circ$	$\bullet$	$\circ$	$\circ$	93.69 / 6.12	77.84 / 14.68
Meso-only	$\bullet$	$\circ$	$\bullet$	$\circ$	94.91 / 6.54	80.71 / 12.72
Meso-removed	$\circ$	$\bullet$	$\circ$	$\bullet$	94.08 / 7.16	77.84 / 13.08
Global-removed	$\circ$	$\bullet$	$\bullet$	$\circ$	95.41 / 6.71	80.6 / 13.27
Local-removed	$\bullet$	$\circ$	$\bullet$	$\bullet$	96.41 / 4.54	80.96 / 14.52
PGCN	$\circ$	$\bullet$	$\bullet$	$\bullet$	96.93 / 5.11	82.24 / 14.85

In each line, $\bullet$ means the module is employed, and $\circ$ means that module is blocked. The backbone module and local module do not activate at the same time.

The three modules improve the sentiment recognition accuracy of PGCN from 92.34% to 96.93% on the SEED dataset; and from 75.94% to 82.24% on the SEED-IV dataset; furthermore, it seems each module contributes positively. The following is a more detailed discussion of the data in Table V.

(1)

The introduction of each module individually gives a comprehensive boost to the model, with the local module having a Sharpley value of about 2% on the SEED and SEED-IV dataset, the meso module having a Sharpley value of about 3% - 3.5% on the SEED and SEED-IV dataset, and the global module having a Sharpley value of about 1% and 2% on the SEED and SEED-IV dataset.
(2)

Comparing the baseline and meso-only modules, the meso module can effectively improve the network performance, with a 2.57% improvement on the SEED dataset and a 4.77% improvement on the SEED-IV dataset. We speculate that the improvement may come from that the meso module can extract discriminative features between nodes at the mesoscopic scale with the reference of a priori knowledge and effectively improve the network by fusing the extracted features with the output of the backbone.
(3)

Since the meso-removed is a combination of the local and global module, comparing it with the local-only module reveals that introducing the global module on top of the local module only improves the emotion recognition ability by 0.39% on SEED, while it brings no improvement on SEED-IV. We hypothesized that for the GCN-based model, stacking network layers is accompanied by a severe over-smoothing problem, and to verify the conjecture; we conducted a more in-depth experiment in Figure 5. As a comparison, global-only brings about 0.5% improvement on baseline probably because the GCN in the baseline is not sufficiently trained and optimized.

Since the meso module contains two different scales, the brain region scale, and the hemisphere scale, we ablated the submodules in the meso module and presented the result in Table VI.

Before the introduction of the meso module, the stacking of local and global modules allowed the network to have the corresponding local and global perceptions but was accompanied by stagnation or even a decrease in the fitting ability due to the deepening of the GCN. By introducing the 7-region and 2-region meso module, the network gains mesoscopic perceptual fields while adding some critical virtual mesoscopic centers, resulting in an improved fitting ability.

TABLE VI: Effectiveness of Meso-layer on the SEED and SEED-IV dataset.

7 regions	2 regions	SEED	SEED-IV
$\circ$	$\circ$	94.08 / 7.16	77.84 / 13.08
$\bullet$	$\circ$	94.74 / 6.66	80.34 / 14.71
$\circ$	$\bullet$	94.85 / 5.91	79.77 / 13.16
$\bullet$	$\bullet$	96.93 / 5.11	82.24 / 14.85

In each line, $\bullet$ means the module is employed, and $\circ$ means that module is blocked.

In Table V and VI, the hyperparameter is optimized for PGCN, but more optimal choices may exist for other models. However, to control the variables and reduce the parameter tuning, we tried to keep the hyperparameters fixed, but this may allow the ablation experiments to demonstrate more significant improvements than model-by-model optimization.

V-B Why PGCN works? Network Architecture Analysis

To further explore why PGCN works, we plot line graphs as the layer number of the GCN network increases and depict it in Figure 5 and Figure 6. In this case, vanilla GCN presents the most commonly used GCN network based on a 2D electrode adjacency matrix, and for the 6-layer network of PGCN, we added a attention-based global-scale layer.

We plot the node smoothness curve with an increasing network layer in Figure 5 to visualize the over-smoothing problem. In this case, node smoothness calculates the cosine similarity between the nodes of the output features in each layer. Thanks to the redesigned initialized adjacency matrix, the PGCN has a minor increase in node smoothness than the vanilla GCN at the local layer [53]. At the mesoscopic layer, with the introduction of virtual nodes, the node smoothness remains almost constant or even decreases while the PGCN gains a larger perceptual field, while the node smoothness of the vanilla GCN continues to increase. At the global layer, the global perception field allows the node smoothness of the PGCN to rapidly increase and surpass that of the vanilla GCN.

Figure 6 shows the average prediction accuracy of the networks with different layer numbers. For vanilla GCN networks, as the layer number increases, the recognition accuracy tends to decrease after reaching a maximum of 92.34% at two layers, which is why the majority of current GCNs for emotion recognition tasks have two-layer networks [33, 14]. In the PGCN, as the number of network layers increases, the recognition accuracy reaches a maximum of 96.93% at the number of layers of five and begins to decline.

Comparing vanilla GCN with PGCN, it can be seen that thanks to the introduction of sparse graph relational reasoning and local feature aggregation, PGCN can outperform vanilla GCN by 1.2% on shallow networks (layer $\leq$ 2). When the network continues to deepen (2 $<$ layer $\leq$ 4), the better emotion recognition effect brought by the larger receptive field in vanilla GCN is gradually at a disadvantage in the competition with the high feature similarity problem caused by over-smoothing, the effectiveness begins to decline; in the PGCN, thanks to the network’s ability to aggregate mesoscopic features guided by a priori knowledge in a highly sparse manner, the features are aggregated to virtual nodes, which gives a larger perceptual field without causing excessive feature similarity problems for the electrode nodes, and the network performance continues to increase. The over-smoothing problem in the vanilla GCN becomes more pronounced when the network reaches the global perceptual field (layer $>$ 4), which further reduces the model’s effectiveness; in the PGCN, by fusing electrode nodes with the virtual nodes and employing a sparse attention-based adjacency matrix, the better information interaction due to increasing the perceptual field still competes with the over-interaction of feature information due to over-smoothing at network layer number 5, and the model’s performance continues to improve, but as the network layers continue to stack, over-smoothing prevails, and the model’s performance declines.

V-C Representation Visualization

To manifest brain activation in the emotion recognition progress, we selected five subjects in each of the SEED and SEED-IV datasets and displayed their heat map of the learned adjacency matrix diagonal elements in Figure 7. The diagonal elements of the adjacency matrix provide the most intuitive indication of the weights of the node features in the graph convolution, and the diagonal elements of the adjacency matrix were deflated to between 0 and 1 for presentation.

[Uncaptioned image] — Figure 7: Heat map of the learned adjacency matrix diagonal elements for subject-dependent emotion recognition on the (a) SEED and (b) SEED-IV datasets, the diagonal elements were deflated to between 0 and 1 for presentation. For most subjects, electrodes in the frontal, temporal and occipital lobes have greater self-weighting.

Combining the results from the two datasets, electrodes in the frontal, temporal and occipital lobes gained greater weight after learning. In particular, on the 3-category SEED dataset, there was significant activation of nodes located in the temporal and occipital lobes and partial activation of nodes located in the frontal lobe. On the 4-category SEED-IV dataset, nodes were significantly activated in the temporal and occipital lobes and more significant activation in the frontal lobes. The difference in activation of the nodes on the two datasets may be caused by a degree of inductive bias for the 3-category and 4-category classification tasks, where the four classification task requires attention to information from the frontal, temporal and occipital lobes simultaneously for accurate emotion recognition.

To show the connections between nodes, we selected two subjects in each of the SEED and SEED-IV datasets, and plot the top-10 connections of the learned adjacency matrix in Figure 8. Unlike the heat map plotting, we removed the diagonal elements and plotted the connections.

Comparing the different chordograms in Figure 8, the top-10 connections vary considerably across subjects. For example, for subject1 of the SEED dataset, the primary connections are mainly in the frontal and temporal lobes, with fewer links in the occipital lobe; in contrast, for subject2, the critical connections are located in the temporal and occipital lobes, while the frontal lobe decreases in importance. For the SEED-IV dataset, a similar situation exists. Throughout the two datasets, although there were some variations in critical connectivity between subjects, most critical connections were consistently related to the temporal, frontal, and occipital lobes, which are closely related to emotion and vision when combined with the chord diagrams of the two data sets.

In combination with Figures 7 and 8, despite the cross-subject and cross-dataset subject-related variability, the features of nodes located in the frontal, temporal and occipital lobes have greater self-loop weights, and not only that but by building strong connections with other nodes, nodes located in the regions can play a leading role in the network, ultimately achieving excellent emotion recognition.

VI Conclusion

This paper proposes a pyramidal graph convolution network that improves emotion recognition accuracy by effectively finding a balance between the expanded receptive fields of GCN and the consequent over-smoothing problem. In PGCN, three separate modules are designed to acquire information between electrodes at different scales. The local module mainly concentrates on small-world properties, the mesoscopic module learns connections between different brain regions constructed with priori knowledge, and the global module learns the sparse connections between global nodes. The features learned by the three modules are then effectively integrated to improve the effectiveness of emotion recognition. The proposed PGCN achieves the best results on all three publicly available datasets. Our future work will focus on 1) designing optimized meso structures to reduce over-smoothing further while acquiring discriminatory features; 2) exploring a more effective initial brain adjacency matrix; 3) trying to introduce more priori knowledge to improve the emotion recognition accuracy.

References

[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer interaction,” IEEE Signal processing magazine, vol. 18, no. 1, pp. 32–80, 2001.
[2] V. Zotev, A. Mayeli, M. Misaki, and J. Bodurka, “Emotion self-regulation training in major depressive disorder using simultaneous real-time fmri and eeg neurofeedback,” NeuroImage: Clinical, vol. 27, p. 102331, 2020.
[3] J. K. Carpenter, L. A. Andrews, S. M. Witcraft, M. B. Powers, J. A. Smits, and S. G. Hofmann, “Cognitive behavioral therapy for anxiety and related disorders: A meta-analysis of randomized placebo-controlled trials,” Depression and anxiety, vol. 35, no. 6, pp. 502–514, 2018.
[4] E. Q. Wu, P.-Y. Deng, X.-Y. Qu, Z. Tang, W.-M. Zhang, L.-M. Zhu, H. Ren, G.-R. Zhou, and R. S. Sheng, “Detecting fatigue status of pilots based on deep learning network using eeg signals,” IEEE Transactions on Cognitive and Developmental Systems, 2020.
[5] C. D. Katsis, N. Katertsidis, G. Ganiatsas, and D. I. Fotiadis, “Toward emotion recognition in car-racing drivers: A biosignal processing approach,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 38, no. 3, pp. 502–512, 2008.
[6] Y. Gu, S. Chen, and I. Marsic, “Deep mul timodal learning for emotion recognition in spoken language,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5079–5083.
[7] F. Noroozi, D. Kaminska, C. Corneanu, T. Sapinski, S. Escalera, and G. Anbarjafari, “Survey on emotional body gesture recognition,” IEEE transactions on affective computing, 2018.
[8] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, and A. M. Dobaie, “Facial expression recognition via learning deep sparse autoencoders,” Neurocomputing, vol. 273, pp. 643–649, 2018.
[9] G. Nam, H. Lee, J.-H. Lee, and J.-W. Hur, “Disguised emotion in alexithymia: subjective difficulties in emotion processing and increased empathic distress,” Frontiers in Psychiatry, vol. 11, 2020.
[10] T. Song, W. Zheng, P. Song, and Z. Cui, “Eeg emotion recognition using dynamical graph convolutional neural networks,” IEEE Transactions on Affective Computing, vol. 11, no. 3, pp. 532–541, 2018.
[11] K. A. Robbins, J. Touryan, T. Mullen, C. Kothe, and N. Bigdely-Shamlo, “How sensitive are eeg results to preprocessing methods: a benchmarking study,” IEEE transactions on neural systems and rehabilitation engineering, vol. 28, no. 5, pp. 1081–1090, 2020.
[12] T. Zhang, X. Wang, X. Xu, and C. P. Chen, “Gcb-net: Graph convolutional broad network and its application in emotion recognition,” IEEE Transactions on Affective Computing, 2019.
[13] P. Zhong, D. Wang, and C. Miao, “Eeg-based emotion recognition using regularized graph neural networks,” IEEE Transactions on Affective Computing, 2020.
[14] T. Song, S. Liu, W. Zheng, Y. Zong, Z. Cui, Y. Li, and X. Zhou, “Variational instance-adaptive graph for eeg emotion recognition,” IEEE Transactions on Affective Computing, 2021.
[15] Y. He, Z. J. Chen, and A. C. Evans, “Small-world anatomical networks in the human brain revealed by cortical thickness from mri,” Cerebral cortex, vol. 17, no. 10, pp. 2407–2419, 2007.
[16] P. Hagmann, L. Cammoun, X. Gigandet, R. Meuli, C. J. Honey, V. J. Wedeen, and O. Sporns, “Mapping the structural core of human cerebral cortex,” PLoS biology, vol. 6, no. 7, p. e159, 2008.
[17] Y. Ding, N. Robinson, Q. Zeng, D. Chen, A. A. P. Wai, T.-S. Lee, and C. Guan, “Tsception: a deep learning framework for emotion detection using eeg,” in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–7.
[18] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,” Journal of neural engineering, vol. 15, no. 5, p. 056013, 2018.
[19] M. X. Cohen, Analyzing neural time series data: theory and practice. MIT press, 2014.
[20] A. H. Kemp, R. B. Silberstein, S. M. Armstrong, and P. J. Nathan, “Gender differences in the cortical electrophysiological processing of visual emotional stimuli,” NeuroImage, vol. 21, no. 2, pp. 632–646, 2004.
[21] R.-N. Duan, J.-Y. Zhu, and B.-L. Lu, “Differential entropy feature for eeg-based emotion classification,” in 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 2013, pp. 81–84.
[22] J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.
[23] Z. Liang, S. Oba, and S. Ishii, “An unsupervised eeg decoding system for human emotion recognition,” Neural Networks, vol. 116, pp. 257–268, 2019.
[24] W.-L. Zheng and B.-L. Lu, “Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks,” IEEE Transactions on Autonomous Mental Development, vol. 7, no. 3, pp. 162–175, 2015.
[25] Y. Yang, Q. J. Wu, W.-L. Zheng, and B.-L. Lu, “Eeg-based emotion recognition using hierarchical network with subnetwork nodes,” IEEE Transactions on Cognitive and Developmental Systems, vol. 10, no. 2, pp. 408–419, 2017.
[26] S. Alhagry, A. A. Fahmy, and R. A. El-Khoribi, “Emotion recognition based on eeg using lstm recurrent neural network,” Emotion, vol. 8, no. 10, pp. 355–358, 2017.
[27] W. Tao, C. Li, R. Song, J. Cheng, Y. Liu, F. Wan, and X. Chen, “Eeg-based emotion recognition via channel-wise attention and self attention,” IEEE Transactions on Affective Computing, 2020.
[28] D. Zhang, L. Yao, K. Chen, S. Wang, X. Chang, and Y. Liu, “Making sense of spatio-temporal preserving representations for eeg-based human intention recognition,” IEEE transactions on cybernetics, vol. 50, no. 7, pp. 3033–3044, 2019.
[29] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[30] T. Hamaguchi, H. Oiwa, M. Shimbo, and Y. Matsumoto, “Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach,” arXiv preprint arXiv:1706.05674, 2017.
[31] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, and H. Li, “T-gcn: A temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3848–3858, 2019.
[32] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in neural information processing systems, vol. 29, 2016.
[33] Y. Ding, N. Robinson, Q. Zeng, and C. Guan, “Lggnet: learning from local-global-graph representations for brain-computer interface,” arXiv preprint arXiv:2105.02786, 2021.
[34] M. Jin, H. Chen, Z. Li, and J. Li, “Eeg-based emotion recognition using graph convolutional network with learnable electrode relations,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 5953–5957.
[35] O. Sporns, G. Tononi, and R. Kötter, “The human connectome: a structural description of the human brain,” PLoS computational biology, vol. 1, no. 4, p. e42, 2005.
[36] G. Gong, Y. He, L. Concha, C. Lebel, D. W. Gross, A. C. Evans, and C. Beaulieu, “Mapping anatomical connectivity patterns of human cerebral cortex using in vivo diffusion tensor imaging tractography,” Cerebral cortex, vol. 19, no. 3, pp. 524–536, 2009.
[37] E. Bullmore and O. Sporns, “The economy of brain network organization,” Nature reviews neuroscience, vol. 13, no. 5, pp. 336–349, 2012.
[38] G. Buzsáki, C. Geisler, D. A. Henze, and X.-J. Wang, “Interneuron diversity series: circuit complexity and axon wiring economy of cortical interneurons,” Trends in neurosciences, vol. 27, no. 4, pp. 186–193, 2004.
[39] G. Li, M. Müller, G. Qian, I. C. D. Perez, A. Abualshour, A. K. Thabet, and B. Ghanem, “Deepgcns: Making gcns go as deep as cnns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[40] R. Salvador, J. Suckling, M. R. Coleman, J. D. Pickard, D. Menon, and E. Bullmore, “Neurophysiological architecture of functional magnetic resonance images of human brain,” Cerebral cortex, vol. 15, no. 9, pp. 1332–1342, 2005.
[41] G. E. Bruder, J. W. Stewart, and P. J. McGrath, “Right brain, left brain in depressive disorders: clinical and theoretical implications of behavioral, electrophysiological and neuroimaging findings,” Neuroscience & Biobehavioral Reviews, vol. 78, pp. 178–191, 2017.
[42] K. H. Jawabri and S. Sharma, “Physiology, cerebral cortex functions,” 2019.
[43] W.-L. Zheng, J.-Y. Zhu, and B.-L. Lu, “Identifying stable patterns over time for emotion recognition from eeg,” IEEE Transactions on Affective Computing, vol. 10, no. 3, pp. 417–429, 2017.
[44] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[45] W.-L. Zheng, W. Liu, Y. Lu, B.-L. Lu, and A. Cichocki, “Emotionmeter: A multimodal framework for recognizing human emotions,” IEEE transactions on cybernetics, vol. 49, no. 3, pp. 1110–1122, 2018.
[46] W. Liu, J.-L. Qiu, W.-L. Zheng, and B.-L. Lu, “Comparing recognition performance and robustness of multimodal deep learning models for multimodal emotion recognition,” IEEE Transactions on Cognitive and Developmental Systems, 2021.
[47] Y. Li, W. Zheng, Y. Zong, Z. Cui, T. Zhang, and X. Zhou, “A bi-hemisphere domain adversarial neural network model for eeg emotion recognition,” IEEE Transactions on Affective Computing, 2018.
[48] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE transactions on neural networks, vol. 22, no. 2, pp. 199–210, 2010.
[49] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The journal of machine learning research, vol. 17, no. 1, pp. 2096–2030, 2016.
[50] Y. Li, L. Wang, W. Zheng, Y. Zong, L. Qi, Z. Cui, T. Zhang, and T. Song, “A novel bi-hemispheric discrepancy model for eeg emotion recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 13, no. 2, pp. 354–367, 2020.
[51] L.-M. Zhao, R. Li, W.-L. Zheng, and B.-L. Lu, “Classification of five emotions from eeg and eye movement signals: complementary representation properties,” in 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER). IEEE, 2019, pp. 611–614.
[52] R. Li, Y. Wang, and B.-L. Lu, “A multi-domain adaptive graph convolutional network for eeg-based emotion recognition,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5565–5573.
[53] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun, “Measuring and relieving the over-smoothing problem for graph neural networks from the topological view,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3438–3445.

Appendix A

This paper uses PyTorch to build PGCN and deploy it on a 2080TI GPU. The local module contains two layers of GCN networks with a size of 62 $\times$ 5 for the input and 62 $\times$ 30 for the output; the mesoscopic module contains two layers of networks with an output size of 71 $\times$ 30; the Global network contains one layer of GCNs with an output size of 71 $\times$ 70; the emotion recognition network contains a 3-layer fully connected network. For the optimization of the model, we used the AdamW optimizer and warm-up, setting the learning rate to 1e-2 and the batch size to 64. We evaluated the model using the average accuracy (ACC) and standard deviation (STD).

Appendix B

B-1 SEED Dataset

The SEED dataset collected EEG emotional data from 15 subjects (seven males and eight females) who watched movie clips with three different emotional tendencies: negative, positive, and neutral. Three sessions of EEG data were collected from each subject, each containing 15 movie clips of different emotions. Each movie clip corresponds to a trial, and each trial contained a 5-second hint, a 4-minute movie clip, a 45-second self-assessment, and a 15-second break.

In the subject-dependent experiments, we followed the experimental setup of [24, 10, 47], for each subject, we used the first nine trials of the same session as the training set and the last six trials as the testing set, and calculated the mean and standard deviation for all subjects on two sessions [24, 13, 14]. In the subject-independent experiments, we followed the experimental setup in [10, 47] and implemented a leave-one-subject-out cross-validation, and calculated the mean and standard deviation of all subjects on all three sessions.

B-2 SEED-IV Dataset

The SEED-IV dataset collected EEG emotional data from 15 subjects (7 males and eight females) who watched movie clips with four emotional tendencies: neutral, sad, fear, and happy. Each subject watched 24 movie clips of three sessions, and each movie clip corresponding to a trial, and each trial lasted around 2 minutes.

In the subject-dependent experiments, we followed the experimental setup of [50, 13], and for each subject, to ensure data balance, we first set aside the last two trials of each emotion data in each session as the testing set and use the remaining 16 trials as the training set. In the subject-independent experiments, we followed the experimental setup of [50, 13] and implemented cross-validation with one subject left behind. In evaluating the results, we calculated the average accuracy of all subjects in the three sessions.

B-3 SEED-V Dataset

The SEED-V dataset collected EEG data from 16 subjects (six males and ten females) and contained five emotional tendencies: happy, disgust, neutral, fear, and sad. Each subject watched 15 movie clips in three sessions.

In the subject-dependent experiments, following the previous experimental settings [51, 52], we used the data of the first ten trials as the training set, and the last five trials as the testing set. For the subject-independent experiments, we also followed the same leave-one-out experimental approach. We calculated the accuracy and standard deviation of all subjects on the three sessions on both two experiments.