\hypersetup

hidelinks

Data Augmentation for Seizure Prediction
with Generative Diffusion Model

Kai Shu, Le Wu, Yuchang Zhao, Aiping Liu, Ruobing Qian, and Xun Chen This work was partially supported by the National Natural Science Foundation of China under Grants 32271431 and W2432042, the Research Project of Health Commission of Anhui Province under Grant AHWJ2022b004, and the Joint Fund for Medical Artificial Intelligence under Grant MAI2023C004. (Corresponding author: Xun Chen.)Kai Shu, Le Wu, Yuchang Zhao, Aiping Liu and Xun Chen are with the School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]).Ruobing Qian is with the Epilepsy Center, Department of Neurosurgery, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei 230001, China. 0009-0007-8707-0085 0000-0002-8565-9626 0000-0002-1726-7962 0000-0001-8849-5228 0000-0002-4922-8116

Abstract

Data augmentation (DA) can significantly strengthen the electroencephalogram (EEG)-based seizure prediction methods. However, existing DA approaches are just the linear transformations of original data and cannot explore the feature space to increase diversity effectively. Therefore, we propose a novel diffusion-based DA method called DiffEEG. DiffEEG can fully explore data distribution and generate samples with high diversity, offering extra information to classifiers. It involves two processes: the diffusion process and the denoised process. In the diffusion process, the model incrementally adds noise with different scales to EEG input and converts it into random noise. In this way, the representation of data can be learned. In the denoised process, the model utilizes learned knowledge to sample synthetic data from random noise input by gradually removing noise. The randomness of input noise and the precise representation enable the synthetic samples to possess diversity while ensuring the consistency of feature space. We compared DiffEEG with original, down-sampling, sliding windows and recombination methods, and integrated them into five representative classifiers. The experiments demonstrate the effectiveness and generality of our method. With the contribution of DiffEEG, the Multi-scale CNN achieves state-of-the-art performance, with an average sensitivity, FPR, AUC of 95.4%, 0.051/h, 0.932 on the CHB-MIT database and 93.6%, 0.121/h, 0.822 on the Kaggle database.

Index Terms:

Seizure prediction, deep learning, diffusion model, data augmentation.

I Introduction

Epilepsy is caused by abnormal discharge of brain neurons and is one of the world’s most common neurological diseases [1]. It is reported by the World Health Organization (WHO) that about 50 million people are suffered from epilepsy. With the development of medical science, 70% of the patients can be seizure-free after proper treatment. However, the remaining 30% of them still cannot be controlled by drugs [2]. Therefore, seizure prediction has great value for them. A precise prediction can let patients get timely protection, improving the quality of their lives by reducing their mental stress.

Electroencephalogram (EEG) is an efficient method for capturing and monitoring electrical activity of the brain [3]. EEG recordings can be divided into three kinds depending on the position where the signals are collected. There is non-invasive scalp electroencephalogram (sEEG) with electrodes attached to the subject’s scalp, semi-invasive electrocorticogram (ECoG) with electrodes placed between the cerebral cortex and the dura mater, and invasive electroencephalogram (iEEG) with implanted electrodes [4]. EEG is widely used to diagnose epilepsy in clinical [5]. Over the last few years, studies have demonstrated that EEG can be used to predict upcoming seizures [6][7]. In practice, physicians have divided the epilepsy EEG signals into four states: preictal (the period before seizures), ictal (the duration of seizures), postictal (the period following seizures) and interictal state (the interval between seizures) [8]. According to the definitions above, seizure prediction is transferred to a binary classification task which distinguishes preictal states from interictal ones. In this way, a variety of methods have been proposed to achieve the prediction.

In the traditional machine learning methods, many kinds of manually extracted features have been used to predict seizures, including spatial domain, time domain, frequency domain, and time-frequency domain features. After feature extraction, a classifier is used to do the classification by these features [9]. For instance, Chisci et al. utilized the autoregressive coefficient as features, and a support vector machine (SVM) as the classifier [10]. However, extracting features manually requires extensive expertise, and the efficacy of seizure prediction is heavily reliant on the quality of these extracted features. Consequently, deep learning (DL) methods have gained growing attention within the last few years, where representative features can be extracted automatically instead of manually [11][12]. Researchers have designed various DL architectures to achieve high-performance seizure prediction. Li et al. applied the multi-layer perceptrons (MLP) blocks to extract spatial and temporal features [13]. Gao et al. employed a multi-scale convolutional neural network (CNN) by using the dilated convolutions to capture information at different levels [14]. In another work [15], Gao et al. proposed a Transformer-based network. The model utilized the attention mechanism to establish associations between information at different locations and strengthen the role of key information. The multiple features extracted by these networks had great effects on seizure prediction and these DL methods all obtained satisfactory performance.

It is worthy noting that DL algorithms typically need a substantial amount of labeled data to achieve good performance. However, the low-frequency property of seizure onsets causes serious insufficiency of the preictal data for a specific patient [16][17]. Thus, the composition of dataset is seriously imbalanced, leading to the overfitting during model training [18].

To solve the issue of imbalanced data, various methods have been employed. Khan et al. selected interictal samples randomly to down-sample the interictal data [19]. However, down-sampling discards too many interictal samples, which could lead to the loss of useful information. Besides, the small quantity of data after down-sampling makes it hard for models to reach the optimal result. Therefore, some researchers proposed data augmentation (DA) methods. DA is the process that generates new samples to augment a small or imbalanced dataset. Troung et al. generated more preictal segments by using an overlapping technique which slid a 30-s window along the EEG signal [20]. Zhang et al. came up with an idea of recombination. For each seizure, they split every training EEG sample into three segments, and generated new samples as a recombination of the randomly selected segments [21]. Although those traditional DA methods can increase the number of preictal samples, they are just simple transformations of the existing data, with distribution constrained by original data due to the presence of numerous repetitive parts [22]. Because of the nonstationary signals in the brain, the epileptic representation of EEG varies over time [23] and an upcoming seizure may have representation biased from the previous ones. The simple transformations of existing samples can hardly expand the data distribution to cover the variable representation.

With the advancement of generative model, an increasing number of researchers have utilized these models to address the imbalanced data [24]. Rasheed et al. used a generative adversarial network (GAN) to produce synthetic EEG feature maps and evaluated the performance of generated data by a validation method [25]. However, the training process of GAN is difficult. Once the design is improper, the gradient may disappear or explode, resulting in unstable generation quality. In addition, the model may fall into local minimum during training and can only generate samples similar to a limited number of initial data. This is known as the mode collapse problem and will lead to the lack of generation diversity [26].

Recently, diffusion models have been widely used in various fields such as computer vision (CV) and natural language processing (NLP) [27][28]. Diffusion models are a class of promising generative models, which regard data as a step-by-step diffusion of a series of random noises. They use a Markov chain to gradually add Gaussian noise to data, getting the posterior distribution probability and learning the reverse denoised process. Thus they could synthesise new samples from random noise with the learned distribution of original data [29]. One important use of the diffusion model is DA due to its strong generation ability [30][31][32]. For example, Chambon et al. utilized a pre-trained latent diffusion model to generate high-fidelity, diverse synthetic chest x-rays (CXR) and measured a 5% improvement by jointly training a classifier on synthetic and real images [33]. In this paper, we proposed a diffusion-based model named DiffEEG. In specific, DiffEEG consists of two opposite processes, namely the forward/diffusion process and the reverse/denoised process. In the diffusion process, the model adds noise with different scales to the EEG sample step by step and converts it into random noise. This procedure is conditioned on the short-time Fourier transform (STFT) spectrogram of the input EEG to provide the guiding time-frequency features. By minimizing the loss between the output of the network and the added noise, the feature distribution of preictal data could be fully explored. In the denoised process, based on the learned conditional probability distribution, the synthetic data could be sampled from random noise input by gradually removing the noise with different scales. With the randomness of initial noise and the precise representation, the generated samples could have new information and diversity while ensuring the consistency of feature space, narrowing the distance between data clusters composed by initial samples from each seizure [34]. The distribution illustration of down-sampling, sliding windows, recombination and DiffEEG is presented in Fig. 1. Different from the existing DA methods, DiffEEG is capable of utilizing the learned distribution information to generate samples of diversity and provide additional information for model training.

Refer to caption — Figure 1: The distribution illustration of down-sampling, sliding windows, recombination and DiffEEG. Each cluster represents the preictal data of a seizure. The clusters of different preictal data exhibit both similar and distinct distribution. (a) Down-sampling of interictal samples may cause the loss of useful information. (b) When using the sliding windows, the distribution of generated samples is between the front and rear samples. (c) The recombination is implemented within each seizure, because samples recombined by segments from different seizures have poor authenticity. The distribution of recombined samples is at the center of the three component samples and thus limited by them. (d) Instead, DiffEEG could fully explore the feature space and expand the distribution to outward area. The generated samples connect different clusters into a whole.

We train DiffEEG with the patient’s preictal EEG signals so that our model can generate synthetic preictal data until we have an equal number of samples per class. The generation quality is improved with the reduction of the training loss. The samples are then given to a classifier to conduct seizure prediction task. We used five representative network frameworks (SVM, Spatio-temporal MLP, EEGNet, Multi-scale CNN, Transformer) as classifiers on the CHB-MIT and Kaggle dataset to verify the universality and generality of our model. In addition, we compared our model with that using just the original data and those using three existing methods for solving the data imbalance, including down-sampling, sliding windows and recombination, to evaluate the effectiveness and superiority of DiffEEG.

The main contributions of our work are as follows:

•

We propose a method of using diffusion model to solve the imbalanced data problem, which is the first work to introduce diffusion model to the seizure prediction field.
•

We verify that our diffusion model is superior to the existing DA methods and can be utilized in almost any other classification network to substantially improve the seizure prediction performance.
•

With the improvement of DiffEEG, the best prediction performance in our experiments outperforms that of most state-of-the-art (SOTA) methods.

The rest of the paper is structured as follows. Section II details our proposed methods, including the theory of diffusion model, the structure of DiffEEG and five classifiers. Section III shows the datasets, our experimental results and comparison with SOTA works. Section IV discusses the results of our methods. Finally, we conclude our work in Section V.

II Methodology

II-A Diffusion Model

Since its inception, the diffusion model has shown its remarkable advantages in various fields, offering a unified loss function with high simplicity and a stable training process with adequate theoretical support [35]. The diffusion model aims to transform the prior data into random noise and revise the transformations step by step to rebuild a brand new sample with the same distribution as the prior. We define $x_{t}\in R^{H\times L}$ for $t$ = 0, 1, 2, … , $T$ as the data at diffusion step $t$ , where $H$ is the EEG channel and $L$ is the length of EEG segment. The process that transforms the starting state $x_{0}$ into random noise $x_{T}$ through a Markov chain $q(x_{t}|x_{t-1})$ is the forward/diffusion process:

q(x_{1},...\ ,x_{T}|x_{0})=\prod_{i=1}^{T}q(x_{t}|x_{t-1}).

(1)

The process that converts random noise $x_{T}$ to data $x_{0}$ with the starting distribution through a Markov chain $p_{\theta}(x_{t-1}|x_{t})$ is called the reverse/denoised process, which is parameterized by $\theta$ :

p_{\theta}(x_{0},...\ ,x_{T-1}|x_{T})=\prod_{i=1}^{T}p_{\theta}(x_{t-1}|x_{t}).

(2)

The diffusion and reverse process are showed in Fig. 2. Given $x_{T}\sim N(0,I)$ , we can sample $x_{t-1}\sim p_{\theta}(x_{t-1}|x_{t})$ step by step, until we get the final synthetic data $x_{0}$ with the same distribution as real data.

Based on the variance schedule $\{\beta_{t}\}_{t=1}^{T}$ of the noise added during the diffusion process, some constants about noise can be further defined:

\alpha_{t}=1-\beta_{t},\ \overline{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s},\ \widetilde{\beta}_{t}=\begin{cases}\beta_{1},&t=1\\ \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_{t}}\beta_{t},&t>1\\ \end{cases}.

(3)

We then define the mean $\mu_{\theta}$ and standard deviation $\sigma_{\theta}$ of the conditional probability distribution from $x_{t}$ to $x_{t-1}$ :

	$\displaystyle\mu_{\theta}(x_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}(x_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\varepsilon_{\theta}(x_{t},t)),$
	$\displaystyle\sigma_{\theta}(x_{t},t)=\sqrt{\widetilde{\beta}_{t}}.$		(4)

$\varepsilon_{\theta}:R^{H\times L}\rightarrow R^{H\times L}$ is the diffusion network, with inputs $x_{0}$ , Gaussian noise $\varepsilon\sim N(0,I)$ and diffusion step $t$ .

The training of diffusion model is to maximize its variational lower bound (ELBO):

ELBO=E_{q(x_{0},...,x_{T})}log\frac{p_{\theta}(x_{0},...\ ,x_{T-1}|x_{T})\cdot p(x_{T})}{q(x_{1},...\ ,x_{T}|x_{0})}.

(5)

It is proved that high generation quality can be achieved by minimizing the unweighted variant of ELBO [29], which is used as the training objective of our model:

\min\limits_{\theta}L(\theta)=E_{x_{0},\varepsilon,t}\|\varepsilon-\varepsilon_{\theta}(\sqrt{\overline{\alpha}_{t}}x_{0}+\sqrt{1-\overline{\alpha}_{t}}\varepsilon,\ t)\|_{2}^{2}.

(6)

The training and sampling procedures are in Algorithm 1 and 2, respectively.

Algorithm 1 Training

repeat

Sample

x_{0}\sim q_{data}

\varepsilon\sim N(0,I)

, and

t

\sim

Uniform({1, 2, … ,

T

})Take gradient descent step on

\nabla_{\theta}\|\varepsilon-\varepsilon_{\theta}(\sqrt{\overline{\alpha}_{t}}x_{0}+\sqrt{1-\overline{\alpha}_{t}}\varepsilon,\ t)\|_{2}^{2}

until converged

Algorithm 2 Sampling

Sample

x_{T}\sim N(0,I)

for

t

T

T

-1, … , 1 do

Compute

\mu_{\theta}

(

x_{t}

,t) and

\sigma_{\theta}

(

x_{t}

,t)Sample

x_{t-1}\sim p_{\theta}(x_{t-1}|x_{t})

= N(

x_{t-1};\mu_{\theta}(x_{t},t),\sigma_{\theta}(x_{t},t)^{2}I

)

end for

According to Algorithm 2, given $x_{T}\sim N(0,I)$ , $x_{T-1}$ can be generated by sampling the standard Gaussian noise $z$ :

x_{T-1}=\frac{1}{\sqrt{\alpha_{T}}}(x_{T}-\frac{\beta_{T}}{\sqrt{1-\overline{\alpha}_{T}}}\varepsilon_{\theta}(x_{T},T))+\sigma_{\theta}(x_{T},T)\cdot z.

(7)

$\frac{\beta_{T}}{\sqrt{1-\overline{\alpha}_{T}}}\varepsilon_{\theta}(x_{T},T)$ is the Gaussian noise removed in the first step, which is estimated by the trained diffusion model. It’s worth noting that only the combination of mutually independent Gaussian distributions can be Gaussian, but the removed noise is related to $x_{T}$ . As a consequence, $x_{T-1}$ is not necessarily Gaussian. In this manner, the model continues to sample $x_{T-2}$ , $x_{T-3}$ , … , $x_{0}$ . The final generated sample $x_{0}$ may not necessarily be Gaussian either. Hence, the performance of the diffusion model is not affected by whether the data distribution is Gaussian. The model can adapt to generation tasks under any data distribution and synthesis samples conforming to the corresponding distribution, which is one of the reasons why it is widely used in fields such as CV, NLP and others.

In order to shorten the training and sampling time, acceleration strategies are adopted in both the training and inference processes. During the training process, a random-step training strategy is utilized, which has been widely used since its initial proposal by Ho et al. in Denoising Diffusion Probabilistic Models (DDPM) [29]. Based on the properties of the Gaussian distribution, $x_{t}$ can be obtained from $x_{0}$ by adding noise in one step:

x_{t}=\sqrt{\overline{\alpha}_{t}}x_{0}+\sqrt{1-\overline{\alpha}_{t}}{\epsilon}.

(8)

Compared with full-step training, random-step training substantially reduces the computational overhead while ensuring that the model has the opportunity to learn the data distribution at all diffusion steps.

The acceleration strategy used in the inference process is the skip-step sampling scheme proposed by Song et al. in Denoising Diffusion Implicit Models (DDIM) [36]. In order to implement skip-step sampling, it is necessary to construct a non-Markovian noise-adding process. Song et al. proved that the loss function of this process is

	$\displaystyle L=KL(p(x_{t-1}\|x_{t})\|\|p_{\theta}(x_{t-1}\|x_{t}))\propto$
	$\displaystyle\frac{(\sqrt{1-{\alpha}_{t-1}-{\sigma}_{t}^{2}}-\sqrt{\frac{{\alpha}_{t-1}(1-{\alpha}_{t})}{{\alpha}_{t}}})^{2}}{2{\sigma}_{t}}\|\|{\epsilon}_{\theta}(x_{t},t)-{\epsilon}_{t}\|\|^{2},$		(9)

which remains consistent with equation (6), the loss function of the Markovian form in DDPM. This allows the continue utilization of the objective function from DDPM to train the network, while the inference process, not bound by the Markovian property, can employ the skip-step method to accelerate the sampling speed.

II-B DiffEEG Architecture

The raw EEG signals are characterized by serious non-stationarity. Their spectral composition keeps varying over time, making it difficult for models to learn the representation in the time or frequency domain alone [37]. The short-time Fourier transform (STFT) is extensively applied in the analysis of non-stationary signals. It can reflect the temporal variations of the spectral component and capture the epileptic representation of EEG signals. As a matter of fact, many studies have adopted STFT instead of raw EEG signals as inputs to the classification networks for seizure prediction [20] [38]. Therefore, we utilized the STFT spectrum in our DiffEEG model as the conditional guidance.

Based on the characteristic of multi-channel EEG data, we have proposed the diffusion-based network DiffEEG with conditional guidance. The network is composed of the input block, the residual block and the output block. Fig. 3 shows the whole architecture.

The input block consists of three layers. The input layer adds noise to the input EEG data to diffuse it forward by one step, and uses a 1×1 convolution (Conv) with $C$ kernels to map the data into $C$ residual channels. The 1×1 Conv is a way to increase the non-linearity of the network without affecting the receptive fields [39]. It is utilized to raise the channel dimension while preserving the original shape of the data. The step-embedding layer is to encode the current diffusion-step and transform it into a $C$ -dimensional embedding vector. As the added noise and the output of the model vary with different diffusion-step, it is necessary to inject the step information into the network. The sine and cosine functions mentioned in [40] are applied to convert the step $t$ into a learnable embedding $t_{eb}$ , as showed in equation (10). These functions are selected due to the property that for any given offset $k$ , $t_{eb}(x)$ can be expressed as a linear function of $t_{eb}(x+k)$ , making it easy for the model to learn the relative positions between steps. Three fully connected (FC) layers are subsequently employed to project the embedding into the high-dimensional feature space and learn the step information. The condition layer utilizes the STFT spectrogram of the original signal as the conditional guidance. This condition is capable of providing time-frequency information for the network to understand the complex EEG rhythms, by learning the correlation between the fluctuations in EEG and the frequency content behind them. For CHB-MIT dataset, we set the STFT window length to 256 with no overlap while for Kaggle dataset the window length is 400 with 300 overlap. The STFT spectrogram is then up-sampled by two transposed 2-D convolutions to acquire the same shape as the EEG input in time domain. The transposed convolution can learn the up-sampling parameters that best suit the characteristics of the data. At last, there is a 1×1 Conv with 2 $C$ kernels to merge the features across different frequencies.

\begin{split}t_{eb}=[\sin{(10^{\frac{0\times 4}{63}}t)},...\ ,\sin{(10^{\frac{63\times 4}{63}}t)},\\ \cos{(10^{\frac{0\times 4}{63}}t)},...\ ,\cos{(10^{\frac{63\times 4}{63}}t)}]\end{split}

(10)

The residual block contains $N$ residual layers with $C$ residual channels. In each residual layer, we employ a bidirectional dilated convolution (Bi-DilConv) [41] whose dilation is doubled between layers. The dilated convolution introduces intervals within the kernels and enlarges the sliding stride, thereby expanding the receptive field of each neuron without increasing the computational overhead. The exponential growth in dilation is to provide a multi-scale receptive field, learning both local and global information. Meanwhile, we take the modulus of dilation by 10 to prevent the exponential explosion. The diffusion-step embedding and STFT condition from the input block are added as the bias terms before and after the Bi-DilConv. Then a gated-tanh unit (GTU) is applied to learn the nonlinear features [42]. The GTU is a combination of the tanh and sigmoid activation functions. The sigmoid activation acts as a gate for the tanh activation, regulating the weight of the tanh activation’s output. At the end of each residual layer is a 1×1 Conv with 2 $C$ kernels to linearly combine the features from different channels. We take advantage of the residual structure in [43]. The input of each residual layer is divided into two branches: one passes through the Bi-DilConv, GTU and 1×1 Conv as in a general network, while the other is directly connected to the output of the 1×1 Conv. The output of the 1×1 Conv is also divided into two branches: one is added with the input and connect to the next layer, while the other skips all the subsequent residual layers and is directly fed into the final output block. The purpose of this structure is to mitigate the vanishing gradient problem because the gradient of the input and output can propagate directly to the end of the network without traversing the intermediate layers.

The output block involves two layers. The skip-connection layer sums the outcomes from $N$ residual layers to integrate the features of different scales. The output layer utilizes two 1×1 convolutions to transform the data into the final output with the same dimensions as the input EEG data.

II-C Classifier Architecture

We choose five representative SOTA network frameworks as classifiers. They are SVM, Spatio-temporal MLP [13], EEGNet [44], Multi-scale CNN [14] and Transformer [15].

II-C1 SVM

SVM is a commonly used machine learning classifier with strong and stable classification capacity. It aims to find an optimal hyperplane that separates different classes of samples. Support vectors are the samples closest to the hyperplane, which play a crucial role in determining the position of hyperplane. By maximizing the distance between the support vectors and the hyperplane, SVM can provide powerful generalization and robustness. Before employing SVM, features need to be extracted manually from the EEG data. We adopted the features used in [11], including 4 statistical moments: mean, variance, skewness and kurtosis, 8 spectral band power: $\delta$ (0.5-4 Hz), $\theta$ (4-8 Hz), $\alpha$ (8-13 Hz), $\beta$ (13-30 Hz), $\gamma$ -1 (30-50 Hz), $\gamma$ -2 (50-75 Hz), $\gamma$ -3 (75-100 Hz) and $\gamma$ -4 (100-128 Hz), and 2 Hjorth parameters: mobility and complexity.

II-C2 Spatio-temporal MLP

Spatio-temporal MLP is composed by a preprocessing block and an MLP block. The preprocessing block contains three layers: a denoising layer to remove the muscle and ocular artifacts [45], a weighted layer to strengthen effects of the key channels, and a reduction layer to reduce the computation cost. The MLP block involves an inter-channel layer that learns the spatial correlation between channels and an intra-channel layer that extracts information within channels. At last, there is an average pooling layer followed by the FC layer to obtain the probability of each seizure state.

II-C3 EEGNet

EEGNet is a compact CNN that has been successfully applied to various EEG classification tasks [44]. There are two blocks in EEGNet. The first block starts with a convolution to capture the frequency information, followed by a depthwise convolution to learn the frequency-specific spatial filters and reduce the number of trainable parameters. In the second block, a separable convolution is applied to learn how to individually summarize the feature maps and optimally combine them together. As the original EEGNet performed poorly when dealing with the epileptic EEG data [46], we add two 1D CNN before the original layers to adequately extract the representation from each channel and concentrate on the important information. Meanwhile, we modify the size of convolution and average pooling kernels according to the sampling rate of our EEG signals. The structure of the modified EEGNet is shown in Fig. 4.

II-C4 Multi-scale CNN

The structure of Multi-scale CNN can be divided into two parts: the temporal multi-scale part and the spatial multi-scale part. The dilated convolution is employed in both parts to broaden the receptive field and learn global information without increasing extra arguments. In each part, the convolutions with large kernels focus on information during a long term or over a large area of brain, while convolutions with small kernels capture the local information. Afterwards, an attention-based feature-weighted fusion method is taken to reduce the redundancy and fuse the features. The features are finally put into a four-layer CNN classifier.

II-C5 Transformer

Transformer is a DL network characterized by the self-attention mechanism [40][47]. Transformer models are generally composed of an encoder and a decoder. We only take advantage of the encoder part because the purpose of our task is classification. The Transformer-based model we used mainly consists of three modules: the input embedding module to diminish the input dimension and extract initial features, the positional encoding module to learn positional information which is then added to the embedding vector, and the attention module to explore associations between information at different locations and assign weights to them.

II-D Training and Testing

In our work, we use the leave-one-out cross-validation strategy. Specifically, for a patient with $N$ seizures, we have $N$ parts of preictal data. So the whole interictal data is divided into $N$ equal parts accordingly. For every cross-validation, we take one part from the interictal data followed by one part from the preictal data as the test set while the rest of the interictal and preictal data serve as the training set. Due to the low-frequency of epileptic seizures, the total duration of the preictal data is significantly less than that of the interictal data. As we can see in Fig. 5, all subjects in the CHB-MIT and Kaggle database have an interictal-to-preictal ratio greater than 4:1. Some of them even exceed a ratio of 20:1. In order to avoid the adverse effects of imbalanced data on predictive models [48], we employ four methods to solve the imbalanced problem:

•

Down-sampling [19]: Randomly picking samples from the interictal data until we get the same number as the preictal samples.
•

Sliding windows [20]: Sliding windows of length $W$ at every step $S$ to create samples with overlapping parts.
•

Recombination [21]: We first divide each preictal samples into three segments, and then randomly pick three segments to recombine new artificial samples.
•

Diffusion model: We use the proposed DiffEEG to generate high-quality preictal samples.

We randomly cut out EEG samples from the training set and send their STFT spectrograms to the trained model as condition. To further raise the diversity of produced samples, besides the randomly selected spectrograms, we also make use of the recombined spectrograms to each generate half of the data. We randomly pick out three preictal samples from the training set, divide the time dimension of their STFT spectrogram into three segments, and randomly choose one sub-spectrogram from each samples for recombination. The spectrogram is then put into the trained DiffEEG to generate EEG signal. After generating enough preictal samples to equal the interictal samples, we mix them randomly and select 25% of the later samples in the training set for validation and the rest for training. Early-stop strategy is utilized to avoid overfitting. The training stops unless the sensitivity and specificity are both the best for five epochs and the best model is then returned. Fig. 6 is a flowchart of the main process.

II-E Postprocessing

To reduce false alarms, we use the $k$ -of- $n$ method proposed in [20]. The alarm will be triggered only if at least $k$ out of the last $n$ predictions are judged as preictal. We set $k$ to 8 and $n$ to 10 in our experiments. In addition, a refractory period of 30 minutes is implemented to avoid frequent alarms in the short term.

III Experiments and Results

In this section, comparative experiments are conducted on the CHB-MIT and Kaggle database. The details of the two datasets and the evaluation metrics of our experiments are presented, followed by the experimental results and comparisons with several SOTA methods.

III-A Dataset

The CHB-MIT dataset records the sEEG data from 23 patients. It involves recordings for 844 hours with 163 seizure events, at a sampling rate of 256 Hz [49]. As the electrode channels of different patients vary from 18 to 22, we select the 18 common electrodes for our research [21]. We follow the definitions of preictal and interictal period in [20], and consider seizures with an interval of less than 15 minutes as a single seizure. Besides, We only focus on patients experiencing less than 10 seizures per day since it is not essential to perform the prediction task for patients who have an average seizure frequency of every 2 hours. Moreover, we select patients who experienced more than three seizures and had an interictal-to-preictal ratio greater than 2:1. This ensures that we have sufficient data to train the DiffEEG model and necessitates the generation of preictal samples. Based on these considerations, there are 13 patients that meet our requirements. The data information is listed in Table I.

TABLE I: Subject information of the CHB-MIT sEEG database.

Patient	Interictal times (h)	Preictal times (h)	No. of seizures
Pat1	17	3.5	7
Pat2	23	1.5	3
Pat3	22	3	6
Pat5	14	2.5	5
Pat9	46.3	2	4
Pat10	26	3.4	7
Pat13	14	3	7
Pat17	10	1.5	3
Pat18	24	2.5	5
Pat19	25	1.5	3
Pat20	20	3.9	8
Pat21	23.4	2	4
Pat23	12.9	3.2	7
Total	277.6	33.5	69

The Kaggle dataset records the iEEG data from five dogs and two patients. It contains long-term recordings with 48 seizures [50]. For dogs, the electrode channels are 16 for Dog1-Dog4 and 15 for Dog5. The sampling rate is 400 Hz. For patients, the electrode channels are 15 for patient1 and 24 for patient2. The sampling rate is 5 kHz. As the high sampling rate could cause an increase in data dimension and model complexity, we don’t test our method on these two patients. The data information of five dogs is listed in Table II.

TABLE II: Subject information of the Kaggle iEEG database.

Subject	Interictal times (h)	Preictal times (h)	No. of seizures
Dog1	80	4	4
Dog2	83.3	7	7
Dog3	240	12	12
Dog4	134	14	14
Dog5	75	5	5
Total	612.3	42	42

III-B Experiments and Evaluation Metrics

Before the experiments, it is necessary to define the seizure prediction horizon (SPH) and the seizure occurrence period (SOP). We adopt the definitions proposed by Maiwald et al. [51]. The SPH represents the time interval between the seizure alarm and the actual onset of seizure, while the SOP represents the period where seizures are predicted to happen. For a correct prediction, the seizure onset must fall within the SOP. If the system returns a preictal result but no seizure occurs during the SOP, then there is a false alarm. In our experiments, the SPH and SOP are set to 1 minute and 30 minutes for the CHB-MIT database, while for the Kaggle database, the SPH and SOP are set to 5 minutes and 60 minutes.

Seizure prediction experiments are carried on in an event-based way or segment-based way. According to [52], the event-based prediction is more representative of practical situations and holds greater significance in real-world applications, so we take the event-based way. We then apply four evaluation metrics to estimate the prediction results. Sensitivity (Sens) is defined as the proportion of successfully predicted seizures to all seizures. False prediction rate (FPR) is defined as the number of false alarms within an hour. Area under the curve (AUC) can reflect the quality of classification performance, where a value of 0.5 represents a random classification and a value of 1 signifies a perfect classifier.

p-value can determine whether the prediction system demonstrates statistical superiority over a random predictor. According to [15], the sensitivity $S_{c}$ of chance predictor can be formulated as equation (11), where ${\lambda}_{w}$ represents the Poisson rate parameter, ${\tau}_{w_{0}}$ represents SPH and ${\tau}_{w}$ denotes the sum of SPH and SOP.

S_{c}=1-exp(-{\lambda}_{w}{\tau}_{w}+(1-e^{-{\lambda}_{w}{\tau}_{w_{0}}}))

(11)

Under the assumption that the prediction system successfully predicts $n$ out of $N$ seizures, the random predictor can outperform the prediction system if it correctly predicts at least $n$ seizures. The p-value can be calculated as below at a significant level of 0.05:

p\text{-}value=1-\sum_{i=0}^{n-1}\binom{N}{i}S_{c}^{i}(1-S_{c})^{N-i},\ \text{for}\ \frac{n}{N}\geq S_{c}.

(12)

III-C Results and Comparison

To study the impact of different time steps on the generation quality of the DiffEEG model, we select four step values (50, 100, 200, 300) and train the model for 40 epochs respectively. Since the quality of synthetic samples is hard to quantitatively assess, we utilize the training loss as a proxy. A lower loss implies that the model has learned the data distribution and epileptic representation more comprehensively, resulting in higher quality of the generated samples. To observe the experimental results dynamically, we plot the loss curves against epochs for the four diffusion steps. In Fig. 7, we display the curves of two subjects from the CHB-MIT and Kaggle dataset respectively. In the meantime, the average converged loss and the standard deviation between losses of each subject are calculated for an overall comparison. From the curves, it can be observed that as time step increases from 50 to 200, the converged loss value significantly decreases. However, when time step is increased to 300, the decrease in the converged loss is not substantial. Therefore, we opt for 200 as the final value of diffusion step.

To demonstrate the effectiveness of using DiffEEG for DA, we conduct comparative experiments with classification networks using just the original data and the augmented data generated by DiffEEG respectively. To explore the superiority of DiffEEG in solving the imbalanced data problem, we conduct the contrast experiments with models using the down-sampling, sliding windows and recombination methods. In the meantime, we utilize five representative network frameworks: SVM, Spatio-temporal MLP, EEGNet, Multi-scale CNN, Transformer as classifiers and test the prediction performance on the CHB-MIT and Kaggle database, to validate the universality of our approach. The performance is concluded in Table III $\sim$ Table XII. To make the comparative effectiveness more obvious, the best results among all methods are marked with bold.

TABLE III: Performance of SVM on CHB-MIT database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Pat1	100.0	0.000	0.9669	<0.001	100.0	0.000	0.9869	<0.001	100.0	0.000	0.9711	<0.001	100.0	0.000	0.9819	<0.001	100.0	0.000	0.9821	<0.001
Pat2	66.7	0.000	0.7776	0.003	66.7	0.043	0.8104	0.009	66.7	0.000	0.8227	0.007	66.7	0.000	0.7773	0.004	66.7	0.000	0.8228	0.005
Pat3	66.7	0.136	0.7551	<0.001	100.0	0.318	0.8301	<0.001	100.0	0.136	0.8467	<0.001	100.0	0.227	0.8444	<0.001	100.0	0.091	0.8496	<0.001
Pat5	80.0	0.071	0.8071	<0.001	80.0	0.071	0.8609	0.002	100.0	0.071	0.8471	<0.001	80.0	0.071	0.8292	0.001	100.0	0.071	0.8332	<0.001
Pat9	25.0	0.022	0.6035	0.054	50.0	0.281	0.6954	0.075	50.0	0.259	0.7188	0.059	50.0	0.216	0.6729	0.068	50.0	0.259	0.6878	0.087
Pat10	71.4	0.346	0.6953	0.002	71.4	0.500	0.7119	0.009	57.1	0.462	0.7499	0.040	71.4	0.423	0.7288	0.004	85.7	0.462	0.7714	<0.001
Pat13	100.0	0.214	0.8867	<0.001	100.0	0.357	0.9125	<0.001	100.0	0.357	0.9057	<0.001	100.0	0.357	0.8983	<0.001	100.0	0.357	0.8899	<0.001
Pat17	66.7	0.200	0.7491	0.060	66.7	0.400	0.7375	0.090	66.7	0.200	0.7538	0.060	66.7	0.200	0.7480	0.070	66.7	0.200	0.7511	0.066
Pat18	60.0	0.000	0.7912	0.002	60.0	0.083	0.7967	0.005	60.0	0.000	0.7969	0.002	60.0	0.041	0.7845	0.002	80.0	0.083	0.8144	<0.001
Pat19	100.0	0.000	0.8000	<0.001	100.0	0.120	0.8867	<0.001	100.0	0.040	0.8885	<0.001	100.0	0.040	0.8883	<0.001	100.0	0.000	0.8990	<0.001
Pat20	100.0	0.050	0.9807	<0.001	100.0	0.150	0.9748	<0.001	100.0	0.050	0.9831	<0.001	100.0	0.100	0.9850	<0.001	100.0	0.000	0.9828	<0.001
Pat21	100.0	0.214	0.9297	<0.001	100.0	0.299	0.9186	0.001	100.0	0.214	0.9401	<0.001	100.0	0.256	0.9251	<0.001	100.0	0.256	0.9457	<0.001
Pat23	100.0	0.077	0.9880	<0.001	100.0	0.077	0.9849	<0.001	100.0	0.077	0.9916	<0.001	100.0	0.077	0.9903	<0.001	100.0	0.077	0.9903	<0.001
Ave	79.7	0.102	0.825	-	84.2	0.208	0.854	-	84.7	0.144	0.863	-	84.2	0.155	0.850	-	88.4	0.143	0.863	-

TABLE IV: Performance of Spatio-temporal MLP on CHB-MIT database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Pat1	100.0	0.000	0.9663	<0.001	100.0	0.000	0.9806	<0.001	100.0	0.000	0.9711	<0.001	100.0	0.000	0.9834	<0.001	100.0	0.000	0.9935	<0.001
Pat2	66.7	0.000	0.6302	<0.001	100.0	0.043	0.7336	<0.001	100.0	0.000	0.7755	<0.001	100.0	0.000	0.8144	<0.001	100.0	0.000	0.8537	<0.001
Pat3	100.0	0.045	0.8656	<0.001	83.3	0.091	0.8451	<0.001	100.0	0.045	0.8966	<0.001	100.0	0.091	0.8611	<0.001	100.0	0.045	0.9171	<0.001
Pat5	80.0	0.000	0.8424	0.002	100.0	0.285	0.8380	<0.001	100.0	0.071	0.9082	<0.001	100.0	0.142	0.8979	0.001	100.0	0.000	0.9458	<0.001
Pat9	50.0	0.022	0.6085	<0.001	75.0	0.216	0.6951	<0.001	75.0	0.064	0.6262	<0.001	75.0	0.064	0.7101	<0.001	75.0	0.043	0.7651	<0.001
Pat10	57.1	0.192	0.7042	0.005	71.4	0.269	0.7630	0.192	71.4	0.038	0.8190	0.032	57.1	0.153	0.7436	0.084	85.7	0.038	0.8435	0.001
Pat13	100.0	0.285	0.8578	<0.001	100.0	0.285	0.8903	<0.001	100.0	0.285	0.8683	<0.001	100.0	0.285	0.8913	<0.001	100.0	0.214	0.9086	<0.001
Pat17	66.7	0.300	0.7245	0.325	100.0	0.400	0.8044	0.199	100.0	0.400	0.7883	0.114	100.0	0.300	0.8082	0.057	100.0	0.200	0.8740	0.002
Pat18	80.0	0.042	0.8897	0.001	80.0	0.083	0.8012	0.005	80.0	0.000	0.8933	<0.001	80.0	0.042	0.8722	0.003	80.0	0.000	0.9071	<0.001
Pat19	100.0	0.000	0.9570	<0.001	66.7	0.000	0.6786	<0.001	66.7	0.000	0.8013	0.008	66.7	0.000	0.7950	0.008	100.0	0.000	0.9488	<0.001
Pat20	87.5	0.050	0.8949	<0.001	100.0	0.150	0.9380	<0.001	100.0	0.050	0.9542	<0.001	100.0	0.050	0.9454	<0.001	100.0	0.050	0.9792	<0.001
Pat21	50.0	0.214	0.6013	<0.001	100.0	0.427	0.7237	0.011	100.0	0.214	0.7182	<0.001	100.0	0.214	0.7683	<0.001	100.0	0.214	0.8198	<0.001
Pat23	100.0	0.000	0.9930	<0.001	100.0	0.000	0.9867	<0.001	100.0	0.000	0.9795	<0.001	100.0	0.000	0.9923	<0.001	100.0	0.000	0.9969	<0.001
Ave	79.8	0.089	0.810	-	90.5	0.173	0.821	-	91.8	0.090	0.846	-	90.7	0.103	0.853	-	95.4	0.062	0.904	-

TABLE V: Performance of EEGNet on CHB-MIT database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Pat1	100.0	0.000	0.9702	<0.001	100.0	0.000	0.9671	<0.001	100.0	0.000	0.9819	<0.001	100.0	0.000	0.9785	<0.001	100.0	0.000	0.9857	<0.001
Pat2	0.0	0.000	0.5068	1.000	33.3	0.087	0.5664	0.121	66.7	0.000	0.6186	0.036	33.3	0.000	0.5690	0.091	66.7	0.000	0.7150	0.037
Pat3	66.7	0.000	0.7938	<0.001	66.7	0.091	0.7796	<0.001	83.3	0.000	0.8638	<0.001	66.7	0.000	0.8566	0.001	83.3	0.000	0.8783	<0.001
Pat5	80.0	0.000	0.8621	<0.001	100.0	0.357	0.8056	<0.001	100.0	0.000	0.9489	<0.001	100.0	0.000	0.9131	<0.001	100.0	0.000	0.9639	<0.001
Pat9	0.0	0.000	0.5000	1.000	75.0	0.367	0.6563	0.009	50.0	0.000	0.7140	0.197	75.0	0.043	0.7425	0.005	75.0	0.086	0.7712	<0.001
Pat10	57.1	0.077	0.6878	<0.001	85.7	0.346	0.6906	0.012	85.7	0.038	0.8288	<0.001	85.7	0.115	0.8062	<0.001	85.7	0.077	0.8610	<0.001
Pat13	100.0	0.286	0.8799	<0.001	100.0	0.286	0.8783	<0.001	100.0	0.286	0.9060	<0.001	100.0	0.286	0.9046	<0.001	100.0	0.286	0.9278	<0.001
Pat17	66.7	0.200	0.7361	0.037	100.0	0.200	0.7793	0.001	66.7	0.300	0.7538	0.052	66.7	0.300	0.7688	0.075	100.0	0.200	0.9433	0.044
Pat18	40.0	0.042	0.6685	0.020	80.0	0.208	0.6976	<0.001	80.0	0.000	0.8467	<0.001	80.0	0.042	0.8317	<0.001	80.0	0.042	0.8520	<0.001
Pat19	100.0	0.000	0.8264	0.002	66.7	0.040	0.7043	0.004	100.0	0.000	0.9114	<0.001	100.0	0.000	0.8969	<0.001	100.0	0.000	0.9719	<0.001
Pat20	100.0	0.050	0.9430	<0.001	100.0	0.150	0.9429	<0.001	100.0	0.100	0.9553	<0.001	100.0	0.100	0.9585	<0.001	100.0	0.050	0.9761	<0.001
Pat21	100.0	0.128	0.7296	<0.001	100.0	0.299	0.7847	0.001	100.0	0.256	0.8484	<0.001	100.0	0.256	0.8182	<0.001	100.0	0.214	0.8746	<0.001
Pat23	100.0	0.000	0.9862	<0.001	100.0	0.077	0.9673	<0.001	100.0	0.000	0.9977	<0.001	100.0	0.000	0.9945	<0.001	100.0	0.000	0.9984	<0.001
Ave	70.0	0.060	0.776	-	85.2	0.193	0.786	-	87.1	0.075	0.860	-	85.2	0.088	0.849	-	91.6	0.073	0.901	-

TABLE VI: Performance of Multi-scale CNN on CHB-MIT database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Pat1	100.0	0.000	0.9721	<0.001	100.0	0.000	0.9799	<0.001	100.0	0.000	0.9812	<0.001	100.0	0.000	0.9927	<0.001	100.0	0.000	0.9989	<0.001
Pat2	66.7	0.000	0.6331	<0.001	66.7	0.087	0.6783	<0.001	100.0	0.000	0.6837	<0.001	66.7	0.000	0.7462	<0.001	100.0	0.044	0.8884	<0.001
Pat3	100.0	0.045	0.8944	<0.001	100.0	0.045	0.8735	<0.001	100.0	0.045	0.9062	<0.001	100.0	0.045	0.9319	<0.001	100.0	0.000	0.9596	<0.001
Pat5	80.0	0.000	0.8917	0.002	100.0	0.214	0.8896	0.003	80.0	0.000	0.9228	0.002	100.0	0.000	0.9509	<0.001	100.0	0.000	0.9961	<0.001
Pat9	50.0	0.022	0.7075	0.004	75.0	0.172	0.6552	0.106	75.0	0.064	0.7402	<0.001	75.0	0.043	0.7722	<0.001	75.0	0.000	0.8244	<0.001
Pat10	85.7	0.077	0.7294	<0.001	85.7	0.307	0.7447	<0.001	85.7	0.115	0.7529	<0.001	85.7	0.192	0.8098	<0.001	85.7	0.077	0.8795	<0.001
Pat13	71.4	0.500	0.7863	0.154	85.7	0.142	0.8352	0.142	85.7	0.500	0.7868	0.147	100.0	0.428	0.8730	0.003	100.0	0.214	0.9469	<0.001
Pat17	66.7	0.200	0.7123	0.193	100.0	0.200	0.8618	0.014	100.0	0.300	0.8804	0.042	100.0	0.300	0.7917	0.012	100.0	0.100	0.8681	<0.001
Pat18	80.0	0.000	0.8923	<0.001	80.0	0.083	0.8760	<0.001	80.0	0.041	0.8854	<0.001	80.0	0.000	0.8920	<0.001	80.0	0.000	0.8987	<0.001
Pat19	100.0	0.040	0.8584	<0.001	100.0	0.200	0.7919	0.003	100.0	0.080	0.9069	<0.001	100.0	0.000	0.9746	<0.001	100.0	0.000	0.9898	<0.001
Pat20	100.0	0.200	0.9398	<0.001	100.0	0.300	0.9337	<0.001	100.0	0.150	0.9636	<0.001	100.0	0.100	0.9723	<0.001	100.0	0.050	0.9853	<0.001
Pat21	100.0	0.256	0.7291	0.014	100.0	0.427	0.7548	0.198	100.0	0.256	0.8448	<0.001	100.0	0.300	0.9052	<0.001	100.0	0.182	0.8967	<0.001
Pat23	100.0	0.000	0.9819	<0.001	100.0	0.077	0.9800	<0.001	100.0	0.077	0.9890	<0.001	100.0	0.000	0.9923	<0.001	100.0	0.000	0.9872	<0.001
Ave	84.7	0.103	0.825	-	91.8	0.173	0.835	-	92.8	0.125	0.865	-	92.9	0.108	0.893	-	95.4	0.051	0.932	-

TABLE VII: Performance of Transformer on CHB-MIT database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Pat1	100.0	0.059	0.9542	<0.001	100.0	0.000	0.9861	<0.001	100.0	0.000	0.9899	<0.001	100.0	0.059	0.9854	<0.001	100.0	0.000	0.9954	<0.001
Pat2	33.3	0.000	0.6295	0.011	100.0	0.087	0.7642	0.002	100.0	0.000	0.8321	<0.001	100.0	0.000	0.8205	<0.001	100.0	0.000	0.9096	<0.001
Pat3	66.7	0.000	0.7652	<0.001	66.7	0.090	0.7776	<0.001	83.3	0.045	0.8835	<0.001	100.0	0.136	0.8873	<0.001	100.0	0.045	0.9225	<0.001
Pat5	60.0	0.000	0.7164	0.105	100.0	0.142	0.7840	0.002	100.0	0.071	0.8646	<0.001	100.0	0.428	0.8175	<0.001	100.0	0.142	0.9344	<0.001
Pat9	0.0	0.022	0.5333	1.000	50.0	0.086	0.6327	0.141	75.0	0.064	0.7738	<0.001	75.0	0.108	0.7869	<0.001	75.0	0.043	0.8384	<0.001
Pat10	28.6	0.115	0.5620	0.122	85.7	0.307	0.7130	0.197	85.7	0.192	0.7795	0.005	85.7	0.038	0.7833	0.011	85.7	0.077	0.8556	<0.001
Pat13	100.0	0.285	0.8986	<0.001	100.0	0.285	0.9204	<0.001	100.0	0.357	0.9303	<0.001	100.0	0.214	0.9353	<0.001	100.0	0.214	0.9589	<0.001
Pat17	66.7	0.100	0.8093	0.117	100.0	0.100	0.8946	0.132	100.0	0.100	0.8750	0.020	100.0	0.100	0.9017	0.015	100.0	0.100	0.9421	<0.001
Pat18	40.0	0.000	0.6735	0.042	80.0	0.083	0.7412	0.019	80.0	0.042	0.8069	<0.001	80.0	0.000	0.7849	<0.001	80.0	0.000	0.8331	<0.001
Pat19	0.0	0.000	0.5628	1.000	66.7	0.400	0.7739	0.110	66.7	0.080	0.7869	0.004	66.7	0.000	0.7857	0.035	100.0	0.040	0.9061	<0.001
Pat20	100.0	0.050	0.9074	<0.001	100.0	0.150	0.9435	<0.001	100.0	0.100	0.9631	<0.001	100.0	0.100	0.9604	<0.001	100.0	0.050	0.9828	<0.001
Pat21	75.0	0.085	0.7223	0.004	100.0	0.427	0.7976	<0.001	100.0	0.128	0.8654	<0.001	100.0	0.085	0.8819	<0.001	100.0	0.085	0.9228	<0.001
Pat23	100.0	0.000	0.9760	<0.001	100.0	0.000	0.9808	<0.001	100.0	0.000	0.9910	<0.001	100.0	0.000	0.9916	<0.001	100.0	0.000	0.9941	<0.001
Ave	59.3	0.055	0.747	-	88.4	0.166	0.824	-	91.6	0.091	0.872	-	92.9	0.098	0.871	-	95.4	0.061	0.923	-

TABLE VIII: Performance of SVM on Kaggle database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Dog1	0.0	0.025	0.4989	1.000	50.0	0.213	0.5769	0.142	50.0	0.300	0.5779	0.127	50.0	0.263	0.6264	0.110	50.0	0.238	0.6282	0.088
Dog2	85.7	0.036	0.8334	<0.001	100.0	0.072	0.9310	<0.001	85.7	0.060	0.9110	<0.001	100.0	0.072	0.9338	<0.001	100.0	0.060	0.9545	<0.001
Dog3	66.7	0.017	0.6961	<0.001	75.0	0.250	0.7724	<0.001	83.3	0.242	0.7770	<0.001	83.3	0.242	0.7760	<0.001	83.3	0.183	0.7841	<0.001
Dog4	64.3	0.052	0.6286	<0.001	92.9	0.269	0.7589	<0.001	85.7	0.276	0.7383	<0.001	92.9	0.276	0.7376	<0.001	92.9	0.224	0.7097	<0.001
Dog5	60.0	0.040	0.7820	0.001	80.0	0.093	0.7994	<0.001	80.0	0.053	0.7942	<0.001	80.0	0.053	0.7929	<0.001	100.0	0.053	0.8056	<0.001
Ave	55.3	0.034	0.688	-	79.6	0.179	0.768	-	76.9	0.186	0.76	-	81.2	0.181	0.773	-	85.2	0.152	0.776	-

TABLE IX: Performance of Spatio-temporal MLP on Kaggle database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Dog1	0.0	0.000	0.5000	1.000	50.0	0.262	0.5491	0.167	50.0	0.200	0.5907	0.120	50.0	0.225	0.5638	0.080	75.0	0.250	0.6213	0.042
Dog2	57.1	0.048	0.6102	<0.001	100.0	0.264	0.7866	<0.001	100.0	0.228	0.8311	<0.001	100.0	0.288	0.8229	<0.001	100.0	0.216	0.8607	<0.001
Dog3	33.3	0.025	0.5966	<0.001	83.3	0.179	0.7196	<0.001	83.3	0.170	0.7215	<0.001	83.3	0.158	0.7492	<0.001	91.7	0.154	0.7855	<0.001
Dog4	14.3	0.075	0.5333	0.031	85.7	0.358	0.6632	<0.001	85.7	0.373	0.6652	<0.001	92.9	0.380	0.6719	<0.001	92.9	0.298	0.7136	<0.001
Dog5	60.0	0.027	0.6584	0.007	100.0	0.173	0.8153	<0.001	100.0	0.133	0.8313	<0.001	100.0	0.133	0.8521	<0.001	100.0	0.093	0.8641	<0.001
Ave	32.9	0.035	0.580	-	83.8	0.247	0.707	-	83.8	0.221	0.728	-	85.2	0.237	0.732	-	91.9	0.202	0.769	-

TABLE X: Performance of EEGNet on Kaggle database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Dog1	0.0	0.000	0.5000	1.000	50.0	0.275	0.6331	0.119	50.0	0.175	0.6197	0.083	50.0	0.163	0.6186	0.313	75.0	0.138	0.6653	0.015
Dog2	57.1	0.048	0.7065	<0.001	100.0	0.216	0.8514	<0.001	100.0	0.120	0.8528	<0.001	100.0	0.132	0.8608	<0.001	100.0	0.096	0.9089	<0.001
Dog3	8.3	0.000	0.5159	0.020	83.3	0.167	0.7713	<0.001	91.7	0.117	0.7722	<0.001	91.7	0.121	0.7884	<0.001	91.7	0.054	0.8245	<0.001
Dog4	14.3	0.000	0.5630	0.097	78.6	0.269	0.6822	<0.001	85.7	0.209	0.7127	<0.001	85.7	0.201	0.7208	<0.001	85.7	0.194	0.7471	<0.001
Dog5	40.0	0.040	0.6393	0.008	80.0	0.120	0.8176	<0.001	80.0	0.053	0.8415	<0.001	80.0	0.053	0.8383	<0.001	100.0	0.080	0.9199	<0.001
Ave	23.9	0.018	0.585	-	78.4	0.209	0.751	-	81.5	0.135	0.760	-	81.5	0.134	0.765	-	90.5	0.112	0.813	-

TABLE XI: Performance of Multi-scale CNN on Kaggle database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Dog1	0.0	0.000	0.5000	1.000	50.0	0.187	0.6305	<0.001	50.0	0.225	0.6177	<0.001	50.0	0.200	0.6481	<0.001	75.0	0.138	0.6900	<0.001
Dog2	85.7	0.012	0.8003	<0.001	100.0	0.204	0.8728	<0.001	100.0	0.144	0.8865	<0.001	100.0	0.132	0.8927	<0.001	100.0	0.096	0.9065	<0.001
Dog3	58.3	0.013	0.7197	<0.001	91.7	0.116	0.7880	<0.001	91.7	0.138	0.7700	<0.001	83.3	0.100	0.7849	<0.001	100.0	0.100	0.8300	<0.001
Dog4	71.4	0.097	0.6444	<0.001	85.7	0.268	0.7188	<0.001	92.9	0.306	0.7265	<0.001	92.9	0.239	0.7343	<0.001	92.9	0.230	0.7686	<0.001
Dog5	80.0	0.027	0.7651	<0.001	100.0	0.066	0.8886	<0.001	100.0	0.053	0.8716	<0.001	100.0	0.053	0.8877	<0.001	100.0	0.040	0.9147	<0.001
Ave	59.1	0.030	0.686	-	85.5	0.168	0.780	-	86.9	0.173	0.774	-	85.2	0.145	0.790	-	93.6	0.121	0.822	-

TABLE XII: Performance of Transformer on Kaggle database.

Patient	Original				Down-sampling				Sliding windows				Recombination				DiffEEG
	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value	Sens (%)	FPR/h	AUC	p-value
Dog1	0.0	0.000	0.5000	1.000	50.0	0.225	0.6013	0.106	50.0	0.150	0.5941	<0.001	50.0	0.125	0.5962	0.068	75.0	0.138	0.6387	<0.001
Dog2	71.4	0.036	0.6576	<0.001	100.0	0.156	0.8697	<0.001	100.0	0.144	0.8972	<0.001	100.0	0.168	0.8508	<0.001	100.0	0.120	0.9166	<0.001
Dog3	25.0	0.013	0.5398	<0.001	83.3	0.100	0.7607	<0.001	91.7	0.104	0.7435	<0.001	91.7	0.104	0.7644	<0.001	91.7	0.033	0.8155	<0.001
Dog4	0.0	0.000	0.5000	1.000	78.6	0.470	0.6449	<0.001	78.6	0.300	0.6800	<0.001	78.6	0.447	0.6391	<0.001	85.7	0.201	0.7243	<0.001
Dog5	80.0	0.040	0.7522	<0.001	100.0	0.133	0.8166	<0.001	100.0	0.067	0.8478	<0.001	100.0	0.080	0.8756	<0.001	100.0	0.053	0.9179	<0.001
Ave	35.3	0.018	0.590	-	82.4	0.217	0.739	-	84.1	0.153	0.753	-	84.1	0.185	0.745	-	90.5	0.109	0.803	-

It is clear that all the classification models using the augmented data of DiffEEG achieve higher Sens and AUC than those of the models using just the original data. Among the approaches to solve the problem of imbalanced data, our proposed model achieves the highest Sens, AUC and the lowest FPR on all classifiers. Compared with the sliding windows and recombination methods, DiffEEG improves Sens by an average of 3.64% and 4.06%, AUC by an average of 0.043 and 0.041, and reduces FPR by an average of 0.027/h and 0.032/h on the CHB-MIT database. As for the Kaggle database, it improves Sens by an average of 7.70% and 6.90%, AUC by an average of 0.042 and 0.036, and reduces FPR by an average of 0.034/h and 0.037/h. Besides, our approach has increased the number of subjects whose improvement over chance is statistically significant, at a significant level of 0.05.

Among the five classification network, Multi-scale CNN wins the best performance. The combination of DiffEEG and Multi-scale CNN obtains an average Sens, FPR, AUC of 95.4%, 0.051/h, 0.932 on the CHB-MIT database and an average Sens, FPR, AUC of 93.6%, 0.121/h, 0.822 on the Kaggle database.

To further measure our proposed model, we compare our results with several SOTA methods that used the CHB-MIT and Kaggle database. The statistical comparison is performed by counting the number of patients whose improvement over chance is statistically significant at a given confidence interval. The significance level $\alpha$ is set to 0.05 and 0.01, representing a confidence interval of 95% and 99% respectively. The comparison results are listed in Table XIII. It is obvious that our method achieves the optimal performance on the CHB-MIT database, and comparable performance to the SOTA approaches on the Kaggle database. Besides, our work has acquired statistically significant improvement over chance for all the subjects in both databases. For the individuals without statistical significance in these SOTA methods, our work can achieve statistical significance on them, except for Pat7 who does not meet our selection criteria.

TABLE XIII: Comparison to existing methods on CHB-MIT and Kaggle database.

Authors	Database	Features	Classifier	No. of seizures	No. of subjects	Sens(%)	FPR/h	AUC	Interictal-Preictal distance (min)	p-value over chance at $\alpha$ = 0.05	p-value over chance at $\alpha$ = 0.01
Troung et al. 2018 [20]	CHB-MIT	STFT	CNN	64	13	81.2	0.16	NR	240-30	12/13 (miss Pat9)	10/13 (miss Pat5, Pat9, Pat10)
Ozcan et al. 2019 [11]	CHB-MIT	Spectral power	3D CNN	77	16	85.7	0.096	NR	240-60	15/16 (miss Pat18)	13/16 (miss Pat7, Pat10, Pat18)
		Statistical moments
		Hjorth parameters
Zhang et al. 2020 [21]	CHB-MIT	Common spatial	CNN	156	23	92	0.12	0.900	30-30	NR	NR
		pattern statistics
Yang et al. 2021 [38]	CHB-MIT	STFT	RDANet	64	13	89.3	NR	0.913	240-30	NR	NR
Li et al. 2022 [53]	CHB-MIT	raw EEG signals	FB-CapsNet	105	19	93.4	0.096	0.928	240-30	19/19	18/19 (miss Pat17)
Zhao et al. 2022 [54]	CHB-MIT	raw EEG signals	AddNet-SCL	105	19	93.0	0.094	0.929	240-30	19/19	18/19 (miss Pat17)
Our work	CHB-MIT	raw EEG signals	CNN	69	13	95.4	0.051	0.932	240-30	13/13	13/13
		generated EEG
Troung et al. 2018 [20]	Kaggle	STFT	CNN	42	5	73.4	0.186	NR	240-60	4/5 (miss Dog1)	4/5 (miss Dog1)
Daoud et al. 2019 [8]	Kaggle	raw EEG signals	DCNN+Bi-LSTM	42	5	81.5	0.167	0.805	240-60	4/5 (miss Dog1)	4/5 (miss Dog1)
Xu et al. 2020 [55]	Kaggle	raw EEG signals	CNN	42	5	80.9	0.134	0.808	240-60	4/5 (miss Dog1)	4/5 (miss Dog1)
Chen et al. 2021 [56]	Kaggle	STFT	CNN	42	5	80	0.372	0.828	240-60	NR	NR
Zhao et al. 2022 [54]	Kaggle	raw EEG signals	SCL-AddNets	42	5	89.1	0.12	0.831	240-60	5/5	5/5
Gao et al. 2023 [57]	Kaggle	raw EEG signals	ProtoPNet	42	5	88.6	0.146	0.764	240-60	5/5	5/5
Our work	Kaggle	raw EEG signals	CNN	42	5	93.6	0.121	0.822	240-60	5/5	5/5
		generated EEG

TABLE XIV: T-test between real and augmented data.

Subjects	Variance	$\delta$ band	$\theta$ band	$\alpha$ band	$\beta$ band	$\gamma$ band
Pat1	0.970	0.811	0.419	<0.001	<0.001	<0.001
Pat2	0.911	0.516	0.714	0.980	0.480	<0.001
Pat3	0.987	0.898	0.027	0.893	0.813	0.722
Pat5	0.984	<0.001	<0.001	0.305	<0.001	0.127
Pat9	0.072	<0.001	<0.001	<0.001	0.068	<0.001
Pat10	0.082	0.005	0.134	<0.001	<0.001	<0.001
Pat13	0.503	0.063	<0.001	0.059	<0.001	<0.001
Pat17	0.953	0.004	0.992	0.953	0.052	<0.001
Pat18	0.870	0.347	0.094	0.949	0.035	0.514
Pat19	0.110	0.080	0.218	0.003	0.017	0.001
Pat20	0.108	0.310	0.188	<0.001	<0.001	<0.001
Pat21	0.050	0.594	0.050	<0.001	<0.001	<0.001
Pat23	0.582	<0.001	0.360	0.840	<0.001	<0.001
Dog1	0.786	<0.001	<0.001	0.011	0.407	<0.001
Dog2	0.190	<0.001	<0.001	0.164	<0.001	<0.001
Dog3	0.033	<0.001	<0.001	0.032	<0.001	<0.001
Dog4	0.101	<0.001	<0.001	0.109	<0.001	<0.001
Dog5	0.397	<0.001	0.058	<0.001	<0.001	<0.001

IV Discussion

Despite decades of development in medical technology, the collection of high-quality EEG signals still demands expensive medical devices and plenty of expert time [25]. It is impractical to increase the amount of preictal EEG data through continuous collection. Data augmentation is the process that generates new samples to augment a small or imbalanced database by converting existing samples. Exposing the model to various augmented data could raise the precision and stability of classifier, making it more robust to changes [58]. According to the results in Table III $\sim$ Table XII, it can be observed that all the classification models using the augmented data generated by DiffEEG achieve higher Sens and AUC than the models using just the original data, indicating that our method can improve the predictive precision of epilepsy. However, the models using the original data exhibited lower FPR. This is because the interictal samples in the original data are far more than the preictal data, leading to the overfitting of the model. As a result, the model can accurately identify the interictal samples but struggles to recognize the preictal samples. Besides, compared with the existing methods for addressing the imbalanced data problem, we can conclude that the DiffEEG has obvious advantage over them. There has been a notable enhancement in the performance of seizure prediction, proving that the data generated by DiffEEG has high quality and diversity. From a statistical perspective, the augmented data can increase the number of subjects whose improvement over chance is statistically significant, suggesting that our DA technique is capable of enhancing the reliability of the model’s predictions. In practice, the distribution of EEG signals varies over time due to the changes in external environment and mental states of patients. The epileptic representation in different preictal time varies as well if the location or mode of abnormal discharge is different [23]. When the data cluster of an upcoming seizure has few nearby distribution with the existing clusters, the seizure will be difficult to predict. Based on the distribution learned during the diffusion process, the data produced by DiffEEG can narrow the distance between different clusters and provide additional information to classifiers. Instead, the distribution of data generated by sliding windows or recombination is restricted by the original data and cannot be fully explored, so their improvement to prediction performance is limited.

By comparing the results of five classification networks, it is clear that no matter what DA methods we used, the performance of Multi-scale CNN was the best. SVM is a powerful machine learning classifier, but its performance typically relies on the handcrafted feature engineering. The manually extracted features are hard to capture all the complex representations from raw data, making the performance of SVM inferior to that of deep learning networks. Although the blocks in Spatio-temporal MLP can extract information within and between channels, the fully connected structure of MLPs determines that the efficiency of data processing is not high. The lack of weight sharing mechanism also causes a huge amount of parameters, making it difficult to converge to optimal result. The structure of EEGNet is lightweight and computationally efficient. The multi-layer 1D convolution enables the model to effectively extract the time-frequency information. However, EEGNet lacks consideration of the inter-channel correlation, leading to the insufficient extraction of spatial features. For Transformer, the input embedding and position encoding can capture the global information of data. The attention module can learn the relation between features and strengthen the role of key information. For Multi-scale CNN, the local perception ability of convolution makes it easier to extract local information. By the use of dilated convolution, spatial and temporal features of different scales can be extracted, providing multi-scale information for seizure prediction. In the meantime, the model also utilizes attention mechanism as Transformer to assign weights to features and focus on the critical representation. Therefore, the Multi-scale CNN network obtains the best effects. SVM is a widely used traditional classification method. MLP, CNN and Transformer are the most representative frameworks in deep learning and almost all classifiers are based on them. As our proposed DiffEEG shows superiority on these networks, it is proved that DiffEEG can be utilized in almost any other classification network to substantially improve the seizure prediction performance.

The generation of multichannel EEG signals has been a challenge to researchers for a long time. Therefore, the existing generative models either generate the feature maps of EEG signals [25] or generate each channel respectively before combining them [59]. This is done in case the model cannot fully learn the representations of multichannel EEG signals. Our proposed model, DiffEEG, can overcome the challenge and directly generate signals of the whole channels. The real and synthetic samples of 18 channels and 30-s duration generated by DiffEEG are shown in Fig. 8. To further inspect the details of the samples, we only select the first three channels of 10-s duration as shown in Fig. 9. Through visual inspection, we can see that the augmented data is not the simple superposition of existing data, but diversified data generated based on the learned distribution, which could provide additional information to improve the performance.

Furthermore, statistical tests are performed for the synthetic data to validate that the augmented data increase diversity and new information while preserving the characteristics of the genuine EEG data. As the augmented data are generated conditioned on the STFT spectrum of real data, a paired t-test can be appropriately performed for the augmented data and the corresponding real data. T-tests are usually carried out on the feature space instead of the high-dimensional time-domain data [60]. Hence, we select variance to reflect the amplitude distribution, and extract the spectral power of $\delta$ , $\theta$ , $\alpha$ , $\beta$ , $\gamma$ bands to give information about the energy variations. We utilized these 6 features to conduct the paired t-test at a significance level of 0.05. The p-values on each subject are shown in Table XIV. Based on the theory of t-test, if the p-value is less than 0.05, there is considered to be a significant difference between the two sample groups; otherwise, there is no significant difference. From the results, it can be observed that the variance of both groups is not significantly different across all subjects except for Dog3, indicating that the amplitude distribution of the augmented data is similar to that of the real data for most patients. For each spectral feature, some of the patients exhibit significant difference between the real and augmented groups, some of the patients not. This reflects the diversity introduced by our DiffEEG model in the generated samples. The synthetic data retain the temporal rhythms of the original data while bringing in diverse variations in the frequency domain.

The DiffEEG model is trained with the guidance of STFT condition. Compared to directly using the raw EEG signal, the conditional model can generate samples with higher quality. Numerous articles have demonstrated the advantages of conditional diffusion models over unconditional ones. Kong et al. mentioned that generating audios in the time domain without conditional information is a challenging task [28]. Their conditional model achieved higher Mean Opinion Score (MOS) values than the unconditional model, indicating that the generated samples of the conditional model have higher quality. Ho et al. also showed that the sample quality and fidelity can be greatly improved by the conditional guidance [34]. Certainly, the conditional models also have their limitations. Ho et al. explained that the presence of condition could reduce the diversity of generation, resulting in a trade-off between quality and diversity. In our EEG generation task, we prioritize the quality because the complicated distribution of EEG data is difficult to learn. The loss of authenticity in the augmented data will lead to a decline in the seizure prediction performance. Besides, we have made use of the recombined spectrograms to increase the diversity as mentioned in Section II-D.

Finally, seizure prediction experiments are carried out to test the contributions of the synthetic data. The results verify that the data generated by DiffEEG can greatly improve the prediction performance and beat the existing DA methods in various situations.

The outstanding effects achieved by DiffEEG on public datasets have revealed multiple potential impacts on the clinical practice of seizure prediction.

•

Rapidly establishing high-performance and generalizable epilepsy prediction models for patients. In clinical practice, the insufficient preictal data often restricts the performance of seizure prediction models [59]. Relying on collecting more preictal data to build high-performance models is extremely time-consuming. With DiffEEG, preictal data with additional information can be generated in a short time, enhancing the predictive effect of the model. Moreover, the strong generalization capability of the synthetic data allows them to be widely applicable to various prediction models.
•

Assisting doctors in studying biomarkers for epilepsy prediction and enhancing the interpretability of predictions. It is challenging to capture the EEG biomarkers before seizures because of the complex spatial-temporal dynamics present in the epileptic brain [61]. The DiffEEG model can fully explore the feature distribution of EEG signals and generates data with distinct epileptic representation. These data provide a possible reference for doctors in identifying biomarkers for seizure prediction, improving the interpretability of deep learning-based prediction models.
•

Protecting the data privacy of patients. In the medical field, EEG data is typically sensitive information. Utilizing synthetic samples for data augmentation can reduce the reliance on original data, thereby lowering the risks of data leakage and privacy concerns.

Our work also presents vast application prospects in other EEG-based tasks. Issues such as data imbalance and scarcity are prevalent in EEG-based tasks. For instance, in sleep staging tasks, Stage N2 generally occupies a majority of the sleep time (45%-55%), while stage N1 only accounts for very little time (2%-5%) [62]. In emotion recognition tasks, the AMIGOS dataset exhibits a higher arousal ratings, whereas in the DEAP dataset, the proportion of positive emotions is larger than that of negative ones [63]. In the diagnosis of brain diseases such as depression, Alzheimer’s disease and mild cognitive impairment, it is typical to face a lack of data due to the equipment requirements, time constraints and privacy concerns. The EEG signals for all these tasks exhibit significantly different feature distribution. However, DiffEEG can leverage the STFT to provide time-frequency information, guiding the model to learn task-specific representations for each distribution. Therefore, our approach enables the generation of diverse and high-quality EEG samples for data augmentation under various tasks, raising the generalization, accuracy, and robustness of these tasks.

Despite the promising influence and applications demonstrated by DiffEEG, it is important to acknowledge its potential limitations and pursue the directions for future research. One of the primary limitations is the slow training process. Although we speed up the training process by randomly selecting the diffusion steps, it still takes plenty of time to train the model. In future work, the input block could be modified to transform the multichannel EEG data into latent space. By performing the diffusion process in a low-dimensional latent space, the computation cost and training time can be greatly reduced. Another limitation is the lack of cross-subject capabilities. To ensure the generation quality, DiffEEG is trained under a subject-specific strategy. When a new patient arrives, we need to retrain a model rather than utilizing the existing data and models, which is both time-consuming and resource-wasting. In the future, the structure of DiffEEG could be improved for pretraining on a large dataset. In this way, we only need to fine-tune the pretrained model to fit the data distribution of each patient, enabling the model with few-shot learning capability.

In addition to the inherent limitations of the DiffEEG itself, there are several challenges and considerations when integrating it into real-world applications. The first obstacle comes from the acquisition quality of the EEG signals. In real-world scenarios, factors such as the loose or offset of the collecting electrodes, the patient’s chewing and vigorous movement can introduce severe noise in the EEG signals [64]. The noise interference may mask or distort the epileptic representation, hindering the model from learning the distribution characteristics of epileptic EEG data. We cannot guarantee that a model trained on low-quality EEG signals is capable of producing high-quality EEG data. The second challenge is the difficulty in continuously optimizing the model. Patients employing the epilepsy prediction system still experience seizures, and the consequent preictal data could be utilized to update and enhance the DiffEEG model. However, retraining the model with all the data each time new data arrives is too costly, and training the model solely on the new data could lead to the catastrophic forgetting problem. Hence, integrating incremental learning into DiffEEG is worth considering. At last, the highly realistic content generated by DiffEEG can raise ethical concerns. The model has the potential to create deceptive EEG signals including deepfakes and misinformation [65]. Therefore, ensuring responsible use and preventing malicious applications are ongoing challenges.

V Conclusion

In this article, we propose a novel and effective data augmentation method by the use of diffusion model. We conduct contrast experiments with just using the original data and with three existing methods of solving the imbalanced data problem: down-sampling, sliding windows and recombination. We evaluate the prediction performance on five representative classifiers to verify the effectiveness and generality of DiffEEG. The results prove that the data generated by DiffEEG achieve better performance on all five classification networks. Among them, the Multi-scale CNN utilizing the data augmented by DiffEEG obtained the best performance, with an average Sens, FPR, AUC of 95.4%, 0.051/h, 0.932 on the CHB-MIT database and an average Sens, FPR, AUC of 93.6%, 0.121/h, 0.822 on the Kaggle database. Both results reached the SOTA level. To the best of our knowledge, our work is the first diffusion model used in epileptic EEG generation, creating a new and effective way to solve the imbalanced data. This model can be widely applied to all kinds of seizure prediction classifiers to improve the classification performance significantly.

References

[1] R. S. Fisher, C. Acevedo, A. Arzimanoglou, A. Bogacz, J. H. Cross, C. E. Elger, J. Engel Jr, L. Forsgren, J. A. French, M. Glynn et al., “Ilae official report: a practical clinical definition of epilepsy,” Epilepsia, vol. 55, no. 4, pp. 475–482, 2014.
[2] W. H. Organization et al., Epilepsy: a public health imperative. World Health Organization, 2019.
[3] D. Zeng, K. Huang, C. Xu, H. Shen, and Z. Chen, “Hierarchy graph convolution network and tree classification for epileptic detection on electroencephalography signals,” IEEE Transactions on Cognitive and Developmental Systems, vol. 13, no. 4, pp. 955–968, 2020.
[4] B. Maimaiti, H. Meng, Y. Lv, J. Qiu, Z. Zhu, Y. Xie, Y. Li, W. Zhao, J. Liu, M. Li et al., “An overview of eeg-based machine learning methods in seizure prediction and opportunities for neurologists in this field,” Neuroscience, vol. 481, pp. 197–218, 2022.
[5] L. Xiao, C. Li, Y. Wang, J. Chen, W. Si, C. Yao, X. Li, C. Duan, and P.-A. Heng, “Automatic localization of seizure onset zone from high-frequency seeg signals: A preliminary study,” IEEE Journal of Translational Engineering in Health and Medicine, vol. 9, pp. 1–10, 2021.
[6] J. Cao, J. Zhu, W. Hu, and A. Kummert, “Epileptic signal classification with deep eeg features by stacked cnns,” IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 4, pp. 709–722, 2019.
[7] L. Kuhlmann, K. Lehnertz, M. P. Richardson, B. Schelter, and H. P. Zaveri, “Seizure prediction—ready for a new era,” Nature Reviews Neurology, vol. 14, no. 10, pp. 618–630, 2018.
[8] H. Daoud and M. A. Bayoumi, “Efficient epileptic seizure prediction based on deep learning,” IEEE Transactions on Biomedical Circuits and Systems, vol. 13, no. 5, pp. 804–813, 2019.
[9] X. Chen, C. Li, A. Liu, M. J. McKeown, R. Qian, and Z. J. Wang, “Toward open-world electroencephalogram decoding via deep learning: A comprehensive survey,” IEEE Signal Processing Magazine, vol. 39, no. 2, pp. 117–134, 2022.
[10] L. Chisci, A. Mavino, G. Perferi, M. Sciandrone, C. Anile, G. Colicchio, and F. Fuggetta, “Real-time epileptic seizure prediction using ar models and support vector machines,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 5, pp. 1124–1132, 2010.
[11] A. R. Ozcan and S. Erturk, “Seizure prediction in scalp eeg using 3d convolutional neural networks with an image-based approach,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 11, pp. 2284–2293, 2019.
[12] J. Cao, D. Hu, Y. Wang, J. Wang, and B. Lei, “Epileptic classification with deep-transfer-learning-based feature fusion algorithm,” IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 2, pp. 684–695, 2021.
[13] C. Li, C. Shao, R. Song, G. Xu, X. Liu, R. Qian, and X. Chen, “Spatio-temporal mlp network for seizure prediction using eeg signals,” Measurement, vol. 206, p. 112278, 2023.
[14] Y. Gao, X. Chen, A. Liu, D. Liang, L. Wu, R. Qian, H. Xie, and Y. Zhang, “Pediatric seizure prediction in scalp eeg using a multi-scale neural network with dilated convolutions,” IEEE Journal of Translational Engineering in Health and Medicine, vol. 10, pp. 1–9, 2022.
[15] Y. Gao, A. Liu, X. Cui, R. Qian, and X. Chen, “A general sample-weighted framework for epileptic seizure prediction,” Computers in Biology and Medicine, vol. 150, p. 106169, 2022.
[16] M. J. Cook, T. J. O’Brien, S. F. Berkovic, M. Murphy, A. Morokoff, G. Fabinyi, W. D’Souza, R. Yerra, J. Archer, L. Litewka et al., “Prediction of seizure likelihood with a long-term, implanted seizure advisory system in patients with drug-resistant epilepsy: a first-in-man study,” The Lancet Neurology, vol. 12, no. 6, pp. 563–571, 2013.
[17] M. Ihle, H. Feldwisch-Drentrup, C. A. Teixeira, A. Witon, B. Schelter, J. Timmer, and A. Schulze-Bonhage, “Epilepsiae–a european epilepsy database,” Computer Methods and Programs in Biomedicine, vol. 106, no. 3, pp. 127–138, 2012.
[18] P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive modeling on imbalanced domains,” ACM Computing Surveys (CSUR), vol. 49, no. 2, pp. 1–50, 2016.
[19] H. Khan, L. Marcuse, M. Fields, K. Swann, and B. Yener, “Focal onset seizure prediction using convolutional networks,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 9, pp. 2109–2118, 2017.
[20] N. D. Truong, A. D. Nguyen, L. Kuhlmann, M. R. Bonyadi, J. Yang, S. Ippolito, and O. Kavehei, “Convolutional neural networks for seizure prediction using intracranial and scalp electroencephalogram,” Neural Networks, vol. 105, pp. 104–111, 2018.
[21] Y. Zhang, Y. Guo, P. Yang, W. Chen, and B. Lo, “Epilepsy seizure prediction on eeg using common spatial pattern and convolutional neural network,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 2, pp. 465–474, 2019.
[22] A. Antoniou, A. Storkey, and H. Edwards, “Data augmentation generative adversarial networks,” arXiv preprint arXiv:1711.04340, 2017.
[23] Y. Qi, L. Ding, Y. Wang, and G. Pan, “Learning robust features from nonstationary brain signals by multiscale domain adaptation networks for seizure prediction,” IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 3, pp. 1208–1216, 2021.
[24] B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Effective data augmentation with diffusion models,” arXiv preprint arXiv:2302.07944, 2023.
[25] K. Rasheed, J. Qadir, T. J. O’Brien, L. Kuhlmann, and A. Razi, “A generative model to synthesize eeg data for epileptic seizure prediction,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 2322–2332, 2021.
[26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in Neural Information Processing Systems, vol. 29, 2016.
[27] G. Giannone, D. Nielsen, and O. Winther, “Few-shot diffusion models,” arXiv preprint arXiv:2205.15463, 2022.
[28] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[29] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
[30] A. Kebaili, J. Lapuyade-Lahorgue, and S. Ruan, “Deep learning approaches for data augmentation in medical imaging: A review,” Journal of Imaging, vol. 9, no. 4, p. 81, 2023.
[31] W. H. Pinaya, P.-D. Tudosiu, J. Dafflon, P. F. Da Costa, V. Fernandez, P. Nachev, S. Ourselin, and M. J. Cardoso, “Brain imaging generation with latent diffusion models,” in Deep Generative Models: Second MICCAI Workshop, DGM4MICCAI 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings. Springer, 2022, pp. 117–126.
[32] H. Chung, E. S. Lee, and J. C. Ye, “Mr image denoising and super-resolution using regularized reverse diffusion,” IEEE Transactions on Medical Imaging, 2022.
[33] P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, and A. Chaudhari, “Roentgen: Vision-language foundation model for chest x-ray generation,” arXiv preprint arXiv:2211.12737, 2022.
[34] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 8780–8794, 2021.
[35] H. Cao, C. Tan, Z. Gao, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion model,” arXiv preprint arXiv:2209.02646, 2022.
[36] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[37] N. A. Khan and S. Ali, “A new feature for the classification of non-stationary signals based on the direction of signal energy in the time–frequency domain,” Computers in Biology and Medicine, vol. 100, pp. 10–16, 2018.
[38] X. Yang, J. Zhao, Q. Sun, J. Lu, and X. Ma, “An effective dual self-attention residual network for seizure prediction,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 1604–1613, 2021.
[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[41] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[42] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International Conference on Machine Learning. PMLR, 2017, pp. 933–941.
[43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[44] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, and B. J. Lance, “Eegnet: a compact convolutional neural network for eeg-based brain–computer interfaces,” Journal of Neural Engineering, vol. 15, no. 5, p. 056013, 2018.
[45] J. Hu, C.-s. Wang, M. Wu, Y.-x. Du, Y. He, and J. She, “Removal of eog and emg artifacts from eeg using combination of functional link neural network and adaptive neural fuzzy inference system,” Neurocomputing, vol. 151, pp. 278–287, 2015.
[46] D. Lu, S. Bauer, V. Neubert, L. S. Costard, F. Rosenow, and J. Triesch, “Staging epileptogenesis with deep neural networks,” in Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 2020, pp. 1–10.
[47] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[48] R. Barandela, R. M. Valdovinos, J. S. Sánchez, and F. J. Ferri, “The imbalanced training sample problem: Under or over sampling?” in Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18-20, 2004. Proceedings. Springer, 2004, pp. 806–814.
[49] A. H. Shoeb, “Application of machine learning to epileptic seizure onset detection and treatment,” Ph.D. dissertation, Massachusetts Institute of Technology, 2009.
[50] B. H. Brinkmann, J. Wagenaar, D. Abbot, P. Adkins, S. C. Bosshard, M. Chen, Q. M. Tieng, J. He, F. Muñoz-Almaraz, P. Botella-Rocamora et al., “Crowdsourcing reproducible seizure forecasting in human and canine epilepsy,” Brain, vol. 139, no. 6, pp. 1713–1722, 2016.
[51] T. Maiwald, M. Winterhalder, R. Aschenbrenner-Scheibe, H. U. Voss, A. Schulze-Bonhage, and J. Timmer, “Comparison of three nonlinear seizure prediction methods by means of the seizure prediction characteristic,” Physica D: Nonlinear Phenomena, vol. 194, no. 3-4, pp. 357–368, 2004.
[52] Z. Zhang, A. Liu, Y. Gao, X. Cui, R. Qian, and X. Chen, “Distilling invariant representations with domain adversarial learning for cross-subject children seizure prediction,” IEEE Transactions on Cognitive and Developmental Systems, 2023.
[53] C. Li, Y. Zhao, R. Song, X. Liu, R. Qian, and X. Chen, “Patient-specific seizure prediction from electroencephalogram signal via multi-channel feedback capsule network,” IEEE Transactions on Cognitive and Developmental Systems, 2022.
[54] Y. Zhao, C. Li, X. Liu, R. Qian, R. Song, and X. Chen, “Patient-specific seizure prediction via adder network and supervised contrastive learning,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 30, pp. 1536–1547, 2022.
[55] Y. Xu, J. Yang, S. Zhao, H. Wu, and M. Sawan, “An end-to-end deep learning approach for epileptic seizure prediction,” in 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2020, pp. 266–270.
[56] R. Chen and K. K. Parhi, “Seizure prediction using convolutional neural networks and sequence transformer networks,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 6483–6486.
[57] Y. Gao, A. Liu, L. Wang, R. Qian, and X. Chen, “A self-interpretable deep learning model for seizure prediction using a multi-scale prototypical part network,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 31, pp. 1847–1856, 2023.
[58] E. Lashgari, D. Liang, and U. Maoz, “Data augmentation for deep-learning-based electroencephalography,” Journal of Neuroscience Methods, vol. 346, p. 108885, 2020.
[59] Y. Xu, J. Yang, and M. Sawan, “Multichannel synthetic preictal eeg signals to enhance the prediction of epileptic seizures,” IEEE Transactions on Biomedical Engineering, vol. 69, no. 11, pp. 3516–3525, 2022.
[60] Z. Islam, M. Abdel-Aty, Q. Cai, and J. Yuan, “Crash data augmentation using variational autoencoder,” Accident Analysis & Prevention, vol. 151, p. 105950, 2021.
[61] Y. Li, Y. Liu, Y.-Z. Guo, X.-F. Liao, B. Hu, and T. Yu, “Spatio-temporal-spectral hierarchical graph convolutional network with semisupervised active learning for patient-specific seizure prediction,” IEEE Transactions on Cybernetics, vol. 52, no. 11, pp. 12 189–12 204, 2021.
[62] J. Fan, C. Sun, C. Chen, X. Jiang, X. Liu, X. Zhao, L. Meng, C. Dai, and W. Chen, “Eeg data augmentation: towards class imbalance problem in sleep staging tasks,” Journal of Neural Engineering, vol. 17, no. 5, p. 056017, 2020.
[63] Z. Zhang, S. Zhong, and Y. Liu, “Beyond mimicking under-represented emotions: Deep data augmentation with emotional subspace constraints for eeg-based emotion recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 9, 2024, pp. 10 252–10 260.
[64] O. E. Karpov, V. V. Grubov, V. A. Maksimenko, N. Utaschev, V. E. Semerikov, D. A. Andrikov, and A. E. Hramov, “Noise amplification precedes extreme epileptic events on human eeg,” Physical Review E, vol. 103, no. 2, p. 022310, 2021.
[65] A. Katirai, N. Garcia, K. Ide, Y. Nakashima, and A. Kishimoto, “Situating the social issues of image generation models in the model life cycle: a sociotechnical approach,” AI and Ethics, pp. 1–18, 2024.

Data Augmentation for Seizure Prediction with Generative Diffusion Model