MTSA-SNN: A Multi-modal Time Series Analysis Model Based on Spiking Neural Network
††thanks: * These authors contributed equally to this work.
††thanks: † Corresponding Author.
Abstract
Time series analysis and modelling constitute a crucial research area. Traditional artificial neural networks struggle with complex, non-stationary time series data due to high computational complexity, limited ability to capture temporal information, and difficulty in handling event-driven data. To address these challenges, we propose a Multi-modal Time Series Analysis Model Based on Spiking Neural Network (MTSA-SNN). The Pulse Encoder unifies the encoding of temporal images and sequential information in a common pulse-based representation. The Joint Learning Module employs a joint learning function and weight allocation mechanism to fuse information from multi-modal pulse signals complementary. Additionally, we incorporate wavelet transform operations to enhance the model’s ability to analyze and evaluate temporal information. Experimental results demonstrate that our method achieved superior performance on three complex time-series tasks. This work provides an effective event-driven approach to overcome the challenges associated with analyzing intricate temporal information. Access to the source code is available at https://github.com/Chenngzz/MTSA-SNN
Index Terms:
Multi-modal,Time series analysis, Spiking neural network, Joint learning, Pulse encoder, Wavelet transformI Introduction
Traditional artificial neural networks (ANNs) have found extensive applications in time series analysis. They serve as a non-parametric, non-linear model capable of effectively capturing complex non-linear relationships within time series data. This is particularly valuable for addressing numerous time series problems since relationships within such data are typically non-linear. Deep neural networks (DNNs), as an extension of ANNs, exhibit a multi-layer structure that automatically learns features and hierarchical information from data. This characteristic enhances the capability of DNNs to analyze complex time series data by capturing patterns at various abstraction levels. For instance, deep learning models like Long Short-Term Memory (LSTM) networks have been widely employed to predict future values or sequences using past time steps[1].ANNs have also been widely applied across a range of applications traditionally addressed by statistical methods, including classification, pattern recognition, prediction, and process control [2].
However, for complex and volatile time series information, traditional ANNs often face challenges in capturing temporal features accurately. Consequently, Spiking Neural Networks (SNNs), as an alternative approach, have garnered considerable attention. Currently, SNNs have been successfully applied in various time series prediction scenarios, including financial time series forecasting, time series classification [3], and real-time online time series prediction [4].
SNNs rely on discrete signals in continuous time to effectively capture complex time patterns. Nonetheless, current SNN models encounter several challenges. First, the transformation of time series data into a suitable spiking representation poses a significant challenge. Second, the firing times of spiking neurons play a crucial role in model performance, necessitating higher demands for stability and accuracy. Moreover, integrating information from different sources into a single spiking network framework for decision-making involves complex issues related to cross-modal time synchronization and information mapping.
To address these challenges, we propose a Multi-Modal Time Series Analysis model based on Spiking Neural Networks (MTSA-SNN). This model consists of three key components: a single-modal spiking encoder, a spiking joint learning module, and an output layer. The spiking encoder is responsible for transforming time-series information from different modalities into spike signals. It includes alternating layers of feature extraction and neuron layers to selectively process input data from each modality. In the spiking joint learning module, we design a joint learning function and weight allocation mechanism to balance and fuse the complex spike information from multiple modalities. The output layer optimally adjusts the fused spike information to adapt to complex time series analysis tasks. The main contributions are as follows:
-
•
A novel Multi-modal Time Series Analysis Model Based on a Spiking Neural Network proposed by us. This model introduces an efficient event-driven approach that overcomes the limitations of traditional time series analysis methods.
-
•
We design SNN joint learning functions and a weight allocation mechanism, effectively addressing the balance and fusion of pulsed information.
-
•
We synergize wavelet transform with pulse networks to bolster the model’s capability in analyzing complex and non-stationary temporal data.
-
•
Extensive experiments demonstrate the outstanding performance of our approach across multiple complex time series datasets.
II Related Work
II-A Time Series Forecasting
Modelling and forecasting time series data is a valuable task in various domains. It has evolved significantly, transitioning from traditional methods to deep learning techniques, resulting in improved prediction accuracy and relevance over time.
Initially, time series forecasting relied on traditional approaches such as the ARIMA model [5] and Fourier analysis [6]. ARIMA, which includes auto-regressive (AR) and moving average (MA) components with differencing (I) to address non-stationarity, had challenges related to parameter selection and model identification. Fourier analysis was used for frequency domain analysis to identify periodic and seasonal patterns in the data.
Later, deep learning methods such as RNN and LSTM emerged to handle temporal dependencies [7]. LSTM, an improved version of RNN, performed better with long sequences due to its enhanced memory and forgetting mechanisms, becoming the preferred model for many time series problems. Nonetheless, they encountered challenges related to gradient vanishing and exploding when handling extended sequences, which restricted their practicality.
In contrast to single-modal time series forecasting, multi-modal time series forecasting leverages multiple data sources, such as text, images, and sensor data, to capture a broader perspective, enabling a wider range of pattern and trend recognition. This approach offers benefits like information synthesis, complementarity of different data types, model robustness, and improved generalization. Multi-modal deep learning models use CNN and BiLSTM to extract features from multi-modal time series data. Ensemble models, including probabilistic time series prediction based on Hidden Markov Models [8] and stacked ensembles, have been used to enhance accuracy and reduce overfitting.
Specific algorithms, including interpretable ML models and multi-modal meta-learning techniques [9], have been applied in diverse use cases, ranging from early Parkinson’s disease detection to time series regression tasks. These applications highlight their potential in various domains, reflecting the diversity and complexity of time series modelling and forecasting. They underscore the evolving methods and technologies that offer robust tools for a broad spectrum of application scenarios.

II-B Spiking Neural Network
Multi-modal time series models struggle with complex, irregularly and non-uniformly sampled data due to their continuous computations, difficulty in handling event-driven data patterns, and high computational complexity. However, Spiking Neural Networks (SNNs) hold promise in mitigating these challenges. SNNs, a unique class of neural networks that communicate using discrete spike signals in a continuous-time framework [10], are capable of emulating biological neural systems’ sparsity and encoding temporal information [11]. SNNs find practical application in various time series prediction scenarios, including financial time series forecasting, time series classification [3], and real-time online time series prediction [4].
SNNs pose challenges due to their complex neurons and non-differentiable pulse-based operations. Choosing a multi-modal time series model depends on the problem and data characteristics. The multi-spike network SNN variant is useful for financial time series prediction. Therefore, you should select the most appropriate model based on the problem and data characteristics.
SNNs (Spiking Neural Networks) present challenges due to the complexity of their neurons and the non-differentiable nature of pulse-based operations, making training complex. The choice of a multi-modal time series model should depend on the problem and data characteristics. For instance, a variant like multi-spike networks has proven valuable in time series prediction, especially for non-stationary data [12]. Thus, selecting the right model should align with the problem and data intricacies.
In summary, the proposed MTSA-SNN model efficiently encodes multimodal information into spikes. It utilizes a spike-based cooperative learning module to effectively map and integrate complex spike information. This method provides an accurate and practical event-driven approach that addresses the analysis of complex and non-stationary temporal information, demonstrating strong performance across multiple time series datasets.
III Methodology
The MTSA-SNN structure consists of three main components: SNN Encoder Module used to extract features from time series data; SNN Joint Learning Module utilizes joint learning and probability distribution methods to map multi-modal signals to a shared joint learning space, enabling the fusion of pulse signals. Output layer used to generate predictions and classification results for multi-modal time series data. The entire workflow is shown in Algorithm 1.
III-A Single-Modal Pulse Encoding Module
The SNN Image Encoder is a component that processes time-series image information into pulse representations and extracts features. This encoder alternates between the Feature Extraction (FM) module and the Leaky Integrate-and-Fire (LIF) SNN module. Visual information initially passes through the SNN layer to be transformed into a unified and compatible pulse signal format, making it suitable for subsequent network operations. The FM module further performs feature extraction on the visual information converted into pulse signals, including operations such as convolution and pooling. After feature extraction, the pulse signal is then passed to the pulse co-learning module.
The SNN Series Encoder is another modality encoder used for pulse-coding and feature extraction of temporal data sequences. These sequence data initially pass through the SNN layer and are then transformed into pulse signals. The network employs alternating operations between mapping layers and neurons. Neurons receive pulse information from the previous layer and membrane potential from the preceding time step in the sequence. By introducing this self-feedback mechanism, the pulse network can utilize membrane potential information from the previous time step to influence the calculations at the current time step. Consequently, the encoder is better equipped to capture the temporal correlations and dynamic changes in time-series data. The pulse information encoded through sequence encoding is .
Due to the strong temporal information processing capabilities of SNN, we employ the Leaky Integrate-and-Fire (LIF) model to describe the neural dynamics of multi-modal information. The following formula can represent the dynamic equation for the LIF model under continuous-time sequences:
(1) |
(2) |
is a membrane potential function concerning time . represents the resting membrane potential of the neuron. is a constant that characterizes the charging and discharging rate of the neuron’s membrane potential. is the synaptic pulse input function. denotes the membrane’s responsiveness to input currents.
When the membrane potential exceeds the threshold potential , the neuron is activated and triggers a spike, denoted as . is the Heaviside step function, which is 1 when and 0 otherwise. represents the threshold potential. is the reset potential, to which the membrane potential is reset when the neuron is activated.
(3) |
A neuron receives multiple pulse signals. Their effects are not independent but accumulate within the neuron, leading to a sustained change in membrane potential. By controlling the pulse frequency and timing, neurons can integrate and encode input information over time. Assuming that N neurons generate multiple pulses at different time points, these pulse timings can be represented by a series of time sequences . The cumulative effect of multiple pulses can be expressed as .
represents the Dirac Delta function, signifying the generation of a pulse at the firing time. is the output of the cumulative effect of multiple pulses, which corresponds to the pulse output of the encoder & . Algorithm 2 is the workflow of the SNN encoder.
III-B Multi-Modal pulse Joint Learning Module
The pulse signals extracted from different encoders are first subjected to normalization and mapping operations before input into a unified pulse co-learning module. The pulse signals and obtained from two heterogeneous spaces are then transformed from the time domain to the frequency domain through Fourier transformation. Fourier transformation decomposes the signal into different frequency components, which aids in analyzing the frequency domain characteristics of different modal signals.
(4) | ||||
(5) |
To better integrate and align the information from two different signal spaces, we introduce a joint learning function denoted as . This function aims to adjust the feature representations of the signals, mapping the signals from space and space to a common frequency domain space. During the training process, this function is continuously adjusted to make the pulse information in different modalities more consistent, achieving effective fusion and alignment of heterogeneous signals. denotes the fusion of pulse information in the joint learning space. is the dimension of joint learning space, where data from different modalities coexist in a shared representation.
(6) | ||||
(7) |
We introduce a more effective pulse-based joint weight allocation mechanism (JWAM). This mechanism involves mapping the similarity results in of multi-modal pulse signals into different spatial dimensions of the probability distribution matrix (). The similarity probability distribution is adaptively adjusted based on the features of each modality and their relative importance to achieve information fusion. integrates information from various modalities, providing a quantitative method for scoring cross-modal information representation. is a metric function used to measure the similarity between two pulse information representations in heterogeneous spaces. This function employs the Euclidean distance calculation method to assess the similarity between different modality representations. is used to adjust the sensitivity of the similarity measurement function. It is worth noting that it can dynamically adapt based on the distribution information of different modality features, enhancing the robustness and adaptability of similarity measurements.
(8) | ||||
(9) |
Furthermore, matrix transformations of the information in the joint space are utilized to adjust pulse signals. This operation aims to optimize the feature space while taking into consideration information from different modalities in order to better accommodate the characteristics of pulse sequences from other modalities. Additionally, we interact this process with cross-modal probability distributions to obtain the pulse fusion representation denoted as . This can be expressed as:
(10) |
The Output layer is responsible for two major tasks: predicting and classifying information from multi-modal time series pulse fusion data. It employs network layer techniques such as residual connections and ReLU to transform the fused information into a common format, making it available for various downstream tasks.
III-C Pulse Signal Processing Based on Wavelet Transform
To effectively address the non-stationary, non-linear characteristics and constraints in multi-scale feature analysis of time-series data, we employ the wavelet transform analysis method. Wavelet transform possesses exceptional time-frequency locality and multi-scale analysis capabilities, making it more suitable for capturing local features of signals at different time and frequency scales. The MTSA-SNN network based on wavelet transform can capture richer feature representations, endowing it with a significant advantage in handling non-stationary signals, extracting critical signal features, and analyzing signals across multiple scales.
MTSA-SNN employs wavelet transform to decompose input signals into four subbands: LL, LH, HH and HL, which represent distinct signal characteristics in terms of different frequencies and spatial scales. This multi-scale and multi-frequency analysis approach equips the MTSA-SNN model with a comprehensive understanding of multimodal data, enhancing its learning capabilities. As illustrated in Fig. 3 and Fig. 4, the temporal visualizations of these four subbands in the ETT and stock prediction datasets demonstrate the effectiveness of this multi-scale analysis.

Fig. 2 depicts the pulse network outputs based on the MIT-BIH dataset with different processing methods. It is evident that the pulse output subjected to wavelet transform more accurately captures the features of multimodal signals, resulting in a more stable and effective neural activation.


IV EXPERIMENT
IV-A Datasets
We conduct experimental evaluations for classification and regression tasks on two traditional time series datasets, MIT-BIH Arrhythmia (MIT-BIH)[13] and Electricity Transformer Temperature (ETT)[14]. Additionally, we perform a market forecasting analysis on relevant stock indices of the Chinese stock market from June 6, 2013, to June 6, 2023, covering a ten-year period, focusing on the opening and closing prices.
IV-B Comparison with other methods
MTSA-SNN demonstrates remarkable performance advantages in the field of biological time-series data analysis. The experimental results in Table I demonstrates that our model has achieved advanced performance in the detection of cardiac arrhythmias in multimodal electrocardiogram data. With a dataset classification accuracy of 98.75%, MTSA-SNN markedly outperforms previous leading algorithms. This is attributable to the effective simulation of the neural signal conduction process in biological systems through MTSA-SNN’s pulse-based fusion approach, resulting in significant performance advantages.
Network | Accuracy (%) | F1(%) | Precision(%) |
---|---|---|---|
Mousavi et al.[12] | 97.62 | 85.82 | 91.46 |
Yang et al.[15] | 97.76 | 88.28 | 94.34 |
Hammad et al.[16] | 98.00 | 89.70 | 86.55 |
Xing et al.[17] | 98.26 | 89.09 | - |
MTSA-SNN (ours) | 98.75 | 94.31 | 94.62 |
In addition, our method exhibits outstanding performance in various prediction tasks, including transformer temperature monitoring and stock market forecasting. Analyzing the results presented in Table II, our model demonstrates the lowest MAE and MSE across four different time steps in the ETT dataset. Furthermore, in Table III, MTSA-SNN achieves remarkably low errors of 0.96 and 1.15 in the stock market price prediction task compared to traditional time-series prediction models such as LSTM and XGBoost. MTSA-SNN, by converting complex and diverse multimodal time series data into a pulse-based representation, significantly enhances the model’s predictive and analytical capabilities regarding time-series information.
Methods | NLinear [18] | DLinear [18] | Autoformer[19] | Informer[14] | MTSA-SNN (ours) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Metric | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |
96 | 0.374 | 0.394 | 0.375 | 0.399 | 0.449 | 0.459 | 0.865 | 0.713 | 0.235 | 0.247 | |
192 | 0.408 | 0.415 | 0.405 | 0.416 | 0.500 | 0.482 | 1.008 | 0.792 | 0.345 | 0.371 | |
336 | 0.429 | 0.427 | 0.439 | 0.443 | 0.521 | 0.496 | 1.107 | 0.809 | 0.358 | 0.362 | |
720 | 0.440 | 0.453 | 0.472 | 0.490 | 0.514 | 0.512 | 1.181 | 0.865 | 0.396 | 0.439 |
Network | LSTM | XGBoost | LSTM-XGBoost | MTSA-SNN (ours) |
---|---|---|---|---|
MAE | 2.465 | 2.317 | 1.394 | 0.961 |
MSE | 2.839 | 2.285 | 1.461 | 1.152 |

IV-C Ablation Study
We conduct a comprehensive ablation study to evaluate different components of the MTSA-SNN model. As shown in Fig. 5, we present pulse signal output heatmaps for different components at the same time step using the MIT-BIH dataset. The brightness of the colours in the figure represents the activation levels of neurons. In comparison to the activation patterns from single-modal encoders, the joint learning module of MTSA-SNN activates more neurons, thus enriching the representation of temporal information. Furthermore, the application of wavelet transform enhances the representation of temporal information within the MTSA-SNN. This suggests that joint learning of pulses effectively balances multi-modal pulse signals and fuses them together. Simultaneously, wavelet transformation contributes to enhancing the representation of temporal information in the pulse network.
In addition, we analyze the spectral information of the waveform plots during the training process of the single-modal encoder and the joint learning module. In Fig. 6, the horizontal axis represents the time steps, while the vertical axis represents the amplitude. This indicates that the MTSA-SNN model effectively integrates and analyzes multi-modal signals while enhancing the overall robustness of the model.

V Conclusion
In this paper, we introduce an innovative Multi-modal Time Series Analysis Model based on the Spiking Neural Network. The model’s pulse encoder is designed to uniformly pulse-code multi-modal information. The pulse joint learning module is employed to effectively integrate complex pulse-encoded data. Additionally, we incorporate wavelet transform operations to enhance the model’s capability to analyze and evaluate time series data. Experimental results on three distinct time series datasets demonstrate the outstanding performance of our proposed approach across multiple tasks.
References
- [1] Y. Hua, Z. Zhao, R. Li, X. Chen, Z. Liu, H. Zhang, Deep learning with long short-term memory for time series prediction, IEEE Communications Magazine 57 (6) (2019) 114–119.
- [2] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, N. A. Mohamed, H. Arshad, State-of-the-art in artificial neural network applications: A survey, Heliyon 4 (11) (2018).
- [3] H. Fang, A. Shrestha, Q. Qiu, Multivariate time series classification using spiking neural networks, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–7.
- [4] A. M. George, S. Dey, D. Banerjee, A. Mukherjee, M. Suri, Online time-series forecasting using spiking reservoir, Neurocomputing 518 (2023) 82–94.
- [5] L. Kong, G. Li, W. Rafique, S. Shen, Q. He, M. R. Khosravi, R. Wang, L. Qi, Time-aware missing healthcare data prediction based on arima model, IEEE/ACM Transactions on Computational Biology and Bioinformatics (2022).
- [6] E. M. Stein, R. Shakarchi, Fourier analysis: an introduction, Vol. 1, Princeton University Press, 2011.
- [7] Q. Xian, W. Liang, A multi-modal time series intelligent prediction model, in: Z. Qian, M. Jabbar, X. Li (Eds.), Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications, Springer Nature Singapore, Singapore, 2022, pp. 1150–1157.
- [8] M. Zhang, X. Jiang, Z. Fang, Y. Zeng, K. Xu, High-order hidden markov model for trend prediction in financial time series, Physica A: Statistical Mechanics and its Applications 517 (2019) 1–12.
- [9] Z. Chen, D. Wang, Multi-initialization meta-learning with domain adaptation, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 1390–1394.
- [10] K. Yamazaki, V.-K. Vo-Ho, D. Bulsara, N. Le, Spiking neural networks and their applications: A review, Brain Sciences 12 (7) (2022) 863.
- [11] S. C.-X. Li, B. Marlin, Learning from irregularly-sampled time series: A missing data perspective, in: International Conference on Machine Learning, PMLR, 2020, pp. 5937–5946.
- [12] Q. Liu, L. Long, Q. Yang, H. Peng, J. Wang, X. Luo, Lstm-snp: A long short-term memory model inspired from spiking neural p systems, Knowledge-Based Systems 235 (2022) 107656.
- [13] G. B. Moody, R. G. Mark, The impact of the mit-bih arrhythmia database, IEEE engineering in medicine and biology magazine 20 (3) (2001) 45–50.
- [14] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, in: Proceedings of the AAAI conference on artificial intelligence, 2021, pp. 11106–11115.
- [15] W. Yang, Y. Si, D. Wang, B. Guo, Automatic recognition of arrhythmia based on principal component analysis network and linear support vector machine, Computers in Biology and Medicine 101 (2018) 22–32.
- [16] M. Hammad, A. M. Iliyasu, A. Subasi, E. S. Ho, A. A. Abd El-Latif, A multitier deep learning model for arrhythmia detection, IEEE Transactions on Instrumentation and Measurement 70 (2020) 1–9.
- [17] Y. Xing, L. Zhang, Z. Hou, X. Li, Y. Shi, Y. Yuan, F. Zhang, S. Liang, Z. Li, L. Yan, Accurate ecg classification based on spiking neural network and attentional mechanism for real-time implementation on personal portable devices, Electronics 11 (12) (2022) 1889.
- [18] A. Zeng, M. Chen, L. Zhang, Q. Xu, Are transformers effective for time series forecasting?, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 37, 9, 2023, pp. 11121–11128.
- [19] H. Wu, J. Xu, J. Wang, M. Long, Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting, Advances in Neural Information Processing Systems (NIPS) 34 (2021) 22419–22430.