This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Real-time Forecasting of Time Series in Financial Markets Using Sequentially Trained Many-to-one LSTMs

Kelum Gajamannage [email protected] Yonggi Park [email protected] Department of Mathematics and Statistics, Texas A&M University - Corpus Christ, 6300 Ocean Dr., Corpus Christi, TX 78412, USA
Abstract

Financial markets are highly complex and volatile; thus, learning about such markets for the sake of making predictions is vital to make early alerts about crashes and subsequent recoveries. People have been using learning tools from diverse fields such as financial mathematics and machine learning in the attempt of making trustworthy predictions on such markets. However, the accuracy of such techniques had not been adequate until artificial neural network (ANN) frameworks were developed. Moreover, making accurate real-time predictions of financial time series is highly subjective to the ANN architecture in use and the procedure of training it. Long short-term memory (LSTM) is a member of the recurrent neural network family which has been widely utilized for time series predictions. Especially, we train two LSTMs with a known length, say TT time steps, of previous data and predict only one time step ahead. At each iteration, while one LSTM is employed to find the best number of epochs, the second LSTM is trained only for the best number of epochs to make predictions. We treat the current prediction as in the training set for the next prediction and train the same LSTM. While classic ways of training result in more error when the predictions are made further away in the test period, our approach is capable of maintaining a superior accuracy as training increases when it proceeds through the testing period. The forecasting accuracy of our approach is validated using three time series from each of the three diverse financial markets: stock, cryptocurrency, and commodity. The results are compared with those of an extended Kalman filter, an autoregressive model, and an autoregressive integrated moving average model.

keywords:
Many-to-one LSTM , sequential training , real-time forecasting , time series , financial markets

1 Introduction

Financial markets refer broadly to any marketplace that entitles the trading of securities, commodities, and other fungibles, and the financial security market includes stock market, cryptocurrency market, etc. Bahadur et al. [2019]. Among the three markets, which are stock, cryptocurrency, and commodity, the stock markets are well known to people while the other two are not. A cryptocurrency market exchanges digital or virtual currencies between peers without the need for a third party such as a bank [Squarepants, 2022], but a commodity market trades raw materials such as gold and oil rather than manufactured products. These markets are both highly complex and volatile due to diverse economical, social, and political conditions [Qiu et al., 2020]. Learning such markets for the sake of making predictions is vital because that aids market analysts to make early alerts about crashes and subsequent recoveries so that the investors can either obtain better precautions for future crashes or gain more profit under future recoveries. Since it is unreliable and inefficient to rely only on a trader’s personal experience and intuition for the analysis and judgment of such markets, traders need smart trading recommendations derived from scientific research methods.

The classical methods of making predictions on time series data are mostly linear statistical approaches such as linear parametric autoregressive (AR), moving average (MA), and autoregressive integrated moving average (ARIMA) [Zhao et al., 2018] where they assume linear relationships between the current output and previous outputs. Thus, they often do not capture non-linear relationships in the data and cannot cope with certain complex time series. Because financial time series are nonstationary, nonlinear, and contaminated with high noise [Bontempi et al., 2013], traditional statistical models have some limitations in predicting financial time series with high precision. Purely data-driven approaches such as Artificial Neural Networks (ANNs) are adopted to forecast nonlinear and nonstationary time series data with both high efficiency and better accuracy and have become a popular predictor due to adaptive self-learning [Gajamannage et al., 2021].

Recurrent Neural Networks (RNNs) are powerful and robust types of ANNs that belong to the most promising algorithms in use because of their internal memory [Park et al., 2022]. This internal memory remembers its inputs and helps RNN to find solutions for a vast variety of problems [Ma & Principe, 2018]. RNN is optimized with respect to its weights to fit the training data by adopting a technique called backpropagation that requires the gradient of the RNN. However, the gradient of RNN may vanish and explode during the optimization routing which hampers RNN’s learning ability of long data sequences [Allen-Zhu et al., 2019]. As a solution to these two problems [Le & Zuidema, 2016], the LSTM architecture [Hochreiter & Schmidhuber, 1997], which is a special type of RNN, is often used. LSTMs are explicitly designed to learn long-term dependencies of time-dependent data by remembering information for long periods. LSTM performs faithful learning in applications such as speech recognition [Tian et al., 2017, Kim et al., 2017] and text processing [Shih et al., 2018, Simistira et al., 2015]. Moreover, LSTM is also suitable for complex data sequences such as stock time series extracted from financial markets because it has internal memory, has capability of customization, and is free from gradient-related issues.

We adopt a real-time iterative approach to train an LSTM that makes only one prediction for each iteration. For that, we train this LSTM with a known length, sat TT time steps, of previous data while setting the loss function to be the mean square error between labels and predictions. The LSTM predicts only one time step ahead during the current iteration that we treat as an observation for the next training dataset. We train the same LSTM over all the iterations where the number of iterations is equal to the number of total predictions. This real-time LSTM model is capable of incorporating every new future observation of the time series into the ongoing training process to make predictions. Since we use a sequence of observed time series to predict only one time step ahead, the prediction accuracy increases significantly. Moreover, the T1T-1 previous observations along with the current prediction predict for the next time step, and so the prediction error associated with the current prediction is further minimized as it runs through iterations. While classic ways of training result in more error when the predictions are made further away in the test period, our approach is capable of maintaining a superior accuracy as training increases when it proceeds through the testing period.

This paper is structured with four sections, namely, introduction (Sec.1), methods (Sec. 2), performance analysis (Sec. 3), and discussion (Sec. 4). In Sec.1, first, we present the notion of real-time time series predictions. Then, we provide mathematical formulation of many-to-one LSTM architecture for sequential training. Finally, for the state-of-the-art time series prediction methods, we provide the formulation of one nonlinear statistical approach called extended Kalman filters (EKF), and two linear statistical approaches called AR and ARIMA. Sec. 3 provides a detailed analysis of the performance of our LSTM architecture against that of EKF, AR, and ARIMA using three financial stocks (Apple, Microsoft, Google), three cryptocurrencies (Bitcoin, Ethereum, Cardano), and three commodities (gold, crude oil, natural gas). We present the conclusions along with a discussion in Sec. 4.

Notation Description
tt Index for time steps
TT Length of the training period
NN Forecasting length
LL Number of epochs
KK Number of stacked LSTMs
l(t)\mathcal{L}^{(t)}_{l} Training loss at the ll-th epoch of the tt-th iteration
σ\sigma Sigmoid function in LSTM
pp Order of the AR model
qq Number of past innovations in MA model
\mathcal{E} Relative root mean square error
𝒙(t)\boldsymbol{x}^{(t)} The observation at the tt-th time step where 1tT1\leq t\leq T
𝑿(t)=[𝒙(t),,𝒙(T+t2)]\boldsymbol{X}^{(t)}=\left[\boldsymbol{x}^{(t)},...,\boldsymbol{x}^{(T+t-2)}\right] tt-th input training window
𝒙^(t)\hat{\boldsymbol{x}}^{(t)} The prediction at the tt-th time step where T<tT+NT<t\leq T+N
𝒘(t)\boldsymbol{w}^{(t)} White Gaussian noise vector with zero mean in EKF
𝒚(t)\boldsymbol{y}^{(t)} Observation vector at the tt-th time step in EKF
𝒗(t)\boldsymbol{v}^{(t)} White Gaussian noise vector with zero mean in EKF
αi,1ip\alpha_{i},1\leq i\leq p Parameters of AR model
ϵ(t)\boldsymbol{\epsilon}^{(t)} White Gaussian noise vector with zero mean in AR model
βi,1iq\beta_{i},1\leq i\leq q Parameters of MA model
𝒂(t)\boldsymbol{a}^{(t)} tt-th past innovation of MA model
𝒄\boldsymbol{c} Biased vector in ARIMA
𝒃i,𝒃f,𝒃c\boldsymbol{b}_{i},\boldsymbol{b}_{f},\boldsymbol{b}_{c} Bias vectors in LSTM
ff System dynamics in EKF
hh Measurement function in EKF
LiL^{i} ii-th level lag operator
ΔD()\Delta^{D}(\cdot) DD-th differential time series
Q(t)Q^{(t)}, R(t)R^{(t)}, P(t)P^{(t)} Covariance matrices of 𝒘(t)\boldsymbol{w}^{(t)}, 𝒗(t)\boldsymbol{v}^{(t)}, 𝒙(t)\boldsymbol{x}^{(t)}, respectively, in EKF
JfJ_{f}, JhJ_{h} Jacobian matrices of f()f(\cdot), h()h(\cdot), respectively, in EKF
Wi,Wf,WoW_{i},W_{f},W_{o} Weight matrices in LSTM
Table 1: Notations used in this paper and their descriptions
Abbreviations Description
LSTM Long Short-Term Memory
KF Kalman Filter
EKF Extended Kalman Filter
AR AutoRegressive
MA Moving Average
ARMA AutoRegressive Moving Average
ARIMA AutoRegressive Integrated Moving Average
Table 2: Abbreviations used in this paper and their descriptions

2 Methods

In this section, first, we provide technical details of the real-time time series prediction scheme. Then, we present the LSTM architecture to cater to the real-time time series prediction, and LSTM’s training and predicting procedures. Moreover, we apply this real-time prediction scheme for three other time series prediction methods, namely, EKF, AR, and ARIMA. These three methods will be our state-of-the-art methods to compare the performance of LSTM.

2.1 Real-time time series prediction

We adopt a “sequential” approach to efficiently train time series models and predict for future. For a fixed-length input sequential data, the model is set to predict only one future time step at an iteration where the process runs until the required length of the prediction is performed. This real-time prediction approach is capable of incorporating every new data point of the time series into the ongoing training process to make predictions for the next time step. Let, the current observed time series is [𝒙(1),,𝒙(T)][\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(T)}] for some TT, the unobserved future portion of the time series is [𝒙(T+1),,𝒙(T+N)]\left[\boldsymbol{x}^{(T+1)},\dots,\boldsymbol{x}^{(T+N)}\right] for some N<TN<T, and the time series model is \mathcal{F}, see Fig. 1. For the first iteration, we train the time series forecasting model with the x(T)=(𝑿(1))x^{(T)}=\mathcal{F}\left(\boldsymbol{X}^{(1)}\right) where 𝑿(1)=[𝒙(1),,𝒙(T1)]\boldsymbol{X}^{(1)}=\left[\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(T-1)}\right]. Then, we predict for the time step (T+1)(T+1), denoted by 𝒙^(T+1)\hat{\boldsymbol{x}}^{(T+1)}, as 𝒙^(T+1)=(𝑿(2))\hat{\boldsymbol{x}}^{(T+1)}=\mathcal{F}\left(\boldsymbol{X}^{(2)}\right) where 𝑿(2)=[𝒙(2),,𝒙(T)]\boldsymbol{X}^{(2)}=\left[\boldsymbol{x}^{(2)},\dots,\boldsymbol{x}^{(T)}\right]. In the second iteration, we train the same model \mathcal{F} with 𝒙^(T+1)=(𝑿(2))\hat{\boldsymbol{x}}^{(T+1)}=\mathcal{F}\left(\boldsymbol{X}^{(2)}\right) where 𝑿(2)=[𝒙(2),,𝒙(T)]\boldsymbol{X}^{(2)}=\left[\boldsymbol{x}^{(2)},\dots,\boldsymbol{x}^{(T)}\right] and predict for the time step (T+2)(T+2), denoted by 𝒙^(T+2)\hat{\boldsymbol{x}}^{(T+2)}, as 𝒙^(T+2)=(𝑿(3))\hat{\boldsymbol{x}}^{(T+2)}=\mathcal{F}\left(\boldsymbol{X}^{(3)}\right) where 𝑿(3)=[𝒙(3),,𝒙^(T+1)]\boldsymbol{X}^{(3)}=\left[\boldsymbol{x}^{(3)},\dots,\hat{\boldsymbol{x}}^{(T+1)}\right]. We keep on doing this process until the predictions are performed for all the NN time steps .

Refer to caption
Figure 1: Real-time time series prediction scheme where the currently observed time series, [𝒙(1),,𝒙(T)]\left[\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(T)}\right] for some TT, is shown by blue color. The unobserved future time series, [𝒙(T+1),,𝒙(T+N)]\left[\boldsymbol{x}^{(T+1)},\dots,\boldsymbol{x}^{(T+N)}\right] for some N<TN<T, is shown by white color while the prediction for the future is shown in red color. For the first iteration, we train the time series forecasting model with the 𝒙(T)=(𝑿(1))\boldsymbol{x}^{(T)}=\mathcal{F}\left(\boldsymbol{X}^{(1)}\right) where 𝑿(1)=[𝒙(1),,𝒙(T1)]\boldsymbol{X}^{(1)}=\left[\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(T-1)}\right]. Then, we predict for the time step (T+1)(T+1) as 𝒙^(T+1)=(𝑿(2))\hat{\boldsymbol{x}}^{(T+1)}=\mathcal{F}\left(\boldsymbol{X}^{(2)}\right) where 𝑿(2)=[𝒙(2),,𝒙(T)]\boldsymbol{X}^{(2)}=\left[\boldsymbol{x}^{(2)},\dots,\boldsymbol{x}^{(T)}\right]. In the second iteration, we train the same model with 𝒙^(T+1)=(𝑿(2))\hat{\boldsymbol{x}}^{(T+1)}=\mathcal{F}\left(\boldsymbol{X}^{(2)}\right) where 𝑿(2)=[𝒙(2),,𝒙(T)]\boldsymbol{X}^{(2)}=\left[\boldsymbol{x}^{(2)},\dots,\boldsymbol{x}^{(T)}\right] and predict for the time step (T+2)(T+2) as 𝒙^(T+2)=(𝑿(3))\hat{\boldsymbol{x}}^{(T+2)}=\mathcal{F}\left(\boldsymbol{X}^{(3)}\right) where 𝑿(3)=[𝒙(3),,𝒙^(T+1)]\boldsymbol{X}^{(3)}=\left[\boldsymbol{x}^{(3)},\dots,\hat{\boldsymbol{x}}^{(T+1)}\right]. We keep on doing this process until all the predictions are performed.

2.2 Many-to-one LSTM architecture with sequential training

Since we make predictions only for one time step ahead at a time for an input time series, the LSTM architecture implemented here is the many-to-one type, see Fig. 2(a) for KK-stacked LSTM architecture. An LSTM consists of a series of nonlinear recurrent modules, denoted as Mj(t)M^{(t)}_{j} for t=1,,Tt=1,\dots,T and j=1,,Nj=1,\dots,N in Fig. 2, where each module processes data related to one time step. LSTM introduces a memory cell, a special type of the hidden state, that has the same shape as the hidden state which is engineered to record additional information. Each recurrent module in an LSTM filters information through four hidden layers where three of them are gates, namely, forgotten gate, input gate, and output gate, and the other is called the cell state that maintains and updates long-term memory, see 2(b).

The forgotten gate resets the content of the memory cell by deciding what information should be forgotten or retained. This gate produces a value between zero and one where zero means completely forgetting the previous hidden state and one means completely retaining that. Information from the previous hidden state, i.e., 𝒉(t1)\boldsymbol{h}^{(t-1)}, and the information from the current input, i.e., 𝒙(t)\boldsymbol{x}^{(t)}, is passed through the sigmoidsigmoid function, denoted as σ\sigma, according to

𝒇(t)=σ(Wf[𝒉(t1),𝒙(t)]+𝒃f),\boldsymbol{f}^{(t)}=\sigma\left(W_{f}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{f}\right), (1)

where WfW_{f} and 𝒃f\boldsymbol{b}_{f} are weighting matrix and biased vector, respectively. The input gate consisting of two components decides what new information is to be stored in the cell state. The first component is a sigmoidsigmoid layer that decides which values to be updated based on the previous hidden state and the information from the current input such that

𝒊(t)=σ(Wi[𝒉(t1),𝒙(t)]+𝒃i),\boldsymbol{i}^{(t)}=\sigma\left(W_{i}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{i}\right), (2)

where WiW_{i} and 𝒃i\boldsymbol{b}_{i} are weighting matrix and biased vector, respectively. The next component is a tanhtanh layer that creates a vector of new candidate values, c~(t)\tilde{c}^{(t)}, based on the previous hidden state and the information from the current input as

𝐜~(t)=tanh(Wc[𝒉(t1),𝒙(t)]+𝒃c),\tilde{\mathbf{c}}^{(t)}=tanh\left(W_{c}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{c}\right), (3)

where WcW_{c} and 𝒃c\boldsymbol{b}_{c} are weighting matrix and biased vector, respectively.

Refer to caption
Figure 2: KK-stacked LSTMs for many-to-one forecasting of a single feature time series where each LSTM is a collection of recurrent modules, denoted as MM’s. (a) the left figure shows a folded version of the artificial neural network (ANN) whereas the right figure shows its unfolded version of it. Here, the input for the ANN is 𝑿=[𝒙(1),,𝒙(t),,𝒙(T1)]\boldsymbol{X}=\left[\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(t)},\dots,\boldsymbol{x}^{(T-1)}\right] and output from that is 𝒙^(T)\hat{\boldsymbol{x}}^{(T)}. (b) Each MM filters information through four hidden layers where three of them are gates, namely, forgotten, input, and output, and the other is called the cell state. The forgotten gate resets the content of the memory cell, the input gate decides what new information is stored in the memory cell, the cell state stores long-term information in the memory, and the output gate sends out a filtered version of the memory cell’s stored information from the MM. Operations in an MM are given as follows: 𝒇(t)=σ(Wf[𝒉(t1),𝒙(t)]+𝒃f)\boldsymbol{f}^{(t)}=\sigma\left(W_{f}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{f}\right), 𝒊(t)=σ(Wi[𝒉(t1),𝒙(t)]+𝒃i)\boldsymbol{i}^{(t)}=\sigma\left(W_{i}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{i}\right), 𝐜~(t)=tanh(WC[𝒉(t1),𝒙(t)]+𝒃c)\tilde{\mathbf{c}}^{(t)}=tanh\left(W_{C}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{c}\right), 𝒄(t)=𝒇(t)𝒄(t1)𝒊(t)𝒄~(t)\boldsymbol{c}^{(t)}=\boldsymbol{f}^{(t)}\odot\boldsymbol{c}^{(t-1)}\oplus\boldsymbol{i}^{(t)}\odot\tilde{\boldsymbol{c}}^{(t)}, 𝒐(t)=σ(Wo[𝒉(t1),𝒙(t)]+𝒃o)\boldsymbol{o}^{(t)}=\sigma\left(W_{o}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{o}\right), and 𝒉(t)=𝒐(t)tanh(𝒄(t))\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)}\odot tanh\left(\boldsymbol{c}^{(t)}\right), where \odot and \oplus are point-wise multiplication and point-wise addition, respectively.

Cell state updates the LSTM’s memory with new long-term information. For that, first, it multiplies point wisely the old cell state c(t1)c^{(t-1)} by the forgetting state 𝒇(t)\boldsymbol{f}^{(t)}, i.e., 𝒇(t)𝒄(t1)\boldsymbol{f}^{(t)}\odot\boldsymbol{c}^{(t-1)}, to assure that the information retains from the old cell state is what is allowed by the forgetting state. Then, we add the pointwise product 𝒊c~(t)\boldsymbol{i}\odot\tilde{c}^{(t)} into 𝒇(t)𝒄(t1)\boldsymbol{f}^{(t)}\odot\boldsymbol{c}^{(t-1)}, i.e.,

𝒄(t)=𝒇(t)𝒄(t1)𝒊(t)𝒄~(t),\boldsymbol{c}^{(t)}=\boldsymbol{f}^{(t)}\odot\boldsymbol{c}^{(t-1)}\oplus\boldsymbol{i}^{(t)}\odot\tilde{\boldsymbol{c}}^{(t)}, (4)

as the information from the current input state which is found relevant by the ANN. The output gate determines the value of the next hidden state with the information from the current cell state, current input state, and previous hidden state. First, a sigmoidsigmoid layer decides how much of the current input and the previous hidden state are going to output. Then, the current cell state is passed through the tanhtanh layer to scale the cell state value between -1 and 1. Thus, the output 𝒉(t)\boldsymbol{h}^{(t)} is

𝒉(t)=𝒐(t)tanh(𝒄(t)),with𝒐(t)=σ(Wo[𝒉(t1),𝒙(t)]+𝒃o),\boldsymbol{h}^{(t)}=\boldsymbol{o}^{(t)}\odot tanh\left(\boldsymbol{c}^{(t)}\right),\ \text{with}\ \ \boldsymbol{o}^{(t)}=\sigma\left(W_{o}\cdot[\boldsymbol{h}^{(t-1)},\boldsymbol{x}^{(t)}]+\boldsymbol{b}_{o}\right), (5)

where WoW_{o} and 𝒃o\boldsymbol{b}_{o} are weighting matrix and biased vector, respectively. Based upon 𝒉(t)\boldsymbol{h}^{(t)}, the network decides which information from the current hidden state should be carried out to the next hidden state where the next hidden state is used for prediction. To conclude, the forget gate determines which relevant information from the prior steps is needed. The input gate decides what relevant information can be added from the current cell state, and the output gates finalize the input to the next hidden state.

2.2.1 Optimization of LSTM

Training an LSTM is the process of minimizing a relevant reconstruction error function, also called loss function, with respect to weights and bias vectors of Eqns. (1), (2), (3), (4), and (5). Such a minimization problem is implemented in four steps: first, forward propagation of input data through the ANN to get the output; second, calculate the loss, between forecasted output and the true output; third, calculate the derivatives of the loss function with respect to the LSTM’s weights and bias vectors using backpropagation through time (BTT) Werbos [1990]; and fourth, adjusting the weights and bias vectors by gradient descent method Gruslys et al. [2016].

BTT unrolls backward all the dependencies of the output onto the weights of the ANN Manneschi & Vasilaki [2020], which is represented from left side to right side in Fig. 2(a). At each iteration, say t[1,N+1]t\in[1,N+1], we train the LSTM by only one instance of input-label where the input is 𝑿(t)=[𝒙(t),,𝒙(T),𝒙^(T+1),,𝒙^(T+t2)]\boldsymbol{X}^{(t)}=\left[\boldsymbol{x}^{(t)},\dots,\boldsymbol{x}^{(T)},\hat{\boldsymbol{x}}^{(T+1)},\dots,\hat{\boldsymbol{x}}^{(T+t-2)}\right] and the label is 𝒙^(T+t1)\hat{\boldsymbol{x}}^{(T+t-1)}. Due to this process, at the tt-th iteration, the ANN is trained with tt-th input-label instance and predicts for the (T+t1T+t-1)-th time step. Thus, we formulate the loss function at the tt-th iteration of the LSTM as the relative mean square error,

(t)=𝒙^(T+t1)𝒙(T+t1)F2𝒙(T+t1)F2,\mathcal{L}^{(t)}=\frac{\left\|\hat{\boldsymbol{x}}^{(T+t-1)}-\boldsymbol{x}^{(T+t-1)}\right\|_{F}^{2}}{\left\|\boldsymbol{x}^{(T+t-1)}\right\|_{F}^{2}}, (6)

where FF denotes the Frobenius norm and 𝒙^(T+t1)\hat{\boldsymbol{x}}^{(T+t-1)} is the output of the LSTM for the input 𝑿(t1)\boldsymbol{X}^{(t-1)}. We use BTT to compute the derivatives of Eqn. (6) with respect to the weights and bias vectors. We update the weights using the gradient descent-based method, called Adaptive Moment Estimation (ADAM) Kingma & Ba [2015]. ADAM is an iterative optimization algorithm used in recent machine learning algorithms to minimize loss functions where it employs the averages of both the first-moment gradients and the second-moment of the gradients for computations. It generally converges faster than standard gradient descent methods and saves memory by not accumulating the intermediate weights.

To assure better convergence of the loss function, we integrate epochs into the training process in a unique way that we explain here for the tt-th iteration. However, if the loss function is non-convex or semi-convergence choosing the best number of epochs is challenging. Fig. 3 illustrates the non-convex behavior of an LSTM’s loss function that is trained with the closing prices of the Apple stock. Here, we input a sequence of 1227 days of prices into the LSTM and generate the price for the 1228-th day where the loss is computed as the relative mean square error between the predicted price and the observed prices for the 1228-th day. We proceed with this single-day training for 60 epochs as shown in Fig. 3. Since the loss varies non-convexly with respect to epochs, we came up with a unique way of training the LSTM through epochs. Particularly, we maintain two LSTMs, denoted as LSTM1LSTM_{1} and LSTM2LSTM_{2}, that are trained through each iteration. We assume that those two LSTMs corresponding to the (t1)(t-1)-th iteration are given for the tt-th iteration. For the tt-th iteration, we train LSTM1LSTM_{1} with the input 𝑿(t)\boldsymbol{X}^{(t)} and the label 𝒙T+t1\boldsymbol{x}^{T+t-1} for fix number of epochs, say LL. Here, we record LSTM1LSTM_{1}’s optimum weights and biased vectors corresponding to each of the epoch. We reformulate LSTM2LSTM_{2} with the weights and biased vectors corresponding to the least loss among LL epochs. Finally, we redefine LSTM1LSTM_{1} as LSTM2LSTM_{2} and proceed to the (t+2)(t+2)-th iteration. Algorithm 1 summarizes the training and prediction procedure of our sequentially trained many-to-one LSTM scheme.

Refer to caption
Figure 3: Non-convex behavior of the loss function. We apply our training scheme to train an LSTM with closing prices of the Apple stock. We input a sequence of 1227 days of prices into the LSTM and generate the price for the 1228-th day. The loss is computed as the relative mean square error between the predicted price and the observed prices for the 1228-th day. We proceed with this single-day training 60 times, also called epochs, where the loss for the ll-th epoch is denoted as Ll(t)L^{(t)}_{l}.
Algorithm 1 : Many-to-one LSTM architecture with sequential training.
Denotation: 𝑿(1)=[𝒙(1),,𝒙(T1)]\boldsymbol{X}^{(1)}=\left[\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(T-1)}\right], 𝑿(2)=[𝒙(2),,𝒙(T)]\boldsymbol{X}^{(2)}=\left[\boldsymbol{x}^{(2)},\dots,\boldsymbol{x}^{(T)}\right], and 𝑿(t)=[𝒙(t),,𝒙(T),𝒙^(T+1),,𝒙^(T+t2)]\boldsymbol{X}^{(t)}=\left[\boldsymbol{x}^{(t)},\dots,\boldsymbol{x}^{(T)},\hat{\boldsymbol{x}}^{(T+1)},\dots,\hat{\boldsymbol{x}}^{(T+t-2)}\right] for 3tN+13\leq t\leq N+1.
Input: training time series ([𝒙(1),,𝒙(T)])\left(\left[\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(T)}\right]\right); forecast length (NN); number of maximum epochs (LL).
Output: time series forecast ([𝒙^(T+1),,𝒙^(T+N)])\left(\left[\hat{\boldsymbol{x}}^{(T+1)},\dots,\hat{\boldsymbol{x}}^{(T+N)}\right]\right); trained LSTMLSTM (LSTM1LSTM_{1} or LSTM2LSTM_{2}).
1:  Initialization: two LSTMLSTMs, denoted as LSTM1LSTM_{1} and LSTM2LSTM_{2}, with the weights Wf=Wi=Wc=Wo=0W_{f}=W_{i}=W_{c}=W_{o}=0 and biased vectors 𝒃f=𝒃i=𝒃c=𝒃o=𝒉(0)=0\boldsymbol{b}_{f}=\boldsymbol{b}_{i}=\boldsymbol{b}_{c}=\boldsymbol{b}_{o}=\boldsymbol{h}^{(0)}=\textbf{0}.
2:  for t[1,N+1]t\in[1,N+1] do
3:     for l[1,L]l\in[1,L] do
4:        Compute 𝒙^(T+t1)=LSTM1(𝑿(t))\hat{\boldsymbol{x}}^{(T+t-1)}=LSTM_{1}\left(\boldsymbol{X}^{(t)}\right) according to the map in Fig 2(a).
5:        Minimize Ll(t)=𝒙^(T+t1)𝒙(T+t1)F2/𝒙(T+t1)F2L^{(t)}_{l}=\left\|\hat{\boldsymbol{x}}^{(T+t-1)}-\boldsymbol{x}^{(T+t-1)}\right\|_{F}^{2}\Big{/}\left\|\boldsymbol{x}^{(T+t-1)}\right\|_{F}^{2} using BTT with respect to the weights WiW_{i}, WiW_{i}, WcW_{c}, and WoW_{o}, and bias vectors 𝒃f\boldsymbol{b}_{f}, 𝒃i\boldsymbol{b}_{i}, 𝒃c\boldsymbol{b}_{c}, and 𝒃o\boldsymbol{b}_{o} of the composite representation of the functions in Eqns. (1), (2), (3), (4), and (5).
6:        Record Ll(t)L^{(t)}_{l} along with WiW_{i}, WiW_{i}, WcW_{c}, WoW_{o}, 𝒃f\boldsymbol{b}_{f}, 𝒃i\boldsymbol{b}_{i}, 𝒃c\boldsymbol{b}_{c}, and 𝒃o\boldsymbol{b}_{o}.
7:        Update the weights and bias vectors of LSTM1LSTM_{1} using the gradient descent-based method ADAM.
8:     end for
9:     Reformulate LSTM2LSTM_{2} with WiW_{i}, WiW_{i}, WcW_{c}, WoW_{o}, 𝒃f\boldsymbol{b}_{f}, 𝒃i\boldsymbol{b}_{i}, 𝒃c\boldsymbol{b}_{c}, and 𝒃o\boldsymbol{b}_{o} corresponding to the least {Ll(t)|l}\left\{L^{(t)}_{l}\Big{|}\ \forall l\right\}.
10:     Replicate LSTM2LSTM_{2} and define it as LSTM1LSTM_{1}.
11:     Forecast for (T+t)(T+t)-th time step, i.e., 𝒙^(T+t)=LSTM1(𝑿(t+1))\hat{\boldsymbol{x}}^{(T+t)}=LSTM_{1}\left(\boldsymbol{X}^{(t+1)}\right)
12:  end for

2.3 State-of-the-art methods

Here, we present three state-of-the-art time series prediction methods, namely, extended Kalman filter (EKF), autoregression, and autoregressive integrated moving average (ARIMA), that we use to validate the performance of our LSTM scheme. Here, we utilize the same sequential training as we did for LSTMs to make real-time predictions on the same financial time series.

2.3.1 Extended Kalman Filter (EKF)

EKF is a nonlinear version of the standard Kalman filter where the formulation of EKF is based on the linearization of both the state and the observation equations. In an EKF, the state Jacobian and the measurement Jacobian replace the state transition matrix and the measurement matrix, respectively, of a linear KF (Valade et al. [2017]). This process essentially linearizes the non-linear function around the current estimate. Linearization enables the propagation of both the state and state covariance in an approximately linear format. Here, the extended Kalman filter is presented in three steps, namely, dynamic process, model forecast step, and data assimilation step.

Dynamic Process

Here, we present both the state model and the observation model of a nonlinear dynamic process. The current state, 𝒙(t+1)\boldsymbol{x}^{(t+1)}, is modeled as a sum of the nonlinear function of the previous state, 𝒙(t)\boldsymbol{x}^{(t)}, and the noise, 𝒘(t)\boldsymbol{w}^{(t)}, as

𝒙(t+1)=f(𝒙(t))+𝒘(t),\boldsymbol{x}^{(t+1)}=f\left(\boldsymbol{x}^{(t)}\right)+\boldsymbol{w}^{(t)}, (7)

where 𝒙(t)\boldsymbol{x}^{(t)}, 𝒘(t)n\boldsymbol{w}^{(t)}\in\mathbb{R}^{n}. Here, the random process {𝒘(t)}\{\boldsymbol{w}^{(t)}\} is Gaussian white noise with zero mean and covariance matrix of Q(t)=E[𝒘(t)(𝒘(t))T]Q^{(t)}=E\left[\boldsymbol{w}^{(t)}\left(\boldsymbol{w}^{(t)}\right)^{T}\right]. The initial state 𝒙(0)\boldsymbol{x}^{(0)} is a random vector with known mean 𝝁0=E(𝒙(0))\boldsymbol{\mu}_{0}=E\left(\boldsymbol{x}^{(0)}\right) and covariance P(0)=E[(𝒙(0)𝝁0)(𝒙(0)𝝁0)T]P^{(0)}=E\left[(\boldsymbol{x}^{(0)}-\boldsymbol{\mu}_{0})(\boldsymbol{x}^{(0)}-\boldsymbol{\mu}_{0})^{T}\right]. The Jacobian of the predicted state with respect to the previous state, denoted as JfJ_{f}, is obtained by partial derivatives as Jf=f𝒙()J_{f}=f_{\boldsymbol{x}}(\cdot).

The current observation, 𝒚(t+1)\boldsymbol{y}^{(t+1)}, is modeled as a sum of the nonlinear function of the current state, 𝒙(t+1)\boldsymbol{x}^{(t+1)}, and the noise, 𝒗(t+1)\boldsymbol{v}^{(t+1)}, as

𝒚(t)=h(𝒙(t))+𝒗(t),\boldsymbol{y}^{(t)}=h\left(\boldsymbol{x}^{(t)}\right)+\boldsymbol{v}^{(t)}, (8)

where 𝒚(t+1)\boldsymbol{y}^{(t+1)}, 𝒗(t+1)n\boldsymbol{v}^{(t+1)}\in\mathbb{R}^{n}. Here, the random process {𝒗(t+1)}\{\boldsymbol{v}^{(t+1)}\} is Gaussian white noise with zero mean and covariance matrix of R(t)=E[𝒗(t)(𝒗(t))T]R^{(t)}=E\left[\boldsymbol{v}^{(t)}\left(\boldsymbol{v}^{(t)}\right)^{T}\right]. The Jacobian of the predicted observation with respect to the previous state, denoted as JhJ_{h}, is obtained by partial derivatives as Jh=h𝒙()J_{h}=h_{\boldsymbol{x}}(\cdot).

Model Forecast Step

The state Jacobian and the measurement Jacobian replace linear KF’s state transition matrix and the measurement matrix, respectively Valade et al. [2017]. Let, the initial estimates of the state and the covariance are 𝒙(0|0)\boldsymbol{x}^{(0|0)} and P(0|0)P^{(0|0)}, respectively. The state and the covariance matrix are propagated to the next step using

𝒙^(t+1)f(𝒙^(t))\displaystyle\hat{\boldsymbol{x}}^{(t+1)}\approx f\left(\hat{\boldsymbol{x}}^{(t)}\right) (9)

and

P(t+1)=Jf(𝒙^(t))P(t)[Jf(𝒙^(t))]T+Q(t),\displaystyle P^{(t+1)}=J_{f}\left(\hat{\boldsymbol{x}}^{(t)}\right)P^{(t)}\left[J_{f}\left(\hat{\boldsymbol{x}}^{(t)}\right)\right]^{T}+Q^{(t)}, (10)

respectively.

Data Assimilation Step

The measurement at the t+1t+1 step is given by

𝒚(t+1)h(𝒙^(t+1))\displaystyle\boldsymbol{y}^{(t+1)}\approx h\left(\hat{\boldsymbol{x}}^{(t+1)}\right) (11)

Use the difference between the actual measurement and the predicted measurement to correct the state at the t+1t+1 step. To correct the state, the filter must compute the Kalman gain. First, the filter computes the measurement prediction covariance (innovation) as

S(t+1)=Jh(𝒙(t+1))P(t+1)[Jh(𝒙(t+1))]+R(t+1)\displaystyle S^{(t+1)}=J_{h}\left(\boldsymbol{x}^{(t+1)}\right)P^{(t+1)}\left[J_{h}\left(\boldsymbol{x}^{(t+1)}\right)\right]+R^{(t+1)} (12)

Then, the filter computes the Kalman gain as

Kg(t+1)=P(t+1)[Jh(𝒙(t+1))]T[S(k+1)]1\displaystyle K_{g}^{(t+1)}=P^{(t+1)}\left[J_{h}\left(\boldsymbol{x}^{(t+1)}\right)\right]^{T}\left[S^{(k+1)}\right]^{-1} (13)

The filter corrects the predicted estimate by using observation. The estimate, after the correction using the observation 𝒚(t+1)\boldsymbol{y}^{(t+1)}, is

𝒙^(t+1)=𝒙(t+1)+K(t+1)[𝒚(t+1)h(𝒙(t+1))]P(t+1)=[IKg(t+1)Jh(𝒙^(t+1))]P(t+1)\displaystyle\begin{split}\hat{\boldsymbol{x}}^{(t+1)}=\boldsymbol{x}^{(t+1)}+K^{(t+1)}\left[\boldsymbol{y}^{(t+1)}-h\left(\boldsymbol{x}^{(t+1)}\right)\right]\\ P^{(t+1)}=\left[I-K_{g}^{(t+1)}J_{h}\left(\hat{\boldsymbol{x}}^{(t+1)}\right)\right]P^{(t+1)}\end{split} (14)

The corrected state is often called the a posteriori estimate of the state, because it is derived after including the observation.

2.3.2 Autoregression (AR) model

Many observed time series exhibit serial autocorrelation which is known to be the linear association between lagged observations. The AR model predicts the value for the current time step, 𝒙(t)\boldsymbol{x}^{(t)}, based on a linear relationship between pp-recent observations, 𝒙(t1)\boldsymbol{x}^{(t-1)}, 𝒙(t2)\boldsymbol{x}^{(t-2)}, \dots, 𝒙(tp)\boldsymbol{x}^{(t-p)}, where this pp is known as the order of the model Geurts et al. [1977]. Let, α1,,αp\alpha_{1},\dots,\alpha_{p}\in\mathbb{R} are the coefficients, order pp AR model is given by

𝒙(t)=𝒄+α1𝒙(t1)+α2𝒙(t2)++αp𝒙(tp)+ϵ(t),\displaystyle\boldsymbol{x}^{(t)}=\boldsymbol{c}+\alpha_{1}\boldsymbol{x}^{(t-1)}+\alpha_{2}\boldsymbol{x}^{(t-2)}+...+\alpha_{p}\boldsymbol{x}^{(t-p)}+\boldsymbol{\epsilon}^{(t)}, (15)

where ϵ(t)\boldsymbol{\epsilon}^{(t)} is uncorrelated noise with a zero mean. Let, the lag operator polynomial notation is Li𝒙(t)=𝒙(ti)L^{i}\boldsymbol{x}^{(t)}=\boldsymbol{x}^{(t-i)}. We define order pp autoregression lag operator polynomial as α(L)=(1α1LαpLp)\alpha(L)=(1-\alpha_{1}L-\dots-\alpha_{p}L^{p}). Thus, AE model is given by

α(L)𝒙(t)=c+ϵ(t).\displaystyle\alpha(L)\boldsymbol{x}^{(t)}=c+\boldsymbol{\epsilon}^{(t)}. (16)

The solution for the AR model is given by

𝒙(t)=α1(L)(c+ϵ(t)).\displaystyle\boldsymbol{x}^{(t)}=\alpha^{-1}(L)\left(c+\boldsymbol{\epsilon}^{(t)}\right). (17)

2.3.3 Autoregressive integrated moving average (ARIMA) model

ARIMA model is made by combining a differential version of AR model into a moving average (MA) model. MA model captures serial autocorrelation in a time series 𝒙(1),,𝒙(t),,𝒙(T)\boldsymbol{x}^{(1)},\dots,\boldsymbol{x}^{(t)},\dots,\boldsymbol{x}^{(T)} by expressing the conditional mean of 𝒙(t)\boldsymbol{x}^{(t)} as a function of past innovations, 𝒂t(),𝒂(t1),,𝒂(tq)\boldsymbol{a}^{t()},\boldsymbol{a}^{(t-1)},...,\boldsymbol{a}^{(t-q)}. An MA model that depends on qq past innovations is called an MA model of order qq, denoted by MA(qq). In general, the MA(qq) model can be represented by the formula

𝒙(t)=𝝁+𝒂(t)+β1𝒂(t1)++βq𝒂(tq),\displaystyle\boldsymbol{x}^{(t)}=\boldsymbol{\mu}+\boldsymbol{a}^{(t)}+\beta_{1}\boldsymbol{a}^{(t-1)}+...+\beta_{q}\boldsymbol{a}^{(t-q)}, (18)

where 𝒂(t)\boldsymbol{a}^{(t)}’s are uncorrelated innovation processes with a zero mean and μ\mu is the unconditional mean of 𝒙(t)\boldsymbol{x}^{(t)} for all tt.

For some observed time series, a higher order AR or MA model is needed to capture the underlying process well. In this case, a combined ARMA model can sometimes be a parsimonious choice. An ARMA model expresses the conditional mean of 𝒙(t)\boldsymbol{x}^{(t)} as a function of both recent observations, 𝒙(t1),𝒙(t2),,𝒙(tp)\boldsymbol{x}^{(t-1)},\boldsymbol{x}^{(t-2)},\dots,\boldsymbol{x}^{(t-p)}, and recent innovations, 𝒂(t),𝒂(t1),,𝒂(tq)\boldsymbol{a}^{(t)},\boldsymbol{a}^{(t-1)},\dots,\boldsymbol{a}^{(t-q)}. The ARMA model with AR degree of pp and MA degree of qq is denoted by ARMA (p,qp,q), which is given by

𝒙(t)=𝒄+α1𝒙(t1)+α2𝒙(t2)++αp𝒙(tp)+β0𝒂(t)+β1𝒂(t1)++βq𝒂(tq),\displaystyle\begin{split}\boldsymbol{x}^{(t)}=\boldsymbol{c}+\alpha_{1}\boldsymbol{x}^{(t-1)}+\alpha_{2}\boldsymbol{x}^{(t-2)}+\dots+\alpha_{p}\boldsymbol{x}^{(t-p)}\\ +\beta_{0}\boldsymbol{a}^{(t)}+\beta_{1}\boldsymbol{a}^{(t-1)}+\dots+\beta_{q}\boldsymbol{a}^{(t-q)},\end{split} (19)

Shumway & Stoffer [2017].

The ARIMA process generates nonstationary series that are integrated of order DD where that nonstationary process can be made stationary by taking DD differences. A series that can be modeled as a stationary ARMA(p,qp,q) process after being differenced DD times is denoted by ARIMA(p,D,qp,D,q), which is given by

ΔD𝒙(t)=μ+α1ΔD𝒙(t1)+α2ΔD𝒙(t2)++αpΔD𝒙(tp)+𝒂(t)β1𝒂(t1)β2𝒂(t2)βq𝒂(tq),\begin{split}\Delta^{D}\boldsymbol{x}^{(t)}=\mu+\alpha_{1}\Delta^{D}\boldsymbol{x}^{(t-1)}+\alpha_{2}\Delta^{D}\boldsymbol{x}^{(t-2)}+...+\alpha_{p}\Delta^{D}\boldsymbol{x}^{(t-p)}\\ +\boldsymbol{a}^{(t)}-\beta_{1}\boldsymbol{a}^{(t-1)}-\beta_{2}\boldsymbol{a}^{(t-2)}-...-\beta_{q}\boldsymbol{a}^{(t-q)},\end{split} (20)

where ΔD𝒙(t)\Delta^{D}\boldsymbol{x}^{(t)} denotes a DD-th differential time series, 𝒂(t)\boldsymbol{a}^{(t)}’s are uncorrelated innovation processes with a zero mean, and μ\mu is the unconditional mean of 𝒙(t)\boldsymbol{x}^{(t)} for all tt [Newbold, 1983]. With the lag operator Li𝒙(t)=𝒙(ti)L^{i}\boldsymbol{x}^{(t)}=\boldsymbol{x}^{(t-i)}, the ARIMA model can be written as

α(L)(1L)D𝒙(t)=c+β(L)𝒂(t)\displaystyle\alpha(L){(1-L)}^{D}\boldsymbol{x}^{(t)}=c+\beta(L)\boldsymbol{a}^{(t)} (21)

where α(L)=(1α1LαpLp)\alpha(L)=(1-\alpha_{1}L-\dots-\alpha_{p}L^{p}) and β(L)=(1+β1L++βqLq)\beta(L)=(1+\beta_{1}L+\dots+\beta_{q}L^{q}). Thus, the solution for ARIMA model is given by

𝒙(t)=(α(L)(1L)D)1(c+β(L)𝒂(t)).\displaystyle\boldsymbol{x}^{(t)}=\left(\alpha(L){(1-L)}^{D}\right)^{-1}\left(c+\beta(L)\boldsymbol{a}^{(t)}\right). (22)

3 Performance Analysis

The performance analysis of LSTM is conducted using nine financial time series obtained from three markets, namely, stocks, cryptocurrencies, and commodities. We chose Apple, Google, and Microsoft for stocks; Bitcoin, Ethereum, Cardano for cryptocurrencies; and Gold, Oil, and Natural Gas for commodities. These diverse examples validate the broad applicability of LSTMs in analyzing and predicting financial time series.

We follow the procedure given in Fig. 1 to train the real-time many-to-one LSTM architecture given in Fig. 2. Setting the LSTM to run for a specific number of epochs and then using that trained network to make predictions often do not perform the best training and then do not perform accurate predictions since the loss function undergoes semi-convergence as shown in Fig. 3. To avoid this issue; first, we train the LSTM for 100 epochs; second, we compute the best number of epochs associated with the least loss; and finally, train again a new LSTM with that many epochs. Moreover, the parameter choices for the training length and prediction length are shown in Table 3.

Table 3: Parameter choices for the training length, prediction length, and number of epochs used in real-time many-to-one LSTMs.
time series Training length Prediction length
(TT) (NN)
Apple 1228 30
Microsoft 1228 30
Google 1228 30
Bitcoin 1064 30
Ethereum 1064 30
Cardano 1064 30
Oil 8248 200
Natural gas 5802 150
Gold 816 30

Now, we incorporate the same one-day recursive prediction procedure in Fig. 1 into the other three state-of-the-art methods, namely, EKF, AR, and ARIMA, to predict the above financial time series. After a trial and error process, we found that the best pp’s of AR are 300, 400, and 400, for Apple, Microsoft, and Google, respectively; and the best (p,D,qp,D,q)’s of ARIMA are (10, 0, 2), (10, 2, 1), and (0, 1 , 1), for Apple, Microsoft, and Google, respectively. Then, the best pp’s of AR were found to be 100, 100, and 300, for Bitcoin, Ethereum, and Cardano, respectively; and the best (p,D,qp,D,q)’s of ARIMA were found to be (6, 0, 2), (6, 1, 1), and (8, 2 , 1), for Bitcoin, Ethereum, and Cardano, respectively. Finally, the best pp’s of AR were 200, 200, and 100, for Oil, Natural gas, and Gold, respectively; and the best (p,D,qp,D,q)’s of ARIMA were (4, 1, 1), (10, 1, 2), and (8, 2 , 0), for Oil, Natural gas, and Gold, respectively. Thus, we set the methods with the best parameter values and executed them with the corresponding time series.

We compute the mean of the relative absolute difference between the predicted and the observed time series for the prediction period using

=1Nt=T+1T+N𝒙^(t)𝒙(t)2𝒙(t)2,\mathcal{E}=\frac{1}{N}\sum^{T+N}_{t=T+1}\frac{\big{\|}\hat{\boldsymbol{x}}^{(t)}-\boldsymbol{x}^{(t)}\big{\|}_{2}}{\big{\|}\boldsymbol{x}^{(t)}\big{\|}_{2}}, (23)

as an error measure of the prediction that we show in Table 4. Hereby, we observe that the order of the best to the worst prediction performance is LSTM, EKF, AR, and ARIMA.

Table 4: This table shows the prediction error, quantified as the means of the relative difference between the predicted and the observed time series for the prediction period, of four methods LSTM, EKF, AR, and ARIMA. The analysis is conducted on three stocks Apple, Microsoft, and Google; three cryptocurrencies, Bitcoin, Ethereum, and Cardano; and three commodities, oil, natural gas, and gold. On average, LSTM performs 17 times better than EKF, 7 times better than AR, and 4 times better than ARIMA. Moreover, the average prediction errors of LSTM are 0.05, 0.22, and 0.14 for stocks, cryptocurrencies, and commodities, respectively.
time series LSTM EKF AR ARIMA
Apple 5.91025.9*10^{-2} 4.51014.5*10^{-1} 2.51012.5*10^{-1} 7.51027.5*10^{-2}
Microsoft 5.51025.5*10^{-2} 4.21014.2*10^{-1} 4.31014.3*10^{-1} 6.31026.3*10^{-2}
Google 3.51023.5*10^{-2} 4.71014.7*10^{-1} 4.71014.7*10^{-1} 5.51025.5*10^{-2}
Bitcoin 2.11012.1*10^{-1} 2.71002.7*10^{0} 4.01014.0*10^{-1} 3.81013.8*10^{-1}
Ethereum 1.81011.8*10^{-1} 2.91002.9*10^{0} 8.11018.1*10^{-1} 1.01001.0*10^{0}
Cardano 2.71012.7*10^{-1} 4.31004.3*10^{0} 1.21001.2*10^{0} 1.71001.7*10^{0}
Oil 1.51011.5*10^{-1} 4.61004.6*10^{0} 6.31016.3*10^{-1} 3.01013.0*10^{-1}
Natural gas 2.01012.0*10^{-1} 3.41003.4*10^{0} 8.01018.0*10^{-1} 1.01001.0*10^{0}
Gold 7.71027.7*10^{-2} 2.41002.4*10^{0} 1.21001.2*10^{0} 1.11001.1*10^{0}

Fig. 4 shows the price predictions of the three stocks, Apple, Microsoft, and Google, using our real-time many to one LSTM, EKF, AR, and ARIMA. Since some of the predictions closely mimic the observed time series to overlap, we compute the absolute difference between the observations and the predictions, see Figs. 4(c), 4(f), and 4(i). We observe that all four methods are capable of capturing the pattern of the time series with the order of the best to the worst prediction performance is LSTM, ARIMA, AR, and EKF. Moreover, while LSTM and ARIMA perform similarly good, EKF and AR perform similarly weak.

Refer to caption
Figure 4: Price prediction of three stocks, Apple (first row), Microsoft (second row), and Google (third row), using our real-time many to one LSTM (blue), EKF (yellow), AR (green), and ARIMA (purple). The first, second, and third columns show the entire time series, observed and predicted time series for the last 30 days of the prediction period, and the absolute difference between observed and predicted time series, respectively. We observe that the order of the best to the worst performance is LSTM, ARIMA, AR, and EKF. Note that some of the time series are not visible as they are covered by others.

Fig. 5 shows the price predictions of the three cryptocurrencies, Bitcoin, Ethereum, and Cardano, using LSTM, EKF, AR, and ARIMA. Since some of the predictions closely mimic the observed time series to overlap, we compute the absolute difference between the observations and the predictions, see Figs. 4(c), 4(f), and 4(i). We observe that LSTM, ARIMA, and AR are capable of capturing the pattern of the time series in contrast to the weak prediction of EKF. The order of the best to the worst prediction performance is LSTM, AR, ARIMA, and EKF.

Refer to caption
Figure 5: Price prediction of three cryptocurrencies, Bitcoin (first row), Ethereum (second row), and Cardano (third row), using our real-time many to one LSTM (blue), EKF (yellow), AR (green), and ARIMA (purple). The first, second, and third columns show the entire time series, observed and predicted time series for the last 30 days of the prediction period, and the absolute difference between observed and predicted time series, respectively. We observe that the order of the best to the worst performance is LSTM, AR, ARIMA, and EKF. Note that some of the time series are not visible as they are covered by others.

Fig. 6 shows the price predictions of the three commodities, Oil, Natural gas, and Gold, using LSTM, EKF, AR, and ARIMA. We compute the absolute difference between the observations and the predictions, see Figs. 5(c), 5(f), and 5(i) since some of the predictions are similar to observations. We observe that mostly LSTM, ARIMA, and AR are capable of capturing the pattern of the time series. The order of the best to the worst prediction performance is LSTM, Ar, ARIMA, and EKF.

Refer to caption
Figure 6: Price prediction of three commodities, Oil (first row), Natural gas (second row), and Gold (third row), using our real-time many to one LSTM (blue), EKF (yellow), AR (green), and ARIMA (purple). The first, second, and third columns show the entire time series, observed and predicted time series for the last 30 days of the prediction period, and the absolute difference between observed and predicted time series, respectively. We observe that the order of the best to the worst performance is LSTM, AR, ARIMA, and EKF. Note that some of the time series are not visible as they are covered by others.

The performance of this real-time many-to-one LSTM is highly influenced by the number of epochs that it is executed. To check this assertion, we compute the prediction performance of the LSTM with respect to different numbers of epochs for Apple, Bitcoin, and Gold. The prediction performance is computed as the mean of the relative absolute difference, i.e., \mathcal{E}, between the prediction and the observed time series. Since EKF, AR, and ARIMA are independent of epochs, we represent their \mathcal{E} as a straight line. We observe that the performance of LSTM improves from worst to the best when the number of epochs is increased.

Refer to caption
Figure 7: Prediction performance of the real-time many-to-one LSTM with respect to the different numbers of epochs. The first row shows the mean of the relative absolute difference, denoted as \mathcal{E}, between the prediction and the observed time series for Apple, Bitcoin, and Gold. Note that, EKF, AR, and ARIMA are independent of epochs; however, we represent a straight line for their \mathcal{E} in the first row. The second, the third, and the fourth rows show the prediction and the observed time series of the prices in the prediction periods of Apple, Bitcoin, and Gold for 10 epochs, 20 epochs, and 50 epochs, respectively. We observe that the performance of LSTM improves from worst to the best when the number of epochs is increased. Note that some of the time series and plots are not visible as they are covered by others.

4 Discussion

The classical methods of solving temporal chaotic systems are mostly linear models which assume linear relationships between systems’ previous outputs for stationary time series. Thus, they often do not capture non-linear relationships in the data and cannot cope with certain non-stationary signals. Because financial time series are often nonstationary, nonlinear, and contain noise [Bontempi et al., 2013], traditional statistical models encounter some limitations in predicting them with high precision. In this paper, we have presented a real-time forecasting technique for financial markets using sequentially trained many-to-one LSTM. We applied this technique for some time series obtained from stock market, cryptocurrency market, and commodity market, then, compared the performance against three state-of-the-art methods, namely, EKF, AR, and ARIMA.

Here, we train a many-to-one LSTM with sequential data sampled using a moving window approach such that the succeeding window is shifted forward by one data instance from the preceding window. Such sequential window training plays an important role in time series predictions since it, 1) helps generate more data from a given limited time series and then thorough training of the ANN; 2) makes the data heterogeneous so that the overfitting issue of the ANN can be reduced; and 3) facilitates the learning of patterns of the data not only for the entire time series but also for short segments of sequential data. Sequential window training maximizes the performance of this LSTM as accelerates LSTM’s learning capability as well as it increases LSTM’s robustness to new data.

The performance analysis of this study covers the LSTM applied to nine time series obtained from three financial markets, stocks (Apple, Microsoft, Google), cryptocurrencies (Bitcoin, Ethereum, Cardano), and commodities (gold, crude oil, natural gas). We observed that the LSTM performs exceptionally better than the other three methods for all the nine datasets where the performance of EKF was significantly weak. We have seen in Table 4, on average, LSTM performs 17 times better than EKF, 7 times better than AR, and 4 times better than ARIMA. The average prediction errors of LSTM are 0.05, 0.22, and 0.14 for stocks, cryptocurrencies, and commodities, respectively. The reason for that is while the prediction on less volatile time series like in the stock market is easy, the prediction on high volatile time series like in the cryptocurrency market is challenging.

In future work, we are planning to extend this sequentially trained many-to-one LSTM to employ as a real-time fault detection technique in industrial production processes. This real-time fault detection scheme will be capable of producing an early alarm to alert a shift in the production process so the quality controlling team can take necessary actions. Moreover, trajectories of collectively moving agents can be represented on a low-dimensional manifold that underlies on a high-dimensional data cloud Gajamannage et al. [2019], Gajamannage & Paffenroth [2021], Gajamannage et al. [2015]. However, some segments of these trajectories are not tracked by multi-object tracking methods due to natural phenomena such as occlusions. Thus, we are planning to utilize our LSTM architecture to make predictions for the fragmented segments of the trajectories.

We empirically validated that our real-time LSTM outperforms the performance of EKF, AR, and ARIMA. In the future, we are planning to compare the performance of our real-time LSTM with that of the other famous ANN-based methods such as Facebook developed Prophet [Taylor & Letham, 2018], Amazon developed DeepAR [Salinas et al., 2020], Google developed Temporal Fusion Transformer [Lim et al., 2021], and Element AI developed N-BEATS [Oreshkin et al., 2019]. Prophet was designed for automatic forecasting of univariate time series data. DeepAR is a probabilistic forecasting model based on recurrent neural networks. Temporal Fusion Transformer is a novel attention-based architecture that combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics. N-BEATS is a custom deep learning algorithm that is based on backward and forward residual links for univariate time series point forecasting.

We presented both nonlinear and real-time prediction technique for financial time series that is made by a many-to-one LSTM which is sequentially trained with windows of data. The sequential window training approach has significantly improved LSTM’s learning ability while dramatically reducing LSTM’s over-fitting issues. We empirically justified that our LSTM possesses superior performance even for highly volatile time series such as those in cryptocurrencies and commodities.

Acknowledgments

The authors would like to thank the Google Cloud Platform for granting Research Credit to access its GPU computing resources under project number 397744870419.

References

  • Allen-Zhu et al. [2019] Allen-Zhu, Z., Li, Y., & Song, Z. (2019). On the convergence rate of training recurrent neural networks. In Advances in Neural Information Processing Systems (pp. 1310–1318). PMLR volume 32. arXiv:1810.12065.
  • Bahadur et al. [2019] Bahadur, N., Paffenroth, R., & Gajamannage, K. (2019). Dimension Estimation of Equity Markets. In Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019 (pp. 5491--5498). Institute of Electrical and Electronics Engineers Inc. doi:10.1109/BigData47090.2019.9006343.
  • Bontempi et al. [2013] Bontempi, G., Ben Taieb, S., & Le Borgne, Y. A. (2013). Machine learning strategies for time series forecasting. In Lecture Notes in Business Information Processing (pp. 62--77). Springer volume 138 LNBIP. doi:10.1007/978-3-642-36318-4_3.
  • Gajamannage et al. [2015] Gajamannage, K., Butail, S., Porfiri, M., & Bollt, E. M. (2015). Identifying manifolds underlying group motion in Vicsek agents. European Physical Journal: Special Topics, 224, 3245--3256. doi:10.1140/epjst/e2015-50088-2.
  • Gajamannage & Paffenroth [2021] Gajamannage, K., & Paffenroth, R. (2021). Bounded manifold completion. Pattern Recognition, 111, 107661. doi:https://doi.org/10.1016/j.patcog.2020.107661.
  • Gajamannage et al. [2019] Gajamannage, K., Paffenroth, R., & Bollt, E. M. (2019). A nonlinear dimensionality reduction framework using smooth geodesics. Pattern Recognition, 87, 226--236. doi:10.1016/j.patcog.2018.10.020.
  • Gajamannage et al. [2021] Gajamannage, K., Park, Y., Paffenroth, R., & Jayasumana, A. P. (2021). Reconstruction of Fragmented Trajectories of Collective Motion using Hadamard Deep Autoencoders. arXiv preprint arXiv:2110.10428, . doi:10.48550/arxiv.2110.10428. arXiv:2110.10428.
  • Geurts et al. [1977] Geurts, M., Box, G. E. P., & Jenkins, G. M. (1977). Time Series Analysis: Forecasting and Control. Journal of Marketing Research, 14, 269. doi:10.2307/3150485.
  • Gruslys et al. [2016] Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., & Graves, A. (2016). Memory-efficient backpropagation through time. Advances in Neural Information Processing Systems, 29, 4132--4140. arXiv:1606.03401.
  • Hochreiter & Schmidhuber [1997] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9, 1735--1780. doi:10.1162/neco.1997.9.8.1735.
  • Kim et al. [2017] Kim, J., El Khamy, M., & Lee, J. (2017). Residual LSTM: Design of a deep recurrent architecture for distant speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, 2017-Augus, 1591--1595. doi:10.21437/Interspeech.2017-477.
  • Kingma & Ba [2015] Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, . doi:10.48550/arXiv.1412.6980. arXiv:1412.6980.
  • Le & Zuidema [2016] Le, P., & Zuidema, W. (2016). Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs. arXiv preprint arXiv:1603.00423, (pp. 87--93). doi:10.18653/v1/w16-1610. arXiv:1603.00423.
  • Lim et al. [2021] Lim, B., Arık, S., Loeff, N., & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37, 1748--1764. doi:10.1016/j.ijforecast.2021.03.012. arXiv:1912.09363.
  • Ma & Principe [2018] Ma, Y., & Principe, J. (2018). Comparison of Static Neural Network with External Memory and RNNs for Deterministic Context Free Language Learning. In Proceedings of the International Joint Conference on Neural Networks (pp. 1--7). IEEE volume 2018-July. doi:10.1109/IJCNN.2018.8489240.
  • Manneschi & Vasilaki [2020] Manneschi, L., & Vasilaki, E. (2020). An alternative to backpropagation through time. Nature Machine Intelligence, 2, 155--156. doi:10.1038/s42256-020-0162-9.
  • Newbold [1983] Newbold, P. (1983). ARIMA model building and the time series analysis approach to forecasting. Journal of Forecasting, 2, 23--35. doi:10.1002/for.3980020104.
  • Oreshkin et al. [2019] Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437, . URL: http://arxiv.org/abs/1905.10437. arXiv:1905.10437.
  • Park et al. [2022] Park, Y., Gajamannage, K., Jayathilake, D. I., & Bollt, E. M. (2022). Recurrent Neural Networks for Dynamical Systems : Applications to Ordinary Differential Equations , Collective Motion , and Hydrological Modeling, . (pp. 1--15). doi:10.48550/arxiv.2202.07022.
  • Qiu et al. [2020] Qiu, J., Wang, B., & Zhou, C. (2020). Forecasting stock prices with long-short term memory neural network based on attention mechanism. PLoS ONE, 15, e0227222. doi:10.1371/journal.pone.0227222.
  • Salinas et al. [2020] Salinas, D., Flunkert, V., Gasthaus, J., & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36, 1181--1191. doi:10.1016/j.ijforecast.2019.07.001. arXiv:1704.04110.
  • Shih et al. [2018] Shih, C. H., Yan, B. C., Liu, S. H., & Chen, B. (2018). Investigating Siamese LSTM networks for text categorization. In Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 (pp. 641--646). IEEE volume 2018-Febru. doi:10.1109/APSIPA.2017.8282104.
  • Shumway & Stoffer [2017] Shumway, R. H., & Stoffer, D. S. (2017). ARIMA Models. (pp. 75--163). Springer, Cham. doi:10.1007/978-3-319-52452-8_3.
  • Simistira et al. [2015] Simistira, F., Ul-Hassan, A., Papavassiliou, V., Gatos, B., Katsouros, V., & Liwicki, M. (2015). Recognition of historical Greek polytonic scripts using LSTM networks. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR (pp. 766--770). IEEE volume 2015-Novem. doi:10.1109/ICDAR.2015.7333865.
  • Squarepants [2022] Squarepants, S. (2022). Bitcoin: A Peer-to-Peer Electronic Cash System. SSRN Electronic Journal, (p. 21260). doi:10.2139/ssrn.3977007.
  • Taylor & Letham [2018] Taylor, S. J., & Letham, B. (2018). Forecasting at Scale. American Statistician, 72, 37--45. doi:10.1080/00031305.2017.1380080.
  • Tian et al. [2017] Tian, X., Zhang, J., Ma, Z., He, Y., Wei, J., Wu, P., Situ, W., Li, S., & Zhang, Y. (2017). Deep LSTM for large vocabulary continuous speech recognition, . doi:10.48550/arXiv.1703.07090. arXiv:1703.07090.
  • Valade et al. [2017] Valade, A., Acco, P., Grabolosa, P., & Fourniols, J. Y. (2017). A study about kalman filters applied to embedded sensors. Sensors (Switzerland), 17, 2810. doi:10.3390/s17122810.
  • Werbos [1990] Werbos, P. J. (1990). Backpropagation Through Time: What It Does and How to Do It. Proceedings of the IEEE, 78, 1550--1560. doi:10.1109/5.58337.
  • Zhao et al. [2018] Zhao, Y., Ge, L., Zhou, Y., Sun, Z., Zheng, E., Wang, X., Huang, Y., & Cheng, H. (2018). A new Seasonal Difference Space-Time Autoregressive Integrated Moving Average (SD-STARIMA) model and spatiotemporal trend prediction analysis for Hemorrhagic Fever with Renal Syndrome (HFRS). PLoS ONE, 13, e0207518. doi:10.1371/journal.pone.0207518.