Quaternion Recurrent Neural Network
with Real-Time Recurrent Learning
and Maximum Correntropy Criterion
Abstract
We develop a robust quaternion recurrent neural network (QRNN) for real-time processing of 3D and 4D data with outliers. This is achieved by combining the real-time recurrent learning (RTRL) algorithm and the maximum correntropy criterion (MCC) as a loss function. While both the mean square error and maximum correntropy criterion are viable cost functions, it is shown that the non-quadratic maximum correntropy loss function is less sensitive to outliers, making it suitable for applications with multidimensional noisy or uncertain data. Both algorithms are derived based on the novel generalised HR (GHR) calculus, which allows for the differentiation of real functions of quaternion variables and offers the product and chain rules, thus enabling elegant and compact derivations. Simulation results in the context of motion prediction of chest internal markers for lung cancer radiotherapy, which includes regular and irregular breathing sequences, support the analysis.
Index Terms:
quaternion recurrent neural network, real-time recurrent learning, maximum correntropy criterion, generalised HR calculus, motion predictionI Introduction
Recently, the incorporation of quaternion algebra into neural network architectures has paved the way for enhancing performance and robustness, particularly in the contexts where data are naturally multidimensional [1, 2, 3, 4]. Quaternion neural networks (QNNs) leverage the inherent multidimensional nature of quaternions to build models that are more compact than quadrivariate real ones, thus improving parameter efficiency and potentially capturing intricate patterns in data [5, 6, 7, 8]. Consequently, extensions of the real-valued backpropagation to the domain of quaternions has led to various QNN architectures [9]. Notably, the development of learning algorithms using the Generalised HR (GHR) calculus has been shown to enable new algorithms that make full use of the structurally rich quaternion algebra and so enhance performance [10, 11, 12].
In the context of recurrent neural networks (RNNs), the real-time recurrent learning (RTRL) algorithm, owing to its online nature, is a preferred choice for real-time applications [13, 14]. It is therefore natural to investigate whether, in conjunction with quaternions, the RTRL would yield similar advantages when tracking non-linear dynamics and adapting to intricate temporal patterns in multidimensional sequences.
Current QNNs are mostly trained with the mean square error (MSE) cost function [15]. Since the MSE measures the average squared difference between the predicted and actual data values, it is sensitive to outliers and is optimal only for normally distributed data. Unlike MSE, the maximum correntropy criterion (MCC) is a non-quadratic loss function, which employs a nonlinear kernel to measure the similarity between the actual and predicted data [16]. This makes the MCC less sensitive to outliers, and hence suitable for applications with noisy and heavy tailed data. For quaternion recurrent neural networks (QRNNs), where data are corrupted with multidimensional noise of different channel-wise natures, the MCC promises to offer a robust alternative to MSE.
This motivates us to introduce a QRNN trained with RTRL and the MCC cost in this setting. The novel GHR calculus is employed for compact derivations of otherwise very cumbersome gradient expression in quaternion learning machines. The performance of the proposed network is compared against its counterpart that employs MSE as a loss function, and against the RNN equipped with RTRL and MCC or MSE loss, the Quaternion Least Mean Square (QLMS) and the Least Mean Square (LMS) algorithms. The performances are verified in the context of motion prediction of chest internal markers for lung cancer radiotherapy. The approach is general enough to serve as a basis of a whole class of online QNNs.
II Preliminaries
II-A Quaternion algebra
Denote a quaternion, q, as
(1) |
where and the imaginary units i, j, and k satisfy , , , . The set of quaternions is defined as . The multiplication of two quaternions in is noncommutative due to the properties of the imaginary units. The real part of is defined as , whereas the imaginary part is . The conjugate of is . The modulus of a quaternion is denoted as . The inverse of a quaternion is . For any quaternion, q, and a nonzero quaternion, , the transformation [17]
(2) |
describes a rotation of q.
II-B The GHR calculus
A quaternion function , defined as is called real-differentiable, if are differentiable as functions of the real variables [18]. If is real-differentiable, then the left GHR derivative of the function with respect to () is
(3) |
where , , and are the partial derivatives of with respect to . If , , the product rule is [10]
(4) |
the chain rule is
(5) |
and the rotation rule is
(6) |
Denote the quaternion gradient of a function as
(7) |
where is the transpose of [9]. The quaternion Jacobian matrix of is then defined as
(8) |
Note the convention that for two vectors and , is a matrix for which the th element is , thus, the dimension of is [9].
II-C Maximum Correntropy criterion

The aim is to form a quaternion function , based on a sequence , where is the system input, and is the expected response, both quaternion valued, for an instant time . We denote the expected response as , where is a nonlinear quaternion function to estimate, and is noise with an arbitrary probability density function that does not need to be Gaussian.
When finding , we ideally would like to solve the empirical risk minimization problem, that is, to minimize
(9) |
where is a quaternion vector space. Notably, the use of MSE can result in large variations for the weights, or shift in the output, when the noise distribution contains outlying points, is non-symmetric, or has a nonzero mean.
The correntropy is defined as , where and are quaternion random variables, and is a symmetric positive-definite quaternion kernel, with kernel size [19]. For real-valued inputs, the Gaussian kernel can be expressed as
(10) |
where and are real valued [20]. An analogous Gaussian-based kernel for quaternion data may be expressed as
(11) | |||||
where and are the quaternions and [20] (see Fig. 1). The factor of 4 is introduced to normalize the effect of the four components in the quaternion as opposed to a single real component. In the complex case, the factor of 2 would be included for normalization.
III Quaternion Recurrent Neural Network with RTRL and Maximum Correntropy Criterion
We next introduce the QRNN equipped with RTRL and MCC; this is achieved based on the GHR calculus. An illustration of a general architecture for QRNN with RTRL is shown in Fig. 2.
III-A Forward pass
Let denote the hidden state of the th neuron in the th layer at time , defined as
(12) |
where
(13) |
Here, represents the vector of recurrent quaternion weights, is the quaternion split activation function (see Appendix VI-B), represents the output of the -th neuron in layer at time , is the hidden state of the -th neuron in layer at time , and is the Hamilton product.
III-B Correntropy cost function
Denote the error vector by . Given the quaternion Gaussian kernel in (11), the correntropy between and the desired output becomes
(14) |
where
(15) |
Usually, the joint probability density function is unknown and only a finite set of samples is available. To estimate the correntropy, the expectation can be approximated based on
(16) |
The correntropy should account for the even moments between and , especially the magnitude . As increases, higher-order terms should diminish, making the second-order term dominant. This facilitates using a gradient based method to maximize correntropy.
As we aim to use gradient descent, instead of maximising , the objective of the network training is to minimise the loss function , expressed as
(17) |
where is the current time index and is the total number of time steps. Using the correntropy with a Gaussian kernel function as a cost function gives our problem kernel Hilbert space attributes, which differs from the linear MSE that only considers the Euclidean distance of errors. Essentially, the correntropy represents a weighted sum of error distances, while the kernel function modifies the influence of each error by mapping it from the input space to the reproducing kernel Hilbert space.

III-C Backpropagation
We now derive the quaternion real-time recurrent learning algorithm for the QRNN. The quaternion gradient of the correntropy cost function can be expressed as
(18) |
To find , we use the GHR chain rule [10] to obtain
(19) |
The first term can be evaluated as
(20) |
Upon substituting (20) into the formula for , we obtain
(21) |
Then, upon employing the GHR chain rule [10], the derivative of can be calculated as
(22) |
whereby the GHR product rule gives
(23) |
Given that , we obtain
(24) |
where is the Jacobian matrix of . Thus,
(25) |
which allows us to arrive at
(26) |
This can be simplified into
(27) |
Finally, to find the quaternion gradient, we take the Hermitian (conjugate transpose) of (27), that is
(28) |
Denote the quaternion error terms at time as , that is
(29) |
with as the derivative of the activation , as the Hadamard product, and and as the Hermitians of the weight matrices for the th and the th layer, respectively. Then, the weight and bias updates for the output layer are
(30) |
(31) |
The weight and bias updates for the hidden layers are
(32) |
(33) |
(34) |
Here, is the learning rate, and represents the matrix of recurrent quaternion weights for layer .
IV Simulation results
Error |
|
All sequences | Regular breathing | Irregular breathing | |
---|---|---|---|---|---|
RMSE | QRNN RTRL w/ MCC | ||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
nRMSE |
|
||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
MAE |
|
||||
|
|||||
|
|||||
|
|||||
|
|||||
|
|||||
Jitter |
|
||||
|
|||||
|
|||||
|
|||||
|
|||||
|
IV-A Motion prediction of chest internal points for lung cancer radiotherapy
In the context of lung cancer radiotherapy, tracking the position of infrared-reflective markers on the chest is a method used to approximate the location of the tumour. Nonetheless, the precision of radiation delivery in radiotherapy systems is constrained by the inherent latency due to limitations in robotic control. Compensating for delays is crucial to reduce the damage to healthy tissues. Employing RNN with online learning can facilitate predictions that adapt to the dynamic nature of respiratory signals, potentially mitigating this issue [21]. Unlike offline methods, which do not change after initial training, online methods continually update the synaptic weights of the network with each new data input. This continuous updating allows the neural network to adjust to the patient’s evolving breathing patterns, offering improved resilience against complex movements. Online learning is particularly advantageous in medical settings, where acquiring extensive training datasets can be challenging. It provides a way to adapt to data not included in the initial training set. Adaptive or dynamic learning techniques have been frequently utilized in radiotherapy. Studies have shown the effectiveness of these approaches compared to static models [22, 23, 24]. One such dynamic method is RTRL [13], which has been applied in various systems like the Cyberknife Synchrony system [24] and the SyncTraX system [25]. It has also been used to track the position of internal points within the chest [21], demonstrating its broad applicability in medical technology.
While prior studies in the field of respiratory motion prediction primarily concentrated on univariate signal analysis, our simulation focuses on the prediction of 3D marker position, represented as pure quaternions, setting a prediction horizon of two seconds. This is achieved through the use of a QRNN trained with RTRL. We compare its effectiveness, when associated with the MCC loss, against other forecasting algorithms. Furthermore, we conducted predictions for all three markers concurrently, enabling the QRNN to further identify and leverage the interrelationships in their movements.
IV-B 3D marker position data
The data used for this study was collected from nine instances of 3D positional tracking of three external markers placed on the chest and abdomen of subjects resting on a HexaPOD treatment couch. The tracking was accomplished using an NDI Polaris infrared camera. Each recorded sequence varied in length, ranging from 73 to 320 seconds, and was sampled at a frequency of 10 Hz. The movement trajectories of the markers were recorded in three dimensions: superior-inferior (ranging from 6 mm to 40 mm), left-right (2 mm to 10 mm), and antero-posterior (18 mm to 45 mm). Out of these recordings, five depict normal breathing patterns, while the remaining four capture subjects engaging in activities such as talking or laughing. Further specifics about the public dataset are elaborated in [26, 27]. The data from subjects was divided into two groups to assess the robustness of the algorithms.
IV-C Algorithms & Training
We compared the performance of the QRNN equipped with RTRL and the MCC loss function (Sec. III) against its counterpart that employs the MSE loss function (see Appendix VI-A). The performances of the standard RNN with RTRL and MSE loss as described in [21] or MCC loss as defined in (10), as well as the Quaternion Least Mean Square (QLMS) algorithm [28], and the real-valued Least Mean Square (LMS) algorithm, were also included in the comparison of the prediction methods. Note that the predictions for all three markers were conducted concurrently for the standard RNNs with RTRL. The QRNNs and RNNs were characterized by one hidden layer. The activation function was the hyperbolic tangent function for RNNs, and the split hyperbolic tangent function for QRNNs (see Appendix VI-B). The cross-validation metric was the RMSE. The ranges of hyper-parameters for cross-validation with grid search were as follows. For QRNNs and RNNs with RTRL, the learning rate range was . The range in the number of hidden units was for QRNNs and for RNNs. For QLMS and LMS, the parameter range was . For all prediction methods, the regressor range defined as the number of time steps was . There were 50 runs successively performed for cross-validation, and 300 runs for evaluation. Updating QRNNs and RNNs with the gradient rule may induce numerical instability. The gradient norm was therefore clipped to avoid large weight updates, while the optimization method was stochastic gradient descent. The training and validation of these models were conducted in the first minute of each recorded sequence, both for 30 seconds. The four error types, presented in Table I, used to evaluate the prediction performance are implemented following [27].
IV-D Prediction performance
The QRNN equipped with RTRL and the MCC loss produced the lowest mean average error, RMSE, and nRMSE averaged over all the sequences as well as in the context of irregular breathing (Table I). The QRNN with RTRL and the MSE loss performed slightly better than its counterpart with the MCC loss for predicting sequences of regular breathing, in terms of the RMSE and nRMSE metrics. The robustness of the QRNN with RTRL and the MCC loss against irregular movements was further confirmed as its nRMSE exhibited an increase of 8.2% between regular and irregular breathing patterns, which is the smallest among the algorithms assessed. More generally, the QRNNs with RTRL displayed consistently better performance for forecasting the breathing sequences than the standard RNNs with RTRL, QLMS, and LMS algorithms. The compact 95% confidence intervals accompanying the performance metrics presented in Table I suggest that choosing 300 test runs was adequate to yield precise results. In addition, the jitter was evaluated. It refers to the fluctuation or oscillation amplitude in the predicted signal, and its extent can significantly affect robotic control during treatment. The QRNN with RTRL and MSE displayed a slightly lower jitter overall than its counterpart with the MCC loss. The QRNN with RTRL and MCC exhibited the lowest jitter in the context of irregular breathing, significantly outperforming the other algorithms, including the RNN with RTRL and MCC (0.77 mm against 0.88 mm) and the QLMS and LMS algorithms (0.77 mm against 1.73 mm).
IV-E Discussion of simulation limits
The simulation has certain limitations, particularly regarding the quantity and length of the sequences utilized, which are relatively modest. Nonetheless, the dataset used encompasses a broad spectrum of respiratory patterns, including various shifts, drifts, slow and abrupt irregular movements, and both active and calm breathing states. A notable aspect of the QRNNs with RTRL and the MCC or MSE loss function we examined is their ability to perform online learning, which does not require extensive pre-existing data to make precise predictions. This is evident from the high accuracy achieved with just one minute of training data. Based on these factors, we believe our results could be extrapolated to larger datasets. Another strength of the simulations is the open-source dataset we used, which promotes reproducibility of our findings. This contrasts with many prior studies in this field, which often rely on private datasets, complicating comparative analyses. We also addressed challenging scenarios like laughing and talking, which are generally controlled in clinical settings. Analyzing performance under such conditions provides insights into other less predictable events that may occur during treatment, such as yawning, hiccupping, or coughing. The current clinical practice is to halt radiation when such anomalies are detected. By differentiating between regular and irregular breathing patterns, we could objectively assess and measure the resilience of the compared algorithms. Given that almost half of our dataset consisted of irregular breathing sequences, the average numerical error metrics across all nine sequences might be higher than what would be encountered in more standard scenarios. Moreover, the QRNN models do take roughly 50% longer to train than the RNN models (Table II). The QRNN with RTRL and MCC demonstrated an average computation time per time step of 184.5 ms. Note that this computation time is less than the approximate marker position sampling interval, which is around 400 ms.
Prediction algorithm | Calculation time per time step (in ms) |
---|---|
QRNN with RTRL and MCC | 184.5 |
QRNN with RTRL and MSE | 182.4 |
RNN with RTRL and MCC | 116.1 |
RNN with RTRL and MSE | 115.7 |
QLMS | 0.507 |
LMS | 0.313 |
V Conclusion
The incorporation of the quaternion algebra into RNNs offers an efficient way to capture and hence make use of the multidimensional structures present in 3D and 4D data. When combined with RTRL, the QRNN model has been demonstrated to adapt and learn intricate multidimensional temporal patterns in real-time. While both MSE and MCC are viable cost functions, the choice depends on the specific application and the nature of the data. Simulations in the context of motion prediction of chest internal points for safer lung cancer radiotherapy have shown that the kernel-based similarity within the MCC helps the QRNN to be consistently more robust to outliers in data. The successful training of QRNNs with RTRL, using just one minute of respiratory data per sequence, have moreover demonstrated the efficacy of dynamic training even with constrained data availability.
Acknowledgement
Pauline Bourigault was supported by the UKRI CDT in AI for Healthcare http://ai4health.io (Grant No. P/S023283/1).
References
- [1] T. Isokawa, T. Kusakabe, N. Matsui, and F. Peper, Quaternion Neural Network and Its Application. Springer Berlin, 2003, p. 318–324.
- [2] T. Parcollet, M. Morchid, and G. Linarès, “A survey of quaternion neural networks,” Artificial Intelligence Review, vol. 53, no. 4, p. 2957–2982, 2019.
- [3] X. Zhu, Y. Xu, H. Xu, and C. Chen, “Quaternion convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2019.
- [4] S. Zhang, Y. Tay, L. Yao, and Q. Liu, “Quaternion knowledge graph embeddings,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
- [5] C. J. Gaudet and A. S. Maida, “Deep quaternion networks,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1–8.
- [6] T. Parcollet, M. Morchid, and G. Linares, “Quaternion convolutional neural networks for heterogeneous image processing,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 8514–8518.
- [7] K. Takahashi, E. Tano, and M. Hashimoto, “Feedforward–feedback controller based on a trained quaternion neural network using a generalised HR calculus with application to trajectory control of a three-link robot manipulator,” Machines, vol. 10, no. 5, p. 333, 2022.
- [8] D. Comminiello, M. Lella, S. Scardapane, and A. Uncini, “Quaternion convolutional neural networks for detection and localization of 3D sound events,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 8533–8537.
- [9] D. Xu, Y. Xia, and D. P. Mandic, “Optimization in quaternion dynamic systems: Gradient, Hessian, and learning algorithms,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 2, pp. 249–261, 2016.
- [10] D. Xu, C. Jahanchahi, C. C. Took, and D. P. Mandic, “Enabling quaternion derivatives: The generalized HR calculus,” Royal Society Open Science, vol. 2, no. 8, p. 150255, 2015.
- [11] D. Xu, L. Zhang, and H. Zhang, “Learning algorithms in quaternion neural networks using GHR calculus,” Neural Network World, vol. 27, no. 3, pp. 271–282, 2017.
- [12] C. Popa, “Learning algorithms for quaternion-valued neural networks,” Neural Processing Letters, vol. 47, no. 3, pp. 949–973, 2017.
- [13] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
- [14] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction. Wiley, 2001.
- [15] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, and Y. Bengio, “Quaternion recurrent neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- [16] I. Santamaria, P. P. Pokharel, and J. C. Principe, “Generalized correlation function: Definition, properties, and application to blind equalization,” IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 2187–2197, 2006.
- [17] J. P. Ward, Quaternions and Cayley Numbers. Springer Netherlands, 1997.
- [18] A. Sudbery, “Quaternionic analysis,” Mathematical Proceedings of the Cambridge Philosophical Society, vol. 85, no. 2, pp. 199–225, 1979.
- [19] W. Liu, P. P. Pokharel, and J. C. Principe, “Correntropy: A localized similarity measure,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2006, pp. 4919–4924.
- [20] T. Ogunfunmi and T. Paul, “The quaternion maximum correntropy algorithm,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 6, pp. 598–602, 2015.
- [21] M. Pohl, M. Uesaka, K. Demachi, and R. B. Chhatkuli, “Prediction of the motion of chest internal points using a recurrent neural network trained with real-time recurrent learning for latency compensation in lung cancer radiotherapy,” Computerized Medical Imaging and Graphics, vol. 91, p. 101941, 2022.
- [22] A. Krauss, S. Nill, and U. Oelfke, “The comparative performance of four respiratory motion predictors for real-time tumour tracking,” Physics in Medicine and Biology, vol. 56, no. 16, p. 5303–5317, 2011.
- [23] T. P. Teo, S. B. Ahmed, P. Kawalec, N. Alayoubi, N. Bruce, E. Lyn, and S. Pistorius, “Feasibility of predicting tumor motion using online data acquired during treatment and a generalized neural network optimized with offline patient tumor trajectories,” Medical Physics, vol. 45, no. 2, p. 830–845, 2018.
- [24] M. Mafi and S. M. Moghadam, “Real-time prediction of tumor motion using a dynamic neural network,” Medical, Biological Engineering, Computing, vol. 58, no. 3, p. 529–539, 2020.
- [25] K. Jiang, F. Fujii, and T. Shiinoki, “Prediction of lung tumor motion using nonlinear autoregressive model with exogenous input,” Physics in Medicine & Biology, vol. 64, no. 21, p. 21NT02, 2019.
- [26] T. Krilavicius, I. Zliobaite, H. Simonavicius, and L. Jarusevicius, “Predicting respiratory motion for real-time tumour tracking in radiotherapy,” in Proceedings of the IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS), 2016.
- [27] M. Pohl, M. Uesaka, H. Takahashi, K. Demachi, and R. Bhusal Chhatkuli, “Prediction of the position of external markers using a recurrent neural network trained with unbiased online recurrent optimization for safe lung cancer radiotherapy,” Computer Methods and Programs in Biomedicine, vol. 222, p. 106908, 2022.
- [28] C. Took and D. Mandic, “The Quaternion LMS algorithm for adaptive filtering of hypercomplex processes,” IEEE Transactions on Signal Processing, vol. 57, no. 4, p. 1316–1327, 2009.
- [29] N. Benvenuto and F. Piazza, “On the complex backpropagation algorithm,” IEEE Transactions on Signal Processing, vol. 40, no. 4, pp. 967–969, 1992.
VI Appendices
VI-A Quaternion RNN with RTRL and Mean Squared Error
VI-A1 Forward pass
The forward pass is defined as in (III-A).
VI-A2 Loss function
The network error at the processing layer is defined by where denotes the desired output. The objective of the network training is to minimize a real-valued mean squared error (MSE) loss function [9], that is
(35) |
VI-A3 Backpropagation
We now derive the quaternion RTRL algorithm for the QRNN. The quaternion gradient of the error function can be expressed as
(36) |
where is the Jacobian matrix of [9]. Denote the quaternion error terms in the recurrent setting at time as . Recall that is essentially the local error term that contributes to the gradient . These error terms can be calculated recursively as
(37) |
where is the element-wise (Hadamard) product, and is the total number of time steps. The weight and bias update rules for the output layer are
(38) |
(39) |
The weight and bias update rules for the hidden layers are
(40) |
(41) |
(42) |
Here, is the learning rate, represents the matrix of recurrent quaternion weights for layer , and represents the Hermitian transpose.
VI-B Split activation function
Denote the split activation function as
(43) |
with representing any real-valued activation function. To perform backpropagation using split activation functions, [29] proposed a ”pseudo-gradient” update wherein the gradient is computed component-wise. Accordingly, the ”pseudo-derivative” of in (43), written as , is given by
(44) |
On the other hand, the compact GHR derivative of any split activation function is defined as [9]
(45) |