This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quaternion Recurrent Neural Network
with Real-Time Recurrent Learning
and Maximum Correntropy Criterion

Pauline Bourigault Department of Computing
Imperial College London
London, United Kingdom
[email protected]
   Dongpo Xu School of Mathematics and Statistics
Northeast Normal University
Changchun, P.R.China
[email protected]
   Danilo P. Mandic Department of Electrical and
Electronic Engineering
Imperial College London
London, United Kingdom
[email protected]
Abstract

We develop a robust quaternion recurrent neural network (QRNN) for real-time processing of 3D and 4D data with outliers. This is achieved by combining the real-time recurrent learning (RTRL) algorithm and the maximum correntropy criterion (MCC) as a loss function. While both the mean square error and maximum correntropy criterion are viable cost functions, it is shown that the non-quadratic maximum correntropy loss function is less sensitive to outliers, making it suitable for applications with multidimensional noisy or uncertain data. Both algorithms are derived based on the novel generalised HR (GHR) calculus, which allows for the differentiation of real functions of quaternion variables and offers the product and chain rules, thus enabling elegant and compact derivations. Simulation results in the context of motion prediction of chest internal markers for lung cancer radiotherapy, which includes regular and irregular breathing sequences, support the analysis.

Index Terms:
quaternion recurrent neural network, real-time recurrent learning, maximum correntropy criterion, generalised HR calculus, motion prediction

I Introduction

Recently, the incorporation of quaternion algebra into neural network architectures has paved the way for enhancing performance and robustness, particularly in the contexts where data are naturally multidimensional [1, 2, 3, 4]. Quaternion neural networks (QNNs) leverage the inherent multidimensional nature of quaternions to build models that are more compact than quadrivariate real ones, thus improving parameter efficiency and potentially capturing intricate patterns in data [5, 6, 7, 8]. Consequently, extensions of the real-valued backpropagation to the domain of quaternions has led to various QNN architectures [9]. Notably, the development of learning algorithms using the Generalised HR (GHR) calculus has been shown to enable new algorithms that make full use of the structurally rich quaternion algebra and so enhance performance [10, 11, 12].

In the context of recurrent neural networks (RNNs), the real-time recurrent learning (RTRL) algorithm, owing to its online nature, is a preferred choice for real-time applications [13, 14]. It is therefore natural to investigate whether, in conjunction with quaternions, the RTRL would yield similar advantages when tracking non-linear dynamics and adapting to intricate temporal patterns in multidimensional sequences.

Current QNNs are mostly trained with the mean square error (MSE) cost function [15]. Since the MSE measures the average squared difference between the predicted and actual data values, it is sensitive to outliers and is optimal only for normally distributed data. Unlike MSE, the maximum correntropy criterion (MCC) is a non-quadratic loss function, which employs a nonlinear kernel to measure the similarity between the actual and predicted data [16]. This makes the MCC less sensitive to outliers, and hence suitable for applications with noisy and heavy tailed data. For quaternion recurrent neural networks (QRNNs), where data are corrupted with multidimensional noise of different channel-wise natures, the MCC promises to offer a robust alternative to MSE.

This motivates us to introduce a QRNN trained with RTRL and the MCC cost in this setting. The novel GHR calculus is employed for compact derivations of otherwise very cumbersome gradient expression in quaternion learning machines. The performance of the proposed network is compared against its counterpart that employs MSE as a loss function, and against the RNN equipped with RTRL and MCC or MSE loss, the Quaternion Least Mean Square (QLMS) and the Least Mean Square (LMS) algorithms. The performances are verified in the context of motion prediction of chest internal markers for lung cancer radiotherapy. The approach is general enough to serve as a basis of a whole class of online QNNs.

II Preliminaries

II-A Quaternion algebra

Denote a quaternion, q, as

q=qa+iqb+jqc+kqd,q=q_{a}+iq_{b}+jq_{c}+kq_{d}\,, (1)

where qa,qb,qc,qdq_{a},q_{b},q_{c},q_{d}\in\mathbb{R} and the imaginary units i, j, and k satisfy i2=j2=k2=ijk=1i^{2}=j^{2}=k^{2}=ijk=-1, ij=ji=kij=-ji=k, jk=kj=ijk=-kj=i, ki=ik=jki=-ik=j. The set of quaternions is defined as {q=qa+iqb+jqc+kqd|qa,qb,qc,qd}\mathbb{H}\triangleq\{q=q_{a}+iq_{b}+jq_{c}+kq_{d}\;|\;q_{a},q_{b},q_{c},q_{d}\in\mathbb{R}\}. The multiplication of two quaternions in \mathbb{H} is noncommutative due to the properties of the imaginary units. The real part of qq is defined as Re{q}=qa\operatorname{Re}\{q\}=q_{a}, whereas the imaginary part is Im{q}=iqb+jqc+kqd\operatorname{Im}\{q\}=iq_{b}+jq_{c}+kq_{d}. The conjugate of qq is q=Re{q}Im{q}=qaiqbjqckqdq^{*}=\operatorname{Re}\{q\}-\operatorname{Im}\{q\}=q_{a}-iq_{b}-jq_{c}-kq_{d}. The modulus of a quaternion is denoted as |q|=qq|q|=\sqrt{qq^{*}}. The inverse of a quaternion q0q\neq 0 is q1=q/|q|2q^{-1}=q^{*}/|q|^{2}. For any quaternion, q, and a nonzero quaternion, μ\mu, the transformation [17]

qμμqμ1q^{\mu}\triangleq\mu q\mu^{-1} (2)

describes a rotation of q.

II-B The GHR calculus

A quaternion function f:f:\mathbb{H}\rightarrow\mathbb{H}, defined as f(q)=fa(qa,qb,f(q)=f_{a}(q_{a},q_{b}, qc,qd)+ifb(qa,qb,qc,qd)+jfc(qa,qb,qc,qd)+kfd(qa,qb,qc,q_{c},q_{d})+if_{b}(q_{a},q_{b},q_{c},q_{d})+jf_{c}(q_{a},q_{b},q_{c},q_{d})+kf_{d}(q_{a},q_{b},q_{c}, qd)q_{d}) is called real-differentiable, if fa,fb,fc,fdf_{a},\,f_{b},\,f_{c},\,f_{d} are differentiable as functions of the real variables qa,qb,qc,qdq_{a},\,q_{b},\,q_{c},\,q_{d} [18]. If f:f:\mathbb{H}\rightarrow\mathbb{H} is real-differentiable, then the left GHR derivative of the function ff with respect to qμq^{\,\mu} (μ0,μ\mu\neq 0,\mu\in\mathbb{H}) is

fqμ=14(fqafqbiμfqcjμfqdkμ),\frac{\partial f}{\partial q^{\mu}}=\frac{1}{4}\biggl{(}\frac{\partial f}{\partial q_{a}}-\frac{\partial f}{\partial q_{b}}i^{\mu}-\frac{\partial f}{\partial q_{c}}j^{\mu}-\frac{\partial f}{\partial q_{d}}k^{\mu}\biggr{)}\in\mathbb{H}\,, (3)

where q=qa+iqb+jqc+kqdq=q_{a}+iq_{b}+jq_{c}+kq_{d}, qa,qb,qc,qdq_{a},q_{b},q_{c},q_{d}\in\mathbb{R}, and fqa,fqb,\frac{\partial f}{\partial q_{a}},\frac{\partial f}{\partial q_{b}}, fqc,fqd\frac{\partial f}{\partial q_{c}},\frac{\partial f}{\partial q_{d}}\in\mathbb{H} are the partial derivatives of ff with respect to qa,qb,qc,qdq_{a},q_{b},q_{c},q_{d}. If f:f:\mathbb{H}\rightarrow\mathbb{H}\,, g:g:\mathbb{H}\rightarrow\mathbb{H}\,, the product rule is [10]

(fg)qμ=fgqμ+fqgμg,\frac{\partial(fg)}{\partial q^{\mu}}=f\frac{\partial g}{\partial q^{\mu}}+\frac{\partial f}{\partial q^{g\mu}}g\,, (4)

the chain rule is

f(g(q))qμ=υ{1,i,j,k}fgυgυqμ,\frac{\partial f(g(q))}{\partial q^{\mu}}=\sum_{\upsilon\in\{1,i,j,k\}}\frac{\partial f}{\partial g^{\upsilon}}\frac{\partial g^{\upsilon}}{\partial q^{\mu}}\,, (5)

and the rotation rule is

(fqμ)υ=fυqυμ.\biggl{(}\frac{\partial f}{\partial q^{\mu}}\biggr{)}^{\upsilon}=\frac{\partial f^{\upsilon}}{\partial q^{\upsilon\mu}}\,. (6)

Denote the quaternion gradient of a function f:nf:\mathbb{H}^{n}\rightarrow\mathbb{R} as

qf(fq)T=(fq1,,fqn)Tn,\nabla_{\textbf{{q}}}f\triangleq\biggl{(}\frac{\partial f}{\partial\textbf{{q}}}\biggr{)}^{T}=\biggl{(}\frac{\partial f}{\partial q_{1}},\cdots,\frac{\partial f}{\partial q_{n}}\biggr{)}^{T}\in\mathbb{H}^{n}\,, (7)

where (fq)T\Bigl{(}\frac{\partial f}{\partial\textbf{{q}}}\Bigr{)}^{T} is the transpose of fq\frac{\partial f}{\partial\textbf{{q}}} [9]. The quaternion Jacobian matrix of f:N×1M×1\textbf{f}:\mathbb{H}^{N\times 1}\rightarrow\mathbb{H}^{M\times 1} is then defined as

fq=(f1q1f1qNfMq1fMqN).\frac{\partial\textbf{f}}{\partial\textbf{q}}=\begin{pmatrix}\frac{\partial f_{1}}{\partial q_{1}}&\cdots&\frac{\partial f_{1}}{\partial q_{N}}\\ \vdots&\ddots&\vdots\\ \frac{\partial f_{M}}{\partial q_{1}}&\cdots&\frac{\partial f_{M}}{\partial q_{N}}\end{pmatrix}. (8)

Note the convention that for two vectors fM×1\textbf{f}\in\mathbb{H}^{M\times 1} and qN×1\textbf{q}\in\mathbb{H}^{N\times 1}, fq\frac{\partial\textbf{f}}{\partial\textbf{q}} is a matrix for which the (m,n)(m,n)th element is (fm/qn)(\partial f_{m}/\partial q_{n}), thus, the dimension of fq\frac{\partial\textbf{f}}{\partial\textbf{q}} is M×NM\times N [9].

II-C Maximum Correntropy criterion

Refer to caption
Figure 1: Mean square error (MSE) (left) and correntropy (right) in the 3D space where each point represents a pure quaternion. We assume that the true quaternion is (0, 0i, 0j, 0k) for simplicity.

The aim is to form a quaternion function zz, based on a sequence (s1,d1),(s2,d2),,(sN,dN)(\textbf{s}_{1},d_{1}),(\textbf{s}_{2},d_{2}),\dots,(\textbf{s}_{N},d_{N}), where sp\textbf{s}_{p} is the system input, and dpd_{p} is the expected response, both quaternion valued, for an instant time pp. We denote the expected response as dp=zopt(sp)+ξpd_{p}=z_{opt}(\textbf{s}_{p})+\xi_{p}, where zoptz_{opt} is a nonlinear quaternion function to estimate, and ξp\xi_{p} is noise with an arbitrary probability density function that does not need to be Gaussian.

When finding zz, we ideally would like to solve the empirical risk minimization problem, that is, to minimize

Remp[fQ,𝐙N]=p=1N|zopt(𝐬p)z(𝐬p)|2\mathrm{R}_{emp}[f\in Q,{\mathbf{Z}}_{N}]=\sum_{p=1}^{N}{\left|{z_{opt}({\mathbf{s}}_{p})-z({\mathbf{s}}_{p})}\right|^{2}} (9)

where QQ is a quaternion vector space. Notably, the use of MSE can result in large variations for the weights, or shift in the output, when the noise distribution contains outlying points, is non-symmetric, or has a nonzero mean.

The correntropy is defined as Vσ(X,Y)=E[κσ(XY)]V_{\sigma}(X,Y)=E\left[\kappa_{\sigma}(X-Y)\right], where XX and YY are quaternion random variables, and κσ()\kappa_{\sigma}(\cdot) is a symmetric positive-definite quaternion kernel, with kernel size σ\sigma [19]. For real-valued inputs, the Gaussian kernel can be expressed as

κreal,σ(XY)=12πσexp[12σ2(XY)2]\kappa_{real,\sigma}(X-Y)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{1}{2\sigma^{2}}(X-Y)^{2}\right] (10)

where XX and YY are real valued [20]. An analogous Gaussian-based kernel for quaternion data may be expressed as

κσ(XY)\displaystyle\kappa_{\sigma}(X-Y) =\displaystyle{}={} 42πσexp[12σ2[(XrYr)2+(XiYi)2\displaystyle\frac{4}{\sqrt{2\pi}\sigma}\exp\biggl{[}-\frac{1}{2\sigma^{2}}\Bigl{[}(X_{r}-Y_{r})^{2}+(X_{i}-Y_{i})^{2} (11)
+(XjYj)2+(XkYk)2]]\displaystyle+\>(X_{j}-Y_{j})^{2}+(X_{k}-Y_{k})^{2}\Bigr{]}\biggr{]}
=\displaystyle= 42πσexp[12σ2|XY|2]\displaystyle\frac{4}{\sqrt{2\pi}\sigma}\exp\left[-\frac{{1}}{2\sigma^{2}}\left|{X-Y}\right|^{2}\right]

where XX and YY are the quaternions X=Xr+iXi+jXj+kXkX=X_{r}+iX_{i}+jX_{j}+kX_{k} and Y=Yr+iYi+jYj+kYkY=Y_{r}+iY_{i}+jY_{j}+kY_{k} [20] (see Fig. 1). The factor of 4 is introduced to normalize the effect of the four components in the quaternion as opposed to a single real component. In the complex case, the factor of 2 would be included for normalization.

III Quaternion Recurrent Neural Network with RTRL and Maximum Correntropy Criterion

We next introduce the QRNN equipped with RTRL and MCC; this is achieved based on the GHR calculus. An illustration of a general architecture for QRNN with RTRL is shown in Fig. 2.

III-A Forward pass

Let ha(l)(n)\textbf{h}_{a}^{(l)}(n) denote the hidden state of the aath neuron in the llth layer at time nn, defined as

ha(l)(n)=Φ(fa(l)(n)),\textbf{h}_{a}^{(l)}(n)=\Phi\,\Bigl{(}\textbf{f}_{a}^{(l)}(n)\Bigr{)}\,, (12)

where

fa(l)(n)=b=1Al1uab(l)hb(l1)(n1)+wab(l)vb(l1)(n)+ba(l).\textbf{f}_{a}^{(l)}(n)=\sum_{b=1}^{A_{l-1}}\textbf{u}_{ab}^{(l)}\boldsymbol{\otimes}\textbf{h}_{b}^{(l-1)}(n-1)+\textbf{w}_{ab}^{(l)}\boldsymbol{\otimes}\textbf{v}_{b}^{\,(l-1)}(n)+\textbf{b}_{a}^{(l)}\,. (13)

Here, uab(l)\textbf{u}_{ab}^{(l)} represents the vector of recurrent quaternion weights, Φ\Phi is the quaternion split activation function (see Appendix VI-B), vb(l1)(n)\textbf{v}_{b}^{(l-1)}(n) represents the output of the bb-th neuron in layer l1l-1 at time nn, hb(l1)(n1)\textbf{h}_{b}^{(l-1)}(n-1) is the hidden state of the bb-th neuron in layer l1l-1 at time n1n-1, and \boldsymbol{\otimes} is the Hamilton product.

III-B Correntropy cost function

Denote the error vector by 𝐞(n)=𝐝(n)𝐡(n)\mathbf{e}(n)=\mathbf{d}(n)-\mathbf{h}(n). Given the quaternion Gaussian kernel in (11), the correntropy between h(n)\textbf{{h}}(n) and the desired output d(n)\mathbf{\textbf{d}}(n) becomes

Vσ(e(n))=E[κσ(e(n))],V_{\sigma}\bigl{(}\textbf{e}(n)\bigr{)}=E\Bigl{[}\kappa_{\sigma}\bigl{(}\textbf{e}(n)\bigr{)}\Bigr{]}\,, (14)

where

κσ(e(n))=42πσexp[12σ2|e(n)|2].\kappa_{\sigma}\bigl{(}\textbf{e}(n)\bigr{)}=\frac{4}{\sqrt{2\pi}\sigma}\exp\biggl{[}-\frac{1}{2\sigma^{2}}|\textbf{e}(n)|^{2}\biggr{]}\,. (15)

Usually, the joint probability density function is unknown and only a finite set of samples is available. To estimate the correntropy, the expectation can be approximated based on

V^N,σ(d(n),h(n))=1Nn=1Nκσ(d(n)h(n)).\hat{V}_{N,\sigma}\bigl{(}\textbf{d}(n),\textbf{h}(n)\bigr{)}=\frac{1}{N}\sum_{n=1}^{N}{\kappa_{\sigma}\bigl{(}\textbf{d}(n)-\textbf{h}(n)\bigr{)}}. (16)

The correntropy should account for the even moments between d(n)\textbf{d}(n) and h(n)\textbf{h}(n), especially the magnitude |d(n)h(n)||\textbf{d}(n)-\textbf{h}(n)|. As σ\sigma increases, higher-order terms should diminish, making the second-order term dominant. This facilitates using a gradient based method to maximize correntropy.

As we aim to use gradient descent, instead of maximising V^N,σ(d(n),h(n))\hat{V}_{N,\sigma}(\textbf{d}(n),\textbf{h}(n)), the objective of the network training is to minimise the loss function V^N,σ(d(n),h(n))-\hat{V}_{N,\sigma}(\textbf{d}(n),\textbf{h}(n)), expressed as

J(n)=1Nn=1Nκσ(e(n)),J(n)=-\frac{1}{N}\sum_{n=1}^{N}\kappa_{\sigma}\bigl{(}\textbf{e}(n)\bigr{)}\,, (17)

where nn is the current time index and NN is the total number of time steps. Using the correntropy with a Gaussian kernel function as a cost function gives our problem kernel Hilbert space attributes, which differs from the linear MSE that only considers the Euclidean distance of errors. Essentially, the correntropy represents a weighted sum of error distances, while the kernel function modifies the influence of each error by mapping it from the input space to the reproducing kernel Hilbert space.

Refer to caption
Figure 2: A general architecture for QRNN with RTRL.

III-C Backpropagation

We now derive the quaternion real-time recurrent learning algorithm for the QRNN. The quaternion gradient of the correntropy cost function can be expressed as

qJ(n)=(J(n)𝐪)T=(J(n)𝐪)H.\nabla_{\textbf{q}^{*}}\,J(n)=\biggl{(}\frac{\partial J(n)}{\partial\mathbf{q}^{*}}\biggr{)}^{T}=\biggl{(}\frac{\partial J(n)}{\partial\mathbf{q}}\biggr{)}^{H}\,. (18)

To find J(n)q\frac{\partial J(n)}{\partial\textbf{q}}, we use the GHR chain rule [10] to obtain

J(n)q=1Nn=1Nμ{1,i,j,k}κσ(e(n))eμ(n)eμ(n)q.\frac{\partial J(n)}{\partial\textbf{q}}=-\frac{1}{N}\sum_{n=1}^{N}\>\sum_{\mu\in\{1,i,j,k\}}\frac{\partial\kappa_{\sigma}(\textbf{e}(n))}{\partial\textbf{e}^{\mu}(n)}\frac{\partial\textbf{e}^{\mu}(n)}{\partial\textbf{q}}\,. (19)

The first term κσ(e(n))eμ(n)\frac{\partial\kappa_{\sigma}(\textbf{e}(n))}{\partial\textbf{e}^{\mu}(n)} can be evaluated as

κσ(e(n))eμ(n)=μ{1,i,j,k}42πσ3exp[12σ2|e(n)|2]|e(n)|2eμ(n).\frac{\partial\kappa_{\sigma}(\textbf{e}(n))}{\partial\textbf{e}^{\mu}(n)}=\sum_{\mu\in\{1,i,j,k\}}-\frac{4}{\sqrt{2\pi}\sigma^{3}}\exp\biggl{[}-\frac{1}{2\sigma^{2}}|\textbf{e}(n)|^{2}\biggr{]}\frac{\partial|\textbf{e}(n)|^{2}}{\partial\textbf{e}^{\mu}(n)}. (20)

Upon substituting (20) into the formula for J(n)q\frac{\partial J(n)}{\partial\textbf{q}}, we obtain

J(n)q=4N2πσ3n=1Nμ{1,i,j,k}exp[12σ2|e(n)|2]|e(n)|2eμ(n)eμ(n)q.\begin{split}\frac{\partial J(n)}{\partial\textbf{q}}&=\frac{4}{N\sqrt{2\pi}\sigma^{3}}\sum_{n=1}^{N}\>\sum_{\mu\in\{1,i,j,k\}}\exp\biggl{[}-\frac{1}{2\sigma^{2}}|\textbf{e}(n)|^{2}\biggr{]}\\ &\quad\quad\frac{\partial|\textbf{e}(n)|^{2}}{\partial\textbf{e}^{\mu}(n)}\frac{\partial\textbf{e}^{\mu}(n)}{\partial\textbf{q}}\,.\end{split} (21)

Then, upon employing the GHR chain rule [10], the derivative of |e(n)|2|\textbf{e}(n)|^{2} can be calculated as

|e(n)|2𝐪=μ{1,i,j,k}|e(n)|2eμ(n)eμ(n)𝐪,\frac{\partial|\textbf{e}(n)|^{2}}{\partial\mathbf{q}}=\sum_{\mu\in\{1,i,j,k\}}\frac{\partial|\textbf{e}(n)|^{2}}{\partial\textbf{e}^{\mu}(n)}\frac{\partial\textbf{e}^{\mu}(n)}{\partial\mathbf{q}}\,, (22)

whereby the GHR product rule gives

|e(n)|2eμ(n)=(e(n)e(n))eμ(n)=e(n)e(n)eμ(n)+e(n)ee(n)μ(n)e(n)=12eμ(n).\displaystyle\begin{split}\frac{\partial|\textbf{e}(n)|^{2}}{\partial\textbf{e}^{\mu}(n)}&=\frac{\partial(\textbf{e}^{*}(n)\textbf{e}(n))}{\partial\textbf{e}^{\mu}(n)}=\textbf{e}^{*}(n)\frac{\partial\textbf{e}(n)}{\partial\textbf{e}^{\mu}(n)}+\frac{\partial\textbf{e}^{*}(n)}{\partial\textbf{e}^{\textbf{e}(n)\mu}(n)}\textbf{e}(n)\\ &=\frac{1}{2}\textbf{e}^{\mu}(n)\,.\end{split} (23)

Given that e(n)=d(n)h(n)\textbf{e}(n)=\textbf{d}(n)-\textbf{h}(n), we obtain

eμ(n)q=(Jqμ(n))μ,\frac{\partial\textbf{e}^{\mu}(n)}{\partial\textbf{q}}=-\Bigl{(}J_{\textbf{q}^{\mu}(n)}\Bigr{)}^{\mu}\,, (24)

where Jqμ(n)h(n)qμJ_{\textbf{q}^{\mu}(n)}\triangleq\frac{\partial\textbf{h}(n)}{\partial\textbf{q}^{\mu}} is the Jacobian matrix of h(n)\textbf{h}(n). Thus,

|e(n)|2𝐪=μ{1,i,j,k}12eμ(n)(Jqμ(n))μ,\frac{\partial|\textbf{e}(n)|^{2}}{\partial\mathbf{q}}=-\sum_{\mu\in\{1,i,j,k\}}\frac{1}{2}\textbf{e}^{\mu}(n)\Bigl{(}J_{\textbf{q}^{\mu}(n)}\Bigr{)}^{\mu}\,, (25)

which allows us to arrive at

J(n)q=4N2πσ3n=1Nμ{1,i,j,k}exp[12σ2|e(n)|2](12eμ(n)(Jqμ(n))μ).\begin{split}\frac{\partial J(n)}{\partial\textbf{q}}&=\frac{4}{N\sqrt{2\pi}\sigma^{3}}\sum_{n=1}^{N}\>\sum_{\mu\in\{1,i,j,k\}}\exp\biggl{[}-\frac{1}{2\sigma^{2}}|\textbf{e}(n)|^{2}\biggr{]}\\ &\quad\quad\biggl{(}-\frac{1}{2}\textbf{e}^{\mu}(n)\Bigl{(}J_{\textbf{q}^{\mu}(n)}\Bigr{)}^{\mu}\biggr{)}\,.\end{split} (26)

This can be simplified into

J(n)q=2N2πσ3n=1Nexp[12σ2|e(n)|2]μ{1,i,j,k}(Jqμ(n)e(n))μ.\begin{split}\frac{\partial J(n)}{\partial\textbf{q}}&=-\frac{2}{N\sqrt{2\pi}\sigma^{3}}\sum_{n=1}^{N}\>\exp\biggl{[}-\frac{1}{2\sigma^{2}}|\textbf{e}(n)|^{2}\biggr{]}\\ &\quad\quad\sum_{\mu\in\{1,i,j,k\}}\Bigl{(}J_{\textbf{q}^{\mu}(n)}\textbf{e}(n)\Bigr{)}^{\mu}\,.\end{split} (27)

Finally, to find the quaternion gradient, we take the Hermitian (conjugate transpose) of (27), that is

qJ(n)=(J(n)𝐪)H=(2N2πσ3n=1Nexp[12σ2|e(n)|2]μ{1,i,j,k}(Jqμ(n)e(n))μ)H.\displaystyle\begin{split}\nabla_{\textbf{q}^{*}}\,J(n)&=\biggl{(}\frac{\partial J(n)}{\partial\mathbf{q}}\biggr{)}^{H}\\ &=\biggl{(}-\frac{2}{N\sqrt{2\pi}\sigma^{3}}\sum_{n=1}^{N}\>\exp\biggl{[}-\frac{1}{2\sigma^{2}}|\textbf{e}(n)|^{2}\biggr{]}\\ &\quad\quad\sum_{\mu\in\{1,i,j,k\}}\Bigl{(}J_{\textbf{q}^{\mu}(n)}\textbf{e}(n)\Bigr{)}^{\mu}\biggr{)}^{H}\,.\end{split} (28)

Denote the quaternion error terms at time nn as δ(l)(n)\delta^{(l)}(n), that is

δ(l)(n)={e(n)=(d(n)h(l)(n)),l=L,n=NΦ¯(f(l)(n))(exp[12σ2|δ(l+1)(n)|2][[(W(l+1))μ×(δ(l+1)(n))μ]H+[(U(l))μ×(δ(l)(n+1))μ]H]), other- wise \displaystyle\delta^{(l)}(n)=\begin{cases}\textbf{e}(n)=\bigl{(}\textbf{d}(n)-\textbf{h}^{(l)}(n)\bigr{)}\,,&\parbox[t]{156.49014pt}{$l=L\,,\\ n=N$}\\ \!\begin{aligned} \hfil\displaystyle\begin{split}&\bar{\Phi}\bigl{(}\textbf{f}^{\,(l)}(n)\bigr{)}\odot\biggl{(}\exp\biggl{[}-\frac{1}{2\sigma^{2}}\left|\delta^{(l+1)}(n)\right|^{2}\biggr{]}\\ &\quad\biggl{[}\Bigl{[}\bigl{(}\textbf{W}^{(l+1)}\bigr{)}^{\mu}\times\bigl{(}\delta^{(l+1)}(n)\bigr{)}^{\mu}\Bigr{]}^{H}\\ &\quad+\Bigl{[}\bigl{(}\textbf{U}^{(l)}\bigr{)}^{\mu}\times\bigl{(}\delta^{(l)}(n+1)\bigr{)}^{\mu}\Bigr{]}^{H}\biggr{]}\biggr{)}\,,\end{split}\end{aligned}&\parbox[t]{156.49014pt}{other-\\ wise}\end{cases} (29)

with Φ¯\bar{\Phi} as the derivative of the activation Φ\Phi, \odot as the Hadamard product, and (W(l+1))H(\textbf{W}^{(l+1)})^{H} and (U(l))H(\textbf{U}^{(l)})^{H} as the Hermitians of the weight matrices for the (l+1)(l+1)th and the llth layer, respectively. Then, the weight and bias updates for the output layer are

W(l)=W(l)+αn=1N(δ(l)(n))×(v(l1)(n))H,\textbf{W}^{(l)}=\textbf{W}^{(l)}+\alpha\sum_{n=1}^{N}\bigl{(}\delta^{(l)}(n)\bigr{)}\times\bigl{(}\textbf{v}^{(l-1)}(n)\bigr{)}^{H}\,, (30)
b(l)=b(l)+αn=1Nδ(l)(n).\textbf{b}^{(l)}=\textbf{b}^{(l)}+\alpha\sum_{n=1}^{N}\delta^{(l)}(n)\,. (31)

The weight and bias updates for the hidden layers are

W(l)=W(l)+αn=1Nμ{1,i,j,k}(δ(l)(n))×(v(l1)(n))H,\displaystyle\textbf{W}^{(l)}=\textbf{W}^{(l)}+\alpha\,\sum_{n=1}^{N}\sum_{\mu\in\{1,i,j,k\}}\bigl{(}\delta^{(l)}(n)\bigr{)}\times\bigl{(}\textbf{v}^{(l-1)}(n)\bigr{)}^{H}\,, (32)
U(l)=U(l)+αn=2Nμ{1,i,j,k}(δ(l)(n))×(h(l)(n))H,\textbf{U}^{(l)}=\textbf{U}^{(l)}+\alpha\,\sum_{n=2}^{N}\sum_{\mu\in\{1,i,j,k\}}\bigl{(}\delta^{(l)}(n)\bigr{)}\times\bigl{(}\textbf{h}^{(l)}(n)\bigr{)}^{H},\, (33)
b(l)=b(l)+αn=1Nμ{1,i,j,k}δ(l)(n).\textbf{b}^{(l)}=\textbf{b}^{(l)}+\alpha\,\sum_{n=1}^{N}\sum_{\mu\in\{1,i,j,k\}}\delta^{(l)}(n)\,. (34)

Here, α>0\alpha>0 is the learning rate, and U(l)\textbf{U}^{(l)} represents the matrix of recurrent quaternion weights for layer ll.

IV Simulation results

TABLE I: Forecasting performances of the QRNN RTRL with MCC or MSE loss, RNN RTRL with MCC or MSE loss, QLMS, and LMS. The errors indicate the average of the performance measure over the sequences included in the test dataset for each algorithm. The 95% mean confidence intervals for the QRNNs and RNNs are shown. The error types presented are the root mean squared error (RMSE, in mm), normalised RMSE (nRMSE, no unit), mean average error (MAE, in mm), and jitter (in mm).
Error
Prediction method
All sequences Regular breathing Irregular breathing
RMSE QRNN RTRL w/ MCC 1.486±0.0041.486\pm 0.004 1.245±0.0041.245\pm 0.004 1.683±0.0051.683\pm 0.005
QRNN RTRL w/ MSE
1.513±0.0061.513\pm 0.006 1.239±0.0051.239\pm 0.005 1.741±0.0081.741\pm 0.008
RNN RTRL w/ MCC
1.691±0.0061.691\pm 0.006 1.437±0.0041.437\pm 0.004 1.896±0.0051.896\pm 0.005
RNN RTRL w/ MSE
1.709±0.0051.709\pm 0.005 1.421±0.0051.421\pm 0.005 1.932±0.0061.932\pm 0.006
QLMS
1.6961.696 1.5491.549 2.0042.004
LMS
1.7281.728 1.5731.573 2.0382.038
nRMSE
QRNN RTRL w/ MCC
0.3148±0.00070.3148\pm 0.0007 0.3115±0.00060.3115\pm 0.0006 0.3392±0.00080.3392\pm 0.0008
QRNN RTRL w/ MSE
0.3186±0.00050.3186\pm 0.0005 0.3092±0.00040.3092\pm 0.0004 0.3473±0.00110.3473\pm 0.0011
RNN RTRL w/ MCC
0.3327±0.00060.3327\pm 0.0006 0.3319±0.00050.3319\pm 0.0005 0.3618±0.00090.3618\pm 0.0009
RNN RTRL w/ MSE
0.3358±0.00080.3358\pm 0.0008 0.3261±0.00070.3261\pm 0.0007 0.3694±0.00080.3694\pm 0.0008
QLMS
0.34610.3461 0.33280.3328 0.43520.4352
LMS
0.35090.3509 0.33910.3391 0.44270.4427
MAE
QRNN RTRL w/ MCC
0.752±0.0070.752\pm 0.007 0.617±0.0050.617\pm 0.005 0.882±0.0090.882\pm 0.009
QRNN RTRL w/ MSE
0.771±0.0060.771\pm 0.006 0.625±0.0060.625\pm 0.006 0.918±0.0070.918\pm 0.007
RNN RTRL w/ MCC
0.854±0.0050.854\pm 0.005 0.701±0.0030.701\pm 0.003 0.971±0.0040.971\pm 0.004
RNN RTRL w/ MSE
0.863±0.0030.863\pm 0.003 0.713±0.0020.713\pm 0.002 1.002±0.0031.002\pm 0.003
QLMS
0.9350.935 0.8810.881 1.1561.156
LMS
0.9880.988 0.9360.936 1.2141.214
Jitter
QRNN RTRL w/ MCC
0.7083±0.00130.7083\pm 0.0013 0.6217±0.00120.6217\pm 0.0012 0.7741±0.00140.7741\pm 0.0014
QRNN RTRL w/ MSE
0.6971±0.00140.6971\pm 0.0014 0.5948±0.00140.5948\pm 0.0014 0.8201±0.00160.8201\pm 0.0016
RNN RTRL w/ MCC
0.7899±0.00160.7899\pm 0.0016 0.6932±0.00140.6932\pm 0.0014 0.8803±0.00130.8803\pm 0.0013
RNN RTRL w/ MSE
0.7862±0.00140.7862\pm 0.0014 0.6745±0.00170.6745\pm 0.0017 0.9021±0.00120.9021\pm 0.0012
QLMS
1.6931.693 1.681.68 1.7261.726
LMS
1.6971.697 1.6831.683 1.7281.728

IV-A Motion prediction of chest internal points for lung cancer radiotherapy

In the context of lung cancer radiotherapy, tracking the position of infrared-reflective markers on the chest is a method used to approximate the location of the tumour. Nonetheless, the precision of radiation delivery in radiotherapy systems is constrained by the inherent latency due to limitations in robotic control. Compensating for delays is crucial to reduce the damage to healthy tissues. Employing RNN with online learning can facilitate predictions that adapt to the dynamic nature of respiratory signals, potentially mitigating this issue [21]. Unlike offline methods, which do not change after initial training, online methods continually update the synaptic weights of the network with each new data input. This continuous updating allows the neural network to adjust to the patient’s evolving breathing patterns, offering improved resilience against complex movements. Online learning is particularly advantageous in medical settings, where acquiring extensive training datasets can be challenging. It provides a way to adapt to data not included in the initial training set. Adaptive or dynamic learning techniques have been frequently utilized in radiotherapy. Studies have shown the effectiveness of these approaches compared to static models [22, 23, 24]. One such dynamic method is RTRL [13], which has been applied in various systems like the Cyberknife Synchrony system [24] and the SyncTraX system [25]. It has also been used to track the position of internal points within the chest [21], demonstrating its broad applicability in medical technology.

While prior studies in the field of respiratory motion prediction primarily concentrated on univariate signal analysis, our simulation focuses on the prediction of 3D marker position, represented as pure quaternions, setting a prediction horizon of two seconds. This is achieved through the use of a QRNN trained with RTRL. We compare its effectiveness, when associated with the MCC loss, against other forecasting algorithms. Furthermore, we conducted predictions for all three markers concurrently, enabling the QRNN to further identify and leverage the interrelationships in their movements.

IV-B 3D marker position data

The data used for this study was collected from nine instances of 3D positional tracking of three external markers placed on the chest and abdomen of subjects resting on a HexaPOD treatment couch. The tracking was accomplished using an NDI Polaris infrared camera. Each recorded sequence varied in length, ranging from 73 to 320 seconds, and was sampled at a frequency of 10 Hz. The movement trajectories of the markers were recorded in three dimensions: superior-inferior (ranging from 6 mm to 40 mm), left-right (2 mm to 10 mm), and antero-posterior (18 mm to 45 mm). Out of these recordings, five depict normal breathing patterns, while the remaining four capture subjects engaging in activities such as talking or laughing. Further specifics about the public dataset are elaborated in [26, 27]. The data from subjects was divided into two groups to assess the robustness of the algorithms.

IV-C Algorithms & Training

We compared the performance of the QRNN equipped with RTRL and the MCC loss function (Sec. III) against its counterpart that employs the MSE loss function (see Appendix VI-A). The performances of the standard RNN with RTRL and MSE loss as described in [21] or MCC loss as defined in (10), as well as the Quaternion Least Mean Square (QLMS) algorithm [28], and the real-valued Least Mean Square (LMS) algorithm, were also included in the comparison of the prediction methods. Note that the predictions for all three markers were conducted concurrently for the standard RNNs with RTRL. The QRNNs and RNNs were characterized by one hidden layer. The activation function was the hyperbolic tangent function for RNNs, and the split hyperbolic tangent function for QRNNs (see Appendix VI-B). The cross-validation metric was the RMSE. The ranges of hyper-parameters for cross-validation with grid search were as follows. For QRNNs and RNNs with RTRL, the learning rate range was η{0.02,0.05,0.1,0.2}\eta\in\{0.02,0.05,0.1,0.2\}. The range in the number of hidden units was h{10,20,30,40,45}h\in\{10,20,30,40,45\} for QRNNs and h{20,40,60,80,90}h\in\{20,40,60,80,90\} for RNNs. For QLMS and LMS, the parameter range was η{0.002,0.005,0.01,0.02,0.05,0.1,0.2}\eta\in\{0.002,0.005,0.01,0.02,0.05,0.1,0.2\}. For all prediction methods, the regressor range defined as the number of time steps was l{10,30,50,70,90}l\in\{10,30,50,70,90\}. There were 50 runs successively performed for cross-validation, and 300 runs for evaluation. Updating QRNNs and RNNs with the gradient rule may induce numerical instability. The gradient norm was therefore clipped to avoid large weight updates, while the optimization method was stochastic gradient descent. The training and validation of these models were conducted in the first minute of each recorded sequence, both for 30 seconds. The four error types, presented in Table I, used to evaluate the prediction performance are implemented following [27].

IV-D Prediction performance

The QRNN equipped with RTRL and the MCC loss produced the lowest mean average error, RMSE, and nRMSE averaged over all the sequences as well as in the context of irregular breathing (Table I). The QRNN with RTRL and the MSE loss performed slightly better than its counterpart with the MCC loss for predicting sequences of regular breathing, in terms of the RMSE and nRMSE metrics. The robustness of the QRNN with RTRL and the MCC loss against irregular movements was further confirmed as its nRMSE exhibited an increase of 8.2% between regular and irregular breathing patterns, which is the smallest among the algorithms assessed. More generally, the QRNNs with RTRL displayed consistently better performance for forecasting the breathing sequences than the standard RNNs with RTRL, QLMS, and LMS algorithms. The compact 95% confidence intervals accompanying the performance metrics presented in Table I suggest that choosing 300 test runs was adequate to yield precise results. In addition, the jitter was evaluated. It refers to the fluctuation or oscillation amplitude in the predicted signal, and its extent can significantly affect robotic control during treatment. The QRNN with RTRL and MSE displayed a slightly lower jitter overall than its counterpart with the MCC loss. The QRNN with RTRL and MCC exhibited the lowest jitter in the context of irregular breathing, significantly outperforming the other algorithms, including the RNN with RTRL and MCC (0.77 mm against 0.88 mm) and the QLMS and LMS algorithms (0.77 mm against 1.73 mm).

IV-E Discussion of simulation limits

The simulation has certain limitations, particularly regarding the quantity and length of the sequences utilized, which are relatively modest. Nonetheless, the dataset used encompasses a broad spectrum of respiratory patterns, including various shifts, drifts, slow and abrupt irregular movements, and both active and calm breathing states. A notable aspect of the QRNNs with RTRL and the MCC or MSE loss function we examined is their ability to perform online learning, which does not require extensive pre-existing data to make precise predictions. This is evident from the high accuracy achieved with just one minute of training data. Based on these factors, we believe our results could be extrapolated to larger datasets. Another strength of the simulations is the open-source dataset we used, which promotes reproducibility of our findings. This contrasts with many prior studies in this field, which often rely on private datasets, complicating comparative analyses. We also addressed challenging scenarios like laughing and talking, which are generally controlled in clinical settings. Analyzing performance under such conditions provides insights into other less predictable events that may occur during treatment, such as yawning, hiccupping, or coughing. The current clinical practice is to halt radiation when such anomalies are detected. By differentiating between regular and irregular breathing patterns, we could objectively assess and measure the resilience of the compared algorithms. Given that almost half of our dataset consisted of irregular breathing sequences, the average numerical error metrics across all nine sequences might be higher than what would be encountered in more standard scenarios. Moreover, the QRNN models do take roughly 50% longer to train than the RNN models (Table II). The QRNN with RTRL and MCC demonstrated an average computation time per time step of 184.5 ms. Note that this computation time is less than the approximate marker position sampling interval, which is around 400 ms.

TABLE II: Time performance of the QRNN trained with RTRL and MCC loss in comparison with other prediction methods (Ubuntu Linux i7-10700K 3.8GHz CPU NVidia GeForce RTX 3080 GPU 32Gb RAM).
Prediction algorithm Calculation time per time step (in ms)
QRNN with RTRL and MCC 184.5
QRNN with RTRL and MSE 182.4
RNN with RTRL and MCC 116.1
RNN with RTRL and MSE 115.7
QLMS 0.507
LMS 0.313

V Conclusion

The incorporation of the quaternion algebra into RNNs offers an efficient way to capture and hence make use of the multidimensional structures present in 3D and 4D data. When combined with RTRL, the QRNN model has been demonstrated to adapt and learn intricate multidimensional temporal patterns in real-time. While both MSE and MCC are viable cost functions, the choice depends on the specific application and the nature of the data. Simulations in the context of motion prediction of chest internal points for safer lung cancer radiotherapy have shown that the kernel-based similarity within the MCC helps the QRNN to be consistently more robust to outliers in data. The successful training of QRNNs with RTRL, using just one minute of respiratory data per sequence, have moreover demonstrated the efficacy of dynamic training even with constrained data availability.

Acknowledgement

Pauline Bourigault was supported by the UKRI CDT in AI for Healthcare http://ai4health.io (Grant No. P/S023283/1).

References

  • [1] T. Isokawa, T. Kusakabe, N. Matsui, and F. Peper, Quaternion Neural Network and Its Application.   Springer Berlin, 2003, p. 318–324.
  • [2] T. Parcollet, M. Morchid, and G. Linarès, “A survey of quaternion neural networks,” Artificial Intelligence Review, vol. 53, no. 4, p. 2957–2982, 2019.
  • [3] X. Zhu, Y. Xu, H. Xu, and C. Chen, “Quaternion convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2019.
  • [4] S. Zhang, Y. Tay, L. Yao, and Q. Liu, “Quaternion knowledge graph embeddings,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • [5] C. J. Gaudet and A. S. Maida, “Deep quaternion networks,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1–8.
  • [6] T. Parcollet, M. Morchid, and G. Linares, “Quaternion convolutional neural networks for heterogeneous image processing,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 8514–8518.
  • [7] K. Takahashi, E. Tano, and M. Hashimoto, “Feedforward–feedback controller based on a trained quaternion neural network using a generalised HR calculus with application to trajectory control of a three-link robot manipulator,” Machines, vol. 10, no. 5, p. 333, 2022.
  • [8] D. Comminiello, M. Lella, S. Scardapane, and A. Uncini, “Quaternion convolutional neural networks for detection and localization of 3D sound events,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 8533–8537.
  • [9] D. Xu, Y. Xia, and D. P. Mandic, “Optimization in quaternion dynamic systems: Gradient, Hessian, and learning algorithms,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 2, pp. 249–261, 2016.
  • [10] D. Xu, C. Jahanchahi, C. C. Took, and D. P. Mandic, “Enabling quaternion derivatives: The generalized HR calculus,” Royal Society Open Science, vol. 2, no. 8, p. 150255, 2015.
  • [11] D. Xu, L. Zhang, and H. Zhang, “Learning algorithms in quaternion neural networks using GHR calculus,” Neural Network World, vol. 27, no. 3, pp. 271–282, 2017.
  • [12] C. Popa, “Learning algorithms for quaternion-valued neural networks,” Neural Processing Letters, vol. 47, no. 3, pp. 949–973, 2017.
  • [13] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, no. 2, pp. 270–280, 1989.
  • [14] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction.   Wiley, 2001.
  • [15] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, and Y. Bengio, “Quaternion recurrent neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [16] I. Santamaria, P. P. Pokharel, and J. C. Principe, “Generalized correlation function: Definition, properties, and application to blind equalization,” IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 2187–2197, 2006.
  • [17] J. P. Ward, Quaternions and Cayley Numbers.   Springer Netherlands, 1997.
  • [18] A. Sudbery, “Quaternionic analysis,” Mathematical Proceedings of the Cambridge Philosophical Society, vol. 85, no. 2, pp. 199–225, 1979.
  • [19] W. Liu, P. P. Pokharel, and J. C. Principe, “Correntropy: A localized similarity measure,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2006, pp. 4919–4924.
  • [20] T. Ogunfunmi and T. Paul, “The quaternion maximum correntropy algorithm,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 6, pp. 598–602, 2015.
  • [21] M. Pohl, M. Uesaka, K. Demachi, and R. B. Chhatkuli, “Prediction of the motion of chest internal points using a recurrent neural network trained with real-time recurrent learning for latency compensation in lung cancer radiotherapy,” Computerized Medical Imaging and Graphics, vol. 91, p. 101941, 2022.
  • [22] A. Krauss, S. Nill, and U. Oelfke, “The comparative performance of four respiratory motion predictors for real-time tumour tracking,” Physics in Medicine and Biology, vol. 56, no. 16, p. 5303–5317, 2011.
  • [23] T. P. Teo, S. B. Ahmed, P. Kawalec, N. Alayoubi, N. Bruce, E. Lyn, and S. Pistorius, “Feasibility of predicting tumor motion using online data acquired during treatment and a generalized neural network optimized with offline patient tumor trajectories,” Medical Physics, vol. 45, no. 2, p. 830–845, 2018.
  • [24] M. Mafi and S. M. Moghadam, “Real-time prediction of tumor motion using a dynamic neural network,” Medical, Biological Engineering, Computing, vol. 58, no. 3, p. 529–539, 2020.
  • [25] K. Jiang, F. Fujii, and T. Shiinoki, “Prediction of lung tumor motion using nonlinear autoregressive model with exogenous input,” Physics in Medicine & Biology, vol. 64, no. 21, p. 21NT02, 2019.
  • [26] T. Krilavicius, I. Zliobaite, H. Simonavicius, and L. Jarusevicius, “Predicting respiratory motion for real-time tumour tracking in radiotherapy,” in Proceedings of the IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS), 2016.
  • [27] M. Pohl, M. Uesaka, H. Takahashi, K. Demachi, and R. Bhusal Chhatkuli, “Prediction of the position of external markers using a recurrent neural network trained with unbiased online recurrent optimization for safe lung cancer radiotherapy,” Computer Methods and Programs in Biomedicine, vol. 222, p. 106908, 2022.
  • [28] C. Took and D. Mandic, “The Quaternion LMS algorithm for adaptive filtering of hypercomplex processes,” IEEE Transactions on Signal Processing, vol. 57, no. 4, p. 1316–1327, 2009.
  • [29] N. Benvenuto and F. Piazza, “On the complex backpropagation algorithm,” IEEE Transactions on Signal Processing, vol. 40, no. 4, pp. 967–969, 1992.

VI Appendices

VI-A Quaternion RNN with RTRL and Mean Squared Error

VI-A1 Forward pass

The forward pass is defined as in (III-A).

VI-A2 Loss function

The network error at the processing layer is defined by 𝐞(n)=𝐝(n)h(L)(n)\mathbf{e}(n)=\mathbf{d}(n)-\textbf{h}^{(L)}(n) where 𝐝(n)\mathbf{d}(n) denotes the desired output. The objective of the network training is to minimize a real-valued mean squared error (MSE) loss function [9], that is

J(n)=𝐞(n)2=𝐞H(n)𝐞(n)J(n)=\lVert\mathbf{e}(n)\rVert^{2}=\mathbf{e}^{H}(n)\mathbf{e}(n) (35)

VI-A3 Backpropagation

We now derive the quaternion RTRL algorithm for the QRNN. The quaternion gradient of the error function can be expressed as

qJ(n)=(J(n)𝐪)T=(J(n)𝐪)H=12μ{1,i,j,k}(JqμH(n)e(n))μ,\begin{split}\nabla_{\textbf{q}^{*}}\,J(n)&=\biggl{(}\frac{\partial J(n)}{\partial\mathbf{q}^{*}}\biggr{)}^{T}=\biggl{(}\frac{\partial J(n)}{\partial\mathbf{q}}\biggr{)}^{H}\\ &=-\frac{1}{2}\sum_{\mu\in\{1,i,j,k\}}\biggl{(}\textbf{J}_{\textbf{q}^{\mu}}^{H}(n)\,\textbf{e}(n)\biggr{)}^{\mu}\,,\end{split} (36)

where 𝐉𝐪μ(n)𝐡(n)𝐪μ\mathbf{J}_{\mathbf{q}^{\mu}}(n)\triangleq\frac{\partial\mathbf{h}(n)}{\partial\mathbf{q}^{\mu}} is the Jacobian matrix of 𝐡(n)\mathbf{h}(n) [9]. Denote the quaternion error terms in the recurrent setting at time nn as δ(l)(n)\delta^{(l)}(n). Recall that δ(l)(n)\delta^{(l)}(n) is essentially the local error term that contributes to the gradient qJ(n)\nabla_{\textbf{q}^{*}}\,J(n). These error terms can be calculated recursively as

δ(l)(n)={e(n)=(d(n)h(L)(n)),l=L,n=NΦ¯(f(l)(n))[(W(l+1))H×(δ(l+1)(n))+(U(l))H×(δ(l)(n+1))], other- wise \delta^{(l)}(n)=\begin{cases}\textbf{e}(n)=(\textbf{d}(n)-\textbf{h}^{(L)}(n))\,,&\parbox[t]{156.49014pt}{$l=L\,,\\ n=N$}\\ \!\begin{aligned} &\bar{\Phi}(\textbf{f}^{\,(l)}(n))\odot\Bigl{[}(\textbf{W}^{(l+1)})^{H}\times(\delta^{(l+1)}(n))\\ &\quad\quad+(\textbf{U}^{(l)})^{H}\times(\delta^{(l)}(n+1))\Bigr{]}\,,\end{aligned}&\parbox[t]{156.49014pt}{other-\\ wise}\end{cases} (37)

where \odot is the element-wise (Hadamard) product, and NN is the total number of time steps. The weight and bias update rules for the output layer are

W(l)=W(l)+αn=1Nδ(l)(n)×(v(l1)(n))H,\textbf{W}^{(l)}=\textbf{W}^{(l)}+\alpha\sum_{n=1}^{N}\delta^{(l)}(n)\times(\textbf{v}^{(l-1)}(n))^{H}\,, (38)
b(l)=b(l)+αn=1Nδ(l)(n).\textbf{b}^{(l)}=\textbf{b}^{(l)}+\alpha\sum_{n=1}^{N}\delta^{(l)}(n)\,. (39)

The weight and bias update rules for the hidden layers are

W(l)=W(l)+αn=1Nμ{1,i,j,k}(δ(l)(n))μ×(v(l1)(n))H,\textbf{W}^{(l)}=\textbf{W}^{(l)}+\alpha\,\sum_{n=1}^{N}\sum_{\mu\in\{1,i,j,k\}}\biggl{(}\delta^{(l)}(n)\biggr{)}^{\mu}\times(\textbf{v}^{(l-1)}(n))^{H}\,, (40)
U(l)=U(l)+αn=2Nμ{1,i,j,k}(δ(l)(n))μ×(h(l)(n))H,\textbf{U}^{(l)}=\textbf{U}^{(l)}+\alpha\,\sum_{n=2}^{N}\sum_{\mu\in\{1,i,j,k\}}\biggl{(}\delta^{(l)}(n)\biggr{)}^{\mu}\times(\textbf{h}^{(l)}(n))^{H},\, (41)
b(l)=b(l)+αn=1Nμ{1,i,j,k}(δ(l)(n))μ.\textbf{b}^{(l)}=\textbf{b}^{(l)}+\alpha\,\sum_{n=1}^{N}\sum_{\mu\in\{1,i,j,k\}}\biggl{(}\delta^{(l)}(n)\biggr{)}^{\mu}\,. (42)

Here, α>0\alpha>0 is the learning rate, U(l)\textbf{U}^{(l)} represents the matrix of recurrent quaternion weights for layer ll, and ()H(\cdot)^{H} represents the Hermitian transpose.

VI-B Split activation function

Denote the split activation function as

Φ()=Φζ()+Φζ()i+Φζ()j+Φζ()k,\Phi(\cdot)=\Phi_{\zeta}(\cdot)+\Phi_{\zeta}(\cdot)i+\Phi_{\zeta}(\cdot)j+\Phi_{\zeta}(\cdot)k\,, (43)

with Φζ()\Phi_{\zeta}(\cdot) representing any real-valued activation function. To perform backpropagation using split activation functions, [29] proposed a ”pseudo-gradient” update wherein the gradient is computed component-wise. Accordingly, the ”pseudo-derivative” of Φ()\Phi(\cdot) in (43), written as Φ¯()\bar{\Phi}(\cdot), is given by

Φ¯()=Φ¯ζ()+Φ¯ζ()i+Φ¯ζ()j+Φ¯ζ()k.\bar{\Phi}(\cdot)=\bar{\Phi}_{\zeta}(\cdot)+\bar{\Phi}_{\zeta}(\cdot)i+\bar{\Phi}_{\zeta}(\cdot)j+\bar{\Phi}_{\zeta}(\cdot)k\,. (44)

On the other hand, the compact GHR derivative of any split activation function Φ()\Phi(\cdot) is defined as [9]

Φ(q)q=14(Φζ(q)rΦζ(q)xiΦζ(q)yjΦζ(q)zk)=14(Φ¯ζ(r)+Φ¯ζ(x)+Φ¯ζ(y)+Φ¯ζ(z)).\begin{split}\frac{\partial\Phi(q)}{\partial q}&=\frac{1}{4}\biggl{(}\frac{\partial\Phi_{\zeta}(q)}{\partial r}-\frac{\partial\Phi_{\zeta}(q)}{\partial x}i-\frac{\partial\Phi_{\zeta}(q)}{\partial y}j-\frac{\partial\Phi_{\zeta}(q)}{\partial z}k\biggr{)}\\ &=\frac{1}{4}\Bigl{(}\bar{\Phi}_{\zeta}(r)+\bar{\Phi}_{\zeta}(x)+\bar{\Phi}_{\zeta}(y)+\bar{\Phi}_{\zeta}(z)\Bigr{)}.\end{split} (45)