Neural Generalised AutoRegressive Conditional Heteroskedasticity

Zexuan Yin^∗

{\dagger}

and Paolo Barucca

{{\dagger}{\ddagger}}

^∗Corresponding author. Email: [email protected]

{\dagger}

Deparment of Computer Science, University College London, WC1E 7JE, United Kingdom

\ddagger

[email protected]

(v1.1 released November 2021)

Abstract

We propose Neural GARCH, a class of methods to model conditional heteroskedasticity in financial time series. Neural GARCH is a neural network adaptation of the GARCH(1,1) model in the univariate case, and the diagonal BEKK(1,1) model in the multivariate case. We allow the coefficients of a GARCH model to be time-varying in order to reflect the constantly changing dynamics of financial markets. The time-varying coefficients are parameterised by a recurrent neural network that is trained with stochastic gradient variational Bayes. We propose two variants of our model, one with normal innovations and the other with Student’s t innovations. We test our models on a wide range of univariate and multivariate financial time series, and we find that the Neural Student’s t model consistently outperforms the others.

keywords:

Heteroskedasticity; Recurrent neural networks; Variational inference; Volatility Forecasting

{classcode}

C32, C45, C53,

1 Introduction

Modelling conditional heteroskedasticity (time-varying volatility) in financial time series such as energy prices (Chan and Grant, 2016), cryptocurrencies (Chu et al., 2017), and foreign currency exchange rates Malik (2005) is of great importance to financial practitioners as it allows better decision making with regards to portfolio selection, asset pricing and risk management. In the univariate setting, popular methods include Autoregressive Conditional Heteroskedastic models (ARCH) (Engle, 1982) and Generalised GARCH (GARCH) models (Bollerslev, 1986). ARCH and GARCH models are regression-based models estimated using maximum likelihood, and are capable of capturing stylised facts about financial time series such as volatility clustering (Bauwens et al., 2006). The ARCH( $p$ ) model describes the conditional volatility as a function of $p$ lagged squared residuals, and similarly the GARCH( $p$ , $q$ ) model include contributions due to the last $q$ conditional variances. Many variants of the GARCH model have been proposed to better capture properties of financial time series, for example the EGARCH (Nelson, 1991) and GJR-GARCH (Glosten et al., 1993) models were designed to capture the so-called leverage effect, which describes the negative relationship between asset price and volatility.

In a multivariate setting, instead of modelling only time-varying conditional variances, for an $n$ -dimensional system, we estimate the $n\times n$ time-varying variance-covariance matrix. This allows us to investigate interactions between the volatility of different time series and whether there is a transmission of volatility (spillover effect) between markets (Bauwens et al., 2006; Erten et al., 2012). Popular multivariate GARCH models include the VEC model (Bollerslev et al., 1988), the BEKK model (Engle and Kroner, 1995), the GO-GARCH model (Van Der Weide, 2002) and DCC model (Christodoulakis and Satchell, 2002; Tse and Tsui, 2002; Engle, 2002).

In this paper we focus specifically on GARCH(1,1) models in the univariate case and the diagonal BEKK(1,1) model in the multivariate case to model daily financial asset returns. We consider several assets classes such as foreign exchange rates, commodities and stock indices. GARCH(1,1) models work well in general practical settings due to their simplicity and robustness to overfitting (Wu et al., 2013).

In traditional GARCH models, the estimated coefficients are constant which imply a stationary returns process with a constant unconditional mean and variance (Bollerslev, 1986). However, there is evidence in existing literature that relaxing the stationary constraint on the returns time series can often lead to a better performance as it allows the model to better capture time-varying market conditions. In Stǎricǎ and Granger (2005) the authors modelled daily S&P 500 returns with locally stationary models and found that most of the dynamics were concentrated in shifts of the unconditional variance, and forecasts based on non-stationary unconditional modelling yielded a better performance than a stationary GARCH(1,1) model. Similarly, the authors in Wu et al. (2013) designed a GARCH(1,1) model with time-varying coefficients that followed a random walk process, and they reported better forecasting performances in the test dataset relative to the GARCH(1,1) model.

To this end, we propose univariate and multivariate GARCH models with time-varying coefficients that are parameterised by a recurrent neural network. Our method allows the simplicity and interpretability of GARCH models to be combined with the expressive power of neural networks, and this approach follows a trend in the literature that combines classical time series models with deep learning. In Rangapuram et al. (2018) for example, the authors proposed to parameterise the coefficients of a linear Gaussian state space model with a recurrent neural network, and the latent states were then inferred using a Kalman filter. This approach is advantageous as the neural network allows modelling of more complex relationships between time steps whilst preserving the structural form of the state space model. Similarly, by preserving the structural form of the BEKK model, we can obtain covariance matrices that are symmetric and positive definite (Engle and Kroner, 1995) without the need of implementing further constraints. We treat the time-varying GARCH coefficients as latent variables to be inferred, and to achieve this we leverage recent advances in amortised variational inference in the form of a variational autoencoder (VAE) (Kingma and Welling, 2014), and subsequent combinations of a VAE with a recurrent neural network (so-called Variational RNN, or VRNN) (Chung et al., 2015; Bayer and Osendorfer, 2014; Krishnan et al., 2017; Fabius and van Amersfoort, 2015; Fraccaro et al., 2016; Karl et al., 2017) to allow efficient structured inference over a sequence of latent random variables.

The rest of the paper is organised as follows: in Section 2 we outline the preliminary mathematical concepts of GARCH modelling and amortised variational inference, in Section 3 we introduce the generative and inference model components of Neural GARCH, and in Section 4 we present the performance of Neural GARCH on univariate and multivariate daily returns time series covering foreign exchange rates, commodity prices, and stock indices.

2 Preliminaries

2.1 Univariate GARCH Model

The GARCH( $p$ , $q$ ) model (Bollerslev, 1986) for a returns process $r_{t}$ is specified in terms of the conditional mean equation:

r_{t}\sim\mathcal{N}(0,\sigma_{t}^{2}),

(1)

and the conditional variance equation:

\sigma_{t}^{2}=\omega+\sum_{i=1}^{p}\alpha_{i}r_{t-i}^{2}+\sum_{j=1}^{q}\beta_{j}\sigma_{t-j}^{2}.

(2)

Under the GARCH(1,1) model, the returns process $r_{t}$ is covariance stationary with a constant unconditional mean and variance given by $\mathbb{E}[r_{t}]=0$ and $\mathbb{E}[r_{t}^{2}]=\frac{\omega}{1-\alpha-\beta}$ , where $\omega>0$ , $\alpha\geq 0$ and $\beta\geq 0$ to ensure that $\sigma_{t}^{2}>0$ , and $\alpha+\beta<1$ to ensure a finite unconditional variance. For parameter estimation assuming normal innovations, the following log-likelihood function is maximised:

\mathcal{L}=-\sum_{t=1}^{T}(\frac{1}{2}log(\sigma_{t}^{2})+\frac{r_{t}^{2}}{2\sigma_{t}^{2}})

(3)

To model the leptokurtic (fat-tailed) behaviour of financial returns, the authors in Bollerslev (1987) considered GARCH models with Student’s t innovations with the following log-likelihood function to be maximised:

\mathcal{L}=-\sum_{t=1}^{T}(log\Gamma(\frac{\nu+1}{2})+log\Gamma(\frac{\nu}{2})+\frac{1}{2}log(\nu-2)+\frac{1}{2}log(\sigma_{t}^{2})+\frac{(\nu+1)}{2}log(1+\frac{r_{t}^{2}}{(\nu-2)\sigma_{t}^{2}})),

(4)

where $\nu>2$ is the degree of freedom and $\Gamma$ is the gamma function.

2.2 BEKK Model

The BEKK multivariate GARCH model (Engle and Kroner, 1995) parameterises an $n$ -dimensional multivariate returns process $\boldsymbol{r}_{t}\in\mathbb{R}^{n\times T}$ :

\boldsymbol{r}_{t}\sim\mathcal{N}(0,\boldsymbol{\Sigma}_{t}),

(5)

\boldsymbol{\Sigma}_{t}=\boldsymbol{\Omega}^{T}\boldsymbol{\Omega}+\sum_{i=1}^{p}\boldsymbol{A}_{i}^{T}\boldsymbol{r}_{t-i}\boldsymbol{r}_{t-i}^{T}\boldsymbol{A}_{i}+\sum_{j=1}^{q}\boldsymbol{B}_{j}^{T}\boldsymbol{\Sigma}_{t-j}\boldsymbol{B}_{j},

(6)

where $\boldsymbol{\Sigma}_{t}$ is the $n\times n$ symmetric and positive-definite covariance matrix, $\boldsymbol{\Omega}$ is an upper triangular matrix with $\frac{n(n+1)}{2}$ non-zero entries, $\boldsymbol{A}$ and $\boldsymbol{B}$ are $n\times n$ coefficient matrices. In our paper we consider the diagonal-BEKK model where $\boldsymbol{A}$ and $\boldsymbol{B}$ are diagonal matrices.

2.3 Neural Network Variational Inference

For a latent variable model with parameters $\theta$ , target variable $y$ and latent variable $z$ , we wish to maximise the marginal likelihood with the latent variable integrated out, which often involves an intractable integral:

logP_{\theta}(y)=log\int P_{\theta}(y|z)P_{\theta}(z)\,dz,

(7)

instead we perform variable inference by approximating the actual posterior distribution $P_{\theta}(z|y)$ with a variational approxiation $q_{\phi}(z|y)$ and maximise the evidence lower bound ( $ELBO$ ) where $logP_{\theta}(y)\geq ELBO$ , which is equivalent to minimising the Kullback-Leiber ( $KL$ ) divergence between the variational posterior $q_{\phi}(z|y)$ and the actual posterior $P_{\theta}(z|y)$ (Kingma and Welling, 2014):

logP_{\theta}(y)=ELBO+KL(q_{\phi}(z|y)||P_{\theta}(z|y)),

(8)

where the $ELBO$ is given by:

ELBO=\mathbb{E}_{z\sim q(z|x)}[logP_{\theta}(x|z)]-KL(q_{\phi}(z|x)||p_{\theta}(z)),

(9)

where $P_{\theta}(z)$ is a prior distribution for $z$ , and in a variational autoencoder (VAE) the generative and inference distributions $logP_{\theta}(x|z)$ and $q_{\phi}(z|x)$ are parameterised by neural networks. An uninformative prior such as $\mathcal{N}(0,1)$ is often used for the prior $P_{\theta}(z)$ , however in our model we adopt a learned prior distribution $P_{\theta}(z|\mathcal{I}_{t-1})$ where $\mathcal{I}_{t-1}$ is the information set up to the time step $t-1$ . This learned prior approach has achieved great success in sequential generation tasks such as video prediction (Franceschi et al., 2020; Denton and Fergus, 2018).

3 Materials and Methods

3.1 Neural GARCH Models

In this section we introduce the intuition and various components of Neural GARCH models. We shall focus specifically on univariate and multivariate GARCH(1,1) models as we would like to keep the GARCH model structure as simple as possible and delegate the modelling of complex relationships between time steps to the underlying neural network which outputs the coefficients of the GARCH models. For the rest of this paper, we use the terms (multivariate)GARCH(1,1) and BEKK(1,1) interchangeably when referring to multivariate systems.

In neural GARCH, the coefficients { $\omega$ , $\alpha$ , $\beta$ } in the univariate case and { $\boldsymbol{\Omega}$ , $\boldsymbol{A}$ , $\boldsymbol{B}$ } in the multivariate case are allowed to vary freely with time. This approach allows the model to capture the time-varying nature of market dynamics Wu et al. (2013). The GARCH(1,1) and BEKK(1,1) models thus become:

\sigma_{t}^{2}=\omega_{t}+\alpha_{t}r_{t-1}^{2}+\beta_{t}\sigma_{t-1}^{2},

(10)

\boldsymbol{\Sigma}_{t}=\boldsymbol{\Omega}^{T}_{t}\boldsymbol{\Omega}_{t}+\boldsymbol{A}_{t}^{T}\boldsymbol{r}_{t-1}\boldsymbol{r}_{t-1}^{T}\boldsymbol{A}_{t}+\boldsymbol{B}_{t}^{T}\boldsymbol{\Sigma}_{t-1}\boldsymbol{B}_{t},

(11)

For notation purposes we define the parameter set $\boldsymbol{\gamma}_{t}=[\omega_{t},\alpha_{t},\beta_{t}]^{T}$ for GARCH(1,1) and $\boldsymbol{\gamma}_{t}=[\boldsymbol{\Omega}_{t},\boldsymbol{A}_{t},\boldsymbol{B}_{t}]^{T}$ for BEKK(1,1).

In our proposed framework, $\boldsymbol{\gamma}_{t}$ is a multivariate normal latent random variable with a diagonal covariance matrix to be estimated at every time step. For GARCH(1,1) this involves an estimation of a vector of size 3 for a model with normal innovations:

\boldsymbol{\gamma}_{t}=\begin{bmatrix}\omega_{t}\\ \alpha_{t}\\ \beta_{t}\end{bmatrix}\sim\mathcal{N}(\boldsymbol{\mu_{t}},\boldsymbol{\Sigma}_{\gamma,t}),

(12)

and the vector $[\sigma_{\omega_{t}}^{2},\sigma_{\alpha_{t}}^{2},\sigma_{\beta_{t}}^{2}]^{T}$ represents the diagonal elements of the covariance matrix $\boldsymbol{\Sigma}_{\gamma,t}$ . Here we have written the covariance matrix of the parameter set $\boldsymbol{\gamma}_{t}$ as $\boldsymbol{\Sigma}_{\gamma,t}$ in order to distinguish it from the covariance matrix of the asset returns $\boldsymbol{\Sigma}_{t}$ . For neural GARCH(1,1) with Student’s t innovations, $\boldsymbol{\gamma}_{t}$ is augmented with the degree of freedom parameter $\nu_{t}$ such that $\boldsymbol{\gamma}_{t}=[\omega_{t},\alpha_{t},\beta_{t},\nu_{t}]^{T}$ .

For the multivariate diagonal BEKK(1,1), we adopt a similarly methodology. For a system of $n$ assets, $\boldsymbol{\gamma}_{t}$ of a model with normal innovations is a vector of size $2n+\frac{n(n+1)}{2}$ (Engle and Kroner, 1995), and with Student’s t innovations $\boldsymbol{\gamma}_{t}$ is of size $2n+\frac{n(n+1)}{2}+1$ . As an example, for a system of 2 assets ( $n=2$ ), the BEKK model is given by:

\boldsymbol{\Sigma}_{t}=\begin{bmatrix}c_{11,t}&0\\ c_{21,t}&c_{22,t}\end{bmatrix}\begin{bmatrix}c_{11,t}&c_{12,t}\\ 0&c_{22,t}\end{bmatrix}+\begin{bmatrix}a_{11,t}&0\\ 0&a_{22,t}\end{bmatrix}\begin{bmatrix}r_{1,t-1}\\ r_{2,t-1}\end{bmatrix}\begin{bmatrix}r_{1,t-1}\\ r_{2,t-1}\end{bmatrix}^{T}\begin{bmatrix}a_{11,t}&0\\ 0&a_{22,t}\end{bmatrix}\\ +\begin{bmatrix}b_{11,t}&0\\ 0&b_{22,t}\end{bmatrix}\begin{bmatrix}\sigma_{11,t}^{2}&\sigma_{12,t}^{2}\\ \sigma_{21,t}^{2}&\sigma_{22,t}^{2}\end{bmatrix}\begin{bmatrix}b_{11,t}&0\\ 0&b_{22,t}\end{bmatrix},

(13)

where $a_{ij,t}$ is the $i,j$ th element of the matrix $\boldsymbol{A}_{t}$ , the parameter set $\boldsymbol{\gamma}_{t}$ , which also has a multivariate normal distribution, is given by:

\boldsymbol{\gamma}_{t}=[a_{11,t},a_{22,t},b_{11,t},b_{22,t},c_{11,t},c_{12,t},c_{22,t}]^{T}

(14)

The main contribution of our paper is the estimation of $\boldsymbol{\gamma}_{t}$ with a recurrent neural network (RNN) and a multilayer perceptron (MLP). We provide the exact estimation schemes in Sections 3.2 and 3.3. Since we assume a multivariate normal distribution with a diagonal covariance matrix for $\boldsymbol{\gamma}_{t}$ , we need to estimate the means and variances of the elements in $\boldsymbol{\gamma}_{t}$ with our neural network.

3.2 Generative Model

The generative model distribution $P_{\theta}(\boldsymbol{r}_{1:T},\boldsymbol{\Sigma}_{1:T},\boldsymbol{\gamma}_{1:T})$ of a general multivariate neural GARCH is presented in Figure 1 and given by (15). For the univariate case, one simply replaces $\boldsymbol{\Sigma}_{t}$ in (15) with $\sigma^{2}_{t}$ .

P_{\theta}(\boldsymbol{r}_{1:T},\boldsymbol{\Sigma}_{1:T},\boldsymbol{\gamma}_{1:T})=P(\boldsymbol{\gamma}_{0})P(\boldsymbol{\Sigma}_{0})\prod_{t=1}^{T}P_{\theta}(\boldsymbol{r}_{t}|\boldsymbol{\Sigma}_{t})P_{\theta}(\boldsymbol{\Sigma}_{t}|\boldsymbol{\gamma}_{t},\boldsymbol{r}_{t-1},\boldsymbol{\Sigma}_{t-1})P_{\theta}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t-1}).

(15)

The initial priors were set to delta distributions, $P(\boldsymbol{\Sigma}_{0})$ was centered on the covariance matrix estimated using the training dataset, and $P(\boldsymbol{\gamma}_{0})$ was centered on a vector of 1s. The predictive distribution $P_{\theta}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t-1})$ takes as input the information set $\mathcal{I}_{t-1}=\{\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t-1}\}$ and predicts the 1-step-ahead value $\boldsymbol{\gamma}_{t}$ . For this parameterisation, we leverage a recurrent neural network to carry $\boldsymbol{r}_{1:t-1}$ such that:

P_{\theta}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t-1})=P_{\theta}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{h}_{t-1}),

(16)

where $\boldsymbol{h}_{t}$ is the hidden state of the underlying RNN. In our model we use a gated recurrent unit (GRU) Cho et al. (2014). We then use an MLP which takes as input $\mathcal{I}_{t-1}$ as maps it to the means and variances of the elements in $\boldsymbol{\gamma}_{t}$ . In the 2-dimensional example given in (14), the estimation is done using:

[\mu_{a_{11,t}},...,\mu_{c_{22,t}},\sigma^{2}_{a_{11,t}},...,\sigma^{2}_{c_{22,t}}]^{T}=MLP_{pred}(\boldsymbol{\gamma}_{t-1},\boldsymbol{h}_{t-1}),

(17)

and we apply a sigmoid function on the neural network output to ensure that the estimated variances of the elements in $\boldsymbol{\gamma}_{t}$ and the GARCH coefficients are non-negative. We have also tested other ways to ensure non-negativity such as using a softplus function, however we found that applying a sigmoid function gave the best performance. For neural GARCH with Student’s t innovations, we require that $\nu>2$ in order to have a well-defined covariance. Since appyling the sigmoid function ensures our estimated coefficients are non-negative, we estimate $\nu^{\prime}=\nu-2$ (instead of $\nu$ directly) to ensure $\nu>2$ .

The conditional distribution $P_{\theta}(\boldsymbol{\Sigma}_{t}|\boldsymbol{\gamma}_{t},\boldsymbol{r}_{t-1},\boldsymbol{\Sigma}_{t-1})$ is a delta distribution centered on (10) in the univariate case and (11) in the multivariate case as we can calculate the covariance matrix $\boldsymbol{\Sigma_{t}}$ deterministically given $\{\boldsymbol{\gamma}_{t},\boldsymbol{r}_{t-1},\boldsymbol{\Sigma}_{t-1}\}$ . The distribution $P_{\theta}(\boldsymbol{r}_{t}|\boldsymbol{\Sigma}_{t})$ is the likelihood function and we have provided their logarithms (in the univariate case) in (3) for normal innovations and (4) for Student’s t innovations.

Refer to caption — Figure 1: Generative model of neural GARCH. The generative MLP takes as input $\{\boldsymbol{\gamma}_{t-1},\boldsymbol{h}_{t-1}\}$ and outputs the estimated means and variances of the elements in $\boldsymbol{\gamma}_{t}$ .

3.3 Inference Model

The inference model distribution $q_{\phi}(\boldsymbol{\Sigma}_{1:T},\boldsymbol{\gamma}_{1:T}|\boldsymbol{r}_{1:T})$ is presented in Figure 2 and can be factorised as:

q_{\phi}(\boldsymbol{\Sigma}_{1:T},\boldsymbol{\gamma}_{1:T}|\boldsymbol{r}_{1:T})=P(\boldsymbol{\gamma}_{0})P(\boldsymbol{\Sigma}_{0})\prod_{t=1}^{T}q_{\phi}(\boldsymbol{\Sigma}_{t}|\boldsymbol{\gamma}_{t},\boldsymbol{r}_{t-1},\boldsymbol{\Sigma}_{t-1})q_{\phi}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t}),

(18)

where $P(\boldsymbol{\gamma}_{0})$ and $P(\boldsymbol{\Sigma}_{0})$ are the same as in the generative model, $q_{\phi}(\boldsymbol{\Sigma}_{t}|\boldsymbol{\gamma}_{t},\boldsymbol{r}_{t-1},\boldsymbol{\Sigma}_{t-1})$ has the same functional form (a delta distribution) as $P_{\theta}(\boldsymbol{\Sigma}_{t}|\boldsymbol{\gamma}_{t},\boldsymbol{r}_{t-1},\boldsymbol{\Sigma}_{t-1})$ , however $\boldsymbol{\gamma}_{t}$ is now drawn from the posterior distribution $q_{\phi}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t})$ where:

q_{\phi}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t})=q_{\phi}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{h}_{t}).

(19)

We note that the generative and inference networks share the same underlying recurrent neural network but uses information at different time steps. The generative model predicts $\boldsymbol{\gamma}_{t}$ using the information set $\mathcal{I}_{t-1}$ and the inference model infers $\boldsymbol{\gamma}_{t}$ using $\mathcal{I}_{t}$ . The inference MLP ( $MLP_{inf}$ ) however is different to that of the generative model ( $MLP_{pred}$ ) and it outputs the posterior estimates of the elements of $\boldsymbol{\gamma}_{t}$ :

[\mu_{a_{11,t}},...,\mu_{c_{22,t}},\sigma^{2}_{a_{11,t}},...,\sigma^{2}_{c_{22,t}}]_{post}^{T}=MLP_{inf}(\boldsymbol{\gamma}_{t-1},\boldsymbol{h}_{t}).

(20)

3.4 Model Training

For neural network training we optimise the generative and inference model parameters ( $\theta$ and $\phi$ ) jointly using stochastic gradient variational Bayes Kingma and Welling (2014). Our objective function is the ELBO defined as:

ELBO(\theta,\phi)=\sum_{n=1}^{T}\mathbb{E}_{\gamma_{t}\sim q_{\phi}}[logP_{\theta}(\boldsymbol{r}_{t}|\boldsymbol{\gamma}_{t})]-KL(q_{\phi}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t})||P_{\theta}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t-1})),

(21)

and we seek:

\{\theta^{*},\phi^{*}\}=\operatorname*{argmax}_{\theta,\phi}ELBO(\theta,\phi).

(22)

3.5 Model Prediction

Neural GARCH produces 1-step-ahead conditional volatility predictions. Given $\mathcal{I}_{t}=\{\boldsymbol{\gamma}_{t},\boldsymbol{\Sigma}_{t},\boldsymbol{r}_{t:t}\}$ , we use (17) to obtain our prediction of $\boldsymbol{\gamma}_{t+1}$ by drawing from the multivariate normal distribution whose parameters are given by $MLP_{pred}$ . We then obtain our estimate of $\boldsymbol{\Sigma}_{t+1}$ deterministically using (11). To estimate $\boldsymbol{\Sigma}_{t+2}$ , we would now have access to $\boldsymbol{r}_{t+1}$ and therefore we obtain the posterior estimate of $\boldsymbol{\gamma}_{t+1}$ using (20) and predict $\boldsymbol{\Sigma}_{t+2}$ using the posterior estimate of $\boldsymbol{\Sigma}_{t+1}$ . This posterior update is crucial as it ensures that we use all available and up-to-date information to predict the next covariance matrix.

3.6 Experiments

We test neural GARCH on a range of daily asset log returns time series covering univariate and multivariate foreign exchange rates (20 pairs), commodity prices (brent crude, silver and gold) and stock indices (DAX, S&P, NASDAQ, FTSE100, Dow Jones). We provide a brief data description in Table 1.

Table 1: Description of asset log returns time series analysed in our experiments.

Dataset	N Time Series	Frequency	Observations	Date Range
Foreign exchange	20	daily	3128	05/08/2011 - 05/08/2021
Brent crude	1	daily	2065	05/08/2013 - 05/08/2021
Silver & gold	2	daily	3109	05/08/2011 - 05/08/2021
Stock indices	5	daily	2054	05/08/2013 - 05/08/2021

For model training, we split each time series such that 80% was used in training, 10% for validation and 10% for testing. The underyling recurrent neural network (GRU) has a hidden state size 64, the generative and inference MLPs ( $MLP_{pred}$ and $MLP_{inf}$ ) are both 3-layer MLPs with 64 hidden nodes and ReLU activation functions.

For univariate time series, we compare the performance of six models: GARCH(1,1)-Normal, GARCH(1,1)-Student’s t, Neural-GARCH(1,1) and Neural-GARCH(1,1)-Student’s t, EGARCH(1,1,1)-Normal and EGARCH(1,1,1)-Student’s t. Although neural GARCH is an adaptation of the GARCH(1,1) model, we include the EGARCH(1,1,1) model as a benchmark as it is capable of accounting for the asymmetric leverage effect: negative shocks lead to larger volatilities than positive shocks, where the middle index represents the order of the asymmetric term. We would like to investigate whether the data driven approach of neural GARCH allows it to model the leverage effect without the explicit dependence on an asymmetric term as in an EGARCH model. For multivariate time series, we compare the performance of multivariate GARCH(1,1) (BEKK(1,1)) with normal and Student’s t innovations against their neural network adaptations. We evaluate the model performance using the log-likelihood of the test dataset.

4 Results & Discussion

In Tables 2, 3, 4 and 5 we provide the log-likelihoods evaluated on the test dataset for commodity prices, stock indices, and univariate and multivariate foreign exchange time series. We have highlighed the best model for each time series in bold. For commodity prices, we observe that EGARCH(1,1,1)-Student’s t is the best performer on Brent crude, whilst Neural-GARCH(1,1)-Student’s t performs best on silver and gold price returns.

For stock indices we observe that Neural-GARCH(1,1)-Student’s t performs best on the DAX AND Dow Jones indices whilst EGARCH(1,1,1)-Student’s t performs best on S&P500, NASDAQ and FTSE 100. The fact that the neural GARCH models perform better than EGARCH in some datasets shows that our data-driven approach can learn to accommodate many but not all scenarios of the leverage effect, and therefore in cases where EGARCH outperforms, there are benefits associated with the direct modelling of the asymmetric effect. For univariate foreign exchange time series, we observe that the Neural GARCH variants outperform traditional GARCH models on 16 out of 20 time series, and where neural GARCH outperforms, Neural-GARCH(1,1) with normal innovations performs better on 5/16 time series and Neural-GARCH(1,1)-Student’s t performs better on 11/16 time series.

Table 2: Test log-likelihoods for commodity price time series. Best result highlighed in bold, higher log-likelihood is better.

Time series	GARCH(1,1)-Normal	GARCH(1,1)-Student’s t	Neural-GARCH(1,1)	Neural-GARCH(1,1)-Student’s t	EGARCH(1,1,1)-Normal	EGARCH(1,1,1)-Student’s t
BRENT	-298.738	-298.689	-307.921	-295.895	-299.966	$\boldsymbol{-292.798}$
SILVER	-554.595	-551.936	-541.713	$\boldsymbol{-514.476}$	-572.780	-581.834
GOLD	-462.28	-450.752	-473.074	$\boldsymbol{-421.566}$	-462.857	-468.509

Table 3: Test log-likelihoods for stock index time series.

Time series	GARCH(1,1)-Normal	GARCH(1,1)-Student’s t	Neural-GARCH(1,1)	Neural-GARCH(1,1)-Student’s t	EGARCH(1,1,1)-Normal	EGARCH(1,1,1)-Student’s t
DAX	-261.275	-268.944	-259.321	$\boldsymbol{-244.190}$	-257.767	-266.163
SNP	-300.849	-298.614	-308.559	-295.934	-300.577	$\boldsymbol{-284.841}$
NASDAQ	-327.547	-326.401	-331.539	-320.387	-334.237	$\boldsymbol{-312.366}$
FTSE	-324.437	$\boldsymbol{-314.480}$	-326.572	-315.606	-322.425	$\boldsymbol{-311.135}$
DOW	-298.406	-302.196	-315.164	$\boldsymbol{-284.247}$	-292.974	-293.486

For multivariate foreign exchange time series, we observe that Neural-BEKK(1,1)-Student’s t is the best performer on 8/9 time series considered. Across different assets we see that the Student’s t version of Neural GARCH consistently performs better than the traditional GARCH models as well as Neural GARCH with normal innovations. This suggests that a model with Student’s t innovation does indeed model the leptokurtic behaviour of financial time series returns better than a model with normal innovations. This finding is in line with our expectations after surveying the literature (for example Bollerslev (1987) and Heracleous (2007)).

Table 4: Test log-likelihoods for univariate foreign exchange time series.

Time series	GARCH(1,1)-Normal	GARCH(1,1)-Student’s t	Neural-GARCH(1,1)	Neural-GARCH(1,1)-Student’s t	EGARCH(1,1,1)-Normal	EGARCH(1,1,1)-Student’s t
AUDCAD	$\boldsymbol{-397.251}$	-402.582	-409.553	-398.645	-397.776	-473.302
AUDCHF	-311.566	-308.029	$\boldsymbol{-293.853}$	-294.010	-309.295	-312.965
AUDJPY	-346.024	-350.401	-353.213	$\boldsymbol{-335.945}$	-346.478	-354.095
AUDNZD	-303.986	-318.345	-307.44	$\boldsymbol{-301.514}$	-303.627	-322.777
AUDUSD	-423.602	-424.594	-432.518	$\boldsymbol{-422.753}$	-424.498	-425.807
CADJPY	-351.749	-359.545	$\boldsymbol{-349.209}$	-349.842	-350.460	-362.875
CHFJPY	-238.566	-241.360	-215.536	$\boldsymbol{-208.710}$	-230.120	-253.050
EURAUD	-338.378	-344.922	-347.995	$\boldsymbol{-336.604}$	-337.481	-347.259
EURCAD	-347.177	-359.499	$\boldsymbol{-345.989}$	-347.730	-346.547	-366.701
EURCHF	-277.643	-153.502	-156.567	$\boldsymbol{-142.963}$	-275.073	-321.051
EURGBP	-366.187	-378.950	-373.515	$\boldsymbol{-364.619}$	-364.727	-389.416
EURJPY	-266.674	-278.327	-267.374	$\boldsymbol{-256.341}$	-262.667	-290.897
EURUSD	-332.917	-347.818	$\boldsymbol{-330.471}$	-334.488	-334.178	-361.348
GBPAUD	$\boldsymbol{-335.530}$	-346.944	-353.800	-344.842	-335.812	-353.034
GBPJPY	-330.030	-348.729	-337.981	$\boldsymbol{-324.559}$	-329.013	-359.506
GBPUSD	$\boldsymbol{-418.593}$	-431.554	-423.460	-419.658	-420.534	-441.162
NZDUSD	$\boldsymbol{-415.648}$	-416.944	-425.841	-417.380	-416.094	-417.153
USDCAD	-408.008	-416.483	$\boldsymbol{-404.614}$	-413.507	-406.735	-419.863
USDCHF	-315.963	-303.351	-276.461	$\boldsymbol{-260.177}$	-282.682	-308.410
USDJPY	-295.295	-304.539	-291.419	$\boldsymbol{-277.477}$	-294.519	-318.100

Table 5: Test log-likelihoods for multivariate foreign exchange time series.

Time series	GARCH(1,1)-Normal	GARCH(1,1)-Student’s t	Neural-GARCH(1,1)	Neural-GARCH(1,1)-Student’s t
EURGBP,EURCHF	-643.521	-558.275	-523.725	$\boldsymbol{-513.214}$
GBPJPY GBPUSD	-629.950	-656.198	-649.221	$\boldsymbol{-605.305}$
AUDCHF AUDJPY	-534.49	-522.934	-497.726	$\boldsymbol{-477.992}$
EURGBP,EURUSD,EURJPY	-920.085	-959.420	-985.156	$\boldsymbol{-917.907}$
USDCAD,USDCHF,USDJPY	-1008.821	-998.041	-990.601	$\boldsymbol{-957.912}$
EURGBP,GBPJPY,USDJPY	$\boldsymbol{-916.957}$	-943.66	-1011.435	-966.806
GBPAUD,GBPJPY,GBPUSD	-971.522	-991.8238	-1037.296	$\boldsymbol{-967.500}$
EURCHF,EURGBP,EURJPY,EURUSD	-1196.477	-1127.192	-1105.298	$\boldsymbol{-1078.165}$
AUDJPY,AUDCHF,EURCHF,GBPJPY	-1505.540	-862.995	-865.471	$\boldsymbol{-783.955}$

In order to evaluate whether the models’ performances across different time series are statistically significant, we plotted a critical difference (cd) diagram by following the approach of the authors in Ismail Fawaz et al. (2019) where a Friedman test at $\alpha=0.05$ Friedman (1940) was first used to reject the null hypothesis that the four models are equivalent and have equal rankings, and then a post-hoc test was done using a Wilcoxon signed-rank test Wilcoxon (1945) at the 95% confidence level. The critical diagram shows average rankings of the models across different datasets.

In Figure 3 we show the cd plot for univariate time series. A bold horizontal line indicates no significant difference amongst the group of models that are on the line. In the univariate experiments we observe no significant difference amongst the group: EGARCH(1,1,1)-Student’s T, GARCH(1,1)-Student’s T and Neural-GARCH(1,1); likewise, there is also no significant difference amongst the group: GARCH(1,1)-Student’s T, Neural-GARCH(1,1), GARCH(1,1)-Normal and EGARCH(1,1,1)-Normal. We also observe that on average, GARCH(1,1)-Normal and EGARCH(1,1,1)-Normal perform significantly better than EGARCH(1,1,1)-Student’s T. We establish that Neural-GARCH(1,1)-Student’s t is the best performer overall on the univariate datasets, and it significantly outperforms the other models with an average rank of 1.8929.

In Figure 4 we show the cd plot constructed using all the time series experiments (univariate and multivariate). Our aim is to compare the class of traditional GARCH(1,1) models against their neural network adaptations. We observe that there is no significant difference between GARCH(1,1)-Student’s t, Neural-GARCH(1,1) and GARCH(1,1)-Normal, and we establish that Neural-GARCH(1,1)-Student’s t is the best performer overall with an average ranking of 1.4324.

For a GARCH(1,1) model, the returns process is often assumed to be stationary with a constant unconditional mean and variance. Neural GARCH(1,1) relaxes this stationary assumption. The unconditional variance of Neural-GARCH(1,1) in the univariate case

\sigma^{2}_{t}=\omega_{t}+\alpha_{t}r^{2}_{t}+\beta_{t}\sigma^{2}_{t-1}

(23)

is obtained by taking the expectation of (23):

\begin{split}\mathbb{E}[r^{2}_{t}]&=\mathbb{E}[\omega_{t}+\alpha_{t}r^{2}_{t-1}+\beta_{t}\sigma^{2}_{t-1}]\\ &=\omega_{t}+\alpha_{t}\mathbb{E}[r^{2}_{t-1}]+\beta_{t}\mathbb{E}[\sigma^{2}_{t-1}]\\ &=\omega_{t}+(\alpha_{t}+\beta_{t})\mathbb{E}[r^{2}_{t-1}].\end{split}

(24)

For a GARCH(1,1) model with constant coefficients $\{\omega,\alpha,\beta\}$ , we have $\mathbb{E}[r^{2}_{t}]=\mathbb{E}[r^{2}_{t-1}]$ (constant unconditional variance) and therefore $\frac{\omega}{1-\alpha-\beta}$ . With Neural-GARCH(1,1), $\mathbb{E}[r^{2}_{t}]\neq\mathbb{E}[r^{2}_{t-1}]$ however we can assume that the parameters $\{\omega_{t},\alpha_{t},\beta_{t}\}$ change gradually with no sudden jumps and therefore $\mathbb{E}[r^{2}_{t}]\approx\mathbb{E}[r^{2}_{t-1}]$ (Bri, 2017) and we can approximate the time-varying unconditional variance of Neural-GARCH(1,1) with $\mathbb{E}[r^{2}_{t}]\approx\frac{\omega_{t}}{1-\alpha_{t}-\beta_{t}}$ with $\alpha_{t}+\beta_{t}<1$ .

Results from our analysis of the Neural-GARCH(1,1) coefficients show a consistent pattern when compared to GARCH(1,1) models. We provide an example for the currency pair USDCHF in Figure 5, which shows the time-varying parameter set $\{\omega_{t},\alpha_{t},\beta_{t}\}$ of Neural-GARCH(1,1) against the constant set $\{\omega,\alpha,\beta\}$ of GARCH(1,1). We observe across different time series that Neural-GARCH(1,1) consistently estimates a higher value for $\omega$ and $\alpha$ , and a lower value for $\beta$ . In Figure 6 we show the zoomed-in images of the Neural-GARCH(1,1) coefficients shown in Figure 5 for the currency pair USDCHF. We observe that the coefficients follow well-behaved time-varying behaviour and similar dynamics is observed across all three parameters. This shows the effectiveness of our learned prior neural network ( $MLP_{pred}$ ) which models the distribution $P_{\theta}(\boldsymbol{\gamma}_{t}|\boldsymbol{\gamma}_{t-1},\boldsymbol{r}_{1:t-1})$ .

Having time-varying coefficients allows us to model the financial returns time series as a non-stationary process with a 0 unconditional mean but time-varying unconditional variance. Similarly, the authors in Stǎricǎ and Granger (2005) reported that by relaxing the stationarity assumption on daily S&P 500 returns and using locally stationary linear models, a better forecasting performance was achieved, and in their analysis they showed most of the dynamics of the returns time series to be concentrated in shifts of the unconditional variance. Our model provides a data-driven approach to modelling the returns process. During model training we optimise over the neural network parameters without implementing any external constraints, however we observe in Figure 6 that the model nonetheless outputs time-varying coefficients that satisfy the condition $\alpha_{t}+\beta_{t}<1$ , which is required for the model to have a well-defined unconditional variance.

5 Conclusions

In this paper we propose neural GARCH: a neural network adaptation of the univariate GARCH(1,1) and multivariate diagonal BEKK(1,1) models to model conditional heteroskedasticity in financial time series. Our model consists of a recurrent neural network that captures the temporal dynamics of the returns process and a multilayer perceptron to predict the next-step-ahead GARCH coefficients, which are then used to determine the conditional volatilities. The generative model of neural GARCH makes predictions based on all available information, and the inference model makes updated posterior estimates of the GARCH coefficients when new information becomes available. We tested two versions of neural GARCH on univariate and multivariate financial returns time series: one with normal innovations and the other with Student’s t innovations. When compared against their GARCH counterparts we observe that neural GARCH Student’s t is the best performer and from our analysis we hypothesise that this is due to the neural network’s ability to capture complex temporal dynamics present in the time series and also allowing us to relax the stationarity assumption that is fundamental to traditional GARCH models.

Acknowledgement

The authors would like to thank Fabio Caccioli, Department of Computer Science, University College London, for proofreading the manuscript and providing feedback.

References

Bri (2017) Changing dynamics: Time-varying autoregressive models using generalized additive modeling.. Psychological Methods, 2017, 22, 409–425.
Bauwens et al. (2006) Bauwens, L., Laurent, S. and Rombouts, J.V., Multivariate GARCH models: A survey. Journal of Applied Econometrics, 2006, 21, 79–109.
Bayer and Osendorfer (2014) Bayer, J. and Osendorfer, C., Learning Stochastic Recurrent Networks. arXiv preprint, 2014, pp. 1–9.
Bollerslev (1986) Bollerslev, T., Generalised Autoregressive Conditional Heteroskedasticity. Journal of Econometrics, 1986, 31, 307–327.
Bollerslev (1987) Bollerslev, T., A Conditionally Heteroskedastic Time Series Model for Speculative Prices and Rates of Return. The Review of Economics and Statistics, 1987, 69, 542–547.
Bollerslev et al. (1988) Bollerslev, T., Engle, R.F. and Wooldridge, J.M., A Capital Asset Pricing Model with Time-Varying Covariances. Journal of Political Economy, 1988, 96, 116–131.
Chan and Grant (2016) Chan, J.C. and Grant, A.L., Modeling energy price dynamics: GARCH versus stochastic volatility. Energy Economics, 2016, 54, 182–189.
Cho et al. (2014) Cho, K., van Merrienboer, B., Bahdanau, D. and Bengio, Y., On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Proceedings of the In Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8), 2014.
Christodoulakis and Satchell (2002) Christodoulakis, G.A. and Satchell, S.E., Correlated ARCH (CorrARCH): Modelling the time-varying conditional correlation between financial asset returns. European Journal of Operational Research, 2002, 139, 351–370.
Chu et al. (2017) Chu, J., Chan, S., Nadarajah, S. and Osterrieder, J., GARCH Modelling of Cryptocurrencies. Journal of Risk and Financial Management, 2017, 10, 17.
Chung et al. (2015) Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. and Bengio, Y., A recurrent latent variable model for sequential data. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 2015-January, pp. 2980–2988, 2015.
Denton and Fergus (2018) Denton, E. and Fergus, R., Stochastic Video Generation with a Learned Prior. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Vol. 3, pp. 1906–1919, 2018.
Engle (1982) Engle, R., Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. Econometrica, 1982, 50, 987–1007.
Engle (2002) Engle, R., Dynamic Conditional Correlation. Journal of Business and Economic Statistics, 2002, 20, 339–350.
Engle and Kroner (1995) Engle, R. and Kroner, K., Multivariate Simultaneous Generalized Arch. Econometric Theory, 1995, 11, 122–150.
Erten et al. (2012) Erten, I., Murat, M. and Okay, N., Volatility Spillovers in Emerging Markets During the Global Financial Crisis : Diagonal BEKK Approach. Munich Personal RePEc Archive, 2012, pp. 1–18.
Fabius and van Amersfoort (2015) Fabius, O. and van Amersfoort, J.R., Variational recurrent auto-encoders. 3rd International Conference on Learning Representations, ICLR 2015 - Workshop Track Proceedings, 2015, pp. 1–5.
Fraccaro et al. (2016) Fraccaro, M., Sønderby, S.K., Paquet, U. and Winther, O., Sequential neural models with stochastic layers. In Proceedings of the Advances in Neural Information Processing Systems, pp. 2207–2215, 2016.
Franceschi et al. (2020) Franceschi, J.Y., Delasalles, E., Chen, M., Lamprier, S. and Gallinari, P., Stochastic latent residual video prediction. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Vol. PartF16814, pp. 3191–3204, 2020.
Friedman (1940) Friedman, M., A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 1940, 11, 86–92.
Glosten et al. (1993) Glosten, L.R., Jagannathan, R. and Runkle, D.E., On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. The Journal of Finance, 1993, 48, 1779–1801.
Heracleous (2007) Heracleous, M., Sample Kurtosis, GARCH-t and the Degrees of Freedom Issue. EUR Working Papers, 2007.
Ismail Fawaz et al. (2019) Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L. and Muller, P.A., Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 2019, 33, 917–963.
Karl et al. (2017) Karl, M., Soelch, M., Bayer, J. and Van Der Smagt, P., Deep variational Bayes filters: Unsupervised learning of state space models from raw data. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, ii, pp. 1–13, 2017.
Kingma and Welling (2014) Kingma, D.P. and Welling, M., Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, pp. 1–14, 2014.
Krishnan et al. (2017) Krishnan, R.G., Shalit, U. and Sontag, D., Structured inference networks for nonlinear state space models. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI 2017, pp. 2101–2109, 2017.
Malik (2005) Malik, A.K., European exchange rate volatility dynamics: An empirical investigation. Journal of Empirical Finance, 2005, 12, 187–215.
Nelson (1991) Nelson, D., Conditional Heteroskedasticity in Asset Returns : A New Approach. Econometrica, 1991, 59, 347–370.
Rangapuram et al. (2018) Rangapuram, S.S., Seeger, M., Gasthaus, J., Stella, L., Wang, Y. and Januschowski, T., Deep state space models for time series forecasting. In Proceedings of the Advances in Neural Information Processing Systems, pp. 7785–7794, 2018.
Stǎricǎ and Granger (2005) Stǎricǎ, C. and Granger, C., Nonstationarities in stock returns. Review of Economics and Statistics, 2005, 87, 503–522.
Tse and Tsui (2002) Tse, Y.K. and Tsui, A.K., A multivariate generalized autoregressive conditional heteroscedasticity model with time-varying correlations. Journal of Business and Economic Statistics, 2002, 20, 351–362.
Van Der Weide (2002) Van Der Weide, R., GO-GARCH: A multivariate generalized orthogonal GARCH model. Journal of Applied Econometrics, 2002, 17, 549–564.
Wilcoxon (1945) Wilcoxon, F., Individual comparisons by ranking methods. Biometrics Bulletin, 1945, 1, 80–83.
Wu et al. (2013) Wu, Y., Lobato, J.M.H. and Ghahramani, Z., Dynamic covariance models for multivariate financial time series. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Vol. 28, pp. 1595–1603, 2013.