This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mixture Density Conditional Generative Adversarial Network Models (MD-CGAN)

Jaleh Zand & Stephen Roberts
Machine Learning Research Group & Oxford-Man Institute of Quantitative Finance
University of Oxford
{jz,sjrob}@robots.ox.ac.uk
Abstract

Generative Adversarial Networks (GANs) have gained significant attention in recent years, with impressive applications highlighted in computer vision in particular. Compared to such examples, however, there have been more limited applications of GANs to time series modelling, including forecasting. In this work, we present the Mixture Density Conditional Generative Adversarial Model (MD-CGAN), with a focus on time series forecasting. We show that our model is capable of estimating a probabilistic posterior distribution over forecasts and that, in comparison to a set of benchmark methods, the MD-CGAN model performs well, particularly in situations where noise is a significant component of the observed time series. Further, by using a Gaussian mixture model as the output distribution, MD-CGAN offers posterior predictions that are non-Gaussian.

1 Introduction

Generative Adversarial Networks (GANs) have been one of the many breakthroughs in Deep Learning methods in recent years. Several different variations of the model have been introduced since the method was first introduced in [1]. One of the most popular variations of this work is that of the Conditional Generative Adversarial Network (CGAN) [2] in which the generator and discriminator (we briefly review the GAN process in Section 2) are both conditioned on some observed information. In the application of time series forecasting, future values are conditioned on information observed from the past, either from the time series itself, some set of associated exogenous data, or a combination of the two. This conditioning addition to the GAN formalism makes the CGAN approach particularly useful as a foundational model for time series prediction. Most applications of (C)GANs, however, have been within computer vision and, to a lesser extent, in natural language processing and simulation models [3, 4, 5].

The literature on the application of any form of GAN model to problems associated with time series is, to date, limited. However, some recent literature shows the potential usefulness of the method. For example, the work reported in [6] applies a recurrent GAN to generate realistic, synthetic, medical data series and in [7] a (standard) GAN model is used to generate realistic financial asset prices and analyse their distributions.

In [8] a GAN is used to forecast high-frequency stock data and in [9] GANs are used to generate missing values in incomplete time series. We note that the GAN models used in all these applications make point estimates for the forecast. Although a perfectly valid approach, and one that has a long history in time series forecasting, we argue that probabilistic forecasts are a prerequisite in many application domains, in which knowledge of the predictive uncertainty is as vital as the prediction itself. In this work, we expand on the CGAN algorithm to allow a full predictive probability distribution, rather than a point value. To obtain richer predictive densities we model the posterior distribution using a finite Gaussian Mixture model (GMM). Although we find only occasional evidence that such non-Gaussian predictions offer significant benefits, we note that producing them is not much more costly than single Gaussian predictive distributions, and so present our approach as a more general multi-component model.

1.1 Related work

We here briefly review recent literature which is close to our approach. We start by noting the work of [10], in which a mixture of GAN models is proposed as a data clustering model. Although clearly related, this is somewhat different to our approach, in which we use a single GAN generator, linked to the hyperparameters of the posterior mixture model, rather than a mixture of GAN models. Furthermore, our goal is forecasting rather than unsupervised data classification. The approaches advocated in [11] and [12] propose a latent space, used for sampling latent vectors in the GAN, formulated via a Gaussian mixture. The latter replaces, in these papers, the single Gaussian distribution used for such generation in standard GAN models. In both these papers, the generator and discriminator have a similar structure to a standard GAN. In [13] a mixture model is used, but for the discriminator alone, with the generator being that of a standard GAN model; we note the difference to our approach, in which the generator is a GMM. Finally, [14] compare (standard) GAN models to GMMs for image generation. The authors show that GANs are superior in their ability to generate sharp images, but note that mixture models offer more efficient inference. They propose a combination of the two and introduce a GAN model in which the GAN generator is a mixture model. However, the sample generator still makes point estimates from the multimodal distribution in order to retain a discriminator function the same as that of a standard GAN model. We offer discriminator extensions which allow the GAN process to operate on the full (multi-modal) posterior distribution.

1.2 Paper structure

The rest of this paper is set out as follows: in Section 2 we provide a brief overview of the (C)GAN model, introducing the key concepts. In Section 3 we present the structure of the MD-CGAN model. In Section 4 we test the model on a variety of ‘real-world’ datasets and discuss the results. Finally, in Section 5, we conclude.

2 The GAN and CGAN Model

The goal of the GAN model is to estimate a generative model using an adversarial process [1]. This is achieved by simultaneously training two models. Firstly, a generative model GG, that in the case of data forecasting learns past patterns in the data and infers the predictive values. Secondly, a discriminative model DD, that determines how likely a sample was to originate from the ‘true’ training data, compared to being sampled from the generator. The generator is hence matched against an adversary, the discriminator (whose goal is to detect the difference between a true data sample and one created by the generator). Components of the model are then trained (via an optimization process) to maximize the probability of the discriminator being unable to distinguish true from generated data samples. Typically, including the approach we take here, the generator and the discriminator are both constructed as multilayer perceptrons, with stochastic gradient methods being employed to obtain optimization.

We start by formulating the definitions which we use in common across the GAN models we test. We consider a time series, yty_{t}. Our aim is to estimate the forecast of some yt>tfy_{t^{\prime}>t}^{f}, conditioned on a set of observations which we denote 𝐱t\mathbf{x}_{t}. The inputs to the generator network are 𝐱t\mathbf{x}_{t} and 𝐳g\mathbf{z}_{g}, where 𝐳g\mathbf{z}_{g} is a collection of samples from a normal distribution, p(zn)g=𝒩(0,vardata)p(z_{n})_{g}=\mathcal{N}(0,\mathrm{var}_{\mathrm{data}}). During model training, the output from the generator, ytfy_{t^{\prime}}^{f}, as well as the true forecast sample yty_{t^{\prime}}, are fed to the discriminator, whose role is to discriminate between them i.e. to identify ytfy_{t^{\prime}}^{f} as the ‘fake’ sample.

In an unconditioned GAN model, there is no control over the data that is generated. In the CGAN model, in contrast, by conditioning the model on additional information, it is possible to direct the data generation process [2]. Schematics of the GAN and CGAN models are depicted in Figure 1.

𝐳t\mathbf{z}_{t}Generatorytfy_{t^{\prime}}^{f}yty_{t^{\prime}}
GAN
Discriminator
FakeReala) GAN
𝐱t\mathbf{x}_{t}𝐳t\mathbf{z}_{t}Generatorytfy_{t^{\prime}}^{f}𝐱t\mathbf{x}_{t}yty_{t^{\prime}}
CGAN
Discriminator
FakeRealb) CGAN
Figure 1: Schematic of (a) GAN and (b) CGAN models, showing Generator and Discriminator components and associated variables.

3 The MD-CGAN Model Framework

As with the GAN and CGAN methods, we consider a time series, yty_{t}. Our aim now is to infer the posterior distribution over some yt>ty_{t^{\prime}>t}, conditioned on the set of observations, 𝐱t\mathbf{x}_{t}. In order to form the posterior distribution we model the full conditional density p(yt|𝐱t)p(y_{t^{\prime}}|\mathbf{x}_{t}) as an adversarial network. To achieve this we use a Mixture Density Network (MDN) model [15] for the generator GG. The inputs to the generator network are as per the CGAN approach, 𝐱t\mathbf{x}_{t} and 𝐳g\mathbf{z}_{g}, where 𝐳g\mathbf{z}_{g} is, as before, a collection of samples from a normal distribution, p(zn)g=𝒩(0,vardata)p(z_{n})_{g}=\mathcal{N}(0,\mathrm{var}_{\mathrm{data}}). The outputs of Gt(𝐱t,𝐳g)G_{t^{\prime}}(\mathbf{x}_{t},\mathbf{z}_{g}) are now, however, the parameters of the Gaussian mixture posterior over the forecast. This mixture has mixing coefficient, standard deviation and mean for the ii-th component denoted as αi\alpha_{i}, σi\sigma_{i}, and μi\mu_{i} respectively. As first proposed in [15], we achieve this by using latent variables 𝐬={𝐬α,𝐬σ,𝐬μ}\mathbf{s}=\{\mathbf{s}_{\alpha},\mathbf{s}_{\sigma},\mathbf{s}_{\mu}\}, conditioned on the inputs. The mapping from [𝐱t,𝐳g]𝐬{αi,σi,μi}[\mathbf{x}_{t},\mathbf{z}_{g}]\mapsto\mathbf{s}\mapsto\{\alpha_{i},\sigma_{i},\mu_{i}\} is modelled via our network. As the mixings must satisfy iαi=1\sum_{i}\alpha_{i}=1, we map 𝐬α\mathbf{s}_{\alpha} to α\mathbf{\alpha} via the softmax function, where αi=exp(sα,i)iexp(sα,i)\mathbf{\alpha}_{i}=\frac{\exp({s}_{\alpha,i})}{\sum\limits_{i^{\prime}}\exp({s}_{\alpha,i^{\prime}})}. The elements of σ\mathbf{\sigma} are strictly positive so we adopt, σi=exp(sσ,i)\sigma_{i}=\exp(s_{\sigma,i}). Finally the means can be mapped directly from the latent variables, hence μi=sμ,i\mu_{i}=s_{\mu,i}. Schematically, the MD-CGAN method is depicted in Figure 2.

𝐱t\mathbf{x}_{t}𝐳t\mathbf{z}_{t}Generatorαi\mathbf{\alpha}_{i}σi\mathbf{\sigma}_{i}μi\mathbf{\mu}_{i}(Gt)\mathcal{L}(G_{t^{\prime}})𝐱t\mathbf{x}_{t}yty_{t^{\prime}}(yt)\mathcal{L}(y_{t^{\prime}})
MD-CGAN
Discriminator
FakeReal
Figure 2: Schematic of the MD-CGAN model. We note that the discriminator in the MD-CGAN model has a different loss function and structure to the GAN and CGAN models.

The above formalism allows us to directly model the predictive likelihood conditioned on an input, and the likelihood of GG, conditioned on the observations 𝐱t\mathbf{x}_{t} and samples 𝐳g\mathbf{z}_{g} as:

(Gt(𝐱t,𝐳g))=i=1mαi(𝐱t,𝐳g)𝒩i(yt|μi(𝐱t,𝐳g),σi(𝐱t,𝐳g))\mathcal{L}(G_{t^{\prime}}(\mathbf{x}_{t},\mathbf{z}_{g}))=\sum_{i=1}^{m}\alpha_{i}(\mathbf{x}_{t},\mathbf{z}_{g})\mathcal{N}_{i}(y_{t^{\prime}}|\mu_{i}(\mathbf{x}_{t},\mathbf{z}_{g}),\sigma_{i}(\mathbf{x}_{t},\mathbf{z}_{g})) (1)

where mm is the number of mixture components.

As in the CGAN model, the discriminator, DD, is also conditioned on 𝐱t\mathbf{x}_{t}. The input to the discriminator model is, by design, 𝐱t2πσa(yt)\mathbf{x}_{t}\sqrt{2\pi}\sigma_{a}\mathcal{L}(y_{t^{\prime}}), where σa\sigma_{a} is the standard deviation of the set of observed yty_{t} 111We note that the GAN approach is not sensitive, within reason, to this value (it is, in effect, a constant in the update equations) and we discuss its setting later in the paper.. For true values of yty_{t^{\prime}}, 2πσa(yt)\sqrt{2\pi}\sigma_{a}\mathcal{L}(y_{t^{\prime}}) is maximized. The generator tries to ‘fool’ the discriminator by generating GtG_{t^{\prime}} such that the 2πσa(Gt)\sqrt{2\pi}\sigma_{a}\mathcal{L}(G_{t^{\prime}}) is maximized. The loss function for the generator, LGL_{G} is as in Equation 2, The discriminator network, on the other hand, attempts to differentiate between true yty_{t^{\prime}} values and the pseudo-values created by the generator. The loss function for the discriminator, LDL_{D} is as in Equation 3. We note that the lowest value of the discriminator loss is achieved when 2πσa(yt)\sqrt{2\pi}\sigma_{a}\mathcal{L}(y_{t^{\prime}}) is maximal (unity) and (Gt(𝐱t,𝐳g))\mathcal{L}(G_{t^{\prime}}(\mathbf{x}_{t},\mathbf{z}_{g})) is minimal (zero).

LG=𝔼zPz(z)[(Gt(𝐱t,𝐳g))]L_{G}=\mathbb{E}_{z\sim P_{z}(z)}[-\mathcal{L}(G_{t^{\prime}}(\mathbf{x}_{t},\mathbf{z}_{g}))] (2)
LD=\displaystyle L_{D}= 𝔼yPdata(y)[𝐱t2πσa(yt)𝐱t2]+\displaystyle\mathbb{E}_{y\sim P_{\mathrm{data}}(y)}[\|\mathbf{x}_{t}\sqrt{2\pi}\sigma_{a}\mathcal{L}(y_{t^{\prime}})-\mathbf{x}_{t}\|^{2}]+
𝔼zPz(z)[𝐱t2πσa(Gt(𝐱t,𝐳g))2]\displaystyle\mathbb{E}_{z\sim P_{z}(z)}[\|\mathbf{x}_{t}\sqrt{2\pi}\sigma_{a}\mathcal{L}(G_{t^{\prime}}(\mathbf{x}_{t},\mathbf{z}_{g}))\|^{2}] (3)

Our algorithm thus follows the steps in Algorithm 1 below.

Algorithm 1 MD-CGAN Algorithm
1:for number of training iterations do
2:     for jj steps do do
3:         Sample NN noise samples, {𝐳1\mathbf{z}^{1},…,𝐳N\mathbf{z}^{N}} from pg(𝐳)p_{g}(\mathbf{z})
4:         Sample NN data points, {𝐱1\mathbf{x}^{1},…,𝐱N\mathbf{x}^{N}} from pdata(𝐱)p_{\mathrm{data}}(\mathbf{x})
5:         Update the discriminator by descending its stochastic gradient:
θn=1N\displaystyle\nabla_{\theta_{\mathcal{L}}}\sum_{n=1}^{N} [𝐱(n)2πσa(y(n))𝐱(n)2+\displaystyle[\|\mathbf{x}^{(n)}\sqrt{2\pi}\sigma_{a}\mathcal{L}(y^{(n)})-\mathbf{x}^{(n)}\|^{2}+
𝐱(n)2πσa(G(𝐳(n),𝐱(n)))2]\displaystyle\|\mathbf{x}^{(n)}\sqrt{2\pi}\sigma_{a}\mathcal{L}(G(\mathbf{z}^{(n)},\mathbf{x}^{(n)}))\|^{2}]
6:     end for
7:     Sample NN noise samples, {𝐳1\mathbf{z}^{1},…,𝐳N\mathbf{z}^{N}} from pg(𝐳)p_{g}(\mathbf{z})
8:     Update the generator by descending its stochastic gradient:
θgn=1N(G(𝐳(n),𝐱(n)))\displaystyle\nabla_{\theta_{g}}\sum_{n=1}^{N}-\mathcal{L}(G(\mathbf{z}^{(n)},\mathbf{x}^{(n)}))
9:end for

4 Experiments

4.1 Comparison with other Learning Models

To provide a range of comparisons to methods related to this work, we compare our MD-CGAN model to the following baseline methods: the Mixture Density Network model (MDN), chosen to baseline mixture density outputs against [15]; the CGAN model, chosen as a well-known GAN approach [2]; and a “standard” Multi-Layer Perceptron (MLP) neural network (SNN) as a simple, yet effective, baseline. As a more traditional, parametric, benchmarking model we use regular (linear-Gaussian) Auto-Regressive (AR) models [16], with parameters obtained by standard least-squares maximum-likelihood estimation. To promote as fair comparison as possible for the nonlinear methods, we use a common neural network architecture across core components in the models, choosing the neural network structure commonly used for GAN models in recent literature [2]. We do not alter this structure throughout our experiments and we keep to fixed hyperparameter settings (guided by those used in [2]). Figure 3 provides a schematic of the structure used. We note that, whilst the lengths of the input and output vectors are dependent on the model, the structure of all networks remains constant.

4.2 Details of implementation

All models were coded in the Python language and we use the Keras library [17] to build the neural networks.

The neural networks in Figure 3 follow the structure of the CGAN model detailed in [2]. The hyperparameters in the models were set as follows: for the discriminator, both in CGAN and MD-CGAN, the dropout parameter is set to 0.4, and alpha for the leaky ReLU is set to 0.2. For the generator the dropout rate is set to 0.5 and alpha for the leaky ReLU is, again, 0.2. The parameter governing the number of neurons, nn, in the dense layers of the neural network modules is set to 20. The variance parameter, σa\sigma_{a}, in the MD-CGAN models is set to 0.2. During training, we follow the steps of algorithm 1, and set jj to 1 for all datasets.

The number of training iterations is, however, specific to model and dataset. With the exception of the GAN models we monitor the errors, until they reach saturation during training. For the GAN models, in which both the generator and the discriminator have a loss value per iteration, we monitor the average sample error from the training data and continue iteration until the errors reach saturation.

In all models we optimize parameters using the Adam optimizer [18], with learning rate set to 0.001, exponential decay for the first momentum set to 0.9, exponential decay for the second momentum set to 0.999, and epsilon set to 10710^{-7}.

Structure of SNN, MDN and GAN model generatorsStructure of GAN model Discriminators
MD-CGAN Input:
𝐱\mathbf{x}, 𝐳\mathbf{z}
MDN Input:
𝐱\mathbf{x}
CGAN Input:
𝐱\mathbf{x}, 𝐳\mathbf{z}
SNN Input:
𝐱\mathbf{x}
Dense layer nnLeaky ReLUDropoutBatch NormalizationDense layer 2n2nLeaky ReLUDropoutBatch NormalizationDense layer 4n4nLeaky ReLUDropoutBatch Normalization
MD-CGAN Output:
{αi,σi,μi}{\{\alpha_{i},\sigma_{i},\mu_{i}\}}
MDN Output:
{αi,σi,μi}{\{\alpha_{i},\sigma_{i},\mu_{i}\}}
CGAN Output:
ytfy^{f}_{t^{\prime}}
SNN Output:
ytfy^{f}_{t^{\prime}}
CGAN Input:
{𝐱,yt}\{\mathbf{x},y_{t^{\prime}}\} , {𝐱,ytf}\{\mathbf{x},y^{f}_{t^{\prime}}\}
[𝐱,yt][\mathbf{x},y_{t^{\prime}}] , [𝐱,ytf][\mathbf{x},y^{f}_{t^{\prime}}]
MD-CGAN Input:
{𝐱,(yt)}\{\mathbf{x},\mathcal{L}(y_{t^{\prime}})\} , {𝐱,(Gt)}\{\mathbf{x},\mathcal{L}(G_{t^{\prime}})\}
𝐱t2πσa(yt)\mathbf{x}_{t}\sqrt{2\pi}\sigma_{a}\mathcal{L}(y_{t^{\prime}}) , 𝐱t2πσa(Gt)\mathbf{x}_{t}\sqrt{2\pi}\sigma_{a}\mathcal{L}(G_{t^{\prime}}) Dense layer 2n2nLeaky ReLUDropoutDense layer 2n2nLeaky ReLUDropoutDense layer 2n2nLeaky ReLUDropout(C)GAN loss functionMD-GAN loss function
FakeRealFakeReal
Figure 3: Common Neural Network structure used across all models.

4.3 Data

We perform experiments on ten datasets with differing provenance; including Mackey-Glass chaotic dataset [19]; the Sunspot dataset [20] and eight financial time series. We first look at the one-step forecasting and the issue of (test set) noise resilience, focusing on controlled experiments with the Mackey-Glass and Sunspot datasets (Subsection 4.4). We then expand our experiments to consider the financial datasets and increased forecast horizon in Subsection 4.5.

For all datasets we use, the time series is split into training and out-of-sample test sets. The training data sets in all our experiments comprise 2000 samples and all test sets consist of 400 sequential data points post the training set. All performance metrics are obtained from the test set only. All nonlinear algorithms are provided, as input, the last kk data points, set to k=5k=5 for the purpose of all our experiments. For the linear AR models, we run both an AR(5) model (corresponding to k=5k=5) and, as a martingale baseline, the AR(0) model, in which the forecast is simple the previous observation. We note, however, that the value of k=5k=5 is not optimized, and is chosen merely to allow simple comparisons across methods. All data sets are pre-normalized to the [0,1] interval, again to allow for simpler comparison across data and methods. We further note that both the CGAN and SNN models make point estimate predictions, whilst MD-CGAN and MDN estimate posterior distributions. To enable a simple comparison across models we therefore report the mean-square error (MSE) for all methods. The number of mixture components, mm, in both MD-CGAN and MDN is set to unity (we vary this in Section 4.6), to further ease comparison. The most-likely value (which for m=1m=1 is merely the posterior mean) of the predictive distribution is taken as the forecast value for both the MDN and MD-CGAN models for the purposes of point-prediction error reporting.

4.4 One-step forecasting

Mackey-Glass and Sunspot time series: We start our experiments looking at one-step ahead forecasts on two well known data sets, the Mackey-Glass chaotic time series [19] and the Sunspot data set [20]. We consider one-step forecast errors in the presence of increasing test set noise. We add 5% to 30% (by amplitude) normally distributed noise to the test data (from a GAN perspective, these input perturbations are, in effect, treated as adversarial attacks). We note that no noise is added to the training dataset. Mean Square Errors (MSE) are presented in Tables 1 and 2 and Figure 4 for all algorithms considered. For both datasets we see that MD-CGAN has the best performance for noise levels of 10% and above. Indeed we see that the GAN models (particularly MD-CGAN) perform consistently well (especially in the Mackey-Glass example) across multiple noise levels, indicating that the approach is particularly resilient to additive observation noise. This is to be expected, as GAN approaches treat the additive noise as adversarial perturbation to the input, against which they are designed to be robust.

In the next section we investigate model performance at longer forecast horizons in the finance domain, which is known to contain variable amounts of stochasticity. Financial times series are often dominated by stochastics, and we expect GAN approaches to be well-suited to forecasting in these circumstances.

Refer to caption
Refer to caption
Figure 4: Comparative MSE plots with increasing test set noise perturbation. We note that the GAN-based methods, particularly MD-CGAN, perform consistently even as noise levels increase.
Table 1: Mackey-Glass data: MSE variation with added noise level.
0% noise 5% noise 10% noise 15% noise 20% noise 25% noise 30% noise
AR(0) 0.0020 0.0042 0.0133 0.0246 0.0418 0.0624 0.0951
AR(5) 4.1e-06 0.2794 1.1558 2.6581 4.4868 7.3152 10.1527
SNN 0.0014 0.0047 0.0242 0.0519 0.0570 0.1013 0.1640
CGAN 0.0036 0.0061 0.0155 0.0240 0.0259 0.0347 0.0360
MDN 0.0002 0.0064 0.0278 0.0589 0.0780 0.0980 0.1402
MD-CGAN 0.0026 0.0044 0.0126 0.0165 0.0197 0.0233 0.0264
Table 2: Sunspot data: MSE variation with added noise level.
0% noise 5% noise 10% noise 15% noise 20% noise 25% noise 30% noise
AR(0) 0.0080 0.0109 0.0200 0.0262 0.0494 0.0795 0.0974
AR(5) 0.0068 0.0081 0.0137 0.0138 0.0226 0.0314 0.0409
SNN 0.0114 0.0132 0.0154 0.0201 0.0335 0.0401 0.0536
CGAN 0.0137 0.0143 0.0149 0.0161 0.0228 0.0269 0.0266
MDN 0.0105 0.0140 0.0161 0.0263 0.0384 0.0592 0.0758
MD-CGAN 0.0093 0.0096 0.0113 0.0126 0.0159 0.0194 0.0203

4.5 Financial forecasts over longer-horizons

One-step forecasts were presented in Subsection 4.1. Here we extend analysis of the financial data over longer horizons. All models were used to make estimates over a horizon of ten weeks for an extended set of financial time series, namely: US initial jobless claims (USIJC, weekly intervals, [21]);, EURUSD foreign exchange daily rates (EURUSD FX rate, [22]), WTI crude oil spot prices (WTI, [23]), Henry Hub Natural Gas spot prices (Nat Gas, [23]), CBOE Volatility Index (VIX [24]), New York Harbor No. 2 Heating Oil Spot Price (Heating Oil [23]), Invesco DB US Dollar Index Bullish Fund (USD Index [25]), and iShares MSCI Brazil Small-Cap ETF (EM ETF [26]). We forecast over a ten-week horizon, representing a 50 step forecast for the daily datasets (FX, WTI, Nat Gas, VIX, Heating Oil, USD Index & EM ETF), and 10 steps for the weekly USIJC dataset. Our comparisons, as previously, include standard econometric linear models, namely the 5-th order autoregressive, AR(5), model and the martingale, or AR(0) model, in which the forecast is the last observed datum. Taking the martingale model as a baseline, we present in Table 3 the mean-square errors as the ratio to the martingale model error. We note that the MD-CGAN approach delivers ratios below unity and provides the lowest error of almost all models in this scenario.

Table 3: Ratio of model MSE to martingale, AR(0), baseline model over 10-week forecast horizon.
USIJC EURUSD FX rate WTI Nat Gas VIX index Heating Oil USD Index EM ETF
AR(5) 0.78 1.91 0.85 1.01 0.71 0.82 1.24 0.89
SNN 0.79 1.25 0.89 0.94 0.71 0.93 1.34 0.82
CGAN 0.77 0.85 1.53 1.07 0.91 0.54 1.37 0.69
MDN 0.84 3.48 1.48 1.13 0.77 0.89 0.68 0.81
MD-CGAN 0.73 0.76 0.80 0.82 0.66 0.59 0.54 0.65

4.6 Multi-modal posterior predictions

Finally, we compare the performance of MD-CGAN over varying numbers of mixture components. In all the previous experiments we set m=1m=1 (hence the model produced a single predictive Gaussian posterior). Here we briefly present the results for the five finance datasets with m{1,2,3}m\in\{1,2,3\}. We report (negative) log-likelihood measures (as we do not compare against point-value models in this section) and consider one-step forecasts on all the data sets. Table 4 presents the performance across data sets for varying numbers of mixture components in the posterior prediction. Figure 5 shows the predicted distribution for the test datasets for Nat Gas and Vix Index.

We note performance improvement for some datasets for m>1m>1. We do not attempt to infer mm, though choosing its value based on performance on a set of cross-validation data would be an option, as would enforcing regularization over the mixture model posterior, through more extensive use of Bayesian inference. We leave these extensions for future research.

Table 4: Negative log-likelihood variation with mm, with standard deviations in brackets.
USIJC EURUSD FX rate WTI Nat Gas VIX index Heating Oil USD index EM ETF
m=1m=1 -1.01 -1.79 -1.75 -1.28 -1.50 -1.63 -0.65 -1.26
(0.65) (0.42) (0.27) (0.72) (0.94) (0.23) (0.38) (0.41)
m=2m=2 -1.05 -0.98 -1.24 -1.33 -1.39 -1.37 -0.83 -1.10
(0.39) (0.33) (0.32) (0.66) (0.78) (0.24) (0.32) (0.32)
m=3m=3 -1.09 -0.67 -1.33 -1.25 -1.48 -1.35 -0.86 -1.09
(0.34) (0.18) (0.37) (0.63) (0.92) (0.26) (0.12) (0.27)
Refer to caption
Refer to caption
Figure 5: Estimated distributions for (left) Nat Gas with m=2m=2 and (right) VIX index with m=1m=1 over out-of-sample test datasets. True samples are shown as white dots. Red indicates high data likelihood and blue data low likelihood under the MD-CGAN model.

5 Conclusion

In this paper we present the MD-CGAN model, which offers extensions to the CGAN [2] methodology, particularly to allow for GAN inference of a (multi-modal) posterior distribution over forecast values. In the experiments considered, we find the MD-CGAN approach outperforms other methods on all datasets in which noise is prevalent, including all the financial time series investigated, over long term forecast horizons. As a GAN model, our approach retains adversarial robustness in forecasting, which we find is most notable when noise is extensively present in data. Our method is thus particularly well suited to dealing with financial data. Furthermore, MD-CGAN can effectively estimate a flexible posterior distribution, in contrast to standard GAN models which (almost without exception) produce point value outputs. Exploiting this rich, multi-model, posterior distribution is not reported in detail here but will feature in follow-up work. In summary, the MD-CGAN model combines the advantageous features of both flexible probabilistic forecasting and GAN methods. We see this as a particularly useful approach for dealing with time series in which noise is significant and for providing robust, long-term forecasts beyond simple point estimates.

References

  • [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [2] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784.
  • [3] H. Wu, B. Gu, X. Wang, V. Pickert, B. Ji, Design and control of a bidirectional wireless charging system using GAN devices, in: 2019 IEEE Applied Power Electronics Conference and Exposition (APEC), IEEE, 2019, pp. 864–869.
  • [4] J. A. Hodge, K. V. Mishra, A. I. Zaghloul, Joint multi-layer GAN-based design of tensorial RF metasurfaces, in: 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2019, pp. 1–6.
  • [5] C. B. Barth, P. Assem, T. Foulkes, W. H. Chung, T. Modeer, Y. Lei, R. C. Pilawa-Podgurski, Design and control of a GAN-based, 13-level, flying capacitor multilevel inverter, IEEE Journal of Emerging and Selected Topics in Power Electronics.
  • [6] C. Esteban, S. L. Hyland, G. Rätsch, Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs, arXiv preprint arXiv:1706.02633.
  • [7] M. Wiese, R. Knobloch, R. Korn, P. Kretschmer, Quant GANs: Deep Generation of Financial Time Series, Quantitative Finance (2020) 1–22.
  • [8] X. Zhou, Z. Pan, G. Hu, S. Tang, C. Zhao, Stock market prediction on high-frequency data using generative adversarial nets, Mathematical Problems in Engineering 2018.
  • [9] Y. Luo, X. Cai, Y. Zhang, J. Xu, Y. Xiaojie, Multivariate time series imputation with generative adversarial networks, in: Advances in Neural Information Processing Systems, 2018, pp. 1596–1607.
  • [10] Y. Yu, W. J. Zhou, Mixture of GANs for Clustering., in: IJCAI, 2018, pp. 3047–3053.
  • [11] S. Gurumurthy, R. Kiran Sarvadevabhatla, R. Venkatesh Babu, Deligan: Generative adversarial networks for diverse and limited data, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 166–174.
  • [12] M. Ben-Yosef, D. Weinshall, Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images, arXiv preprint arXiv:1808.10356.
  • [13] H. Eghbal-zadeh, W. Zellinger, G. Widmer, Mixture density generative adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5820–5829.
  • [14] E. Richardson, Y. Weiss, On GANs and GMMs, in: Advances in Neural Information Processing Systems, 2018, pp. 5847–5858.
  • [15] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006.
  • [16] A. Papoulis, Probability, Random Variables, and Stochastic Processes., McGraw–Hill, 1984.
  • [17] F. Chollet, et al., Keras, https://keras.io (2015).
  • [18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
  • [19] M. C. Mackey, L. Glass, Oscillation and chaos in physiological control systems, Science 197 (4300) (1977) 287–289.
  • [20] F. Clette, et al., WDC-SILSO, http://www.sidc.be/silso/ (2015).
  • [21] Bureau of Labor Statistics, United States Department of Labor, https://www.bls.gov/ (2018).
  • [22] Investing.com, Euro US Dollar Daily Price, https://www.investing.com/currencies/eur-usd-historical-data (2018).
  • [23] U.S. Energy Information Administration, https://www.eia.gov/ (2020).
  • [24] Microsoft Corp. (MSFT), Yahoo!Finance, CBOE Volatility Index Historical Data, https://finance.yahoo.com/quote/%5EVIX/history?p=%5EVIX (2020).
  • [25] Microsoft Corp. (MSFT), Yahoo!Finance, Invesco DB US Dollar Index Bullish Fund Historical Data, https://finance.yahoo.com/quote/UUP/history?p=UUP (2020).
  • [26] Microsoft Corp. (MSFT), Yahoo!Finance, iShares MSCI Brazil Small-Cap ETF Historical Data, https://finance.yahoo.com/quote/EWZS/history?p=EWZS (2020).