Mixture Density Conditional Generative Adversarial Network Models (MD-CGAN)
Abstract
Generative Adversarial Networks (GANs) have gained significant attention in recent years, with impressive applications highlighted in computer vision in particular. Compared to such examples, however, there have been more limited applications of GANs to time series modelling, including forecasting. In this work, we present the Mixture Density Conditional Generative Adversarial Model (MD-CGAN), with a focus on time series forecasting. We show that our model is capable of estimating a probabilistic posterior distribution over forecasts and that, in comparison to a set of benchmark methods, the MD-CGAN model performs well, particularly in situations where noise is a significant component of the observed time series. Further, by using a Gaussian mixture model as the output distribution, MD-CGAN offers posterior predictions that are non-Gaussian.
1 Introduction
Generative Adversarial Networks (GANs) have been one of the many breakthroughs in Deep Learning methods in recent years. Several different variations of the model have been introduced since the method was first introduced in [1]. One of the most popular variations of this work is that of the Conditional Generative Adversarial Network (CGAN) [2] in which the generator and discriminator (we briefly review the GAN process in Section 2) are both conditioned on some observed information. In the application of time series forecasting, future values are conditioned on information observed from the past, either from the time series itself, some set of associated exogenous data, or a combination of the two. This conditioning addition to the GAN formalism makes the CGAN approach particularly useful as a foundational model for time series prediction. Most applications of (C)GANs, however, have been within computer vision and, to a lesser extent, in natural language processing and simulation models [3, 4, 5].
The literature on the application of any form of GAN model to problems associated with time series is, to date, limited. However, some recent literature shows the potential usefulness of the method. For example, the work reported in [6] applies a recurrent GAN to generate realistic, synthetic, medical data series and in [7] a (standard) GAN model is used to generate realistic financial asset prices and analyse their distributions.
In [8] a GAN is used to forecast high-frequency stock data and in [9] GANs are used to generate missing values in incomplete time series. We note that the GAN models used in all these applications make point estimates for the forecast. Although a perfectly valid approach, and one that has a long history in time series forecasting, we argue that probabilistic forecasts are a prerequisite in many application domains, in which knowledge of the predictive uncertainty is as vital as the prediction itself. In this work, we expand on the CGAN algorithm to allow a full predictive probability distribution, rather than a point value. To obtain richer predictive densities we model the posterior distribution using a finite Gaussian Mixture model (GMM). Although we find only occasional evidence that such non-Gaussian predictions offer significant benefits, we note that producing them is not much more costly than single Gaussian predictive distributions, and so present our approach as a more general multi-component model.
1.1 Related work
We here briefly review recent literature which is close to our approach. We start by noting the work of [10], in which a mixture of GAN models is proposed as a data clustering model. Although clearly related, this is somewhat different to our approach, in which we use a single GAN generator, linked to the hyperparameters of the posterior mixture model, rather than a mixture of GAN models. Furthermore, our goal is forecasting rather than unsupervised data classification. The approaches advocated in [11] and [12] propose a latent space, used for sampling latent vectors in the GAN, formulated via a Gaussian mixture. The latter replaces, in these papers, the single Gaussian distribution used for such generation in standard GAN models. In both these papers, the generator and discriminator have a similar structure to a standard GAN. In [13] a mixture model is used, but for the discriminator alone, with the generator being that of a standard GAN model; we note the difference to our approach, in which the generator is a GMM. Finally, [14] compare (standard) GAN models to GMMs for image generation. The authors show that GANs are superior in their ability to generate sharp images, but note that mixture models offer more efficient inference. They propose a combination of the two and introduce a GAN model in which the GAN generator is a mixture model. However, the sample generator still makes point estimates from the multimodal distribution in order to retain a discriminator function the same as that of a standard GAN model. We offer discriminator extensions which allow the GAN process to operate on the full (multi-modal) posterior distribution.
1.2 Paper structure
The rest of this paper is set out as follows: in Section 2 we provide a brief overview of the (C)GAN model, introducing the key concepts. In Section 3 we present the structure of the MD-CGAN model. In Section 4 we test the model on a variety of ‘real-world’ datasets and discuss the results. Finally, in Section 5, we conclude.
2 The GAN and CGAN Model
The goal of the GAN model is to estimate a generative model using an adversarial process [1]. This is achieved by simultaneously training two models. Firstly, a generative model , that in the case of data forecasting learns past patterns in the data and infers the predictive values. Secondly, a discriminative model , that determines how likely a sample was to originate from the ‘true’ training data, compared to being sampled from the generator. The generator is hence matched against an adversary, the discriminator (whose goal is to detect the difference between a true data sample and one created by the generator). Components of the model are then trained (via an optimization process) to maximize the probability of the discriminator being unable to distinguish true from generated data samples. Typically, including the approach we take here, the generator and the discriminator are both constructed as multilayer perceptrons, with stochastic gradient methods being employed to obtain optimization.
We start by formulating the definitions which we use in common across the GAN models we test. We consider a time series, . Our aim is to estimate the forecast of some , conditioned on a set of observations which we denote . The inputs to the generator network are and , where is a collection of samples from a normal distribution, . During model training, the output from the generator, , as well as the true forecast sample , are fed to the discriminator, whose role is to discriminate between them i.e. to identify as the ‘fake’ sample.
In an unconditioned GAN model, there is no control over the data that is generated. In the CGAN model, in contrast, by conditioning the model on additional information, it is possible to direct the data generation process [2]. Schematics of the GAN and CGAN models are depicted in Figure 1.
3 The MD-CGAN Model Framework
As with the GAN and CGAN methods, we consider a time series, . Our aim now is to infer the posterior distribution over some , conditioned on the set of observations, . In order to form the posterior distribution we model the full conditional density as an adversarial network. To achieve this we use a Mixture Density Network (MDN) model [15] for the generator . The inputs to the generator network are as per the CGAN approach, and , where is, as before, a collection of samples from a normal distribution, . The outputs of are now, however, the parameters of the Gaussian mixture posterior over the forecast. This mixture has mixing coefficient, standard deviation and mean for the -th component denoted as , , and respectively. As first proposed in [15], we achieve this by using latent variables , conditioned on the inputs. The mapping from is modelled via our network. As the mixings must satisfy , we map to via the softmax function, where . The elements of are strictly positive so we adopt, . Finally the means can be mapped directly from the latent variables, hence . Schematically, the MD-CGAN method is depicted in Figure 2.
The above formalism allows us to directly model the predictive likelihood conditioned on an input, and the likelihood of , conditioned on the observations and samples as:
(1) |
where is the number of mixture components.
As in the CGAN model, the discriminator, , is also conditioned on . The input to the discriminator model is, by design, , where is the standard deviation of the set of observed 111We note that the GAN approach is not sensitive, within reason, to this value (it is, in effect, a constant in the update equations) and we discuss its setting later in the paper.. For true values of , is maximized. The generator tries to ‘fool’ the discriminator by generating such that the is maximized. The loss function for the generator, is as in Equation 2, The discriminator network, on the other hand, attempts to differentiate between true values and the pseudo-values created by the generator. The loss function for the discriminator, is as in Equation 3. We note that the lowest value of the discriminator loss is achieved when is maximal (unity) and is minimal (zero).
(2) |
(3) |
Our algorithm thus follows the steps in Algorithm 1 below.
4 Experiments
4.1 Comparison with other Learning Models
To provide a range of comparisons to methods related to this work, we compare our MD-CGAN model to the following baseline methods: the Mixture Density Network model (MDN), chosen to baseline mixture density outputs against [15]; the CGAN model, chosen as a well-known GAN approach [2]; and a “standard” Multi-Layer Perceptron (MLP) neural network (SNN) as a simple, yet effective, baseline. As a more traditional, parametric, benchmarking model we use regular (linear-Gaussian) Auto-Regressive (AR) models [16], with parameters obtained by standard least-squares maximum-likelihood estimation. To promote as fair comparison as possible for the nonlinear methods, we use a common neural network architecture across core components in the models, choosing the neural network structure commonly used for GAN models in recent literature [2]. We do not alter this structure throughout our experiments and we keep to fixed hyperparameter settings (guided by those used in [2]). Figure 3 provides a schematic of the structure used. We note that, whilst the lengths of the input and output vectors are dependent on the model, the structure of all networks remains constant.
4.2 Details of implementation
All models were coded in the Python language and we use the Keras library [17] to build the neural networks.
The neural networks in Figure 3 follow the structure of the CGAN model detailed in [2]. The hyperparameters in the models were set as follows: for the discriminator, both in CGAN and MD-CGAN, the dropout parameter is set to 0.4, and alpha for the leaky ReLU is set to 0.2. For the generator the dropout rate is set to 0.5 and alpha for the leaky ReLU is, again, 0.2. The parameter governing the number of neurons, , in the dense layers of the neural network modules is set to 20. The variance parameter, , in the MD-CGAN models is set to 0.2. During training, we follow the steps of algorithm 1, and set to 1 for all datasets.
The number of training iterations is, however, specific to model and dataset. With the exception of the GAN models we monitor the errors, until they reach saturation during training. For the GAN models, in which both the generator and the discriminator have a loss value per iteration, we monitor the average sample error from the training data and continue iteration until the errors reach saturation.
In all models we optimize parameters using the Adam optimizer [18], with learning rate set to 0.001, exponential decay for the first momentum set to 0.9, exponential decay for the second momentum set to 0.999, and epsilon set to .
4.3 Data
We perform experiments on ten datasets with differing provenance; including Mackey-Glass chaotic dataset [19]; the Sunspot dataset [20] and eight financial time series. We first look at the one-step forecasting and the issue of (test set) noise resilience, focusing on controlled experiments with the Mackey-Glass and Sunspot datasets (Subsection 4.4). We then expand our experiments to consider the financial datasets and increased forecast horizon in Subsection 4.5.
For all datasets we use, the time series is split into training and out-of-sample test sets. The training data sets in all our experiments comprise 2000 samples and all test sets consist of 400 sequential data points post the training set. All performance metrics are obtained from the test set only. All nonlinear algorithms are provided, as input, the last data points, set to for the purpose of all our experiments. For the linear AR models, we run both an AR(5) model (corresponding to ) and, as a martingale baseline, the AR(0) model, in which the forecast is simple the previous observation. We note, however, that the value of is not optimized, and is chosen merely to allow simple comparisons across methods. All data sets are pre-normalized to the [0,1] interval, again to allow for simpler comparison across data and methods. We further note that both the CGAN and SNN models make point estimate predictions, whilst MD-CGAN and MDN estimate posterior distributions. To enable a simple comparison across models we therefore report the mean-square error (MSE) for all methods. The number of mixture components, , in both MD-CGAN and MDN is set to unity (we vary this in Section 4.6), to further ease comparison. The most-likely value (which for is merely the posterior mean) of the predictive distribution is taken as the forecast value for both the MDN and MD-CGAN models for the purposes of point-prediction error reporting.
4.4 One-step forecasting
Mackey-Glass and Sunspot time series: We start our experiments looking at one-step ahead forecasts on two well known data sets, the Mackey-Glass chaotic time series [19] and the Sunspot data set [20]. We consider one-step forecast errors in the presence of increasing test set noise. We add 5% to 30% (by amplitude) normally distributed noise to the test data (from a GAN perspective, these input perturbations are, in effect, treated as adversarial attacks). We note that no noise is added to the training dataset. Mean Square Errors (MSE) are presented in Tables 1 and 2 and Figure 4 for all algorithms considered. For both datasets we see that MD-CGAN has the best performance for noise levels of 10% and above. Indeed we see that the GAN models (particularly MD-CGAN) perform consistently well (especially in the Mackey-Glass example) across multiple noise levels, indicating that the approach is particularly resilient to additive observation noise. This is to be expected, as GAN approaches treat the additive noise as adversarial perturbation to the input, against which they are designed to be robust.
In the next section we investigate model performance at longer forecast horizons in the finance domain, which is known to contain variable amounts of stochasticity. Financial times series are often dominated by stochastics, and we expect GAN approaches to be well-suited to forecasting in these circumstances.


0% noise | 5% noise | 10% noise | 15% noise | 20% noise | 25% noise | 30% noise | |
AR(0) | 0.0020 | 0.0042 | 0.0133 | 0.0246 | 0.0418 | 0.0624 | 0.0951 |
AR(5) | 4.1e-06 | 0.2794 | 1.1558 | 2.6581 | 4.4868 | 7.3152 | 10.1527 |
SNN | 0.0014 | 0.0047 | 0.0242 | 0.0519 | 0.0570 | 0.1013 | 0.1640 |
CGAN | 0.0036 | 0.0061 | 0.0155 | 0.0240 | 0.0259 | 0.0347 | 0.0360 |
MDN | 0.0002 | 0.0064 | 0.0278 | 0.0589 | 0.0780 | 0.0980 | 0.1402 |
MD-CGAN | 0.0026 | 0.0044 | 0.0126 | 0.0165 | 0.0197 | 0.0233 | 0.0264 |
0% noise | 5% noise | 10% noise | 15% noise | 20% noise | 25% noise | 30% noise | |
AR(0) | 0.0080 | 0.0109 | 0.0200 | 0.0262 | 0.0494 | 0.0795 | 0.0974 |
AR(5) | 0.0068 | 0.0081 | 0.0137 | 0.0138 | 0.0226 | 0.0314 | 0.0409 |
SNN | 0.0114 | 0.0132 | 0.0154 | 0.0201 | 0.0335 | 0.0401 | 0.0536 |
CGAN | 0.0137 | 0.0143 | 0.0149 | 0.0161 | 0.0228 | 0.0269 | 0.0266 |
MDN | 0.0105 | 0.0140 | 0.0161 | 0.0263 | 0.0384 | 0.0592 | 0.0758 |
MD-CGAN | 0.0093 | 0.0096 | 0.0113 | 0.0126 | 0.0159 | 0.0194 | 0.0203 |
4.5 Financial forecasts over longer-horizons
One-step forecasts were presented in Subsection 4.1. Here we extend analysis of the financial data over longer horizons. All models were used to make estimates over a horizon of ten weeks for an extended set of financial time series, namely: US initial jobless claims (USIJC, weekly intervals, [21]);, EURUSD foreign exchange daily rates (EURUSD FX rate, [22]), WTI crude oil spot prices (WTI, [23]), Henry Hub Natural Gas spot prices (Nat Gas, [23]), CBOE Volatility Index (VIX [24]), New York Harbor No. 2 Heating Oil Spot Price (Heating Oil [23]), Invesco DB US Dollar Index Bullish Fund (USD Index [25]), and iShares MSCI Brazil Small-Cap ETF (EM ETF [26]). We forecast over a ten-week horizon, representing a 50 step forecast for the daily datasets (FX, WTI, Nat Gas, VIX, Heating Oil, USD Index & EM ETF), and 10 steps for the weekly USIJC dataset. Our comparisons, as previously, include standard econometric linear models, namely the 5-th order autoregressive, AR(5), model and the martingale, or AR(0) model, in which the forecast is the last observed datum. Taking the martingale model as a baseline, we present in Table 3 the mean-square errors as the ratio to the martingale model error. We note that the MD-CGAN approach delivers ratios below unity and provides the lowest error of almost all models in this scenario.
USIJC | EURUSD FX rate | WTI | Nat Gas | VIX index | Heating Oil | USD Index | EM ETF | |
AR(5) | 0.78 | 1.91 | 0.85 | 1.01 | 0.71 | 0.82 | 1.24 | 0.89 |
SNN | 0.79 | 1.25 | 0.89 | 0.94 | 0.71 | 0.93 | 1.34 | 0.82 |
CGAN | 0.77 | 0.85 | 1.53 | 1.07 | 0.91 | 0.54 | 1.37 | 0.69 |
MDN | 0.84 | 3.48 | 1.48 | 1.13 | 0.77 | 0.89 | 0.68 | 0.81 |
MD-CGAN | 0.73 | 0.76 | 0.80 | 0.82 | 0.66 | 0.59 | 0.54 | 0.65 |
4.6 Multi-modal posterior predictions
Finally, we compare the performance of MD-CGAN over varying numbers of mixture components. In all the previous experiments we set (hence the model produced a single predictive Gaussian posterior). Here we briefly present the results for the five finance datasets with . We report (negative) log-likelihood measures (as we do not compare against point-value models in this section) and consider one-step forecasts on all the data sets. Table 4 presents the performance across data sets for varying numbers of mixture components in the posterior prediction. Figure 5 shows the predicted distribution for the test datasets for Nat Gas and Vix Index.
We note performance improvement for some datasets for . We do not attempt to infer , though choosing its value based on performance on a set of cross-validation data would be an option, as would enforcing regularization over the mixture model posterior, through more extensive use of Bayesian inference. We leave these extensions for future research.
USIJC | EURUSD FX rate | WTI | Nat Gas | VIX index | Heating Oil | USD index | EM ETF | |
-1.01 | -1.79 | -1.75 | -1.28 | -1.50 | -1.63 | -0.65 | -1.26 | |
(0.65) | (0.42) | (0.27) | (0.72) | (0.94) | (0.23) | (0.38) | (0.41) | |
-1.05 | -0.98 | -1.24 | -1.33 | -1.39 | -1.37 | -0.83 | -1.10 | |
(0.39) | (0.33) | (0.32) | (0.66) | (0.78) | (0.24) | (0.32) | (0.32) | |
-1.09 | -0.67 | -1.33 | -1.25 | -1.48 | -1.35 | -0.86 | -1.09 | |
(0.34) | (0.18) | (0.37) | (0.63) | (0.92) | (0.26) | (0.12) | (0.27) |


5 Conclusion
In this paper we present the MD-CGAN model, which offers extensions to the CGAN [2] methodology, particularly to allow for GAN inference of a (multi-modal) posterior distribution over forecast values. In the experiments considered, we find the MD-CGAN approach outperforms other methods on all datasets in which noise is prevalent, including all the financial time series investigated, over long term forecast horizons. As a GAN model, our approach retains adversarial robustness in forecasting, which we find is most notable when noise is extensively present in data. Our method is thus particularly well suited to dealing with financial data. Furthermore, MD-CGAN can effectively estimate a flexible posterior distribution, in contrast to standard GAN models which (almost without exception) produce point value outputs. Exploiting this rich, multi-model, posterior distribution is not reported in detail here but will feature in follow-up work. In summary, the MD-CGAN model combines the advantageous features of both flexible probabilistic forecasting and GAN methods. We see this as a particularly useful approach for dealing with time series in which noise is significant and for providing robust, long-term forecasts beyond simple point estimates.
References
- [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.
- [2] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784.
- [3] H. Wu, B. Gu, X. Wang, V. Pickert, B. Ji, Design and control of a bidirectional wireless charging system using GAN devices, in: 2019 IEEE Applied Power Electronics Conference and Exposition (APEC), IEEE, 2019, pp. 864–869.
- [4] J. A. Hodge, K. V. Mishra, A. I. Zaghloul, Joint multi-layer GAN-based design of tensorial RF metasurfaces, in: 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2019, pp. 1–6.
- [5] C. B. Barth, P. Assem, T. Foulkes, W. H. Chung, T. Modeer, Y. Lei, R. C. Pilawa-Podgurski, Design and control of a GAN-based, 13-level, flying capacitor multilevel inverter, IEEE Journal of Emerging and Selected Topics in Power Electronics.
- [6] C. Esteban, S. L. Hyland, G. Rätsch, Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs, arXiv preprint arXiv:1706.02633.
- [7] M. Wiese, R. Knobloch, R. Korn, P. Kretschmer, Quant GANs: Deep Generation of Financial Time Series, Quantitative Finance (2020) 1–22.
- [8] X. Zhou, Z. Pan, G. Hu, S. Tang, C. Zhao, Stock market prediction on high-frequency data using generative adversarial nets, Mathematical Problems in Engineering 2018.
- [9] Y. Luo, X. Cai, Y. Zhang, J. Xu, Y. Xiaojie, Multivariate time series imputation with generative adversarial networks, in: Advances in Neural Information Processing Systems, 2018, pp. 1596–1607.
- [10] Y. Yu, W. J. Zhou, Mixture of GANs for Clustering., in: IJCAI, 2018, pp. 3047–3053.
- [11] S. Gurumurthy, R. Kiran Sarvadevabhatla, R. Venkatesh Babu, Deligan: Generative adversarial networks for diverse and limited data, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 166–174.
- [12] M. Ben-Yosef, D. Weinshall, Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images, arXiv preprint arXiv:1808.10356.
- [13] H. Eghbal-zadeh, W. Zellinger, G. Widmer, Mixture density generative adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5820–5829.
- [14] E. Richardson, Y. Weiss, On GANs and GMMs, in: Advances in Neural Information Processing Systems, 2018, pp. 5847–5858.
- [15] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006.
- [16] A. Papoulis, Probability, Random Variables, and Stochastic Processes., McGraw–Hill, 1984.
- [17] F. Chollet, et al., Keras, https://keras.io (2015).
- [18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
- [19] M. C. Mackey, L. Glass, Oscillation and chaos in physiological control systems, Science 197 (4300) (1977) 287–289.
- [20] F. Clette, et al., WDC-SILSO, http://www.sidc.be/silso/ (2015).
- [21] Bureau of Labor Statistics, United States Department of Labor, https://www.bls.gov/ (2018).
- [22] Investing.com, Euro US Dollar Daily Price, https://www.investing.com/currencies/eur-usd-historical-data (2018).
- [23] U.S. Energy Information Administration, https://www.eia.gov/ (2020).
- [24] Microsoft Corp. (MSFT), Yahoo!Finance, CBOE Volatility Index Historical Data, https://finance.yahoo.com/quote/%5EVIX/history?p=%5EVIX (2020).
- [25] Microsoft Corp. (MSFT), Yahoo!Finance, Invesco DB US Dollar Index Bullish Fund Historical Data, https://finance.yahoo.com/quote/UUP/history?p=UUP (2020).
- [26] Microsoft Corp. (MSFT), Yahoo!Finance, iShares MSCI Brazil Small-Cap ETF Historical Data, https://finance.yahoo.com/quote/EWZS/history?p=EWZS (2020).