Fully Embedded Time-Series Generative Adversarial Networks

Joe Beck
[email protected]
University of Tennessee
Department of Mechanical, Aerospace, & Biomedical Engineering
Knoxville, TN, USA
ORCID: 0000-0003-4257-1138 &Subhadeep Chakraborty
[email protected] (Corresponding author)
University of Tennessee
Department of Mechanical, Aerospace, & Biomedical Engineering
Knoxville, TN, USA
ORCID: 0000-0001-5035-9925

Abstract

Generative Adversarial Networks (GANs) should produce synthetic data that fits the underlying distribution of the data being modeled. For real valued time-series data, this implies the need to simultaneously capture the static distribution of the data, but also the full temporal distribution of the data for any potential time horizon. This temporal element produces a more complex problem that can potentially leave current solutions under-constrained, unstable during training, or prone to varying degrees of mode collapse. In FETSGAN, entire sequences are translated directly to the generator’s sampling space using a seq2seq style adversarial autoencoder (AAE), where adversarial training is used to match the training distribution in both the feature space and the lower dimensional sampling space. This additional constraint provides a loose assurance that the temporal distribution of the synthetic samples will not collapse. In addition, the First Above Threshold (FAT) operator is introduced to supplement the reconstruction of encoded sequences, which improves training stability and the overall quality of the synthetic data being generated. These novel contributions demonstrate a significant improvement to the current state of the art for adversarial learners in qualitative measures of temporal similarity and quantitative predictive ability of data generated through FETSGAN.

Keywords: Generative Adversarial Networks (GANs), adversarial autoencoder, synthetic time series data

Statements and Declarations

Competing Interests

The authors declare no known competing interests in the content of this manuscript.

Funding

Funding for this work was partially provided by the Collaborative Sciences Center for Road Safety (CSCRS), as well as the University of Tennessee, Knoxville.

1 Introduction

Generative modeling is the field of research concerned with producing new and unique data that is similar to the data that was used to produce the model. More specifically, we can say that this similarity is defined in terms of the ability to model the underlying distribution represented by the training data. In the case of sequential vector $x_{1:T}$ with length $T$ , the data distribution is characterized by the temporal distribution $p(x_{1},...,x_{T})$ . Even with relatively simple datasets where the vector $x_{t}$ is low dimensional, compounding dependencies in the temporal distribution increase with $T$ until it becomes difficult to measure or even visualize the similarity or differences between the temporal distributions of the training data and the generated data. Generative Adversarial Networks (GANs) have demonstrated exceptional ability in modeling complex distributions such as these. However, GANs are notoriously difficult to train, with instability often preventing convergence and the final generative models featuring some degree of mode collapse, where only a portion of the full target distribution is represented in the synthetic samples.

Refer to caption — Figure 1: The overall training scheme is shown. Above the dashed line, the RNN style architectures are detailed, showing the model outputs (red) of each network as a function of the inputs. $h^{(d)}_{t}$ indicates a hidden state of the network, where $d$ represents the weights associated with a specific depth and $t$ represents that state any specific time. FC indicates a fully connected output layer, and these weights are shared for every output across time, i.e. in the case of the generator $\hat{{x}}_{1:T}$ . Below the dashed line, the training flow is visualized. The mechanisms for producing FETSGAN’s five objective functions are shown in red.

Like much of the work surrounding GANs, the novel process presented here provides additional constraints to the adversarial learning process that regularizes learning, resulting in greater stability during training, higher quality data, and less susceptibility to mode collapse, specifically in the temporal distributions. The architecture presented is a modification of RCGAN[1] or C-RNN-GAN[2] that features a seq2seq style AAE as encoder and decoder for data generation [3], [4]. The complete model is visualized in Fig. 1. There are three primary benefits to using an adversarial autoencoder. First, there is an additional constraint that matches the posterior distribution of the encodings to the prior distribution of the samples at inference time, further combating mode collapse beyond feature-level adversarial training. Second, the seq2seq encoder is forced to summarize the entire sequence at once, allowing it to capture relevant time dependencies of arbitrary length. This can be compared to the approach in the TimeGAN, where the regularizing effect of the teacher-forced supervised losses of the encodings are only applied one time-step into the future [5]. Finally, the posterior encodings of adversarial autoencoders are natively interpretable. This allows fine control over the style of the synthetic data being generated, even in the completely unsupervised learning setting of this work.

The decoder in our framework can potentially suffer from the same supervised training problems as any other autoregressive model where compounding errors over time across shared weights can cause slow or unstable training, especially during the reconstruction of long encoded sequences. Typically this is resolved with some variation of teacher-forcing [6], [7], where the network is provided the ground truth from the previous timestep, and learns to predict only one timestep into the future. This method often leaves the network with some degree of exposure bias, where the compounding error of the model’s own predictions are neglected. Additionally in the context of real-valued time-series generation, statistical properties in the data may produce harsh local minimums for regression based optimization. In this work, we do not use an autoregressive decoder at all. Instead, our solution to reconstruction loss is coined First Above Threshold (FAT) loss. In this scheme, stochastic gradient descent is only applied to the model parameters at one time instance per generated sequence. This allows the network to learn progressively longer sequences during training, and can be applied to any element-wise loss function in a supervised or unsupervised manner. The work described here is an extension of RCGAN and a reformulation of TimeGAN that utilizes a supervised loss in the feature space as well as an unsupervised loss on the encodings of the data which produces improved matching of the temporal distribution of the training data. In addition, FAT loss can improve the quality of the data being generated, as well as producing a standardizing effect on the training dynamics. For our experiments, stocks data, energy usage data, metro traffic data, and a synthetic sine wave dataset are used. The sequences produced are qualitatively analysed to show significant improvement in preventing mode collapse along the temporal distribution. In addition, we demonstrate the ability to selectively sample our model at inference time to produce realistic data of a specific style. Finally, we measure the performance of our model against the primary adversarial learners in terms of predictive and discriminative scores. FETSGAN shows significant improvement to these methods in all stated dimensions of analysis.

2 Related Work

As a generative model, FETSGAN builds on the adversarial training scheme of RCGAN [1] with mixed supervised and unsupervised learning elements, along with a Recurrent Neural Network (RNN) style architecture for the Encoder, Decoder, and Discriminator, which are common for many sequence modeling tasks [8]. The work most closely matching this description is TimeGAN [5], where the key distinction is our use of an AAE for full time sequences, as opposed to the element-wise encodings in TimeGAN. There is also the application specific R-GAN which combined some time-series heuristic information like Fourier representations with a WGAN approach to produce synthetic energy consumption data [9], [10]. COT-GAN is a recent approach to this problem that defines an adversarial loss through an expansion of the Sinkhorn Divergence into the time domain [11]. Due to the inherent instability of adversarial training, [12] regularizes training with a contrastive model and a training scheme grounded in imitation learning. Finally, [13] incorporates two separate approaches using diffusion models and more traditional constrained optimization techniques to produce time-series data. The experimental comparison here is limited to methods that use the most straightforward approach of applying adversarial learning to the feature space, or latent encodings of the feature space. Namely, these models are TimeGAN and RCGAN. This is partly due to the lack of working implementations of the alternative approaches, and also to the similarity in approach with these methods, as well as the fact that these simpler adversarial approaches still seem to remain the preeminent method of time series data generation where such models are applied, with applications in fields such as medicine [14, 15], energy management [16, 17], and sensor simulation [18]. Additionally, the use of transformers [19] has become very popular in the field of sequence generation, particularly with respect to large language models (LLMs) such as the popular Generative Pre-trained Transformer (GPT) architecture [20]. While we did experiment with transformer-based approaches, we found that the method described here outperformed these methods. We postulate the difference in performance was due to transformer-based methods being auto-regressive in nature, exposing the model to inference bias that was not present in our non-auto-regressive approach.

Beyond the direct comparisons with other models that produce realistic time-series data, FETSGAN also incorporates an interpretive latent space, allowing the selective sampling of the posterior distribution at inference time to reach some desired effect, specified by the user. This bears a direct connection to the field of representation learning, where data is compressed in a meaningful way in order to accomplish some downstream task. This has been accomplished on real valued time series data, as data is embedded in RNN style architectures for the purpose of forecasting [21], supervised learning [22], and data imputation [23]. Perhaps the most well-known interpretive generative models are variational autoencoders (VAEs) [24]. The alternative AAE used here has a close resemblance to this approach, as described in [3], with the primary benefit in our use case being the ability to choose an arbitrary prior distribution instead of a standard Gaussian. The simplistic AAE used in this work has also been extended to utilize Wasserstein loss [25] in the image-generation space, demonstrating some improved stability that is typical of WGANs.

A theme of this work and all recent work on the creation of synthetic time-series is the regularization or complete abandonment of adversarial training as a means to produce more stability, and thus higher quality data. Obviously, this challenge is not unique to the time-series domain, and inspiration can be drawn from any generation process that utilizes adversarial learning, particularly image creation. Given the seq2seq translation element of FETSGAN, it stands to reason inspiration could be drawn from image-to-image translation techniques. We can see that the regularization effect of reconstruction loss or cycle loss is present in many adversarial approaches [26], [27]. WGANs [9] and LSGANs [28] focus primarily on the output layer of the discriminator and the loss function to produce a regularizing effect on reducing exploding or vanishing gradients. Spectral normalization [29], particularly in the weights of the discriminator, assures Lipchitz continuity in the gradients, further ensuring training stability. While not directly applicable to the RNN architecture itself, batch normalization in the discriminator has also demonstrated an ability to speed up adversarial training [30]. Spectral normalization is utilized in the linear layers of the our discriminator’s training process, while reconstruction loss bares a resemblance to the cycle consistency loss that has a regularizing effect on training.

3 Proposed Method

In this section, the methodology of the proposed method is described in full detail. Additional motivation for the work is also provided.

3.1 Problem Formulation

Consider the random vector $X\in\mathcal{X}$ , where individual instances are denoted by $x$ . We operate in a discrete time setting where fixed time intervals exist between samples $X_{t}$ , forming sequences of length $T$ , such that $(X_{1},...,X_{T}):=X_{1:T}\in\mathcal{X}^{T}$ . We note that $T$ may be a constant value or may be a random variable itself, and that the proposed methodology does not differ in either case. The representative dataset of length $N$ is given by $\mathcal{D}=\{x_{n,1:T}\}_{n=1}^{N}$ . For convenience, the subscript $n$ is omitted in further notation.

The data distribution $p(X_{1:T})$ describes the sequences, such that any possible conditional $p(X_{t}\mid X_{t-i})$ for $t\leq T$ and $t-i\geq 0$ is absorbed into $p(X_{1:T})$ . Thus, the overall learning objective is to produce the model distribution $\hat{p}(X_{1:T})$ ,

\displaystyle\min_{\hat{p}}D(p(X_{1:T})\mid\mid\hat{p}(X_{1:T}\ ))

(1)

where $D$ is some measure of divergence between the two distributions. We note here that RCGAN [1] applies adversarial loss that minimizes this divergence directly. In the following works, higher quality data is generated by supplementing this loss function with a supervised loss that is more stable [5], or by avoiding adversarial loss altogether [12]. While the adversarial autoencoder in Section 3.2 is intended to supplement and further stabilize training as well, we also highlight the additional ability of the $\operatorname{FAT}$ operator to stabilize the minimization of this objective on reconstructions in Section 3.3.

3.2 Adversarial Autoencoder

In order to represent complex temporal relationships, we would like to leverage the ability to encode the entire time-series as a low dimensional vector in a seq2seq style model. We introduce the random variable $Z\in\mathcal{Z}$ as an intermediate encoding for producing $\hat{p}(X_{1:T})$ . We also introduce $\eta_{1:T}$ as a random noise vector from a known distribution $p_{\eta}$ sampled independent of time. Let $p_{z}(Z)$ be a known prior distribution, the encoder function $e:(\mathcal{X}^{T},\mathcal{\eta}^{T})\rightarrow\mathcal{Z}$ have the encoding distribution $q(Z\mid X_{1:T},\eta_{1:T})$ , and the generator (or decoder) function $g:(\mathcal{Z},\mathcal{\eta}^{T})\rightarrow\mathcal{X}^{T}$ have the decoding distribution $\hat{p}(X_{1:T}\mid Z,\eta_{1:T})$ .

Also, we can define an ideal mapping function that maps sequences to vectors $M:(\mathcal{X}^{T},\eta^{T})\rightarrow\mathcal{M}$ , such that $M_{X}=M(X_{1:T},\eta_{1:T})$ and $p_{M}(M_{X})=p(X_{1:T})p_{\eta}(\eta_{1:T})$ . The aggregated posterior can now be defined,

\displaystyle q(Z_{X})=\int_{M_{X}}q(Z\mid X_{1:T},\eta_{1:T})p_{M}(M_{X})dM_{X}

(2)

Similar to Eq. 1, encoder training occurs by matching this aggregated posterior distribution $q(Z_{X})$ to an arbitrary prior distribution $p_{Z}(Z)$ .

\displaystyle\min_{q}D(p_{z}(Z)\mid\mid q(Z_{X}))

(3)

The posterior distribution $q(Z_{X})$ is equivalent to the universal approximator function in [3], where the noise $\eta_{t}$ exists to provide stochasticity to the model in the event that not enough exists in the data $\mathcal{D}$ for $q(Z_{X})$ to match $p_{z}(Z)$ deterministically.

The generator is trained through reconstructing the data distribution $p(X_{1:T})$ while being conditioned on the encodings.

\displaystyle\min_{\hat{p}}D(p(X_{1:T})\mid\mid\hat{p}(X_{1:T}\mid Z_{X},\eta_{1:T})))

(4)

If divergence is minimized in Eq. 3 and Eq. 4, then the encoder is able to perfectly replicate the prior distribution, and the generator is perfectly able to reconstruct $X_{1:T}$ , resulting in a complete model of $p(X_{1:T})$ when the prior distribution $p_{z}(Z)$ is sampled and decoded at inference time.

3.3 First Above Threshold (FAT) Operator

In our formulation, the generator model minimizes two measures of divergence. First, there is the direct matching of the target distribution through adversarial training, characterized by Eq. 1. Then, there is the secondary reconstruction loss from the intermediate encodings, characterized by Eq. 4. Reconstructions of time series data are denoted $\bar{x}_{1:T}=g(e(x_{1:T}))$ . The simplest loss function to enforce reconstruction of the original data might be,

\displaystyle\mathcal{L}=\mathbb{E}_{x_{1:T}\sim p}(\sum_{t}\|x_{t}-\bar{x}_{t}\|^{2})

(5)

where this objective is minimized in tandem by the generator and the encoder. This may prove to be challenging as our generator model is not an auto regressive model. It takes the intermediate encodings as input at every timestep, as shown in Fig. 1. This reduces the possibility of a vanishing gradient to the encoder itself, and also prevents the generator from forgetting the initial encoding for long sequences. This does, however, make learning through reconstruction more difficult due to the inability to incorporate teacher forcing methods. In addition, time-series reconstruction with real-valued data faces the optimization problem of large local minimums, where the model may collapse into producing only mean values across any one time-series or the mean of the entire dataset. To alleviate these problems, we propose the solution of only applying reconstruction loss at one time instance $t=\tau$ in the sequence, instead of the entire time series at once. Thus, the reconstruction loss,

\displaystyle\mathcal{L}_{recon}=\mathbb{E}_{x_{1:T}\sim p}(\|x_{\tau}-\bar{x}_{\tau}\|^{2})

(6)

trains the generator in a supervised manner. The question remains which time instance $\tau$ to choose. We propose a simple solution. Prior to training, we define a real-valued threshold $\epsilon$ , such that any error that has compounded to produce the reconstruction $\bar{x}_{t}$ is acceptable so long as $\|x_{t}-\bar{x}_{t}\|^{2}<\epsilon$ . In this way, time-series reconstructions are progressively learned from short-term to long-term, instead of all at once. To this end, the First Above Threshold (FAT) operator is introduced. This operator takes as input a sequence $l_{1:T}$ and a threshold $\epsilon$ , such that the minimum value for $t$ is returned where $l_{t}>\epsilon$ . In the event that $l_{t}<\epsilon\,\forall\,t\in T$ , $\operatorname{FAT}_{t}(l_{1:T},\epsilon)=\operatorname*{arg\,max}_{t}(l_{1:T})$ . With this newly defined operator,

\displaystyle\tau=\operatorname{FAT}_{t}(\|x_{t}-\bar{x}_{t}\|^{2},\epsilon)

(7)

defines $\tau$ , thus defining the complete form of Eq. 6. The benefits are two fold. First, a progressive learning approach stabilizes the early portion of training, as the objective function may be less likely to become stuck at a local minimum while trying to encode and reconstruct long sequences all at once. Second, updating the parameters corresponding to only one time instance at a time has a regularizing effect by providing a more granular gradient that is less likely to interfere with the adversarial training for both the encoder and generator. The combination of both of these effects are demonstrated in Fig. 2, as the training dynamics of the model are compared between using the reconstruction loss of Eq. 5 and Eq. 6. We show that the application of the objective in Eq. 6 actually minimizes the loss of Eq. 5 more effectively than applying it directly in the sines dataset. The efficacy of the $\operatorname{FAT}_{t}$ operator can be expected to grow with longer, more complex sequences.

3.4 Complete Model

Time series data collected in the physical world such as sensor measurements will have some degree of stochasticity. We cannot replicate this stochasticity using a reconstruction objective alone. To this end, we introduce the feature space discriminator $d_{x}:\mathcal{X}^{T}\rightarrow\mathcal{Y}^{T}$ which maps sequences to classifications, such that $y_{1:T}=d(x_{1:T})$ and $\hat{y}_{1:T}=d(\bar{x}_{1:T})$ . This loss is applied to the reconstructions $\bar{x}_{1:T}$ , such that the objective can be minimized both through the encoder and the generator. In the Least Squares GAN form, the objective function of the discriminator described by,

		$\displaystyle\mathcal{L}_{dx}=\frac{1}{2}\mathbb{E}_{x_{1:T}\sim p}(\sum_{t}\\|1-y_{t}\\|^{2})+\frac{1}{2}\mathbb{E}_{\bar{x}_{1:T}\sim\hat{p}}(\sum_{t}\\|\hat{y}_{t}\\|^{2})$		(8)
		$\displaystyle\mathcal{L}_{fx}=\frac{1}{2}\mathbb{E}_{\bar{x}_{1:T}\sim\hat{p}}(\sum_{t}\\|1-\hat{y}_{t}\\|^{2})$		(9)

We now introduce the encoding discriminator $d_{z}:\mathcal{Z}\rightarrow\mathcal{Y_{Z}}$ such that $y_{z}=d_{z}(z)$ and $\hat{y}_{z}=d_{z}(z_{x})$ . Also in the LSGAN form,

		$\displaystyle\mathcal{L}_{dz}=\frac{1}{2}\mathbb{E}_{z\sim p_{z}}(\\|1-y_{z}\\|^{2})+\frac{1}{2}\mathbb{E}_{z_{x}\sim q}(\\|\hat{y_{z}}\\|^{2})$		(10)
		$\displaystyle\mathcal{L}_{ez}=\frac{1}{2}\mathbb{E}_{z_{x}\sim q}(\\|1-\hat{y_{z}}\\|^{2})$		(11)

describe the objective functions for the discriminator and encoder, respectively.

Putting everything together, there are three measures of divergence we will minimize with our complete model. The adversarial training between the objective functions described by Eq. 8 and Eq. 9 minimize Eq. 1 directly, where $D$ is the $\chi^{2}$ -divergence in the LSGAN formulation. The adversarial training described by the objective functions Eq. 10 and Eq. 11 apply to the divergence of Eq. 3, also minimizing $\chi^{2}$ -divergence of the intermediate encodings. Finally, Eq. 6 corresponds to Eq. 4, minimizing the Kullback–Leibler (KL) divergence through maximum likelihood (ML) supervised training. In total, all parameter optimization occurs through,

		$\displaystyle\min_{e}\min_{g}(\lambda_{1}\mathcal{L}_{recon}+\mathcal{L}_{ez}+\lambda_{2}\mathcal{L}_{fx})$
		$\displaystyle\min_{d_{z}}(\mathcal{L}_{dz})$
		$\displaystyle\min_{d_{x}}(\mathcal{L}_{dx})$

thus describing the complete objective of the FETSGAN architecture.

3.5 Implementation

The hyperparameters of the model are $\lambda_{1}$ , $\lambda_{2}$ , and $\epsilon$ . Additionally, while not described in notation, the dimensionality of $p_{z}$ and $p_{\eta}$ are also parameters of the model. In terms of tuning these values, $p_{z}$ should generally correspond to the estimated complexity of the input signals while $p_{\eta}$ should correspond to the complexity of the noise. Overestimating these values may simply lead to more complex encodings than are required, and potentially more unstable learning in the embedding space described by Eqs. 10 and 11. $\lambda_{1}$ and $\lambda_{2}$ depend primarily on the scaling of the data. In our experiments, all data is normalized between $(-1,1)$ . Finally, $\epsilon$ should be chosen based on the model’s use-case regarding what amount of error is acceptable in reconstruction. In other words, this value represents the amount of error tolerable for the signal, where any error under this value can be considered stochastic noise subject to adversarial feature space learning. All hyperparameters are fairly robust. To demonstrate this, the parameters are fixed for all experiments. The primary parameters are $\lambda_{1}=10$ , $\lambda_{2}=1$ , and ${\epsilon}=0.1$ . The prior distribution $p_{z}$ and the noise distribution $p_{\eta}$ contain four dimensions, and both are sampled from $p_{z},p_{\eta}\sim\mathcal{U}(-1,1)$ . All models are trained with the Adam optimization strategy [31], where the learning rate for the generator, encoder and both discriminators is $0.001$ . These learning rates decay exponentially in the last $10\%$ of epochs. The full model implementation in Pytorch and instructions for experimental reproduction are provided in the link. ¹¹1https://github.com/jbeck9/FETSGAN

4 Experiments

4.1 Experimental Setup & Datasets

For our experiments, we compare the performance of our model primarily against TimeGAN [5] and RCGAN [1]. As a baseline comparison, we have also included a purely autoregressive method that was trained using only teacher forcing. The methods of [11, 12, 13] are omitted at the time of writing due to a lack of available implementations for reproduction. Due to those limitations, we limit the scope of our conclusions to time series models with adversarial learning applied directly to the feature space, or low dimensional encodings of the feature space.

The efficacy of our model is shown along three dimensions. First, we demonstrate the qualitative similarity between the original data and our generated data in Section 4.2. Then, we demonstrate the unique ability of our method to interpret the prior sampling distribution as a way of providing selective samples that are similar in style to specific samples from the original dataset in Section 4.3. Finally, we demonstrate that under strenuous classification and prediction tasks, our method holds state of the art performance among adversarial learners that apply directly on the feature space for time series data in Section 4.4.

We use three primary datasets for analysis. First, we generate a one dimensional sines dataset of length $T=100$ , where the amplitude, frequency, and phase for each sequence are sampled from a uniform distribution. We also use the six-dimensional historical Google stocks dataset, containing stock price information from 2004 to 2019 of various lengths $T$ . This dataset is provided directly in the code repository. Finally, we use 6 dimensions of the UCI Appliances energy dataset [32] and real-valued traffic and weather data from the UCI Metro Interstate dataset [33].

4.2 Distribution Matching

Visualizing synthetic data generated from an adversarial learning process is important to analyze the extent of mode collapse that may have occurred. This task is tricky for time series data, as it is possible that temporal mode collapse could exist, but be obscured if the temporal dimension is flattened for analysis. In the case of the sines dataset, we can reduce the sequences to a single dimension by simply capturing the dominant frequency in the sequence, using the Discrete Fourier Transform (DFT). Here, the dominant frequency is given by $\operatorname*{arg\,max}_{f}DFT(x_{1:T})$ . Since the original data consists of a sine wave with a single frequency, this provides a valid analysis of the entire sequence. The amplitude and phase of the corresponding dominant frequency are taken as well. A histogram can then be produced, comparing the distribution of frequencies, amplitudes, and phases in each dataset. For the stocks and energy datasets, we settled for a visual comparison in the flattened temporal dimension using TSNE visualization [34], repeating the procedure reported in [5]. The results of these visualizations are shown in Fig. 3. While the results for TimeGAN generally match what is shown in [5], FETSGAN demonstrates a substantial improvement over TimeGAN, RCGAN, and the baseline autoregressive models in matching the underlying distribution for all datasets used.

4.3 Selective Sampling

There is an obvious use case that at inference time, perhaps there is a need to produce a specific style of data, or to sample from a specific portion of the data distribution $p(X_{1:T})$ . Because our model forces dimensionality reduction through an encoder that is forced to match $q$ to $p_{z}$ , we can leverage spatial relationships in the latent space $z$ to selectively sample from the prior distribution $p_{z}$ at inference time. This allows us to produce synthetic data that retains a specific style. To demonstrate this, three sine waves were taken from the data, corresponding to sequences $x=sin(2\pi ft)$ with $f=2,5,8$ . These sequences were then encoded as $z_{x}=e(x)$ . Finally, new sequences were generated by adding noise to these encodings, such that $x_{\eta}=g(z_{x}+\eta)$ , where $\eta\sim\mathcal{N}(0,0.1)$ . The results in Fig. 4 show that, as expected, spatial relationships are maintained in the latent space $z$ . We are able to produce synthetic sine waves that maintain the close relationship to the anchor point $x$ they were sampled near, allowing the ability to produce synthetic sequences within an expected range of style.

Table 1: Predictive & Discriminative Scores. Best scores shown in bold.

Model	Metric	Sines	Energy	Stocks	Metro
T-Forcing (64,134 parameters)	+1 Step Predict	$.011\pm.004$	$.018\pm.002$	$.047\pm.003$	$.093\pm.002$
	+3 Step Predict	$.083\pm.010$	$.028\pm.002$	$.064\pm.006$	$.168\pm.004$
	+5 Step Predict	$.421\pm.220$	$.034\pm.003$	$.084\pm.014$	$.223\pm.013$
	Dis. Score	$.494\pm.008$	$.246\pm.053$	$.243\pm.031$	$.264\pm.021$
RCGAN (93,717 parameters)	+1 Step Predict	$.011\pm.002$	$.035\pm.007$	$.026\pm.001$	$.065\pm.002$
	+3 Step Predict	$.038\pm.010$	$.045\pm.007$	$.073\pm.020$	$.134\pm.003$
	+5 Step Predict	$.073\pm.020$	$.054\pm.008$	$.094\pm.026$	$.180\pm.003$
	Dis. Score	$.467\pm.040$	$.490\pm.011$	$.459\pm.018$	$.114\pm.031$
TimeGAN (337,858 parameters)	+1 Step Predict	$.046\pm.043$	$.021\pm.002$	$.039\pm.001$	$.146\pm.038$
	+3 Step Predict	$.133\pm.121$	$.026\pm.001$	$.048\pm.002$	$.271\pm.047$
	+5 Step Predict	$.167\pm.132$	$.031\pm.001$	$.053\pm.002$	$.330\pm.054$
	Dis. Score	$.317\pm.105$	$.164\pm.066$	$.268\pm.054$	$.374\pm.104$
FETSGAN-FAT (262,686 parameters)	+1 Step Predict	$.016\pm.004$	$.025\pm.003$	$.030\pm.001$	$.063\pm.001$
	+3 Step Predict	$.041\pm.006$	$.029\pm.001$	$.039\pm.002$	$\textbf{.124}\pm.002$
	+5 Step Predict	$.058\pm.008$	$.032\pm.001$	$.042\pm.001$	$\textbf{.165}\pm.002$
	Dis. Score	$.249\pm.125$	$.237\pm.114$	$.055\pm.032$	$.056\pm.031$
FETSGAN-FD (182,827 parameters)	+1 Step Predict	$.008\pm.005$	$.018\pm.002$	$.031\pm.002$	$.077\pm.003$
	+3 Step Predict	$.019\pm.004$	$.026\pm.001$	$.037\pm.001$	$.155\pm.008$
	+5 Step Predict	$.028\pm.003$	$.031\pm.002$	$.041\pm.002$	$.192\pm.006$
	Dis. Score	$.002\pm.007$	$.189\pm.041$	$.086\pm.031$	$.122\pm.066$
FETSGAN (262,686 parameters)	+1 Step Predict	$\textbf{.007}\pm.006$	$\textbf{.016}\pm.001$	$\textbf{.026}\pm.001$	$\textbf{.062}\pm.001$
	+3 Step Predict	$\textbf{.017}\pm.005$	$\textbf{.023}\pm.001$	$\textbf{.036}\pm.001$	$.128\pm.002$
	+5 Step Predict	$\textbf{.026}\pm.003$	$\textbf{.028}\pm.001$	$\textbf{.040}\pm.001$	$.169\pm.001$
	Dis. Score	$\textbf{.001}\pm.004$	$\textbf{.030}\pm.012$	$\textbf{.005}\pm.007$	$\textbf{.025}\pm.014$

4.4 Performance Metrics

To compare quantitative performance between models, we apply two testing metrics, discriminative score, and predictive score. Discriminative score is measured by training an ad hoc RNN classifier to discriminate between real dataset and a static synthetic dataset generated by each model. The best model will have the lowest score $0.5-pred$ , corresponding to how far the predictions were below the decision boundary, where a score of $0$ corresponds to indistinguishable data. In the case of prediction, the “Train on Synthetic, Test on Real” (TSTR) approach is used [1]. A simple RNN is trained as a forecasting model, predicting 1, 3, and 5 steps into the future, given the sequence $x_{1:t}$ for any valid step where $t+step\leq T$ on synthetic data. Then, MAE prediction error from the trained model is measured on the real dataset. The best model is the one which produces the lowest prediction error on real data. Whenever the model under test calls for an RNN style network, a Gated Recurrent Unit (GRU) of 64 cells and 3 layers was used. For each architecture, 3 models were trained, and sampled 5 times. Thus, 15 tests were conducted for each value in Table 1. Training for all models occurred under 1000 epochs using the Adam optimizer with a learning rate of $0.001$ , including the classification and prediction models. Variations of FETSGAN are included to analyze sources of gain. FETSGAN-FAT removes the $\operatorname{FAT}_{t}$ operation by replacing Eq. 6 with Eq. 5. FETSGAN-FD trains without the feature space discriminator $d_{x}$ . The complete model scores best in totality, and the variations of the FETSGAN show statistically significant improvement over RCGAN and TimeGAN in all cases. We note that in the case of the noisy and lengthy Energy dataset, only the complete FETSGAN model was able to produce data realistic enough to completely fool the ad-hoc discriminator with a score under $0.1$ for all experiments. The number of trainable parameters for each model is also shown in Table 1, which was fixed for all experiments.

4.5 Limitations, Future Work, & Ethical Considerations

Fully embedding time-series data using an adversarial autoencoder ensures the synthetic data produced more closely matches the full distribution of the target data, and this auto-encoding reconstruction has a regularizing effect on the adversarial training. However, this also limits the potential applications of this methodology. Data such as video or language tokens are too highly dimensional to be reliably reduced to vector embeddings. As such, the realistic use cases are limited to real-valued time-series signals with relatively low dimensionality. The architecture chosen in this work was motivated in part by the fact that it is not auto-regressive in nature, and thus does not suffer from exposure bias. Limiting exposure bias would allow for more robust use of auto-regressive methods such as attention-based transformers, and constitutes a promising pathway for future work.

While the applications of our work may not reach the scope and magnitude of other synthetic data generation models such as large language models (LLMs), they are similar with respect to the ethical considerations highlighted in [20]. Our methodology can straightforwardly be used to produce synthetic datasets where privacy is a concern within the original data. The primary responsibility for using this methodology and those like it is to ensure it is made explicitly clear that the data produced is synthetic. Additionally, the creator of the model should thoroughly ensure it sufficiently matches the distributions of the target data before applying it to problems with real-world impact.

5 Conclusion

In this paper we introduce FETSGAN, a novel approach to real-valued time series generation that combines feature space adversarial learning with the adversarial autoencoder framework. In addition, we introduce the $\operatorname{FAT}$ operator, which provides a regularizing effect on training complex temporal sequences that are produced from an intermediate encoding. Finally, the method shown here provides an interpretable latent space, allowing higher flexibility for selective sampling at inference time. We demonstrate significant improvement over current adversarial methods applied directly to the feature space or encodings thereof. In future work, we intend to leverage the accuracy and interpretability of this model on a variety of datasets to demonstrate real world utility for synthetic data to aid in various applied machine learning models in forecasting and classification.

6 Competing Interests, Author Contribution, & Data Availability

6.1 Competing Interests

The authors declare no known competing interests in the content of this manuscript. Funding for this work was partially provided by the Collaborative Sciences Center for Road Safety (CSCRS), as well as the University of Tennessee, Knoxville.

6.2 Data Availability and Ethical Use

All the data used for experimentation was publicly available, and contains no sensitive or personal information of any kind. No original datasets were produced through this research. All datasets are either provided through citation, or provided directly at the linked repository.

References

[1] C. Esteban, S. L. Hyland, and G. Rätsch, “Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs,” Dec. 2017.
[2] O. Mogren, “C-RNN-GAN: Continuous recurrent neural networks with adversarial training,” Nov. 2016.
[3] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial Autoencoders,” May 2016.
[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” Dec. 2014.
[5] J. Yoon, D. Jarrett, and M. van der Schaar, “Time-series Generative Adversarial Networks,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019.
[6] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” Sep. 2015.
[7] A. Lamb, A. Goyal, Y. Zhang, S. Zhang, A. Courville, and Y. Bengio, “Professor Forcing: A New Algorithm for Training Recurrent Networks,” Oct. 2016.
[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” Dec. 2014.
[9] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” Dec. 2017.
[10] M. N. Fekri, A. M. Ghosh, and K. Grolinger, “Generating Energy Data for Machine Learning with Recurrent Generative Adversarial Networks,” Energies, vol. 13, no. 1, p. 130, Jan. 2020.
[11] T. Xu, L. K. Wenliang, M. Munn, and B. Acciaio, “COT-GAN: Generating Sequential Data via Causal Optimal Transport,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 8798–8809.
[12] D. Jarrett, I. Bica, and M. van der Schaar, “Time-series Generation by Contrastive Imitation,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 28 968–28 982.
[13] A. Coletta, S. Gopalakrishnan, D. Borrajo, and S. Vyetrenko, “On the Constrained Time-Series Generation Problem,” Advances in Neural Information Processing Systems, vol. 36, pp. 61 048–61 059, Dec. 2023.
[14] S. Dash, A. Yale, I. Guyon, and K. P. Bennett, “Medical Time-Series Data Generation Using Generative Adversarial Networks,” in Artificial Intelligence in Medicine, ser. Lecture Notes in Computer Science, M. Michalowski and R. Moskovitch, Eds. Cham: Springer International Publishing, 2020, pp. 382–391.
[15] H. Li, S. Yu, and J. Principe, “Causal Recurrent Variational Autoencoder for Medical Time Series Generation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 7, pp. 8562–8570, Jun. 2023.
[16] S. Chattoraj, S. Pratiher, S. Pratiher, and H. Konik, “Improving Stability of Adversarial Li-ion Cell Usage Data Generation using Generative Latent Space Modelling,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 8047–8051.
[17] M. Fochesato, F. Khayatian, D. F. Lima, and Z. Nagy, “On the use of conditional TimeGAN to enhance the robustness of a reinforcement learning agent in the building domain,” in Proceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, ser. BuildSys ’22. New York, NY, USA: Association for Computing Machinery, Dec. 2022, pp. 208–217.
[18] E. Adib, A. S. Fernandez, F. Afghah, and J. J. Prevost, “Synthetic ECG Signal Generation Using Probabilistic Diffusion Models,” IEEE Access, vol. 11, pp. 75 818–75 828, 2023.
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. ukasz Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
[20] P. P. Ray, “ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope,” Internet of Things and Cyber-Physical Systems, vol. 3, pp. 121–154, Jan. 2023.
[21] X. Lyu, M. Hueser, S. L. Hyland, G. Zerveas, and G. Raetsch, “Improving Clinical Predictions through Unsupervised Time Series Representation Learning,” Dec. 2018.
[22] A. M. Dai and Q. V. Le, “Semi-supervised Sequence Learning,” Nov. 2015.
[23] F. M. Bianchi, L. Livi, K. Ø. Mikalsen, M. Kampffmeyer, and R. Jenssen, “Learning representations for multivariate time series with missing data using Temporal Kernelized Autoencoders,” Jul. 2019.
[24] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” Dec. 2013.
[25] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein Auto-Encoders,” Dec. 2019.
[26] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised Image-to-Image Translation Networks,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
[27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[28] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Least Squares Generative Adversarial Networks,” Apr. 2017.
[29] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral Normalization for Generative Adversarial Networks,” Feb. 2018.
[30] S. Xiang and H. Li, “On the Effects of Batch and Weight Normalization in Generative Adversarial Networks,” Dec. 2017.
[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[32] L. M. Candanedo, V. Feldheim, and D. Deramaix, “Data driven prediction models of energy use of appliances in a low-energy house,” Energy and Buildings, vol. 140, pp. 81–97, Apr. 2017.
[33] J. Hogue, “Traffic data from mn department of transportation, weather data from openweathermap,” 2018.
[34] L. van der Maaten and G. Hinton, “Visualizing Data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.