This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AMLNet: Adversarial Mutual Learning Neural Network for Non-AutoRegressive Multi-Horizon Time Series Forecasting

Yang Lin School of Computer Science, The University of Sydney
Sydney, Australia
[email protected]
Abstract

Multi-horizon time series forecasting, crucial across diverse domains, demands high accuracy and speed. While AutoRegressive (AR) models excel in short-term predictions, they suffer speed and error issues as the horizon extends. Non-AutoRegressive (NAR) models suit long-term predictions but struggle with interdependence, yielding unrealistic results. We introduce AMLNet, an innovative NAR model that achieves realistic forecasts through an online Knowledge Distillation (KD) approach. AMLNet harnesses the strengths of both AR and NAR models by training a deep AR decoder and a deep NAR decoder in a collaborative manner, serving as ensemble teachers that impart knowledge to a shallower NAR decoder. This knowledge transfer is facilitated through two key mechanisms: 1) outcome-driven KD, which dynamically weights the contribution of KD losses from the teacher models, enabling the shallow NAR decoder to incorporate the ensemble’s diversity; and 2) hint-driven KD, which employs adversarial training to extract valuable insights from the model’s hidden states for distillation. Extensive experimentation showcases AMLNet’s superiority over conventional AR and NAR models, thereby presenting a promising avenue for multi-horizon time series forecasting that enhances accuracy and expedites computation.

Index Terms:
time series forecasting, deep learning, Transformer, knowledge distillation

I Introduction

Time-series forecasting is integral to various practical applications, from electricity grid control [1] and economic trend prediction [2] to traffic control [3]. Such scenarios often require multi-step ahead forecasts for informed decision-making; for instance, accurately predicting hourly electricity consumption for upcoming days or weeks aids efficient resource allocation. Classical statistical methods like AutoRegressive Integrated Moving Average (ARIMA) and exponential smoothing [4] excel in single time series forecasting, yet fall short when handling related time series collectively, as they treat each series independently [5, 6, 7]. Emerging as a promising alternative, deep learning techniques have gained traction for large-scale related time series forecasting [8, 7, 9, 3].

These deep learning methods can be categorized into AutoRegressive (AR) and Non-AutoRegressive (NAR) models. AR models, including DeepAR [8], TCNN [10, 11] and LogSparse Transformer [7], predict one step ahead, using prior forecasts as input for subsequent predictions. While effective at capturing interdependencies in the output space [12, 13], AR models face issues such as training-inference discrepancies [14, 15], error accumulation [16], and high inference latency [17, 18]. Conversely, NAR models (e.g., MQ-RNN [15], N-BEATS [6], AST [5], and Informer [19]) overcome AR modeling problems by generating parallel predictions, proving superior in long horizon forecasting. However, NAR models may yield unrealistic, disjointed series due to a lack of interdependence consideration, leading to unrelated forecasts [16, 5]. Our work addresses this by employing Knowledge Distillation (KD) to incorporate both model outcomes and hidden states, yielding more coherent and accurate forecasts.

To tackle NAR model limitations, we introduce Adversarial Mutual Learning Neural Network (AMLNet), an NAR model utilizing online KD methods. AMLNet comprises an encoder, a deep AR decoder, a deep NAR decoder, and a shallow NAR decoder (Fig. 1). During training, encoder extracts patterns for all decoders, deep AR and NAR decoders train in mutual fashion, then serve as ensemble teachers transferring knowledge to the shallow NAR decoder, enhancing error handling and output interdependence. Testing employs the encoder and shallow NAR for forecast generation. AMLNet’s knowledge transfer employs two techniques: outcome-driven KD, dynamically weighting distillation loss based on network performance to prevent error circulation; and hint-driven KD, distilling knowledge from hidden states via adversarial training, as these states contain valuable information for enhanced transfer.

Our contributions encompass: 1) Introduction of AMLNet, pioneering online KD for time series forecasting. It trains deep AR and NAR decoders mutually as ensemble teachers, transferring knowledge to a shallow NAR decoder, resulting in contiguous forecasts and superior performance with fast inference speed, as demonstrated across four time series datasets. 2) Proposal of outcome-driven and hint-driven online KD, simultaneously learning from teacher network predictions and inner features. Our experiments, compared to state-of-the-art KD methods, affirm the efficacy of both proposed techniques.

II Problem Formulation

II-A Data Sets

We conducted experiments on four publicly available real-world datasets: Sanyo [20], Hanergy [21], Solar [22], and Electricity [23]. The Sanyo and Hanergy datasets consist of solar power generation data from two PV plants in Australia. The data for Hanergy spans from 01/01/2011 to 31/12/2016 (6 years), while the data for Sanyo covers the period from 01/01/2011 to 31/12/2017 (7 years). The Solar dataset comprises solar power data from 137 PV plants in Alabama, USA, gathered between 01/01/2006 and 31/08/2006. The Electricity dataset contains electricity consumption data from 370 households, recorded from 01/01/2011 to 07/09/2014.

A summary of the data statistics is provided in Table I. For the Sanyo and Hanergy datasets, we considered data between 7 am and 5 pm and aggregated it at half-hourly intervals. Additionally, weather and weather forecast data were collected and used as covariates in the experiments (refer to [24] for further details). For the Solar and Electricity datasets, the data was aggregated into 1-hour intervals. Following the approach in [7, 24], calendar features were incorporated based on the granularity of each dataset. Specifically, Sanyo and Hanergy datasets used features such as month, hour-of-the-day, and minute-of-the-hour, Solar dataset used month, hour-of-the-day, and age, and Electricity dataset used month, day-of-the-week, hour-of-the-day, and age. For consistent preprocessing, all data was normalized to have zero mean and unit variance [24].

Start date End date Granularity LdL_{d} NN nTn_{T} nCn_{C} TlT_{l} ThT_{h}
Sanyo 01/01/2011 31/12/2016 30 minutes 20 1 4 3 20 20
Hanergy 01/01/2011 31/12/2017 30 minutes 20 1 4 3 20 20
Solar 01/01/2006 31/08/2006 1 hour 24 137 0 3 24 24
Electricity 01/01/2011 07/09/2014 1 hour 24 370 0 4 168 24
TABLE I: Dataset statistics. LdL_{d} - number of steps per day, NN - number of series, nTn_{T} - number of time-based features, nCn_{C} - number of calendar features, TlT_{l} - length of input series, ThT_{h} - length of forecasting horizon.

II-B Problem Statement

Given is: 1) a set of NN univariate time series (solar or electricity series) {𝐘i,1:Tl}i=1N\{\mathbf{Y}_{i,1:T_{l}}\}^{N}_{i=1}, where 𝐘i,1:Tl[yi,1,yi,2,,yi,Tl]\mathbf{Y}_{i,1:T_{l}}\coloneqq[{y}_{i,1},{y}_{i,2},...,{y}_{i,T_{l}}], TlT_{l} is the input sequence length, and yi,t{y}_{i,t}\in\Re is the value of the iith time series (generated PV solar power or consumed electricity) at time tt; 2) a set of associated time-based multi-dimensional covariate vectors {𝐗i,1:Tl+Th}i=1N\{\mathbf{X}_{i,1:T_{l}+T_{h}}\}_{i=1}^{N}, where ThT_{h} denotes the length of the forecasting horizon. Our goal is to predict the future values of the time series {𝐘i,Tl+1:Tl+Th}i=1N\{\mathbf{Y}_{i,T_{l}+1:T_{l}+T_{h}}\}^{N}_{i=1}, i.e. the PV power or electricity usage for the next ThT_{h} time steps after TlT_{l}.

The covariates for the Sanyo and Hanergy datasets include: weather data {𝐖𝟏i,1:Tl}i=1N\{\mathbf{W1}_{i,1:T_{l}}\}_{i=1}^{N}, weather forecasts {𝐖𝐅i,Tl+1:Tl+Th}i=1N\{\mathbf{WF}_{i,T_{l}+1:T_{l}+T_{h}}\}_{i=1}^{N} and calendar features {𝐙i,1:Tl+Th}i=1N\{\mathbf{Z}_{i,1:T_{l}+T_{h}}\}_{i=1}^{N}, while the covariates for Solar and Electricity datasets include only calendar features.

Specifically, AMLNet produces the probability distribution of the future values, given the past history:

p(𝐘i,Tl+1:Tl+Th𝐘i,1:Tl,𝐗i,1:Tl+Th;θ)=t=Tl+1Tl+Thp(yi,t𝐘i,1:Tl,𝐗i,1:Tl+Th;θ),\begin{split}p\left(\mathbf{Y}_{i,T_{l}+1:T_{l}+T_{h}}\mid\mathbf{Y}_{i,1:T_{l}},\mathbf{X}_{i,1:T_{l}+T_{h}};\theta\right)\\ =\prod_{t=T_{l}+1}^{T_{l}+T_{h}}p\left({y}_{i,t}\mid\mathbf{Y}_{i,1:T_{l}},\mathbf{X}_{i,1:T_{l}+T_{h}};\theta\right),\end{split} (1)

where the input of model at step tt is the concatenation of yi,t1{y}_{i,t-1} and xi,t{x}_{i,t}.

III Related Work

III-A Non-AutoRegressive Sequence Modelling

NAR forecasting models [15, 6, 5, 19] directly eliminate AR connection from the decoder side, instead modeling separate conditional distributions for each prediction independently. Unlike AR models, NAR models enable parallelized training and inference processes. However, NAR models can produce disjointed forecasts and introduce discontinuities [16] due to the erroneous assumption of independence, limiting their ability to capture interdependencies among predictions. AST [5] stands as the sole approach addressing this within the NAR framework, utilizing adversarial training to enhance global perspective. Recent NAR models focused on reducing output space interdependence have primarily emerged in the realm of Natural Language Processing (NLP) tasks [12, 25]. Various strategies have emerged, with KD [26, 27] garnering substantial attention. KD effectively transfers knowledge from a larger teacher to a smaller student network by offering softer, more informative target distributions. KD methods for NAR either distill the prediction distribution of a pre-trained AR teacher model [12] or incorporate hidden state patterns of the AR model [28].

III-B Online Knowledge Distillation

Classic KD methods are offline and can incur computational and memory overhead due to the reliance on powerful pre-trained teachers. In response, online KD techniques [29, 30, 31, 32] have emerged, showing superior results. These methods treat all networks as peers, enabling mutual exchange of output information and requiring less training time and memory. DML [29] introduced collaborative training, where each model can be both a student and a teacher. Further advancements, such as Wu and Gong’s work [32], assemble teachers into online KD, enhancing generalization. Notably, online KD techniques can capture intermediate feature distributions through adversarial training, as seen in AFD [30] and AMLN [31].

III-C Generative Adversarial Networks

Generative Adversarial Networks (GANs) [33], comprising a generator GG and a discriminator DD engaged in adversarial training, were initially proposed for sample generation. However, the adversarial training paradigm has found applications in diverse domains, including computer vision, NLP [30, 31], and time series forecasting [34, 5].

III-D Summary

In contrast to prior work, our AMLNet introduces several advancements: 1) We pioneer the application of online KD for forecasting, introducing AMLNet as the first model to employ online KD methods for training a NAR forecasting model to capture target sequence interdependencies. Specifically, AMLNet trains a deep AR and a deep NAR model mutually as ensemble teachers before transferring their knowledge to a shallow NAR student. 2) While Wu and Gong [32] construct ensemble teachers and adjust KD loss weights based on training epoch number, they overlook teacher model instability during training. AMLNet utilizes outcome-driven KD, assigning dynamic weights to KD losses based on teacher model performance, specifically tailored for probabilistic forecasting. 3) We address the issue of discontinuous predictions stemming from NAR model hidden states, proposing hint-driven KD to capture hidden state distribution information. Unlike previous approaches [30, 31], designed for networks with differing layer counts, our method is tailored to AMLNet’s architecture.

IV Adversarial Mutual Learning Neural Network

The proposed architecture, Adversarial Mutual Learning Neural Network (AMLNet), addresses the challenges in NAR forecasting by modeling output space interdependence. This section presents the architecture of AMLNet, the proposed outcome-driven and hint-driven online KD methods and the optimisation and inference process of AMLNet.

Refer to caption
Figure 1: AMLNet comprises an encoder, P1, P2, and S decoders, with each P1 and P2 layer accompanied by a dedicated discriminator. P1 operates as an AR component, while P2 and S function as NAR components. The solid lines depict the feedforward process, while dashed lines represent the data flow for knowledge distillation.

IV-A Network Architecture

The central components of AMLNet, as depicted in Figure 1, encompass a shared encoder fθef_{\theta_{e}} consisting of nen_{e} layers, a deep AR Peer1 (P1) decoder fθP1f_{\theta_{P1}} with nPn_{P} hidden layers, a deep NAR Peer2 (P2) decoder fθP2f_{\theta_{P2}} also possessing nPn_{P} hidden layers, and a shallow NAR Student (S) decoder fθSf_{\theta_{S}} equipped with nSn_{S} hidden layers. It’s noteworthy that Informer [19] serves as the foundational framework for AMLNet, although other deep learning forecasting models can replace Informer.

To harness temporal patterns from historical data, the encoder extracts insights from past time steps by processing the concatenation of covariate xtx_{t} and ground truth yty_{t} as input at time step tt:

he;1:Tl=fθe(y1:Tl,x1:Tl)h_{e;1:T_{l}}=f_{\theta_{e}}(y_{1:T_{l}},x_{1:T_{l}}) (2)

The shared encoder temporal patterns he;1:Tlh_{e;1:T_{l}} are uniformly leveraged across all decoders, exploiting the consistent input pasting sequence. This shared approach significantly reduces network parameters and computational overhead.

The P1, P2 and S decoders are formulated:

yP1;Tl+1:T=fθP1(yTl:T1,xTl+1:T,he;1:Tl)y_{P1;T_{l}+1:T}=f_{\theta_{P1}}(y_{T_{l}:T-1},x_{T_{l}+1:T},h_{e;1:T_{l}}) (3)
yP2;Tl+1:T=fθP2(xTlTde:Tl,he;1:Tl)y_{P2;T_{l}+1:T}=f_{\theta_{P2}}(x_{T_{l}-T_{d}e:T_{l}},h_{e;1:T_{l}}) (4)
yS;Tl+1:T=fθS(xTlTde:Tl,he;1:Tl)y_{S;T_{l}+1:T}=f_{\theta_{S}}(x_{T_{l}-T_{d}e:T_{l}},h_{e;1:T_{l}}) (5)

where TdeT_{d}e is the length of start token used by Informer [19], and the prediction yy consists of a mean and variance. NAR models without input sequences for the decoder have poor performance and copying input pasting series to the decoder as its input could enhance model performance [12].

Notably, the AR model excels at capturing output space interdependence, whereas the NAR model is adept at mitigating error propagation. Through a mutually beneficial relationship, the deep AR P1 and NAR P2 decoders coalesce as peers, leveraging each other’s strengths to augment their individual abilities. This collective knowledge is then harnessed to train the shallow NAR S decoder, effectively establishing a dynamic ensemble, as illustrated in Figure 1.

IV-B Outcome-driven Online Knowledge Distillation

Conventional optimization objective of probabilistic time series forecasting is the Negative Log Likelihood (NLL) loss, denoted as 𝒩\mathcal{L_{NLL}}. This loss function, prevalent in training probabilistic forecasting models [8, 7], is formally defined in Eq. (6):

𝒩(y^Tl+1:T,yTl+1:T)=12Th×(Thlog(2π)+t=Tl+1Tlog|σt2|+t=Tl+1T(ytμt)2σt2)\begin{split}&\mathcal{L_{NLL}}(\hat{y}_{T_{l}+1:T},y_{T_{l}+1:T})\\ =&-\frac{1}{2T_{h}}\times\Big{(}T_{h}\log(2\pi)+\sum_{t=T_{l}+1}^{T}\log\left|\sigma_{t}^{2}\right|\\ &+\sum_{t=T_{l}+1}^{T}(y_{t}-{\mu}_{t})^{2}\sigma_{t}^{-2}\Big{)}\end{split} (6)

where yTl+1:Ty_{T_{l}+1:T} represents the ground truth, while y^Tl+1:T\hat{y}_{T_{l}+1:T} pertains to the predicted distribution, encompassing the mean μt{\mu}t and standard deviation σt\sigma{t} across the forecasting horizon.

Addressing the inherent limitations of both autoregressive (AR) and non-autoregressive (NAR) models when subjected to the NLL loss, we propose a novel approach by enlisting the counterpart model as a peer network. This peer relationship serves as a guiding mechanism to overcome the respective limitations, approximating both the ground truth and the predicted distribution of their teacher network.

In the realm of online Knowledge Distillation (KD), traditional methodologies involve aggregating prediction and KD losses with fixed predefined proportions [29] or gradually increasing the proportion of KD loss over training epochs [32]. However, these approaches overlook the diverse abilities of the peer networks and the varying quality of their predictions during training. Consequently, the contribution of each peer network to the student network should be weighted according to its performance.

By allocating a constant weight to the KD loss irrespective of the peers’ performance, inaccurately predicted distributions could propagate errors within the network, limiting overall performance. Conventional remedies, such as removing misclassified data for offline KD, are unsuitable for online distillation or forecasting tasks due to reduced training data.

To address these challenges, we introduce an attention-based KD loss that assigns higher weights to well-forecasted samples. The KD loss functions for the P1, P2, and S decoders are formulated in Eq. (8), (9), and (10). P1 and P2 aim to mimic each other’s predicted distributions, while S simultaneously distills knowledge from the outputs of both P1 and P2. The Kullback Leibler (KL) divergence serves as a measure of discrepancy between predicted distributions. Given distributions PP and QQ, their KL divergence is defined in Eq. (7).

KL(PQ)=p(x)log(p(x)q(x))𝑑x\mathcal{L_{\mathrm{KL}}}(P\|Q)=\int_{-\infty}^{\infty}p(x)\log\left(\frac{p(x)}{q(x)}\right)dx (7)
P1=αoTh×ωe(yP2;Tl+1:T,yTl+1:T)KL(yP2;Tl+1:TyP1;Tl+1:T)\begin{split}\mathcal{L}_{P1}=&\frac{\alpha_{o}}{T_{h}}\times\omega_{e}(y_{P2;T_{l}+1:T},y_{T_{l}+1:T})\cdot\\ &\mathcal{L_{\mathrm{KL}}}(y_{P2;T_{l}+1:T}\|y_{P1;T_{l}+1:T})\end{split} (8)
P2=αoTh×ωe(yP1;Tl+1:T,yTl+1:T)KL(yP1;Tl+1:TyP2;Tl+1:T)\begin{split}\mathcal{L}_{P2}=&\frac{\alpha_{o}}{T_{h}}\times\omega_{e}(y_{P1;T_{l}+1:T},y_{T_{l}+1:T})\cdot\\ &\mathcal{L_{\mathrm{KL}}}(y_{P1;T_{l}+1:T}\|y_{P2;T_{l}+1:T})\end{split} (9)
S=αoTh×ωe(yP1;Tl+1:T,yTl+1:T)KL(yP1;Tl+1:TyS;Tl+1:T)+αoTh×ωe(yP2;Tl+1:T,yTl+1:T)KL(yP2;Tl+1:TyS;Tl+1:T)\begin{split}\mathcal{L}_{S}=&\frac{\alpha_{o}}{T_{h}}\times\omega_{e}(y_{P1;T_{l}+1:T},y_{T_{l}+1:T})\cdot\\ &\mathcal{L_{\mathrm{KL}}}(y_{P1;T_{l}+1:T}\|y_{S;T_{l}+1:T})+\\ &\frac{\alpha_{o}}{T_{h}}\times\omega_{e}(y_{P2;T_{l}+1:T},y_{T_{l}+1:T})\cdot\\ &\mathcal{L_{\mathrm{KL}}}(y_{P2;T_{l}+1:T}\|y_{S;T_{l}+1:T})\end{split} (10)

In these equations, αo\alpha_{o} modulates the weight of the outcome-driven KD loss, while the weight ωe()\omega_{e}(\cdot) captures the importance of the teacher model’s predictions. Considering a Gaussian distribution of data, we define ωe()\omega_{e}(\cdot) as follows:

ωe(y^Tl+1:T,yTl+1:T)=1σTl+1:T2πe12(yTl+1:TμTl+1:TσTl+1:T)2\begin{split}&\omega_{e}(\hat{y}_{T_{l}+1:T},y_{T_{l}+1:T})=\frac{1}{\sigma_{T_{l}+1:T}\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{y_{T_{l}+1:T}-\mu_{T_{l}+1:T}}{\sigma_{T_{l}+1:T}}\right)^{2}}\end{split} (11)

where μTl+1:T\mu_{T_{l}+1:T} and σTl+1:T\sigma_{T_{l}+1:T} are mean and standard deviation of teacher predicted distribution. The weight ωe()\omega_{e}(\cdot) ranges from 0 to 1 and changes during training. A higher weight signifies a more accurate teacher prediction, optimizing the student network’s approximation to the teacher’s outputs.

IV-C Hidden State-driven Online Knowledge Distillation

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Hidden state cosine distance of: (a) DeepAR; (b) Informer and (c) AMLNet. (d) Ground truth vs predictions.

Time series data exhibit a natural continuity, yet NAR forecasting models often yield discontinuous and unrelated predictions [16]. This disconnect arises due to these models ignoring output interdependence and position information, and we trace this phenomenon to the behavior of hidden states. To illustrate this, we conducted a case study where an AR model outperformed an NAR model. In Figure 2 (d), we display a smooth and continuous PV power measurement from the Sanyo dataset alongside forecasts from an AR model (DeepAR), an NAR model (Informer), and our proposed AMLNet. Notably, the trajectory of DeepAR is noticeably more continuous and smooth than that of the NAR models. To quantify this observation, we employed the Dynamic Time Warping (DTW) algorithm [35] to measure the similarity between two series. Specifically, we computed the DTW distance and Mean Absolute Percentage Errors (MAPEs) between the DeepAR prediction and the ground truth, yielding 3.05 and 0.102 kW, respectively. In contrast, the Informer yields MAPEs and DTW values of 4.91 and 0.143 kW, indicating that the DeepAR forecasts align more closely with the ground truth.

We visualize the cosine distances between hidden states at the last hidden layer in Figure 2, where each pixel at row ii and column jj represents the distance between hidden states hih_{i} and hjh_{j} at steps ii and jj. Lighter colors correspond to lower distances. Given two hidden states hih_{i} and hjh_{j}, their cosine distance is calculated as 1hihjhihj1-\frac{h_{i}\cdot h_{j}}{||h_{i}||\cdot||h_{j}||}. Notably, the average cosine distance between DeepAR hidden states hih_{i} and their six closest neighbors is 0.016, while the corresponding value for Informer is 0.084. Clearly, the cosine distances of DeepAR hidden states are substantially lower, indicating greater similarity to their neighbors. Such similarity suggests that the hidden states of DeepAR generate predictions with consistent and gradual variations, yielding predictions that are more continuous and smooth. From our analysis, several key observations emerge:

  • AR models exhibit similar hidden states patterns, while NAR models lack this property.

  • Dissimilar hidden states could lead to discontinuous predicted trajectory.

  • Hidden states hold meaningful information, and distilling knowledge from them could be beneficial [31, 30].

Thus, we capitalize on the hidden state patterns through online KD to generate continuous predictions while utilizing their inherent information. Unlike offline KD methods which regularize the distance between pairs of hidden states [28], the hidden states of online KD models exhibit greater variability compared to model outputs during training [30]. Direct regularization in this context impairs the learning process and fails to guarantee convergence. To address this, we adopt an adversarial training approach to extract distributional information from hidden states.

In our approach, Peer 1 (P1) learns from Peer 2 (P2) to counter error accumulation, while P2 learns output interdependence from P1. The Student (S) inherits abilities from both P1 and P2. Given a shared encoder, we focus solely on the decoders’ hidden states. The adversarial training involves two components: a generator producing feature mappings and a discriminator classifying these mappings. Each decoder layer serves as a generator for feature mappings, and each P1 and P2 layer is paired with a discriminator acting as a classifier (refer to Figure 1).

The discriminators receive hidden states hTh×dhidh\in\Re^{T_{h}\times d_{hid}} and output probabilities ranging between 0 (fake) and 1 (real). A discriminator’s architecture consists of a sequence of ConvLayer-BatchNorm-LeakyReLU-ConvLayer-LinearLayer-Sigmoid operations. The initial ConvLayer has an output dimension of 16, stride of 2, and kernel size of 3, while the second ConvLayer has an output dimension of 1, stride of 1, and kernel size of 3.

In our training process, the generators aim to fool discriminators by approximating the distribution of hidden states from their teacher networks. Discriminators, on the other hand, strive to distinguish the origin of hidden states. Specifically, we denote the iith layer of P1, P2, and S decoders as generators Gi,P1G_{i,P1}, Gi,P2G_{i,P2}, and Gi,SG_{i,S}, which produce hidden states hi,P1h_{i,P1}, hi,P2h_{i,P2}, and hi,Sh_{i,S}, respectively. The iith P1 discriminator Di,P1D_{i,P1} is trained to classify P1-generated mappings as real (output: 1) and P2 or S-generated mappings as fake (output: 0). Analogously, the iith P2 discriminator Di,P2D_{i,P2} distinguishes P2-generated mappings as real and P1 or S-generated mappings as fake. Its parameters Gi,P1G_{i,P1} is optimised by minimising the hidden states-driven KD loss i,P1(hi,P1)\mathcal{L}_{i,P1}(h_{i,P1}):

i,P1=αhlog(1Di,P2(hi,P1)).\begin{split}\mathcal{L}_{i,P1}=\alpha_{h}\log(1-D_{i,P2}(h_{i,P1})).\end{split} (12)

where hyperparameter αh\alpha_{h} controls the weight of hint-driven KD loss. Similar, iith P2 layer Gi,P2G_{i,P2} minimises the KD loss:

i,P2=αhlog(1Di,P1(hi,P2)).\begin{split}\mathcal{L}_{i,P2}=\alpha_{h}\log(1-D_{i,P1}(h_{i,P2})).\end{split} (13)

For hint-driven KD, the student network’s shallow layers imitate low-level features from teacher networks, while deep layers learn higher-level features. In the context of the S decoder with fewer layers, we enable each shallow S layer to acquire knowledge from multiple deep network layers. More specifically, the iith shallow S network layer learns features from the j=[1+(i1)×nd1nS1,min(nd1nS1+(i1)×nd1nS1,nd)]j=[1+(i-1)\times\lfloor\frac{n_{d}-1}{n_{S}-1}\rfloor,min(\lceil\frac{n_{d}-1}{n_{S}-1}\rceil+(i-1)\times\lfloor\frac{n_{d}-1}{n_{S}-1}\rfloor,n_{d})]th layers of the teachers. Notably, the term min(nd1nS1+(i1)×nd1nS1,nd)min(\lceil\frac{n_{d}-1}{n_{S}-1}\rceil+(i-1)\times\lfloor\frac{n_{d}-1}{n_{S}-1}\rfloor,n_{d}) ensures that jj does not exceed the depth of the teacher network.

Consequently, each iith shallow S network layer (Gi,SG_{i,S}) distills knowledge from both P1 and P2 decoders, attempting to deceive corresponding discriminators (Dj,P1D_{j,P1} and Dj,P2D_{j,P2}) for jj within the specified range. The KD loss for the iith S layer is formulated in Eq. (14). Here, αh\alpha_{h} governs the weight of hint-driven KD loss.

i,S=αhj(log(1Dj,P1(hi,S))+log(1Dj,P2(hi,S))).\begin{split}\mathcal{L}_{i,S}=&\alpha_{h}\sum_{j}(\log(1-D_{j,P1}(h_{i,S}))\\ &+\log(1-D_{j,P2}(h_{i,S}))).\end{split} (14)

In parallel, the discriminators are trained to classify the origin of hidden states. The iith P1 discriminator (Di,P1D_{i,P1}) distinguishes features generated by the iith P1 layer as real (output: 1), and P2 or S-generated features as fake (output: 0). Similarly, the iith P2 discriminator (Di,P2D_{i,P2}) discriminates P2-generated features as real and P1 or S-generated features as fake. The loss functions for the P1 and P2 discriminators are defined in Eq. (15) and (16), respectively. Here, kk represents the indexes of shallow NAR layers endeavoring to deceive the iith discriminator. For example, the first and second S layers aim to deceive the first P1 discriminator in Figure 1, resulting in D1,P1D_{1,P1} having k=1,2k={1,2}.

𝒟i,P1=logDi,P1(hi,P1)log(1Di,P1(hi,P2))klog(1Di,P1(hk,S))\begin{split}\mathcal{L_{D}}_{i,P1}=&-\log D_{i,P1}(h_{i,P1})\\ &-\log(1-D_{i,P1}(h_{i,P2}))\\ &-\sum_{k}\log(1-D_{i,P1}(h_{k,S}))\end{split} (15)
𝒟i,P2=logDi,P2(hi,P2)log(1Di,P2(hi,P1))klog(1Di,P2(hk,S))\begin{split}\mathcal{L_{D}}_{i,P2}=&-\log D_{i,P2}(h_{i,P2})\\ &-\log(1-D_{i,P2}(h_{i,P1}))\\ &-\sum_{k}\log(1-D_{i,P2}(h_{k,S}))\end{split} (16)

IV-D Optimisation and Inference

The optimization process for AMLNet involves minimizing the forecasting loss, as well as the outcome-driven and hint-driven KD losses. During each training iteration, the encoder and P1 and P2 decoders are optimized first by minimizing Equations (17) and (18). Subsequently, the S decoder is optimized by minimizing Equation (19). Lastly, each discriminator associated with the iith P1 or P2 layer is optimized by minimizing Equation (15) or (16). During testing, only the encoder and S decoder are utilized to generate results. The entire training and testing process is outlined in Algorithm 1.

P1=𝒩P1+P1+i=1nPi,P1\mathcal{L}_{P1}=\mathcal{L_{NLL}}_{P1}+\mathcal{L}_{P1}+\sum_{i=1}^{n_{P}}\mathcal{L}_{i,P1} (17)
P2=𝒩P2+P2+i=1nPi,P2\mathcal{L}_{P2}=\mathcal{L_{NLL}}_{P2}+\mathcal{L}_{P2}+\sum_{i=1}^{n_{P}}\mathcal{L}_{i,P2} (18)
S=𝒩S+S+i=1nSi,S\mathcal{L}_{S}=\mathcal{L_{NLL}}_{S}+\mathcal{L}_{S}+\sum_{i=1}^{n_{S}}\mathcal{L}_{i,S} (19)
Algorithm 1 Training and Testing Process of AMLNet
0:  Training data {(yi,1:T1,xi,1:T,yi,Tl+1:T)}i=1n\{(y_{i,1:T-1},x_{i,1:T},y_{i,T_{l+1}:T})\}^{n}_{i=1}; initialised parameters of the shared encoder fθef_{\theta_{e}}, P1 decoder fθP1f_{\theta_{P1}}, P2 decoder fθP2f_{\theta_{P2}}, shallow S decoder fθSf_{\theta_{S}} and discriminators {Di,P1}i=1nP\{D_{i,P1}\}^{n_{P}}_{i=1} and {Di,P2}i=1nP\{D_{i,P2}\}^{n_{P}}_{i=1}; training epochs emaxe_{max}.
1:  // Training stage
2:  for 1emax1\rightarrow e_{max}do
3:     // Optimizing encoder and P1, P2 decoders
4:     Compute encoder output he;1:Tlh_{e;1:T_{l}} (Eq. (2))
5:     Compute P1 and P2 decoders forecasts yP1;Tl+1:Ty_{P1;T_{l}+1:T} and yP2;Tl+1:Ty_{P2;T_{l}+1:T} and their hidden states h1:nP,P1h_{1:n_{P},P1} and h1:nP,P2h_{1:n_{P},P2} (Eq. (3 and 4))
6:     Compute loss with P1 forecasts P1\mathcal{L}_{P1} (Eq. (17))
7:     Compute loss with P2 forecasts P2\mathcal{L}_{P2} (Eq. (18))
8:     Update encoder, P1, P2 decoders by minimising P1\mathcal{L}_{P1} and P2\mathcal{L}_{P2}
9:     // Optimizing S decoder
10:     Compute forecasts yS;Tl+1:Ty_{S;T_{l}+1:T} and hidden states h1:nS,Sh_{1:n_{S},S} of shallow S decoder (Eq. (5))
11:     Compute loss with S forecasts S\mathcal{L}_{S} (Eq. (19))
12:     Update S decoder by minimising S\mathcal{L}_{S}
13:     // Optimizing discriminators
14:     Compute classification loss 𝒟i,P1\mathcal{L_{D}}_{i,P1} and 𝒟i,P2\mathcal{L_{D}}_{i,P2} for every P1 and P2 discriminators (Eq. (15 and 16))
15:     Update discriminators by minimising the classification loss
16:  end for
17:  // Testing stage
18:  Compute encoder output he;1:Tlh_{e;1:T_{l}} (Eq. (2))
19:  Compute forecasting results yS;Tl+1:Ty_{S;T_{l}+1:T} by the S decoder (Eq. (5))

V Experiments

V-A Experimental Details

We compare the performance of AMLNet with six methods: four state-of-the-art deep learning (DeepAR, LogSparse Transformer, N-BEATS and Informer), a statistical (SARIMAX) and a persistence model. 1) Persistence is a typical baseline in forecasting which considers the time series of the previous day as the prediction for the next day; 2) SARIMAX [4] is an extension of ARIMA which cann handle seasonality with exogenous variables; 3) DeepAR [8] is a widely used RNN-based forecasting model; 4) LogSparse Transformer [7] is a Transformer-based forecasting model, it is denoted as ”LogTrans” in Table III; 5) N-BEATS [6] consists of blocks of fully-connected neural networks, organised into stacks using residual links. We introduced covariates at the input of each block to facilitate multivariate series forecasting; 6) Informer [19] is a Transformer-based forecasting model. We modified it for probabilistic forecasts to generate the mean value and variance. Note that Persistence, N-BEATS and Informer are NAR models while the others are AR models.

λG\lambda_{G} λD\lambda_{D} αo\alpha_{o} αh\alpha_{h} δ\delta dhidd_{hid} nen_{e} ndn_{d} nSn_{S} dfd_{f} nhn_{h}
Sanyo 0.005 0.001 0.1 0.5 0 48 4 4 2 16 8
Hanergy 0.005 0.001 0.1 0.5 0 48 4 4 2 16 8
Solar 0.005 0.001 0.5 0.001 0.2 96 4 3 2 48 32
Electricity 0.001 0.001 0.1 0.001 0.1 48 4 3 2 48 32
TABLE II: Hyperparameters for AMLNet

All models were implemented using PyTorch 1.6 and evaluated on Tesla V100 16GB GPU. The deep learning models were optimised by mini-batch gradient descent with the Adam optimiser and a maximum number of epochs 200. Following the experimental setup in [24, 9], we used the following training, validation and test split: for Sanyo and Hanergy - the data from the last year as test set, the second last year as validation set for early stopping and the remaining data (5 years for Sanyo and 4 years for Hanergy) as training set; for Solar and Electricity - the last week data as test set (from 25/08/2006 for Solar and 01/09/2014 for Electricity), the week before as validation set. For all data sets, the data preceding the validation set is split in the same way into three subsets and the corresponding validation set is used to select the best hyperparameters. We selected the hyperparameters with a minimum loss on the validation set. We used Bayesian optimisation for hyperparameter search with a maximum number of iterations 20. The models used for comparison were tuned based on the authors’ recommendations. For the Transformer-based models, we used learnable position and ID (for Solar, Electricity and Exchange sets) embedding. For AMLNet, the constant sampling factor for Informer backbone was set to 2, and the length of start token TdeT_{d}e is fixed as half of the forecasting horizon. The learning rate of generator λG\lambda_{G} and discriminator λD\lambda_{D} was fixed, the loss function regularisation parameters αo\alpha_{o} and αh\alpha_{h} were chosen from {0.001, 0.05, 0.1, 0.5}, the dropout rate δ\delta was chosen from {0, 0.1, 0.2}, the hidden layer dimension size dhidd_{hid} was chosen from {8, 12, 16, 24, 48}, the Informer backbone Pos-wise FFN dimension dimension size dfd_{f} and number of heads nhn_{h} were chosen from {8, 12, 16, 24, 48, 96} and {4, 8, 16, 24, 32}, number of encoder nen_{e}, P1 and P2 decoder nSn_{S} and shallow NAR decoder layers nSn_{S} were chosen from and {2, 3, 4}. Note that number of encoder layer is not less than number of decoder layers, P1 and P2 decoders have same number of layers, shallow NAR decoder has less number of layers than deep decoders. The discriminators are simply a series of ConvLayer-BatchNorm-LeakyReLU-ConvLayer-LinearLayer-Sigmoid. The first ConvLayer has the output dimension of 16, stride of 2 and kernel size of 3, while the second has dimension of 1, stride of 1 and kernel size of 3.

The selected best hyperparameters for AMLNet are listed in Table II and used for the evaluation of the test set.

Following [8], we report the standard ρ\rho0.5 and ρ\rho0.9-quantile losses. The quantile loss function is applied to predict quantiles, and quantile ρ\rho is the value below which a fraction of observations in a group falls. Given the ground truth yy and ρ\rho-quantile of the predicted distribution y^\hat{y}, the ρ\rho-quantile loss is given by QLρ(y,y^)\mathrm{QL}_{\rho}(y,\hat{y}):

QLρ(y,y^)=2×tPρ(yt,y^t)t|yt|,Pρ(y,y^)={ρ(yy^) if y>y^(1ρ)(y^y) otherwise \begin{split}\mathrm{QL}_{\rho}(y,\hat{y})&=\frac{2\times\sum_{t}P_{\rho}\left(y_{t},\hat{y}_{t}\right)}{\sum_{t}\left|y_{t}\right|},\\ \quad P_{\rho}(y,\hat{y})&=\left\{\begin{array}[]{ll}\rho(y-\hat{y})&\text{ if }y>\hat{y}\\ (1-\rho)(\hat{y}-y)&\text{ otherwise }\end{array}\right.\end{split} (20)
Sanyo Hanery Solar Electricity
Persistence 0.154/- 0.242/- 0.256/- 0.091/-
SARIMAX 0.124/0.096 0.145/0.098 0.256/0.192 0.196/0.079
DeepAR 0.070/0.031 0.092/0.045 0.222/0.093 0.075/0.040
LogTrans 0.067/0.036 0.124/0.066 0.210/0.082 0.059/0.034
N-BEATS 0.077/- 0.132/- 0.212/- 0.071/-
Informer 0.046/0.022 0.084/0.046 0.215/0.115 0.068/0.033
AMLNet-P1 0.044/0.021 0.084/0.043 0.224/0.091 0.068/0.034
AMLNet-P2 0.040/0.019 0.078/0.040 0.206/0.090 0.065/0.033
AMLNet-S 0.042/0.020 0.077/0.038 0.204/0.088 0.067/0.032
TABLE III: ρ\rho0.5/ρ\rho0.9-loss of data sets with various granularities. \diamond denotes results from [7].

V-B Accuracy Analysis

Table III shows the ρ\rho0.5 and ρ\rho0.9 losses of all models, including the three AMLNet versions which use different decoders (AMLNet-P1, AMLNet-P2, AMLNet-S). As N-BEATS and Persistence produce point forecasts, only the ρ\rho0.5-loss is reported for them. AMLNet is the most accurate method - it outperforms the other methods on all data sets except for ρ\rho0.5 on Electricity where the Logsparse Transformer is the best model. For AMLNet, S decoder successfully inherits the abilities of P1 and P2 decoders with fewer layers and has the best performance on Hanergy and Solar sets. Overall, all decoders of AMLNet outperform their backbone model Informer except for Solar and Electricity sets where P1 underperforms, indicating the AMLNet is beneficial. Both NAR branches (P2 and S) exhibit clear improvement over Informer, suggesting the design for overcoming the disadvantages of AR and NAR models is successful for improving forecasting accuracy.

AST [5] employs adversarial training to improve the contiguous and fidelity at the sequence level. AST is not compared directly because it generates quantile forecasts and minimises quantile loss, while we consider probabilistic forecasting with different objective functions in Eq. (6). To compare the effectiveness of AMLNet with AST, we apply the adversarial training of AST to AMLNet and its backbone - Informer, and the results are shown in Table IV. The adversarial training improves the performance of Informer, while AMLNet still exhibits advantages. The design of AMLNet is compatible with the AST adversarial training, combining both techniques achieves further performance improvement on the S decoder.

Sanyo Hanery
Informer 0.046/0.022 0.084/0.046
Informer+Adv 0.045/0.022 0.079/0.041
AMLNet-P1 0.044/0.021 0.084/0.043
AMLNet-P2 0.040/0.019 0.078/0.040
AMLNet-S 0.042/0.020 0.077/0.038
AMLNet-P2+Adv 0.047/0.023 0.079/0.039
AMLNet-P1+Adv 0.045/0.022 0.099/0.047
AMLNet-S+Adv 0.039/0.020 0.072/0.035
TABLE IV: ρ\rho0.5/ρ\rho0.9-loss of adversarial training study.

V-C Case Analysis

To study the capabilities of AMLNet in addressing error accumulation and modeling output space interdependence, we conduct a comparative analysis with two benchmark models: classic AR model DeepAR and NAR model Informer. The evaluation is performed on both the Sanyo and Hanergy datasets, providing insights into AMLNet’s performance across different scenarios.

Fig. 3 illustrates the ρ\rho0.5-loss of various models across different forecasting horizons, using a fixed pasting history. The loss of all models tends to increase as the forecasting horizon expands. However, it is evident that the performance of AR models, such as DeepAR, deteriorates more significantly compared to NAR models. Remarkably, AMLNet’s P1 decoder consistently outperforms DeepAR across different horizons, demonstrating its capability to mitigate the adverse effects of error accumulation. Conversely, NAR models, including Informer and AMLNet, exhibit relatively stable performance over varying forecasting horizons. This observation indicates that AMLNet’s design effectively addresses the issue of error accumulation in its P1 decoder.

Refer to caption
Refer to caption
Figure 3: ρ\rho0.5-loss of DeepAR, Informer and AMLNet with various forecasting horizon on (a) Sanyo set and (b) Hanergy set

Referencing Fig. 2 (c), it becomes evident that AMLNet’s S decoder exhibits lower cosine distances in hidden states compared to its backbone counterpart shown in Fig. 2 (b). This distinction is particularly pronounced when observing the lighter color in Fig. 2 (c) as compared to Fig. 2 (b). Additionally, the average cosine distances of the hidden states in the P2 and S decoders on both the Sanyo and Hanergy datasets are significantly lower, by 28% and 23% respectively, compared to the backbone model. Furthermore, the average DTW distances of the P2 and S decoder predictions exhibit a reduction of 18% and 17%, respectively. These findings underline the efficacy of our designed approach in learning and leveraging output space interdependence. This enables the model’s hidden states to exhibit greater similarity to neighboring states and subsequently generates more realistic and coherent prediction trajectories.

Sanyo Hanery Solar Electricity
LogTrans 101.5±\pm3.5 112.7±\pm7.7 171.8±\pm9.6 437.4±\pm21.5
Informer 18.1±\pm0.5 18.7±\pm1.1 44.7±\pm0.6 213.7±\pm0.6
AMLNet-P1 150.2±\pm2.4 148.4±\pm0.8 249.5±\pm0.4 600.1±\pm0.7
AMLNet-P2 21.7±\pm0.2 21.5±\pm0.1 48.4±\pm0.1 289.6±\pm0.4
AMLNet-S 11.4±\pm0.5 11.6±\pm0.5 30.0±\pm5.8 152.2±\pm0.2
TABLE V: Inference time (ms) of data sets.

V-D Speed Analysis

We conducted an evaluation of the inference time for different configurations of AMLNet, as well as the NAR backbone Informer and the AR baseline LogTrans. The results are summarized in Table V. All experiments were conducted on the same computer configuration, and the reported values represent the average elapsed time in milliseconds along with the standard deviation from 10 runs. Notably, the NAR models exhibit faster inference times compared to the P1 decoder, primarily due to their inherent parallelizability. Informer and AMLNet-P2 demonstrate similar inference speeds, which is consistent with their comparable architectural characteristics. AMLNet-S, designed with fewer layers, stands out as the fastest among the models evaluated.

V-E Ablation Analysis

To assess the effectiveness of our proposed methods, we conducted an ablation study, focusing on the ρ\rho0.5-loss metric for various model configurations. The results are presented in Table VI. In this table, o\mathcal{L}_{o} corresponds to the classic online KD, wo\mathcal{L}_{wo} represents our outcome-driven KD, GAN\mathcal{L}_{GAN} indicates adversarial KD applied to the last hidden layer, and hGAN\mathcal{L}_{hGAN} refers to our hint-driven KD.

Among the key findings from this ablation analysis:

  • AMLNet, when combined with our proposed KD methods, emerges as the most effective model configuration, attaining the highest accuracy.

  • Both outcome-driven KD and hint-driven KD lead to improved accuracy when incorporated into the frameworks, underscoring the efficacy of both design approaches.

  • The Backbone+wo\mathcal{L}_{wo}+GAN\mathcal{L}_{GAN} outperforms Backbone+o\mathcal{L}_{o}+hGAN\mathcal{L}_{hGAN} substantially, suggesting outcome-driven KD exerts a more pronounced impact on accuracy enhancement compared to the hint-driven KD.

  • S decoders tend to outperform P1 and P2 decoders, supporting the notion that our design of online knowledge distillation from P1 and S is a beneficial strategy.

Sanyo Hanery Solar
Backbone (Informer) 0.046/0.022 0.084/0.046 0.215/0.115
Backbone +o\mathcal{L}_{o} (DML) P1 0.053/0.024 0.098/0.048 0.258/0.090
P2 0.051/0.024 0.092/0.047 0.211/0.104
S 0.048/0.023 0.083/0.042 0.219/0.111
Backbone +o\mathcal{L}_{o}+GAN\mathcal{L}_{GAN} (AFD) P1 0.053/0.023 0.096/0.046 0.556/0.202
P2 0.043/0.020 0.079/0.040 0.228/0.104
S 0.048/0.022 0.084/0.041 0.221/0.100
Backbone +o\mathcal{L}_{o}+hGAN\mathcal{L}_{hGAN} P1 0.051/0.023 0.095/0.046 0.276/0.098
P2 0.045/0.021 0.095/0.045 0.266/0.150
S 0.046/0.022 0.079/0.039 0.233/0.108
Backbone +wo\mathcal{L}_{wo} P1 0.049/0.022 0.087/0.045 0.235/0.089
P2 0.045/0.022 0.093/0.047 0.215/0.090
S 0.049/0.023 0.083/0.041 0.208/0.086
Backbone +wo\mathcal{L}_{wo}+GAN\mathcal{L}_{GAN} P1 0.043/0.020 0.082/0.040 0.215/0.086
P2 0.043/0.020 0.079/0.038 0.205/0.087
S 0.043/0.020 0.079/0.037 0.205/0.089
Backbone +wo\mathcal{L}_{wo}+hGAN\mathcal{L}_{hGAN} (AMLNet) P1 0.044/0.021 0.084/0.043 0.224/0.091
P2 0.040/0.019 0.078/0.040 0.206/0.090
S 0.042/0.020 0.077/0.038 0.204/0.088
TABLE VI: ρ\rho0.5/ρ\rho0.9-loss of data sets for ablation study.

VI Conclusion

We introduce AMLNet, the NAR model that harnesses both outcome-driven and hint-driven online KD methods. It comprises a shared encoder, alongside deep AR (P1), deep NAR (P2), and shallow NAR (S) decoders. P1 and P2 operate collaboratively, mutually distilling knowledge from each other, and collectively acting as ensemble teachers to effectively transfer this knowledge to S. Our method dynamically assigns attention-based weights to the model’s output KD loss, thereby effectively mitigating the risk of learning from less reliable predictions. Additionally, we employ adversarial training to distill knowledge from the distribution of hidden states. This is particularly significant as the root of unrealistic forecasts in NAR models often lies within the hidden states, which inherently carry valuable information. Our extensive experimental evaluations substantiate the remarkable performance and effectiveness of AMLNet in comparison to state-of-the-art forecasting models and existing online KD methods. AMLNet excels not only in modeling output space interdependence, resulting in more plausible forecasts, but also in addressing the challenge of error accumulation, all while maintaining a low inference latency.

References

  • [1] X. Jiao, X. Li, D. Lin, and W. Xiao, “A graph neural network based deep learning predictor for spatio-temporal group solar irradiance forecasting,” IEEE Transactions on Industrial Informatics, vol. 18, no. 9, pp. 6142–6149, 2022.
  • [2] J. He, M. Khushi, N. H. Tran, and T. Liu, “Robust dual recurrent neural networks for financial time series prediction,” in Proceedings of the SIAM International Conference on Data Mining (SDM),year=2021,.
  • [3] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting,” in Proceedings of the International Conference on Machine Learning (ICML), 2022.
  • [4] J. Durbin and S. J. Koopman, Time Series Analysis by State Space Methods.   Oxford University Press, 2001.
  • [5] S. Wu, X. Xiao, Q. Ding, P. Zhao, Y. Wei, and J. Huang, “Adversarial sparse transformer for time series forecasting,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • [6] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-BEATS: Neural basis expansion analysis for interpretable time series forecasting,” in Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  • [7] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan, “Enhancing the locality and breaking the memory bottleneck of Transformer on time series forecasting,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • [8] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, “DeepAR: Probabilistic forecasting with autoregressive recurrent networks,” International Journal of Forecasting, vol. 36, no. 3, pp. 1181 – 1191, 2020.
  • [9] Y. Lin, I. Koprinska, and M. Rana, “Ssdnet: State space decomposition neural network for time series forecasting,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), 2021.
  • [10] ——, “Temporal convolutional neural networks for solar power forecasting,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2020.
  • [11] ——, “Temporal convolutional attention neural networks for time series forecasting,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2021.
  • [12] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, “Non-autoregressive neural machine translation,” in Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • [13] E. J. Barezi, I. Calixto, K. Cho, and P. Fung, “A study on the autoregressive and non-autoregressive multi-label learning,” 2020.
  • [14] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2015.
  • [15] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka, “A multi-horizon quantile recurrent forecaster,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [16] S. B. Taieb and A. F. Atiya, “A bias and variance analysis for multistep-ahead time series forecasting,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 1, pp. 62–76, 2016.
  • [17] K. Cho, “Noisy parallel approximate decoding for conditional recurrent language model,” CoRR, 2016.
  • [18] J. Lee, E. Mansimov, and K. Cho, “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • [19] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), 2021.
  • [20] D1, “Sanyo dataset,” http://dkasolarcentre.com.au/
    source/alice-springs/dka-m4-b-phase, 2020.
  • [21] D2, “Hanergy dataset,” http://dkasolarcentre.com.au/
    source/alice-springs/dka-m16-b-phase, 2020.
  • [22] D3, “Solar dataset,” https://www.nrel.gov/grid/solar-power-data.html, 2014.
  • [23] D4, “Electricity dataset,” https://archive.ics.uci.edu/
    ml/datasets/ElectricityLoadDiagrams20112014, 2015.
  • [24] Y. Lin, I. Koprinska, and M. Rana, “SpringNet: Transformer and spring dtw for time series forecasting,” in Proceedings of the International Conference on Neural Information Processing (ICONIP), 2020.
  • [25] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • [26] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv  1503.02531, 2015.
  • [27] Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [28] Z. Li, Z. Lin, D. He, F. Tian, T. Qin, L. Wang, and T.-Y. Liu, “Hint-based training for non-autoregressive machine translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  • [29] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [30] I. Chung, S. Park, J. Kim, and N. Kwak, “Feature-map-level online adversarial knowledge distillation,” in Proceedings of the International Conference on Machine Learning (ICML), 2020.
  • [31] X. Zhang, S. Lu, H. Gong, Z. Luo, and M. Liu, “Amln: Adversarial-based mutual learning network for online knowledge distillation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • [32] G. Wu and S. Gong, “Peer collaborative learning for online knowledge distillation,” in Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), 2021.
  • [33] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014.
  • [34] R. Dang-Nhu, G. Singh, P. Bielik, and M. Vechev, “Adversarial attacks on probabilistic autoregressive forecasting models,” in Proceedings of the International Conference on Machine Learning (ICML), 2020.
  • [35] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978.

Appendix A Dynamic Time Warping

DTW is a classic algorithm to measure similarity between two trajectories which may have different length and vary in time [35]. It finds the optimal alignments between two series via dynamic programming. Given two univariate time series 𝐱[x1,,xn]1×n\mathbf{x}\coloneqq[x_{1},...,x_{n}]\in\Re^{1\times n} and 𝐲[y1,,ym]1×m\mathbf{y}\coloneqq[y_{1},...,y_{m}]\in\Re^{1\times m}, DTW can be considered as a optimization problem:

DTW(𝐱,𝐲)=minπ𝒜(𝐱,𝐲)(i,j)πd(xi,yj)=minπ𝒜(𝐱,𝐲)(i,j)π|xiyj|\begin{split}\text{DTW}(\mathbf{x},\mathbf{y})=\min\limits_{\pi\in\mathcal{A}(\mathbf{x},\mathbf{y})}\sum_{(i,j)\in\pi}d({x_{i}},{y_{j}})=\min\limits_{\pi\in\mathcal{A}(\mathbf{x},\mathbf{y})}\sum_{(i,j)\in\pi}|{x_{i}}-{y_{j}}|\end{split} (21)

where 𝒜\mathcal{A} is the optimal path between 𝐱\mathbf{x} and 𝐲\mathbf{y}, π\pi is a sequence of time index pairs and d(xi,yj)=|xiyj|d({x_{i}},{y_{j}})=|{x_{i}}-{y_{j}}| is the distance measure. The alignment π\pi increases monotonically and continuously from (1,1)(1,1) to (n,m)(n,m).

The details of DTW is shown in algorithm 2 where line 4 involves dynamic programming with a quadratic cost O(nm)O(nm).

Algorithm 2 Dynamic Time Warping
0:  𝐱[x1,,xn]1×n\mathbf{x}\coloneqq[x_{1},...,x_{n}]\in\Re^{1\times n}, 𝐲[y1,,ym]1×m\mathbf{y}\coloneqq[y_{1},...,y_{m}]\in\Re^{1\times m}.
1:  d0,0=0d_{0,0}=0; di,0=d0,j=d_{i,0}=d_{0,j}=\infty; i[[1,N]]i\in[\![1,N]\!], j[[1,m]]j\in[\![1,m]\!]
2:  for j1j\leftarrow 1 to mmdo
3:     for i1i\leftarrow 1 to nndo
4:        di,j=|xiyj|+min{di1,j1,di1,j,di,j1}d_{i,j}=|{x_{i}}-{y_{j}}|+\min\{d_{i-1,j-1},d_{i-1,j},d_{i,j-1}\}
5:     end for
6:  end for
7:  Return dn,md_{n,m}