This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Supervised Contrastive Learning Pretrain-Finetune Approach for Time Series

Trang H. Tran1,2   Lam M. Nguyen2   Kyongmin Yeo2   Nam Nguyen2   Roman Vaculin2
1 School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA
2 IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
[email protected], [email protected]
Work done while an intern at IBM Research
Abstract

Foundation models have recently gained attention within the field of machine learning thanks to its efficiency in broad data processing. While researchers had attempted to extend this success to time series models, the main challenge is effectively extracting representations and transferring knowledge from pretraining datasets to the target finetuning dataset. To tackle this issue, we introduce a novel pretraining procedure that leverages supervised contrastive learning to distinguish features within each pretraining dataset. This pretraining phase enables a probabilistic similarity metric, which assesses the likelihood of a univariate sample being closely related to one of the pretraining datasets. Subsequently, using this similarity metric as a guide, we propose a fine-tuning procedure designed to enhance the accurate prediction of the target data by aligning it more closely with the learned dynamics of the pretraining datasets. Our experiments have shown promising results which demonstrate the efficacy of our approach.

Correspondence to: Lam M. Nguyen.

1 Introductions

1.1 Motivations

Foundation Models For Time Series.

Lately, foundation models have gained significant prominence in the field of artificial intelligence and machine learning (Bommasani et al., 2021), where notable examples of these models include BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020). These models are characterized by their training on extensive datasets, typically employing self-supervised methods at a large scale, and they possess the capability to be adapted for a wide array of downstream tasks (Bommasani et al., 2021). Therefore there have been efforts to extend this success to other applications, including time series (Zhou et al., 2023; Rasul et al., 2023; Xue et al., 2023).

Challenges in General Representation Learning.

One of the main challenges in training foundation models for time series is to address the discrepancy between pretraining and finetuning data (Zhang et al., 2022b; Yeh et al., 2023). As demonstrated in Figure 1, this discrepancy arises at various levels. In Figure 1(a), although the last feature has a slight deviation, most of the features bear some resemblance within the same datasets. However, for different datasets, the dynamics are vastly different, as shown in Figure 1(b). As a result, a foundation model should possess the capability to adapt with a heterogeneous collection of datasets. Therefore it is desirable to find a general representation to contain the diverse knowledge in the pretraining task (Zhang et al., 2022b; Zerveas et al., 2021).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) ETTh1 dataset
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) 5 different datasets
Figure 1: Plots of different features from ETTh1 and from five datasets. Description of datasets is in Section 4.1. Five different datasets featured are Electricity, Exchange-Rate, Traffic, Weather and ETTm1.

The next difficult task focuses on the design of the model to transfer the knowledge to finetuning task (Fawaz et al., 2018). It is reasonable to assume that the dynamic of the finetune datasets should be close to the dynamic of the collection of pretraining datasets in some sense. In this work, we consider the assumption that the representations of finetune dataset is within the span of the represenations of different pretraining datasets.

Our approach is to differentiate the features that originate from different datasets by utilizing the learned representation of these features, therefore partially addressing the high-level discrepancy in time series dynamics and enhancing the knowledge within foundation model. We summarize our contributions below.

Contributions.
  • We use a pretraining procedure with contrastive learning to differentiate the features in each pretraining dataset. This pretraining enables a probabilistic similarity metric to measure if a univariate sample is likely to share a similar dynamic with one of the pretraining datasets.

  • Using the similarity metric, we propose a finetuning procedure to aid the correct prediction of the target data, by making the finetune representation closer to the learned representations of the corresponding pretrain datasets.

  • Our experiments show that the pretrained models have promising performance compared to supervised training approaches. The finetuned model shows better generalization than prior approaches for some of the datasets, meanwhile having competitive results in other settings.

1.2 Related Work

Time Series Forecasting

There are two primary approaches for time series multi-step ahead predictions. Early approaches focuses on joint probability distributions of future system states by iteratively computing their evolution over time, typically using techniques like recurrent neural networks (RNNs) as demonstrated in (Levin, 1990; Yeo et al., 2022). The second approach revolves around training a time series model capable of directly predicting future time steps based on historical data input. This includes multilayer perceptron (MLP)-based methods (Gardner and Dorling, 1998; Zhang et al., 2022a) along with convolutional neural networks (O’Shea and Nash, 2015; Gu et al., 2018). With the rise of attention-based models and their success in natural language processing (Vaswani et al., 2017), attention-based time series models have gained popularity since they can discover the temporal dependencies among time points. Nevertheless, these models face a challenge due to their quadratic time and memory complexity when dealing with learning long-range temporal correlations. To address this, LogTrans (Li et al., 2019) and Pyraformer (Liu et al., 2021) propose strategies to introduce sparsity bias and reduce computational complexity. Informer (Zhou et al., 2021) and FEDformer (Zhou et al., 2022) leverage the low-rank properties of the self-attention matrix to enhance performance.

In contrast, Autoformer (Wu et al., 2021) introduces a novel architecture with an auto-correlation mechanism as an alternative to traditional attention-based models. Conversely, in the work presented in (Zeng et al., 2022), a different approach is taken with the use of a simple set of linear models and suggesting that these simplicity-driven models may outperform more complex structures. On the other hand, (Wu et al., 2023) learns the temporal patterns by exploring the multi-periodicity of time series and capture the temporal 2D-variations in 2D space.

Contrastive Learning.

In recent years, there has been significant advancements in self-supervised representation learning (Ericsson et al., 2022; Jaiswal et al., 2020; Misra and Maaten, 2020) with applications on time series (Kiyasseh et al., 2021; Tonekaboni et al., 2021; Franceschi et al., 2019; Eldele et al., 2021; Yue et al., 2022; Tang et al., 2020; Yang and Qiao, 2022; Zhang et al., 2022c; Nguyen et al., 2023). The common concept within these works is the idea of bringing an anchor and a positive sample closer in the embedding space, while separating the anchor and numerous negative samples. The work by (Khosla et al., 2020) extends the self-supervised contrastive approach to the fully-supervised setting, enabling effective utilization of label information. The contribution of (Khosla et al., 2020) is considering multiple positive pairs per anchor, in addition to the numerous negative pairs, in contrast to self-supervised contrastive learning with a single positive pair. In this work, we utilize the supervised contrastive learning framework in (Khosla et al., 2020) for the pretrain-finetune process. While there have been a self-supervised contrastive pretraining framework for time series (Zhang et al., 2022b), our approach is different since we ultilize the labels for training.

2 Problem Description

Refer to caption
(a) The encoder-decoder model and our pretrain loss. The pretrain loss has two component: a prediction loss on the model output, and a contrastive loss enforced on the representation zz of the model.
Refer to caption
(b) Each dataset is assigned a label and the contrastive loss maximizes the similarity (minimizes the difference) of the representations from the same label group while minimizes the similarity from different label groups.
Figure 2: Description of our pretrain process with supervised contrastive learning

We are given a collection of multivariate training datasets {Xk}pretrain,k=1,,P\left\{X^{k}\right\}_{\text{pretrain}},k=1,\dots,P, each one has sizes Tk×dkT^{k}\times d^{k}, where TkT^{k} is the time dimension and dkd^{k} is the number of features. The number of pretrain datasets is PP. Our goal is to train a foundation model MM on the collection {Xk}pretrain\left\{X^{k}\right\}_{\text{pretrain}} and then finetune it to adapt with a new dataset XfinetuneX_{\text{finetune}} of size Tf×dfT^{f}\times d^{f} where TfT^{f} and dfd^{f} are the time dimension and the number of features of the finetune dataset, respectively.

We consider the time series forecasting problem where the model has the information of previous II time steps and aims to predict the next OO future time steps. Since the different datasets have different numbers of features, the designs of foundation model must have the capability to process such data. A typical approach to this problem is using channel independence, which learns a common model for every univariate time series data (Nie et al., 2023; Han et al., 2023; Li et al., 2023; Xue et al., 2023). It often consists of an encoder that transforms input data xt:t+Ix_{t:t+I} into a representation (context vector) and a decoder that generates the output sequence xt+I:t+I+Ox_{t+I:t+I+O} based on this context (Zhang et al., 2023; Rasul et al., 2023). The encoder and the decoder can have various structures ranging from simple fully connected layers to complex designs e.g. attention based models (Rasul et al., 2023; Das et al., 2023). We describe this model structure in Figure 2(a).

From the multivariate training datasets {Xk}pretrain\left\{X^{k}\right\}_{\text{pretrain}}, we collects the univariate time series, which is further transformed into data samples using sliding windows. We note that the number of univariate data samples in each pretrain dataset is different. We build a pretrain sample collection which has equal number of data samples from each of the pretrain dataset {Xk}pretrain\left\{X^{k}\right\}_{\text{pretrain}}.

3 Pretrain-Finetune Approach with Supervised Contrastive Learning

3.1 Pretrain Process

In this section, we describe our pretraining process. Our framework use a encoder-decoder model which takes the univariate time series as input. The pretrain loss function consists of two components where the first one is the mean squared error between the predicted values and the ground truth. The second component is a contrastive loss computed on the representation zz of the model. Accordingly:

Losspretrain(xt:t+I)=\displaystyle\text{Loss}_{\text{pretrain}}\left(x_{t:t+I}\right)= x^t+I:t+I+Oxt+I:t+I+O2+λSupCon(xt:t+I),\displaystyle\left\|\hat{x}_{t+I:t+I+O}-x_{t+I:t+I+O}\right\|^{2}+\lambda\operatorname{SupCon}\left(x_{t:t+I}\right), (1)

where λ\lambda is a regularizer and the (modified) supervised contrastive loss SupCon(xt:t+I)\operatorname{SupCon}(x_{t:t+I}) is

1|P(z)|pP(z)logexp(zzp/τ)nN(z)exp(zzn/τ)+ϵ,\frac{-1}{|P(z)|}\sum_{p\in P(z)}\log\frac{\exp\left(z\cdot z_{p}/\tau\right)}{\sum_{n\in N(z)}\exp\left(z\cdot z_{n}/\tau\right)+\epsilon},

where zz is the representation of the input data xt:t+I,τ>0x_{t:t+I},\tau>0 is a scalar temperature parameter, P(z)P(z) and N(z)N(z) are the sets of positive and negative representations with zz in a batch of time series, respectively. The representations are negative if they comes from different datasets, and positive if they are from the same pretraining dataset. The operator \cdot is a similarity metric, e.g. the inner dot product or cosine similarity.

We apply this loss function for each data sample batch. By minimizing this contrastive loss, the model maximizes the similarity between zz with the set of positive representations and minimizes the similarity with the set of positive representations, within the batch. Compared to (Khosla et al., 2020), we make a modification of a small factor ϵ\epsilon in the denominator since theoretically there could be no negative representation for a batch, still, the loss is well-defined.

3.2 Probability of Similarity to A Pretrain Dataset

After a pretraining process, the pretrained model MpretrainedM_{\text{pretrained}} is more equipped with the ability to differentiate the heterogeneous temporal dynamics of time series data. However, the next question is how to leverage this knowledge in the finetune process i.e. how the model recognizes the dynamics it has learned in the past. In this section, we propose to use a quantity that approximate the probability of similarity to a prior-exposed pretrain dataset. This helps to analyze the model better and aids the finetuning process.

Let zz be a representation of a finetune data sample and {zl}l=1,2,\{z_{l}\}_{l=1,2,\dots} be all the representations of the pretraining samples, produced by the pretrained model MpretrainedM_{\text{pretrained}}. We note that those representations depends on the current learning model. We recall that PP is the number of pretraining datasets and the similarity metric is zzkz\cdot z_{k}. Thus the approximate probability that the finetune data corresponding with zz comes from dataset ii is:

pi=lDataset(i)exp(zzl/τ)j=1,,ClDataset(j)exp(zzl/τ)\displaystyle p_{i}=\frac{\sum_{l\in\operatorname{Dataset}(i)}\exp\left(z\cdot z_{l}/\tau\right)}{\sum_{j=1,\ldots,C}\sum_{l\in\operatorname{Dataset}(j)}\exp\left(z\cdot z_{l}/\tau\right)} (2)

This estimation naturally arises from the design of the supervised contrastive loss. Since the model maximizes exp(zzp/τ)\exp\left(z\cdot z_{p}/\tau\right) where zpz_{p} is the positive representation and minimizes exp(zzn/τ)\exp\left(z\cdot z_{n}/\tau\right) where znz_{n} is the negative representation, then for a new representation zz, the datasets which has similar dynamic to zz should have higher value of lDataset(i)exp(zzl/τ)\sum_{l\in\operatorname{Dataset}(i)}\exp\left(z\cdot z_{l}/\tau\right). Dividing this quantity by the sum for all the datasets, we get the estimated probability in (2).

3.3 Finetune Using Similarity Metrics

An estimated probability is a good tool to analyze the finetune sample data and gain insights on whether the finetune data is likely to belong to or behave similarly to any of the pretraining datasets. In this section, we propose to utilize this insight. Intuitively, if there is a high chance that a finetune data belongs to a pretrain dataset ii, then it is best to for the model to use the dynamics learned by dataset ii. On the other hand, if there is more than one dominant dataset e.g. [0.4,0.4,0.1,0.1][0.4,0.4,0.1,0.1], then it is beneficial to consider the datasets with high chances (see Figure 3). This observation motivates our finetune process.

Refer to caption
Figure 3: The representation of the finetune data can be close to some sample groups (0) and (1) more than other group (2). In such cases, we give priority to group (0) and then (1) and avoid being close to (2).

The finetune loss function is similar to the pretrain loss that is consisting of a prediction component and a supervised contrastive component. Therefore:

Lossfinetune(xt:t+I)\displaystyle\operatorname{Loss}_{\text{finetune}}\left(x_{t:t+I}\right) =x^t+I:t+I+Oxt+I:t+I+O2+λFTCon(xt:t+I),\displaystyle=\left\|\hat{x}_{t+I:t+I+O}-x_{t+I:t+I+O}\right\|^{2}+\lambda^{\prime}\mathrm{FTCon}\left(x_{t:t+I}\right), (3)

where λ\lambda^{\prime} is a regularizer and the finetune contrastive loss FTCon(x)\operatorname{FTCon}(x) is

1|P(z)|pP(z)logexp(zzp/τ)nN(z)exp(zzn/τ)+ϵ,\frac{-1}{|P(z)|}\sum_{p\in P(z)}\log\frac{\exp\left(z\cdot z_{p}/\tau\right)}{\sum_{n\in N(z)}\exp\left(z\cdot z_{n}/\tau\right)+\epsilon},

where zz is the (current) representation of the finetune input data xt:t+Ix_{t:t+I}, the sets P(z)P(z) and N(z)N(z) of positive and negative representations of pretrain samples are defined as:

{iP(z) if pi>1/PiN(z) if pi<1/P\left\{\begin{array}[]{l}i\in P(z)\text{ if }p_{i}>1/P\\ i\in N(z)\text{ if }p_{i}<1/P\end{array}\right.

where pip_{i} is the estimated probability in (2). The choice of 1/P1/P stems from the fact that there is PP pretrain datasets i.e. higher probability than 1/P1/P indicates higher similarity, which is considered positive representations. If a model predict pi=1/Pp_{i}=1/P, then it offers no information whether the finetune data is similar to dataset ii or not, and then dataset ii is discarded (not considered in the process). We note that pip_{i} is not fixed throughout the finetuning process, as it changes with the weights of the model as the representations is updated.

The advantage of the finetune loss is two-fold. On the one hand, the information of pip_{i} helps the model to find better representations of the finetune data that are closer to that of the pretraining datasets to which it is similar. On the other hand, when the representations learned by the pretrain model is not good enough for the finetune data (and might lead to inaccurate/ mismatch probability) then the prediction loss helps to find better representations which gradually change their estimated probability, and give better context for the finetune time series dynamic.

4 Experiments

4.1 Experiment Settings

Datasets

To maintain fair comparison between our work and prior benchmark, we test our model using the standard experiment procedure as in (Wu et al., 2021). Our experiments use following real-world datasets: ETT, Traffic, Electricity, Weather, Exchange-Rate and ILI (Wu et al., 2021). The ETT dataset contains information collected from electricity transformers, load and oil temperature data recorded at 15-minute intervals spanning from July 2016 to July 2018. The Electricity dataset records the hourly electricity consumption of 321 customers over a three-year period. Exchange-Rate dataset contains daily exchange rate data from eight different countries spanning from 1990 to 2016. The Traffic dataset records hourly data from the California Department of Transportation, including road occupancy rates measured by various sensors throughout the San Francisco Bay area. The Weather dataset provides 21 meteorological indicators, with data recorded at 10-minute intervals throughout the year 2020. Lastly, the ILI includes the influenza-like illness (ILI) patients data from CDC every week between 2002 and 2021.

We follow the standard experiment procedure as in (Wu et al., 2021). The time series data is split into training, validation and test sets in chronological order by the ratio of 7:1:2 for all the data sets. To ensure fair comparison, in our pretraining process we only use the training proportions of the original datasets. In our test, we use the same metrics as the prior reference (Wu et al., 2021) with batch size 32.

Table 1: Comparisons of the test performance from our pretrained model and other supervised learning models*.

Models Our PT model Ratio TimesNet ETSformer LightTS DLinear FEDformer Stationary Autoformer Data Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE \cellcolor[HTML]FFF7E7 96 0.415 0.430 \cellcolor[HTML]CCDDF81.228 \cellcolor[HTML]DFEAFB1.147 0.338 0.375 0.375 0.398 0.374 0.400 0.345 0.372 0.379 0.419 0.386 0.398 0.505 0.475 \cellcolor[HTML]FFF7E7 192 0.471 0.451 \cellcolor[HTML]C5D9F81.259 \cellcolor[HTML]DAE7FA1.165 0.374 0.387 0.408 0.410 0.400 0.407 0.380 0.389 0.426 0.441 0.459 0.444 0.553 0.496 \cellcolor[HTML]FFF7E7 336 0.513 0.479 \cellcolor[HTML]C7DAF81.251 \cellcolor[HTML]DAE7FA1.165 0.410 0.411 0.435 0.428 0.438 0.438 0.413 0.413 0.445 0.459 0.495 0.464 0.621 0.537 \cellcolor[HTML]FFF7E7 ETTm1 720 0.565 0.521 \cellcolor[HTML]D7E4FA1.182 \cellcolor[HTML]DCE8FB1.158 0.478 0.450 0.499 0.462 0.527 0.502 0.474 0.453 0.543 0.490 0.585 0.516 0.671 0.561 \cellcolor[HTML]FFF7E7 96 0.413 0.417 \cellcolor[HTML]EFF4FD1.076 \cellcolor[HTML]F7FAFE1.037 0.384 0.402 0.494 0.479 0.424 0.432 0.386 0.400 0.376 0.419 0.513 0.491 0.449 0.459 \cellcolor[HTML]FFF7E7 192 0.455 0.442 \cellcolor[HTML]F6F9FE1.044 \cellcolor[HTML]F9FBFF1.030 0.436 0.429 0.538 0.504 0.475 0.462 0.437 0.432 0.420 0.448 0.534 0.504 0.500 0.482 \cellcolor[HTML]FFF7E7 336 0.496 0.467 \cellcolor[HTML]FDFEFF1.010 \cellcolor[HTML]F4CCCC0.996 0.491 0.469 0.574 0.521 0.518 0.488 0.481 0.459 0.459 0.465 0.588 0.535 0.521 0.496 \cellcolor[HTML]FFF7E7 ETTh1 720 0.537 0.525 \cellcolor[HTML]F9FBFF1.031 \cellcolor[HTML]F4F8FE1.050 0.521 0.500 0.562 0.535 0.547 0.533 0.519 0.516 0.506 0.507 0.643 0.616 0.514 0.512 \cellcolor[HTML]FFF7E7 96 0.103 0.239 \cellcolor[HTML]F4CCCC0.963 \cellcolor[HTML]FBFCFF1.021 0.107 0.234 0.085 0.204 0.116 0.262 0.088 0.218 0.148 0.278 0.111 0.237 0.197 0.323 \cellcolor[HTML]FFF7E7 192 0.184 0.325 \cellcolor[HTML]F4CCCC0.814 \cellcolor[HTML]F4CCCC0.945 0.226 0.344 0.182 0.303 0.215 0.359 0.176 0.315 0.271 0.380 0.219 0.335 0.300 0.369 \cellcolor[HTML]FFF7E7 336 0.296 0.420 \cellcolor[HTML]F4CCCC0.807 \cellcolor[HTML]F4CCCC0.938 0.367 0.448 0.348 0.428 0.377 0.466 0.313 0.427 0.460 0.500 0.421 0.476 0.509 0.524 \cellcolor[HTML]FFF7E7 Exchange 720 0.537 0.588 \cellcolor[HTML]F4CCCC0.557 \cellcolor[HTML]F4CCCC0.788 0.964 0.746 1.025 0.774 0.831 0.699 0.839 0.695 1.195 0.841 1.092 0.769 1.447 0.941 \cellcolor[HTML]FFF7E7 96 0.253 0.336 \cellcolor[HTML]8EB4F01.506 \cellcolor[HTML]CBDCF81.235 0.168 0.272 0.187 0.304 0.207 0.307 0.197 0.282 0.193 0.308 0.169 0.273 0.201 0.317 \cellcolor[HTML]FFF7E7 192 0.247 0.337 \cellcolor[HTML]B3CCF51.342 \cellcolor[HTML]DAE7FA1.166 0.184 0.289 0.199 0.315 0.213 0.316 0.196 0.285 0.201 0.315 0.182 0.286 0.222 0.334 \cellcolor[HTML]FFF7E7 336 0.268 0.360 \cellcolor[HTML]B0CBF51.354 \cellcolor[HTML]D3E2F91.200 0.198 0.300 0.212 0.329 0.230 0.333 0.209 0.301 0.214 0.329 0.200 0.304 0.231 0.338 \cellcolor[HTML]FFF7E7 Electricity 720 0.310 0.398 \cellcolor[HTML]A4C2F31.409 \cellcolor[HTML]C9DBF81.244 0.220 0.320 0.233 0.345 0.265 0.360 0.245 0.333 0.246 0.355 0.222 0.321 0.254 0.361 \cellcolor[HTML]F1E5FF 96 0.200 0.296 \cellcolor[HTML]F0F5FD1.070 \cellcolor[HTML]E7EFFC1.109 0.187 0.267 0.189 0.280 0.209 0.308 0.193 0.292 0.203 0.287 0.192 0.274 0.255 0.339 \cellcolor[HTML]F1E5FF 192 0.286 0.360 \cellcolor[HTML]DEE9FB1.149 \cellcolor[HTML]DAE7FA1.165 0.249 0.309 0.253 0.319 0.311 0.382 0.284 0.362 0.269 0.328 0.280 0.339 0.281 0.340 \cellcolor[HTML]F1E5FF 336 0.424 0.452 \cellcolor[HTML]B7D0F61.321 \cellcolor[HTML]BFD5F71.288 0.321 0.351 0.314 0.357 0.442 0.466 0.369 0.427 0.325 0.366 0.334 0.361 0.339 0.372 \cellcolor[HTML]F1E5FF ETTm2 720 0.673 0.570 \cellcolor[HTML]6D9EEB1.650 \cellcolor[HTML]A2C2F31.414 0.408 0.403 0.414 0.413 0.675 0.587 0.554 0.522 0.421 0.415 0.417 0.413 0.433 0.432 \cellcolor[HTML]F1E5FF 96 0.317 0.370 \cellcolor[HTML]F4CCCC0.932 \cellcolor[HTML]F4CCCC0.989 0.340 0.374 0.340 0.391 0.397 0.437 0.333 0.387 0.358 0.397 0.476 0.458 0.346 0.388 \cellcolor[HTML]F1E5FF 192 0.420 0.433 \cellcolor[HTML]F5F9FE1.045 \cellcolor[HTML]F5F9FE1.046 0.402 0.414 0.430 0.439 0.520 0.504 0.477 0.476 0.429 0.439 0.512 0.493 0.456 0.452 \cellcolor[HTML]F1E5FF 336 0.536 0.507 \cellcolor[HTML]D6E4FA1.186 \cellcolor[HTML]E4EDFC1.122 0.452 0.452 0.485 0.479 0.626 0.559 0.594 0.541 0.496 0.487 0.552 0.551 0.482 0.486 \cellcolor[HTML]F1E5FF ETTh2 720 0.719 0.601 \cellcolor[HTML]82ACEE1.556 \cellcolor[HTML]C0D5F71.284 0.462 0.468 0.500 0.497 0.863 0.672 0.831 0.657 0.463 0.474 0.562 0.560 0.515 0.511 \cellcolor[HTML]F1E5FF 96 0.888 0.518 \cellcolor[HTML]90B5F01.497 \cellcolor[HTML]76A4ED1.614 0.593 0.321 0.607 0.392 0.615 0.391 0.650 0.396 0.587 0.366 0.612 0.338 0.613 0.388 \cellcolor[HTML]F1E5FF 192 0.781 0.477 \cellcolor[HTML]C4D8F71.266 \cellcolor[HTML]A1C1F31.420 0.617 0.336 0.621 0.399 0.601 0.382 0.598 0.370 0.604 0.373 0.613 0.340 0.616 0.382 \cellcolor[HTML]F1E5FF 336 0.786 0.484 \cellcolor[HTML]C7DAF81.250 \cellcolor[HTML]9CBEF21.440 0.629 0.336 0.622 0.396 0.613 0.386 0.605 0.373 0.621 0.383 0.618 0.328 0.622 0.337 \cellcolor[HTML]F1E5FF Traffic 720 0.813 0.497 \cellcolor[HTML]C3D7F71.270 \cellcolor[HTML]A1C1F31.420 0.640 0.350 0.632 0.396 0.658 0.407 0.645 0.394 0.626 0.382 0.653 0.355 0.660 0.408 \cellcolor[HTML]F1E5FF 96 0.222 0.276 \cellcolor[HTML]BED4F71.291 \cellcolor[HTML]C6D9F81.255 0.172 0.220 0.197 0.281 0.182 0.242 0.196 0.255 0.217 0.296 0.173 0.223 0.266 0.336 \cellcolor[HTML]F1E5FF 192 0.265 0.311 \cellcolor[HTML]D0E0F91.210 \cellcolor[HTML]D4E3FA1.192 0.219 0.261 0.237 0.312 0.227 0.287 0.237 0.296 0.276 0.336 0.245 0.285 0.307 0.367 \cellcolor[HTML]F1E5FF 336 0.302 0.345 \cellcolor[HTML]EEF4FD1.079 \cellcolor[HTML]E3ECFC1.127 0.280 0.306 0.298 0.353 0.282 0.334 0.283 0.335 0.339 0.380 0.321 0.338 0.359 0.395 \cellcolor[HTML]F1E5FF Weather 720 0.375 0.406 \cellcolor[HTML]F9FBFF1.027 \cellcolor[HTML]E2ECFB1.131 0.365 0.359 0.352 0.288 0.352 0.386 0.345 0.381 0.403 0.428 0.414 0.410 0.419 0.428

(*) The red bold text indicates the best amongst all methods in Table 3, and the blue underlined text indicate the second best method. The datasets within the pretrain collection are highlighted with yellow color while the others highlighted purple. In the Ratio column, the numbers highlighted red indicates the settings where the pretrained model has lower metrics than TimesNet. The blue color scale indicates the settings where the pretrained test metrics are worse than TimesNet test results.

4.2 Pretrain and Results

We choose a collection of pretraining datasets including ETTh1, ETTm1, Electricity and Exchange-Rate. The remaining datasets (ETTh2, ETTm2, Weather, Traffic, ILI) are reserved for fine-tuning and/or further testing the generalization ability of our model. Note that we do not compare the test performance for ILI data with the previous results because their setting of input and prediction lengths is different from other datasets. In order to handle the discrepancy within the sizes of the datasets, we design a pretraining collection such that it has equal number of data samples from each pretrain dataset. Note that we only choose a subset of features from Electricity data because it has much more than features other datasets. The detailed description of this collection is in the Appendix.

In our experiment, we choose a simple linear layer for the encoder and the decoder of the model. We choose this simple architecture to better analyze the pretrain-finetune process and also because simple models have shown impressive performance in time series forecasting (Zeng et al., 2022; Tran et al., 2023). We train our model using PyTorch (Paszke et al., 2019) with ADAM optimizer (Kingma and Ba, 2014) for 10 training epochs. We choose the regularization parameter λ=0.1\lambda=0.1 and the temperature parameter τ=0.1\tau=0.1, the pretrain batch size is 512. The experiments are repeated 3 times. All the experiment details are in the Appendix. We compare the test errors (MSE and MAE) of our model with the following supervised learning models for time series forecasting: TimesNet (Wu et al., 2023), ETSformer (Woo et al., 2022), LightTS (Zhang et al., 2022a), DLinear (Zeng et al., 2022), FEDformer (Zhou et al., 2022), Stationary (Liu et al., 2022), Autoformer (Wu et al., 2021). Table 1 shows a record of their test results.

Note that we do not remove the last layer of our model before finetuning and the pretrained model can be tested on all of the datasets. Our pretrain results is reported in Table 1. In this table, we also report the ratios of the test errors between our method and TimesNet, a state-of-the-art model for time series processing.

Results. Table 1 shows that our pretrained model has promising generalization results in most of the pretraining datasets. For ETTh1, the model perform only slightly worse than TimesNet i.e. the difference is 3.4%3.4\% in average compared to TimesNet accuracy. The pretrain model has very good generalization in Exchange-Rate dataset, which is better than TimesNet in 7/8 settings and even better than all the supervised methods in the long term predictions (336 and 720 forward time steps). We argue that the pretrain process acts as a regularization for the Exchange-Rate dataset in this case. On the other hand, for ETTm1 and Electricity datasets the pretrained model is worse than TimesNet (the difference in average is 19.4%19.4\% and 30.7%30.7\%, respectively). However, that is not surprising because the supervised models are trained only specifically on that dataset.

For the other datasets which the pretrained models did not have access, it is reasonable that the test performance is worse than the datasets within the pretrain collection. The Weather and ETTh2 datasets have the best generalization with an average of 16.4%16.4\% difference and 14.5%14.5\% difference compared to TimesNet method.

4.3 Similarity Results

In this section, we show the estimated probability from the pretrained model. We compute this metric using our collection of pretrain samples, and averaging over the finetune samples. We report the percentages in Table 2.

Table 2: The chance that a finetune dataset is similar to one of the pretrain datasets, estimated by our pretrained model. The color scale highlights the high chances in red and low changes in blue.
Finetune % of similarity to the pretrain datasets
datasets \cellcolor [HTML]FFF7E7ETTh1 \cellcolor [HTML]FFF7E7ETTm1 \cellcolor [HTML]FFF7E7Exchange \cellcolor [HTML]FFF7E7Electricity
\cellcolor[HTML]FFF7E7ETTh1 \cellcolor [HTML]EDA4A467.62 \cellcolor [HTML]C3D7F611.86 \cellcolor [HTML]C3D7F611.82 \cellcolor [HTML]ACC8F38.70
\cellcolor[HTML]FFF7E7ETTm1 \cellcolor [HTML]C1D6F611.55 \cellcolor [HTML]F2BFBF53.95 \cellcolor [HTML]FDF1F127.58 \cellcolor [HTML]9FBFF16.93
\cellcolor[HTML]FFF7E7Exchange \cellcolor [HTML]78A5EC1.56 \cellcolor [HTML]B6CEF510.01 \cellcolor [HTML]E57E7E87.79 \cellcolor [HTML]71A1EB0.65
\cellcolor[HTML]FFF7E7      Electricity \cellcolor [HTML]DBE7FA15.20 \cellcolor [HTML]FFFCFC22.08 \cellcolor [HTML]F2BFBF53.75 \cellcolor [HTML]AEC9F38.97
\cellcolor[HTML]F1E5FFETTh2 \cellcolor [HTML]FBEBEB30.87 \cellcolor [HTML]FEFAFA22.65 \cellcolor [HTML]FBEBEB30.49 \cellcolor [HTML]E1EBFB16.00
\cellcolor[HTML]F1E5FFETTm2 \cellcolor [HTML]DFEAFA15.71 \cellcolor [HTML]FCECEC30.10 \cellcolor [HTML]F6D2D244.00 \cellcolor [HTML]B7CFF510.19
\cellcolor[HTML]F1E5FFTraffic \cellcolor [HTML]F7D7D741.40 \cellcolor [HTML]FFFCFC21.66 \cellcolor [HTML]EBF2FC17.39 \cellcolor [HTML]FBFCFE19.54
\cellcolor[HTML]F1E5FFWeather \cellcolor [HTML]70A0EB0.52 \cellcolor [HTML]A3C1F27.4 \cellcolor [HTML]E4767691.87 \cellcolor [HTML]6E9FEB0.22
\cellcolor[HTML]F1E5FFILI \cellcolor [HTML]FDF4F425.97 \cellcolor [HTML]FAE5E534.07 \cellcolor [HTML]FBEBEB30.93 \cellcolor [HTML]AEC9F49.04

This analysis shows that the pretrain model predicts well for the three datasets ETTh1, ETTm1 and Exchange. However, the model cannot classify Electricity data and this fact explains the bad generalization error in Table 1. The Exchange-Rate data is predicted with a high probability and the model shows good generalization. Among the other finetune datasets, a related phenomenon appears in Weather data: high probability of similarity and good generalization. ETTh2 datasets has good metrics since it is close to ETTh1 dataset, while ETTm2 has higher probability in Exchange data class. This confirms our finding that the model prediction and contrastive representation learning complement each other. Another observation from this experiment is that the correct classifications do not varies much between different sample batches.

Table 3: Comparisons of the test performance from our finetuned model and other supervised learning models**.

Models Our FT model Ratio TimesNet ETSformer LightTS DLinear FEDformer Stationary Autoformer Data Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE \cellcolor[HTML]FFF7E7 96 0.347 0.373 \cellcolor[HTML]F8FBFE1.027 \cellcolor[HTML]F4CCCC0.995 0.338 0.375 0.375 0.398 0.374 0.400 0.345 0.372 0.379 0.419 0.386 0.398 0.505 0.475 \cellcolor[HTML]FFF7E7 192 0.384 0.393 \cellcolor[HTML]F8FBFE1.027 \cellcolor[HTML]FBFDFF1.016 0.374 0.387 0.408 0.410 0.400 0.407 0.380 0.389 0.426 0.441 0.459 0.444 0.553 0.496 \cellcolor[HTML]FFF7E7 336 0.414 0.415 \cellcolor[HTML]FDFEFF1.010 \cellcolor[HTML]FDFEFF1.010 0.410 0.411 0.435 0.428 0.438 0.438 0.413 0.413 0.445 0.459 0.495 0.464 0.621 0.537 \cellcolor[HTML]FFF7E7 ETTm1 720 0.473 0.451 \cellcolor[HTML]F4CCCC0.990 \cellcolor[HTML]FFFFFF1.002 0.478 0.450 0.499 0.462 0.527 0.502 0.474 0.453 0.543 0.490 0.585 0.516 0.671 0.561 \cellcolor[HTML]FFF7E7 96 0.385 0.398 \cellcolor[HTML]FFFFFF1.003 \cellcolor[HTML]F4CCCC0.990 0.384 0.402 0.494 0.479 0.424 0.432 0.386 0.400 0.376 0.419 0.513 0.491 0.449 0.459 \cellcolor[HTML]FFF7E7 192 0.432 0.425 \cellcolor[HTML]F4CCCC0.991 \cellcolor[HTML]F4CCCC0.991 0.436 0.429 0.538 0.504 0.475 0.462 0.437 0.432 0.420 0.448 0.534 0.504 0.500 0.482 \cellcolor[HTML]FFF7E7 336 0.473 0.448 \cellcolor[HTML]F4CCCC0.963 \cellcolor[HTML]F4CCCC0.955 0.491 0.469 0.574 0.521 0.518 0.488 0.481 0.459 0.459 0.465 0.588 0.535 0.521 0.496 \cellcolor[HTML]FFF7E7 ETTh1 720 0.492 0.490 \cellcolor[HTML]F4CCCC0.944 \cellcolor[HTML]F4CCCC0.980 0.521 0.500 0.562 0.535 0.547 0.533 0.519 0.516 0.506 0.507 0.643 0.616 0.514 0.512 \cellcolor[HTML]FFF7E7 96 0.081 0.204 \cellcolor[HTML]F4CCCC0.757 \cellcolor[HTML]F4CCCC0.872 0.107 0.234 0.085 0.204 0.116 0.262 0.088 0.218 0.148 0.278 0.111 0.237 0.197 0.323 \cellcolor[HTML]FFF7E7 192 0.164 0.300 \cellcolor[HTML]F4CCCC0.726 \cellcolor[HTML]F4CCCC0.872 0.226 0.344 0.182 0.303 0.215 0.359 0.176 0.315 0.271 0.380 0.219 0.335 0.300 0.369 \cellcolor[HTML]FFF7E7 336 0.295 0.407 \cellcolor[HTML]F4CCCC0.804 \cellcolor[HTML]F4CCCC0.908 0.367 0.448 0.348 0.428 0.377 0.466 0.313 0.427 0.460 0.500 0.421 0.476 0.509 0.524 \cellcolor[HTML]FFF7E7 Exchange 720 0.535 0.587 \cellcolor[HTML]F4CCCC0.555 \cellcolor[HTML]F4CCCC0.787 0.964 0.746 1.025 0.774 0.831 0.699 0.839 0.695 1.195 0.841 1.092 0.769 1.447 0.941 \cellcolor[HTML]FFF7E7 96 0.197 0.282 \cellcolor[HTML]D0E0F91.173 \cellcolor[HTML]F5F9FE1.037 0.168 0.272 0.187 0.304 0.207 0.307 0.197 0.282 0.193 0.308 0.169 0.273 0.201 0.317 \cellcolor[HTML]FFF7E7 192 0.195 0.282 \cellcolor[HTML]EFF5FD1.060 \cellcolor[HTML]F4CCCC0.976 0.184 0.289 0.199 0.315 0.213 0.316 0.196 0.285 0.201 0.315 0.182 0.286 0.222 0.334 \cellcolor[HTML]FFF7E7 336 0.207 0.296 \cellcolor[HTML]F3F7FE1.045 \cellcolor[HTML]F4CCCC0.987 0.198 0.300 0.212 0.329 0.230 0.333 0.209 0.301 0.214 0.329 0.200 0.304 0.231 0.338 \cellcolor[HTML]FFF7E7 Electricity 720 0.242 0.229 \cellcolor[HTML]E4EDFC1.100 \cellcolor[HTML]F4CCCC0.716 0.220 0.320 0.233 0.345 0.265 0.360 0.245 0.333 0.246 0.355 0.222 0.321 0.254 0.361 \cellcolor[HTML]F1E5FF 96 0.196 0.294 \cellcolor[HTML]F2F7FE1.048 \cellcolor[HTML]E4EDFC1.101 0.187 0.267 0.189 0.280 0.209 0.308 0.193 0.292 0.203 0.287 0.192 0.274 0.255 0.339 \cellcolor[HTML]F1E5FF 192 0.266 0.342 \cellcolor[HTML]EDF3FD1.068 \cellcolor[HTML]E2ECFB1.107 0.249 0.309 0.253 0.319 0.311 0.382 0.284 0.362 0.269 0.328 0.280 0.339 0.281 0.340 \cellcolor[HTML]F1E5FF 336 0.365 0.412 \cellcolor[HTML]DAE7FA1.137 \cellcolor[HTML]D0E0F91.174 0.321 0.351 0.314 0.357 0.442 0.466 0.369 0.427 0.325 0.366 0.334 0.361 0.339 0.372 \cellcolor[HTML]F1E5FF ETTm2 720 0.492 0.485 \cellcolor[HTML]C7DAF81.206 \cellcolor[HTML]C8DAF81.203 0.408 0.403 0.414 0.413 0.675 0.587 0.554 0.522 0.421 0.415 0.417 0.413 0.433 0.432 \cellcolor[HTML]F1E5FF 96 0.316 0.372 \cellcolor[HTML]F4CCCC0.929 \cellcolor[HTML]F4CCCC0.995 0.340 0.374 0.340 0.391 0.397 0.437 0.333 0.387 0.358 0.397 0.476 0.458 0.346 0.388 \cellcolor[HTML]F1E5FF 192 0.419 0.433 \cellcolor[HTML]F4F8FE1.042 \cellcolor[HTML]F3F7FE1.046 0.402 0.414 0.430 0.439 0.520 0.504 0.477 0.476 0.429 0.439 0.512 0.493 0.456 0.452 \cellcolor[HTML]F1E5FF 336 0.536 0.507 \cellcolor[HTML]CDDEF91.186 \cellcolor[HTML]DEE9FB1.122 0.452 0.452 0.485 0.479 0.626 0.559 0.594 0.541 0.496 0.487 0.552 0.551 0.482 0.486 \cellcolor[HTML]F1E5FF ETTh2 720 0.708 0.543 \cellcolor[HTML]6D9EEB1.532 \cellcolor[HTML]D4E2F91.160 0.462 0.468 0.500 0.497 0.863 0.672 0.831 0.657 0.463 0.474 0.562 0.560 0.515 0.511 \cellcolor[HTML]F1E5FF 96 0.664 0.410 \cellcolor[HTML]DFEAFB1.120 \cellcolor[HTML]B3CDF51.277 0.593 0.321 0.607 0.392 0.615 0.391 0.650 0.396 0.587 0.366 0.612 0.338 0.613 0.388 \cellcolor[HTML]F1E5FF 192 0.606 0.380 \cellcolor[HTML]F4CCCC0.982 \cellcolor[HTML]DCE8FB1.131 0.617 0.336 0.621 0.399 0.601 0.382 0.598 0.370 0.604 0.373 0.613 0.340 0.616 0.382 \cellcolor[HTML]F1E5FF 336 0.608 0.378 \cellcolor[HTML]F4CCCC0.967 \cellcolor[HTML]DDE9FB1.125 0.629 0.336 0.622 0.396 0.613 0.386 0.605 0.373 0.621 0.383 0.618 0.328 0.622 0.337 \cellcolor[HTML]F1E5FF Traffic 720 0.647 0.398 \cellcolor[HTML]FDFEFF1.011 \cellcolor[HTML]DAE7FA1.137 0.640 0.350 0.632 0.396 0.658 0.407 0.645 0.394 0.626 0.382 0.653 0.355 0.660 0.408 \cellcolor[HTML]F1E5FF 96 0.194 0.251 \cellcolor[HTML]DCE8FB1.128 \cellcolor[HTML]D9E6FA1.141 0.172 0.220 0.197 0.281 0.182 0.242 0.196 0.255 0.217 0.296 0.173 0.223 0.266 0.336 \cellcolor[HTML]F1E5FF 192 0.235 0.291 \cellcolor[HTML]EBF2FD1.073 \cellcolor[HTML]E0EBFB1.115 0.219 0.261 0.237 0.312 0.227 0.287 0.237 0.296 0.276 0.336 0.245 0.285 0.307 0.367 \cellcolor[HTML]F1E5FF 336 0.281 0.327 \cellcolor[HTML]FFFFFF1.004 \cellcolor[HTML]EDF3FD1.069 0.280 0.306 0.298 0.353 0.282 0.334 0.283 0.335 0.339 0.380 0.321 0.338 0.359 0.395 \cellcolor[HTML]F1E5FF Weather 720 0.344 0.368 \cellcolor[HTML]F4CCCC0.942 \cellcolor[HTML]F9FBFF1.025 0.365 0.359 0.352 0.288 0.352 0.386 0.345 0.381 0.403 0.428 0.414 0.410 0.419 0.428

(**) The red bold text indicates the best amongst all methods, and the blue underlined text indicate the second best method. The datasets within the pretrain collection are highlighted with yellow color while the others highlighted purple. In the Ratio column, the numbers highlighted red indicate the settings where the finetuned model has lower metrics than TimesNet. The blue color scale indicates the settings where the finetuned test metrics are worse than TimesNet.

4.4 Finetune and Results

For the finetune step, we estimate the probabilities using a batch of pretrain samples (512 samples) and update the model weights using the finetune loss on that batch. In this stage, the pretrain batch of samples has equal proportions of each datasets. On each finetune datasets, we finetune the model using 50% of the training set compared to supervised approach (split with validation and test sets in chronological order by the ratio of 7:1:2) for 10 epochs. In Table 3, we report the test performance of the models which yields the best validation result and the corresponding ratio with TimesNet. Note that the test batch size is 32, consistent with prior practice (Wu et al., 2021).

Results. Table 3 shows that the finetuned model generalizes better than the supervised methods for two datasets ETTh1 and Exchange-Rate. Exchange-Rate data shows an impressive improvement which yields better results than supervised methods in all settings. This is consistent with our probabilistic result that the model predicts the Exchange-Rate and ETTh1 data the best amongst other datasets. In ETTm1 and Electricity datasets, the model is comparable with or only slightly worse than TimesNet where the difference in average is 0.9%0.9\% and 1.1%1.1\%, respectively.

When finetune with new datasets, the model shows competitive performance for Traffic and Weather datasets (with 6.2%6.2\% and 9.4%9.4\% worse in average difference). The ETTh2 and ETTm2 datasets follow with an average of 12.6%12.6\% and 13.0%13.0\% worse in difference. Given the fact that the model is trained with a different collection of datasets, this experiment shows promising results for our pretrain-finetune approach.

5 Further Analysis

5.1 Variations Analysis

In this section, we analyze some variations of our model. In the first variation, instead of applying the contrastive loss directly on the representation zz, we use another decoder to transform it to a representation yy, then apply the contrastive learning on yy. We also use a linear layer for the decoder transforming zz to yy. For the second variation, we do not use a decoder for the prediction output (i.e. using an identity layer in Figure 2(a) instead of the decoder). In the final variation, we replace the linear encoder and decoder by two-layer neural networks. We report the results in Table 4. This analysis shows that the linear encoder-decoder is essential for the good generalization performance of our method. The first variation with two decoders is also able to capture the dynamic, its performance has 6.2%6.2\% difference in average compared to the implementation of contrastive learning directly on the representation zz. The second variation shows that a decoder for the prediction output is needed. The full results are in the Appendix.

Table 4: Comparisons of the test performance from our finetuned model and the variations. We report the average MSE and MAE over four prediction lengths (96, 192, 336, 720) as in Table 3.
Finetuned Variation 1 \cellcolor[HTML]FFFFFFVariation 2 Variation 3
Data MSE MAE MSE MAE MSE MAE MSE MAE
\cellcolor[HTML]FFF7E7ETTm1 0.405 0.408 0.410 0.424 0.473 0.461 0.480 0.465
\cellcolor[HTML]FFF7E7ETTh1 0.446 0.440 0.482 0.484 0.476 0.528 0.518 0.525
\cellcolor[HTML]FFF7E7Exchange 0.269 0.375 0.269 0.396 0.281 0.414 0.282 0.423
\cellcolor[HTML]FFF7E7Electricity 0.210 0.272 0.227 0.298 0.221 0.338 0.243 0.320
\cellcolor[HTML]F1E5FFETTm2 0.330 0.383 0.356 0.418 0.367 0.418 0.395 0.451
\cellcolor[HTML]F1E5FFETTh2 0.495 0.464 0.514 0.498 0.532 0.549 0.522 0.542
\cellcolor[HTML]F1E5FFTraffic 0.631 0.392 0.643 0.412 0.693 0.431 0.773 0.428
\cellcolor[HTML]F1E5FFWeather 0.264 0.309 0.282 0.344 0.271 0.368 0.310 0.354

5.2 Parameters Analysis

In this section, we analyze how the performance of the model changes when the model parameters λ\lambda changes. A similar analysis for τ\tau is delayed to the Appendix. Note that λ\lambda and τ\tau are the regularization factor and the temperature parameter of the model, respectively, which controls the supervised contrastive loss term.

The choice of the regularization parameter λ\lambda varies in [0.01,0.05,0.1,0.5,1][0.01,0.05,0.1,0.5,1]. We perform the test for eight datasets using the pretrained models and report the average errors (MAE and MSE) over the four prediction lengths in Figure 4. We observe that in most datasets, the choice λ=0.1\lambda=0.1 performs the best.

Refer to caption
Figure 4: The average test performance of the pretrained model by parameter λ\lambda, for eights datasets. We plot both the xx-axis and yy-axis in logarithm scale to highlight the differences in accuracy.

6 Conclusion

In conclusion, our approach aims to address the discrepancy in pretrain-finetune time series and enhance the knowledge within foundation models training. We employ a pretraining procedure incorporating contrastive learning to distinguish features from different pretraining datasets. This supports the development of a probabilistic similarity metric, allows us to assess the likelihood of a univariate sample’s similarity to one of the pretraining datasets. We introduce a fine-tuning procedure designed to leverage the estimated probability. Our experiments demonstrate that our approach yields favorable results, with accuracy levels comparable to or in some cases outperform supervised models.

Future work in this direction offers promising problems. Addressing the inaccurate probability estimation is one of the interesting questions, which may requires further study into the dynamic of the pretraining datasets. There could be many potential reasons for this phenomenon: the two datasets may have similar dynamic that is difficult to distinguish, or the collective training with other datasets makes the model converges to some sub-optimal solutions. Another potential problem involves the discrepancy in a lower level: within each datasets. While we consider the simplified setting that features in the same datasets should be closer than the features from different datasets, it is still beneficial to take into account the potential dynamic variations within each dataset, and further apply that knowledge to improve the models.

Appendix A Experiment Details

A.1 Pretrain Sample Collection

We consider the time series forecasting problem where the model has the information of previous II time steps and aims to predict the next OO future time steps. Thus every univariate sample has I+OI+O time steps. Note that we follow the standard experiment procedure as in (Wu et al., 2021), using following real-world datasets: ETT, Traffic, Electricity, Weather, Exchange-Rate and ILI. The time series data is split into training, validation and test sets in chronological order by the ratio of 7:1:2 for all the data sets. To ensure fair comparison, our pretraining collection only contains the samples in the training proportions of the original datasets.

We choose a collection of pretraining datasets including ETTh1, ETTm1, Electricity and Exchange-Rate. The remaining datasets (ETTh2, ETTm2, Weather, Traffic) are reserved for fine-tuning and/or further testing the generalization ability of our model. Note that in order to handle the discrepancy within the sizes of the datasets, we design a pretraining collection such that it has (approximately) the same number of data samples from each pretrain dataset.

The total number of (univariate) samples within each dataset scales proportionally with their number of features dd and total time steps TT. ETTh1 and ETTm1 represents similar data, however their granularities are different: ETTh1 is recorded hourly while ETTm1 is collected every 15-minute interval. We choose the ETTm1 to be the base dataset, and we sample other datasets so that they have approximately the same number of samples as ETTm1. Since the total time steps of ETTh1 is 4 times less than ETTm1, we repeat the data ETTh1 for 4 times. Similarly, we repeat the data Exchange-Rate for 6 time since it has 8 features with a total time length of 7588 compared to the total length 69680 and 7 features of ETTm1. Finally, since the Electricity is too large, we choose a stride 2 to reduce the total length in half, and we only choose a subset of features (27 features) because Electricity data has much more than features other datasets (321). These 27 features have equally spaced indices of the original 321 features. Table 5 describe our sampling procedure along with the descriptions of all the datasets in our experiments.

Table 5: Description the datasets within the pretrain collection and other datasets used in our experiment.
Used in Number of features
Datasets dd TT Granularity pretrain data Stride Repetition in pretrain
\rowcolor[HTML]FFF7E7 ETTh1 7 17,420 1 hour Yes 1 4 7
\rowcolor[HTML]FFF7E7 ETTm1 7 69,680 15 min Yes 1 1 7
\rowcolor[HTML]FFF7E7 Electricity 321 26,304 1 hour Yes 2 1 27
\rowcolor[HTML]FFF7E7 Exchange-Rate 8 7,588 1 day Yes 1 6 8
\rowcolor[HTML]F1E5FF ETTh2 7 17,420 1 hour No
\rowcolor[HTML]F1E5FF ETTm2 7 69,680 15 min No
\rowcolor[HTML]F1E5FF Traffic 862 17,544 1 hour No
\rowcolor[HTML]F1E5FF Weather 21 52,696 10 min No
\rowcolor[HTML]F1E5FF ILI 7 966 1 week No

A.2 Other experiment details

In our experiment, we choose a simple linear layer (with bias) for the encoder and the decoder of the model. We choose this simple architecture to better analyze the pretrain-finetune process and also because linear models have shown impressive performance in time series forecasting (Zeng et al., 2022). We test different dimensions for the representation of our model, for example we perform grid search with {48,96,192,384}\{48,96,192,384\} for outputs 96 and 192, while we search in {180,360,720,1440}\{180,360,720,1440\} for larger output of 720. We report the results where the representation space is half the output space, as it perform well in our experiment. Table 6 describes this choice and reports our model size.

We pretrain our model using PyTorch (Paszke et al., 2019) with ADAM optimizer (Kingma and Ba, 2014) for 10 training epochs. We note that the size of our pretraining sample collection is approximately four times the size of ETTm1 (univariate) dataset. We choose the regularization parameter λ=0.1\lambda=0.1 and the temperature parameter τ=0.1\tau=0.1, the pretrain batch size is 512. The experiments are repeated 3 times.

Table 6: Description of the model used in our pretrain and finetune experiments
Input Representation Dimension Output Model size
96 48 96 9360
96 96 192 27936
96 168 336 73080
96 360 720 294840

Evaluation metrics. We report the MAE and MSE on test data where lower metrics indicate better results. We note that our model apply the same function for every channel of the test data (i.e. channel independence) and return the matrix output which contains OO time steps and DD features (the number of features of the corresponding testing data). Let PD×OP\in\mathbb{R}^{D\times O} be the predicted value of our model and VD×OV\in\mathbb{R}^{D\times O} be the ground truth value. The metrics are presented as follows:

MAE(P,V)\displaystyle\text{MAE}(P,V) =1DOd=1Dt=1O|PtdVtd|,MSE(P,V)=1DOd=1Dt=1O(PtdVtd)2.\displaystyle=\frac{1}{DO}\sum_{d=1}^{D}\sum_{t=1}^{O}|P_{t}^{d}-V_{t}^{d}|,\quad\text{MSE}(P,V)=\frac{1}{DO}\sum_{d=1}^{D}\sum_{t=1}^{O}(P_{t}^{d}-V_{t}^{d})^{2}.

In our test, we use the same metrics as the prior reference (Wu et al., 2021) with batch size 32. Note that we only implemented our results, the test results of other methods are from the TimesNet reference (Wu et al., 2023).

Appendix B Experiment Results

Table 7: Comparisons of the test performance from our finetuned model and the variations. We report the average MSE and MAE over four prediction lengths: 96, 192, 336, 720.
Models Standard Variation 1 Variation 2 Variation 3
Data Metric MSE MAE MSE MAE MSE MAE MSE MAE
\cellcolor[HTML]FFF7E7 96 0.347 0.373 0.354 0.393 0.414 0.400 0.423 0.433
\cellcolor[HTML]FFF7E7 192 0.384 0.393 0.389 0.409 0.446 0.447 0.458 0.449
\cellcolor[HTML]FFF7E7 336 0.414 0.415 0.418 0.429 0.480 0.473 0.488 0.471
\cellcolor[HTML]FFF7E7 720 0.473 0.451 0.479 0.464 0.554 0.523 0.550 0.509
\cellcolor[HTML]FFF7E7ETTm1 Avg. 0.405 0.408 0.410 0.424 0.473 0.461 0.480 0.465
\cellcolor[HTML]FFF7E7 96 0.385 0.398 0.428 0.448 0.412 0.484 0.456 0.481
\cellcolor[HTML]FFF7E7 192 0.432 0.425 0.473 0.474 0.461 0.512 0.504 0.510
\cellcolor[HTML]FFF7E7 336 0.473 0.448 0.512 0.494 0.503 0.537 0.546 0.533
\cellcolor[HTML]FFF7E7 720 0.492 0.490 0.515 0.522 0.527 0.580 0.566 0.576
\cellcolor[HTML]FFF7E7ETTh1 Avg. 0.446 0.440 0.482 0.484 0.476 0.528 0.518 0.525
\cellcolor[HTML]FFF7E7 96 0.081 0.204 0.080 0.220 0.082 0.236 0.082 0.227
\cellcolor[HTML]FFF7E7 192 0.164 0.300 0.165 0.318 0.168 0.333 0.169 0.324
\cellcolor[HTML]FFF7E7 336 0.295 0.407 0.296 0.434 0.314 0.440 0.315 0.438
\cellcolor[HTML]FFF7E7 720 0.535 0.587 0.535 0.614 0.561 0.646 0.560 0.703
\cellcolor[HTML]FFF7E7Exchange Avg. 0.269 0.375 0.269 0.396 0.281 0.414 0.282 0.423
\cellcolor[HTML]FFF7E7 96 0.197 0.282 0.210 0.294 0.204 0.330 0.227 0.309
\cellcolor[HTML]FFF7E7 192 0.195 0.282 0.215 0.299 0.206 0.331 0.229 0.313
\cellcolor[HTML]FFF7E7 336 0.207 0.296 0.228 0.313 0.220 0.338 0.242 0.328
\cellcolor[HTML]FFF7E7 720 0.242 0.229 0.256 0.288 0.255 0.352 0.274 0.331
\cellcolor[HTML]FFF7E7Electricity Avg. 0.210 0.272 0.227 0.298 0.221 0.338 0.243 0.320
\cellcolor[HTML]F1E5FF 96 0.196 0.294 0.237 0.342 0.200 0.310 0.257 0.359
\cellcolor[HTML]F1E5FF 192 0.266 0.342 0.304 0.386 0.286 0.371 0.336 0.416
\cellcolor[HTML]F1E5FF 336 0.365 0.412 0.384 0.440 0.401 0.448 0.419 0.475
\cellcolor[HTML]F1E5FF 720 0.492 0.485 0.498 0.504 0.581 0.545 0.568 0.555
\cellcolor[HTML]F1E5FFETTm2 Avg. 0.330 0.383 0.356 0.418 0.367 0.418 0.395 0.451
\cellcolor[HTML]F1E5FF 96 0.316 0.372 0.350 0.417 0.340 0.450 0.333 0.440
\cellcolor[HTML]F1E5FF 192 0.419 0.433 0.442 0.471 0.453 0.514 0.453 0.512
\cellcolor[HTML]F1E5FF 336 0.536 0.507 0.537 0.528 0.565 0.578 0.554 0.573
\cellcolor[HTML]F1E5FF 720 0.708 0.543 0.725 0.575 0.769 0.653 0.748 0.644
\cellcolor[HTML]F1E5FFETTh2 Avg. 0.495 0.464 0.514 0.498 0.532 0.549 0.522 0.542
\cellcolor[HTML]F1E5FF 96 0.664 0.410 0.665 0.419 0.706 0.440 0.791 0.421
\cellcolor[HTML]F1E5FF 192 0.606 0.380 0.624 0.408 0.670 0.421 0.754 0.423
\cellcolor[HTML]F1E5FF 336 0.608 0.378 0.631 0.405 0.677 0.422 0.758 0.429
\cellcolor[HTML]F1E5FF 720 0.647 0.398 0.651 0.415 0.719 0.442 0.790 0.439
\cellcolor[HTML]F1E5FFTraffic Avg. 0.631 0.392 0.643 0.412 0.693 0.431 0.773 0.428
\cellcolor[HTML]F1E5FF 96 0.194 0.251 0.210 0.304 0.197 0.301 0.239 0.302
\cellcolor[HTML]F1E5FF 192 0.235 0.291 0.251 0.333 0.240 0.346 0.280 0.340
\cellcolor[HTML]F1E5FF 336 0.281 0.327 0.304 0.368 0.290 0.378 0.330 0.379
\cellcolor[HTML]F1E5FF 720 0.344 0.368 0.363 0.370 0.357 0.446 0.390 0.396
\cellcolor[HTML]F1E5FFWeather Avg. 0.264 0.309 0.282 0.344 0.271 0.368 0.310 0.354

B.1 Variations Analysis

In this section, we analyze some variations of our model. In the first variation, instead of applying the contrastive loss directly on the representation zz, we use another decoder to transform it to a representation yy, then apply the contrastive learning on yy. We also use a linear layer for the decoder transforming zz to yy. We choose the same dimension for zz as the original approach (as described in Table 6). For the dimension of yy, we do grid search between three choices (half the dimention of zz, same dimension of zz and double the dimension of zz) and report the best results of these choices.

For the second variation, we do not use a decoder for the prediction output (i.e. using an identity layer in Figure 2(a) instead of the decoder). In the final variation, we replace the linear encoder and decoder by two-layer neural networks. We keep the same representation dimension for zz as described in Table 6. We also perform grid search on the number of hidden layers to find the best (training) model, then report the test result of that model.

We report the full results in Table 7. This analysis shows that the linear encoder-decoder is essential for the good generalization performance of our method. The first variation with two decoders is also able to capture the dynamic, its performance has 6.2%6.2\% difference in average compared to the implementation of contrastive learning directly on the representation zz. The second variation shows that a decoder for the prediction output is needed.

B.2 Parameters Analysis

In this section, we analyze how the performance of the model changes when the model parameters λ\lambda changes. Note that λ\lambda and τ\tau are the regularization factor and the temperature parameter of the model, respectively, which controls the supervised contrastive loss term.

B.2.1 Parameter λ\lambda

Here we set the temperature parameter of the model τ\tau to be 0.1. The choice of the regularization parameter λ\lambda varies in [0.01,0.05,0.1,0.5,1][0.01,0.05,0.1,0.5,1]. We perform the test for eight datasets using the pretrained models and report the average errors (MAE and MSE) over the four prediction lengths in Table 8 and Figure 5. We observe that in most datasets, the choice λ=0.1\lambda=0.1 performs the best.

Table 8: The average test performance of the pretrained model by parameter λ\lambda, for eights datasets. We report the average MSE and MAE over four prediction lengths: 96, 192, 336, 720.
MSE
λ\lambda ETTm1 ETTh1 Exchange Electricity ETTm2 ETTh2 Traffic Weather
0.01 0.608 0.628 0.445 0.320 0.482 0.681 0.920 0.363
0.05 0.523 0.526 0.336 0.288 0.409 0.511 0.867 0.301
0.1 0.491 0.475 0.280 0.270 0.396 0.498 0.817 0.291
0.5 0.536 0.531 0.387 0.290 0.430 0.587 0.850 0.319
1 0.676 0.605 0.539 0.305 0.513 0.663 0.935 0.368
MAE
λ\lambda ETTm1 ETTh1 Exchange Electricity ETTm2 ETTh2 Traffic Weather
0.01 0.589 0.576 0.495 0.420 0.483 0.544 0.585 0.386
0.05 0.479 0.490 0.401 0.385 0.426 0.488 0.514 0.340
0.1 0.470 0.463 0.393 0.358 0.420 0.478 0.494 0.335
0.5 0.516 0.482 0.434 0.389 0.448 0.544 0.537 0.343
1 0.581 0.548 0.519 0.415 0.513 0.572 0.550 0.408
Refer to caption
Figure 5: The average test performance of the pretrained model by parameter λ\lambda, for eights datasets. We plot both the xx-axis and yy-axis in logarithm scale to highlight the differences in accuracy.

B.2.2 Parameter τ\tau

We set the regularization factor λ\lambda to be 0.1. The choice of the temperature parameter τ\tau varies in [0.05,0.1,0.5,1,5][0.05,0.1,0.5,1,5]. We perform the test for eight datasets using the pretrained models and report the average errors (MAE and MSE) over the four prediction lengths in Table 9 and Figure 6. Since the choice τ=0.1\tau=0.1 performs the reasonably well in most of the datasets, we choose τ=0.1\tau=0.1 in our experiment.

Table 9: The average test performance of the pretrained model by parameter τ\tau, for eights datasets. We report the average MSE and MAE over four prediction lengths: 96, 192, 336, 720.
MSE
τ\tau ETTm1 ETTh1 Exchange Electricity ETTm2 ETTh2 Traffic Weather
0.05 0.537 0.481 0.285 0.279 0.416 0.558 0.819 0.294
0.1 0.491 0.475 0.280 0.270 0.396 0.498 0.817 0.291
0.5 0.529 0.478 0.288 0.285 0.397 0.490 0.833 0.291
1 0.523 0.479 0.304 0.289 0.410 0.506 0.833 0.292
5 0.563 0.481 0.313 0.306 0.411 0.498 0.849 0.292
MAE
τ\tau ETTm1 ETTh1 Exchange Electricity ETTm2 ETTh2 Traffic Weather
0.05 0.491 0.478 0.398 0.374 0.428 0.479 0.497 0.350
0.1 0.470 0.463 0.393 0.358 0.420 0.478 0.494 0.335
0.5 0.468 0.463 0.400 0.379 0.412 0.473 0.503 0.337
1 0.468 0.465 0.419 0.383 0.432 0.487 0.504 0.336
5 0.469 0.465 0.426 0.402 0.430 0.485 0.512 0.337
Refer to caption
Figure 6: The average test performance of the pretrained model by parameter τ\tau, for eights datasets. We plot both the xx-axis and yy-axis in logarithm scale to highlight the differences in accuracy.

References

  • Bommasani et al. [2021] Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the opportunities and risks of foundation models. ArXiv, 2021. URL https://crfm.stanford.edu/assets/report.pdf.
  • Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • Das et al. [2023] Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tide: Time-series dense encoder, 2023.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  • Eldele et al. [2021] Emadeldeen Eldele, Mohamed Ragab, Zhenghua Chen, Min Wu, C. Kwoh, Xiaoli Li, and Cuntai Guan. Time-series representation learning via temporal and contextual contrasting. ArXiv, abs/2106.14112, 2021.
  • Ericsson et al. [2022] Linus Ericsson, Henry Gouk, Chen Change Loy, and Timothy M Hospedales. Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine, 39(3):42–62, 2022.
  • Fawaz et al. [2018] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Transfer learning for time series classification. In 2018 IEEE international conference on big data (Big Data), pages 1367–1376. IEEE, 2018.
  • Franceschi et al. [2019] Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation learning for multivariate time series. Advances in neural information processing systems, 32, 2019.
  • Gardner and Dorling [1998] Matt W Gardner and SR Dorling. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric environment, 32(14-15):2627–2636, 1998.
  • Gu et al. [2018] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern recognition, 77:354–377, 2018.
  • Han et al. [2023] Lu Han, Han-Jia Ye, and De-Chuan Zhan. The capacity and robustness trade-off: Revisiting the channel independent strategy for multivariate time series forecasting, 2023.
  • Jaiswal et al. [2020] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  • Khosla et al. [2020] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18661–18673. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf.
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. URL https://arXivpreprint.org/abs/1412.6980.
  • Kiyasseh et al. [2021] Dani Kiyasseh, Tingting Zhu, and David A Clifton. Clocs: Contrastive learning of cardiac signals across space, time, and patients. In International Conference on Machine Learning, pages 5606–5615. PMLR, 2021.
  • Levin [1990] Esther Levin. A recurrent neural network: Limitations and training. Neural Networks, 1990.
  • Li et al. [2019] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. NeurIPS, 2019.
  • Li et al. [2023] Zhe Li, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting long-term time series forecasting: An investigation on linear mapping, 2023.
  • Liu et al. [2021] Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In ICLR, 2021.
  • Liu et al. [2022] Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring the stationarity in time series forecasting, 2022.
  • Misra and Maaten [2020] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717, 2020.
  • Nguyen et al. [2023] Anh Duy Nguyen, Trang H. Tran, Hieu H. Pham, Phi Le Nguyen, and Lam M. Nguyen. Learning robust and consistent time series representations: A dilated inception-based approach, 2023.
  • Nie et al. [2023] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Jbdc0vTOcol.
  • O’Shea and Nash [2015] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks, 2015.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
  • Rasul et al. [2023] Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Vincent Hassen, Anderson Schneider, Sahil Garg, Alexandre Drouin, Nicolas Chapados, Yuriy Nevmyvaka, and Irina Rish. Lag-llama: Towards foundation models for time series forecasting, 2023.
  • Tang et al. [2020] Chi Ian Tang, Ignacio Perez-Pozuelo, Dimitris Spathis, and Cecilia Mascolo. Exploring contrastive learning in human activity recognition for healthcare. arXiv preprint arXiv:2011.11542, 2020.
  • Tonekaboni et al. [2021] Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. Unsupervised representation learning for time series with temporal neighborhood coding. ArXiv, abs/2106.00750, 2021.
  • Tran et al. [2023] Trang H. Tran, Lam M. Nguyen, Kyongmin Yeo, Nam Nguyen, Dzung Phan, Roman Vaculin, and Jayant Kalagnanam. An end-to-end time series model for simultaneous imputation and forecast, 2023.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
  • Woo et al. [2022] Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven C. H. Hoi. Etsformer: Exponential smoothing transformers for time-series forecasting. CoRR, abs/2202.01381, 2022. URL https://arxiv.org/abs/2202.01381.
  • Wu et al. [2021] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. NeurIPS, 2021.
  • Wu et al. [2023] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ju_Uqw384Oq.
  • Xue et al. [2023] Wang Xue, Tian Zhou, Qingsong Wen, Jinyang Gao, Bolin Ding, and Rong Jin. Make transformer great again for time series forecasting: Channel aligned robust dual transformer, 2023.
  • Yang and Qiao [2022] Ling Yang and Linda Qiao. Unsupervised time-series representation learning with iterative bilinear temporal-spectral fusion. In International Conference on Machine Learning, 2022.
  • Yeh et al. [2023] Chin-Chia Michael Yeh, Xin Dai, Huiyuan Chen, Yan Zheng, Yujie Fan, Audrey Der, Vivian Lai, Zhongfang Zhuang, Junpeng Wang, Liang Wang, and Wei Zhang. Toward a foundation model for time series data, 2023.
  • Yeo et al. [2022] Kyongmin Yeo, Zan Li, and Wesley Gifford. Generative adversarial network for probabilistic forecast of random dynamical systems. SIAM Journal on Scientific Computing, 2022.
  • Yue et al. [2022] Zhihan Yue, Yujing Wang, Juanyong Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and Bixiong Xu. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8980–8987, 2022.
  • Zeng et al. [2022] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? arXiv preprint, 2022.
  • Zerveas et al. [2021] George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and Carsten Eickhoff. A transformer-based framework for multivariate time series representation learning. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 2114–2124, 2021.
  • Zhang et al. [2022a] Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, and Jian Li. Less is more: Fast multivariate time series forecasting with light sampling-oriented mlp structures, 2022a.
  • Zhang et al. [2022b] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 3988–4003. Curran Associates, Inc., 2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/194b8dac525581c346e30a2cebe9a369-Paper-Conference.pdf.
  • Zhang et al. [2022c] Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised contrastive pre-training for time series via time-frequency consistency. arXiv preprint arXiv:2206.08496, 2022c.
  • Zhang et al. [2023] Yifan Zhang, Rui Wu, Sergiu M. Dascalu, and Frederick C. Harris Jr au2. Multi-scale transformer pyramid networks for multivariate time series forecasting, 2023.
  • Zhou et al. [2021] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):11106–11115, May 2021. doi: 10.1609/aaai.v35i12.17325. URL https://ojs.aaai.org/index.php/AAAI/article/view/17325.
  • Zhou et al. [2022] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In ICML, 2022. URL https://proceedings.mlr.press/v162/zhou22g.html.
  • Zhou et al. [2023] Tian Zhou, PeiSong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all:power general time series analysis by pretrained lm, 2023.