A Supervised Contrastive Learning Pretrain-Finetune Approach for Time Series

Trang H. Tran^1,2 Lam M. Nguyen² Kyongmin Yeo² Nam Nguyen² Roman Vaculin²
¹ School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA
² IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
[email protected], [email protected] Work done while an intern at IBM Research

Abstract

Foundation models have recently gained attention within the field of machine learning thanks to its efficiency in broad data processing. While researchers had attempted to extend this success to time series models, the main challenge is effectively extracting representations and transferring knowledge from pretraining datasets to the target finetuning dataset. To tackle this issue, we introduce a novel pretraining procedure that leverages supervised contrastive learning to distinguish features within each pretraining dataset. This pretraining phase enables a probabilistic similarity metric, which assesses the likelihood of a univariate sample being closely related to one of the pretraining datasets. Subsequently, using this similarity metric as a guide, we propose a fine-tuning procedure designed to enhance the accurate prediction of the target data by aligning it more closely with the learned dynamics of the pretraining datasets. Our experiments have shown promising results which demonstrate the efficacy of our approach.

^†^†Correspondence to: Lam M. Nguyen.

1 Introductions

1.1 Motivations

Foundation Models For Time Series.

Lately, foundation models have gained significant prominence in the field of artificial intelligence and machine learning (Bommasani et al., 2021), where notable examples of these models include BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020). These models are characterized by their training on extensive datasets, typically employing self-supervised methods at a large scale, and they possess the capability to be adapted for a wide array of downstream tasks (Bommasani et al., 2021). Therefore there have been efforts to extend this success to other applications, including time series (Zhou et al., 2023; Rasul et al., 2023; Xue et al., 2023).

Challenges in General Representation Learning.

One of the main challenges in training foundation models for time series is to address the discrepancy between pretraining and finetuning data (Zhang et al., 2022b; Yeh et al., 2023). As demonstrated in Figure 1, this discrepancy arises at various levels. In Figure 1(a), although the last feature has a slight deviation, most of the features bear some resemblance within the same datasets. However, for different datasets, the dynamics are vastly different, as shown in Figure 1(b). As a result, a foundation model should possess the capability to adapt with a heterogeneous collection of datasets. Therefore it is desirable to find a general representation to contain the diverse knowledge in the pretraining task (Zhang et al., 2022b; Zerveas et al., 2021).

The next difficult task focuses on the design of the model to transfer the knowledge to finetuning task (Fawaz et al., 2018). It is reasonable to assume that the dynamic of the finetune datasets should be close to the dynamic of the collection of pretraining datasets in some sense. In this work, we consider the assumption that the representations of finetune dataset is within the span of the represenations of different pretraining datasets.

Our approach is to differentiate the features that originate from different datasets by utilizing the learned representation of these features, therefore partially addressing the high-level discrepancy in time series dynamics and enhancing the knowledge within foundation model. We summarize our contributions below.

Contributions.

•

We use a pretraining procedure with contrastive learning to differentiate the features in each pretraining dataset. This pretraining enables a probabilistic similarity metric to measure if a univariate sample is likely to share a similar dynamic with one of the pretraining datasets.
•

Using the similarity metric, we propose a finetuning procedure to aid the correct prediction of the target data, by making the finetune representation closer to the learned representations of the corresponding pretrain datasets.
•

Our experiments show that the pretrained models have promising performance compared to supervised training approaches. The finetuned model shows better generalization than prior approaches for some of the datasets, meanwhile having competitive results in other settings.

1.2 Related Work

Time Series Forecasting

There are two primary approaches for time series multi-step ahead predictions. Early approaches focuses on joint probability distributions of future system states by iteratively computing their evolution over time, typically using techniques like recurrent neural networks (RNNs) as demonstrated in (Levin, 1990; Yeo et al., 2022). The second approach revolves around training a time series model capable of directly predicting future time steps based on historical data input. This includes multilayer perceptron (MLP)-based methods (Gardner and Dorling, 1998; Zhang et al., 2022a) along with convolutional neural networks (O’Shea and Nash, 2015; Gu et al., 2018). With the rise of attention-based models and their success in natural language processing (Vaswani et al., 2017), attention-based time series models have gained popularity since they can discover the temporal dependencies among time points. Nevertheless, these models face a challenge due to their quadratic time and memory complexity when dealing with learning long-range temporal correlations. To address this, LogTrans (Li et al., 2019) and Pyraformer (Liu et al., 2021) propose strategies to introduce sparsity bias and reduce computational complexity. Informer (Zhou et al., 2021) and FEDformer (Zhou et al., 2022) leverage the low-rank properties of the self-attention matrix to enhance performance.

In contrast, Autoformer (Wu et al., 2021) introduces a novel architecture with an auto-correlation mechanism as an alternative to traditional attention-based models. Conversely, in the work presented in (Zeng et al., 2022), a different approach is taken with the use of a simple set of linear models and suggesting that these simplicity-driven models may outperform more complex structures. On the other hand, (Wu et al., 2023) learns the temporal patterns by exploring the multi-periodicity of time series and capture the temporal 2D-variations in 2D space.

Contrastive Learning.

In recent years, there has been significant advancements in self-supervised representation learning (Ericsson et al., 2022; Jaiswal et al., 2020; Misra and Maaten, 2020) with applications on time series (Kiyasseh et al., 2021; Tonekaboni et al., 2021; Franceschi et al., 2019; Eldele et al., 2021; Yue et al., 2022; Tang et al., 2020; Yang and Qiao, 2022; Zhang et al., 2022c; Nguyen et al., 2023). The common concept within these works is the idea of bringing an anchor and a positive sample closer in the embedding space, while separating the anchor and numerous negative samples. The work by (Khosla et al., 2020) extends the self-supervised contrastive approach to the fully-supervised setting, enabling effective utilization of label information. The contribution of (Khosla et al., 2020) is considering multiple positive pairs per anchor, in addition to the numerous negative pairs, in contrast to self-supervised contrastive learning with a single positive pair. In this work, we utilize the supervised contrastive learning framework in (Khosla et al., 2020) for the pretrain-finetune process. While there have been a self-supervised contrastive pretraining framework for time series (Zhang et al., 2022b), our approach is different since we ultilize the labels for training.

2 Problem Description

We are given a collection of multivariate training datasets $\left\{X^{k}\right\}_{\text{pretrain}},k=1,\dots,P$ , each one has sizes $T^{k}\times d^{k}$ , where $T^{k}$ is the time dimension and $d^{k}$ is the number of features. The number of pretrain datasets is $P$ . Our goal is to train a foundation model $M$ on the collection $\left\{X^{k}\right\}_{\text{pretrain}}$ and then finetune it to adapt with a new dataset $X_{\text{finetune}}$ of size $T^{f}\times d^{f}$ where $T^{f}$ and $d^{f}$ are the time dimension and the number of features of the finetune dataset, respectively.

We consider the time series forecasting problem where the model has the information of previous $I$ time steps and aims to predict the next $O$ future time steps. Since the different datasets have different numbers of features, the designs of foundation model must have the capability to process such data. A typical approach to this problem is using channel independence, which learns a common model for every univariate time series data (Nie et al., 2023; Han et al., 2023; Li et al., 2023; Xue et al., 2023). It often consists of an encoder that transforms input data $x_{t:t+I}$ into a representation (context vector) and a decoder that generates the output sequence $x_{t+I:t+I+O}$ based on this context (Zhang et al., 2023; Rasul et al., 2023). The encoder and the decoder can have various structures ranging from simple fully connected layers to complex designs e.g. attention based models (Rasul et al., 2023; Das et al., 2023). We describe this model structure in Figure 2(a).

From the multivariate training datasets $\left\{X^{k}\right\}_{\text{pretrain}}$ , we collects the univariate time series, which is further transformed into data samples using sliding windows. We note that the number of univariate data samples in each pretrain dataset is different. We build a pretrain sample collection which has equal number of data samples from each of the pretrain dataset $\left\{X^{k}\right\}_{\text{pretrain}}$ .

3 Pretrain-Finetune Approach with Supervised Contrastive Learning

3.1 Pretrain Process

In this section, we describe our pretraining process. Our framework use a encoder-decoder model which takes the univariate time series as input. The pretrain loss function consists of two components where the first one is the mean squared error between the predicted values and the ground truth. The second component is a contrastive loss computed on the representation $z$ of the model. Accordingly:

\displaystyle\text{Loss}_{\text{pretrain}}\left(x_{t:t+I}\right)=

\displaystyle\left\|\hat{x}_{t+I:t+I+O}-x_{t+I:t+I+O}\right\|^{2}+\lambda\operatorname{SupCon}\left(x_{t:t+I}\right),

(1)

where $\lambda$ is a regularizer and the (modified) supervised contrastive loss $\operatorname{SupCon}(x_{t:t+I})$ is

\frac{-1}{|P(z)|}\sum_{p\in P(z)}\log\frac{\exp\left(z\cdot z_{p}/\tau\right)}{\sum_{n\in N(z)}\exp\left(z\cdot z_{n}/\tau\right)+\epsilon},

where $z$ is the representation of the input data $x_{t:t+I},\tau>0$ is a scalar temperature parameter, $P(z)$ and $N(z)$ are the sets of positive and negative representations with $z$ in a batch of time series, respectively. The representations are negative if they comes from different datasets, and positive if they are from the same pretraining dataset. The operator $\cdot$ is a similarity metric, e.g. the inner dot product or cosine similarity.

We apply this loss function for each data sample batch. By minimizing this contrastive loss, the model maximizes the similarity between $z$ with the set of positive representations and minimizes the similarity with the set of positive representations, within the batch. Compared to (Khosla et al., 2020), we make a modification of a small factor $\epsilon$ in the denominator since theoretically there could be no negative representation for a batch, still, the loss is well-defined.

3.2 Probability of Similarity to A Pretrain Dataset

After a pretraining process, the pretrained model $M_{\text{pretrained}}$ is more equipped with the ability to differentiate the heterogeneous temporal dynamics of time series data. However, the next question is how to leverage this knowledge in the finetune process i.e. how the model recognizes the dynamics it has learned in the past. In this section, we propose to use a quantity that approximate the probability of similarity to a prior-exposed pretrain dataset. This helps to analyze the model better and aids the finetuning process.

Let $z$ be a representation of a finetune data sample and $\{z_{l}\}_{l=1,2,\dots}$ be all the representations of the pretraining samples, produced by the pretrained model $M_{\text{pretrained}}$ . We note that those representations depends on the current learning model. We recall that $P$ is the number of pretraining datasets and the similarity metric is $z\cdot z_{k}$ . Thus the approximate probability that the finetune data corresponding with $z$ comes from dataset $i$ is:

\displaystyle p_{i}=\frac{\sum_{l\in\operatorname{Dataset}(i)}\exp\left(z\cdot z_{l}/\tau\right)}{\sum_{j=1,\ldots,C}\sum_{l\in\operatorname{Dataset}(j)}\exp\left(z\cdot z_{l}/\tau\right)}

(2)

This estimation naturally arises from the design of the supervised contrastive loss. Since the model maximizes $\exp\left(z\cdot z_{p}/\tau\right)$ where $z_{p}$ is the positive representation and minimizes $\exp\left(z\cdot z_{n}/\tau\right)$ where $z_{n}$ is the negative representation, then for a new representation $z$ , the datasets which has similar dynamic to $z$ should have higher value of $\sum_{l\in\operatorname{Dataset}(i)}\exp\left(z\cdot z_{l}/\tau\right)$ . Dividing this quantity by the sum for all the datasets, we get the estimated probability in (2).

3.3 Finetune Using Similarity Metrics

An estimated probability is a good tool to analyze the finetune sample data and gain insights on whether the finetune data is likely to belong to or behave similarly to any of the pretraining datasets. In this section, we propose to utilize this insight. Intuitively, if there is a high chance that a finetune data belongs to a pretrain dataset $i$ , then it is best to for the model to use the dynamics learned by dataset $i$ . On the other hand, if there is more than one dominant dataset e.g. $[0.4,0.4,0.1,0.1]$ , then it is beneficial to consider the datasets with high chances (see Figure 3). This observation motivates our finetune process.

The finetune loss function is similar to the pretrain loss that is consisting of a prediction component and a supervised contrastive component. Therefore:

\displaystyle\operatorname{Loss}_{\text{finetune}}\left(x_{t:t+I}\right)

\displaystyle=\left\|\hat{x}_{t+I:t+I+O}-x_{t+I:t+I+O}\right\|^{2}+\lambda^{\prime}\mathrm{FTCon}\left(x_{t:t+I}\right),

(3)

where $\lambda^{\prime}$ is a regularizer and the finetune contrastive loss $\operatorname{FTCon}(x)$ is

\frac{-1}{|P(z)|}\sum_{p\in P(z)}\log\frac{\exp\left(z\cdot z_{p}/\tau\right)}{\sum_{n\in N(z)}\exp\left(z\cdot z_{n}/\tau\right)+\epsilon},

where $z$ is the (current) representation of the finetune input data $x_{t:t+I}$ , the sets $P(z)$ and $N(z)$ of positive and negative representations of pretrain samples are defined as:

\left\{\begin{array}[]{l}i\in P(z)\text{ if }p_{i}>1/P\\ i\in N(z)\text{ if }p_{i}<1/P\end{array}\right.

where $p_{i}$ is the estimated probability in (2). The choice of $1/P$ stems from the fact that there is $P$ pretrain datasets i.e. higher probability than $1/P$ indicates higher similarity, which is considered positive representations. If a model predict $p_{i}=1/P$ , then it offers no information whether the finetune data is similar to dataset $i$ or not, and then dataset $i$ is discarded (not considered in the process). We note that $p_{i}$ is not fixed throughout the finetuning process, as it changes with the weights of the model as the representations is updated.

The advantage of the finetune loss is two-fold. On the one hand, the information of $p_{i}$ helps the model to find better representations of the finetune data that are closer to that of the pretraining datasets to which it is similar. On the other hand, when the representations learned by the pretrain model is not good enough for the finetune data (and might lead to inaccurate/ mismatch probability) then the prediction loss helps to find better representations which gradually change their estimated probability, and give better context for the finetune time series dynamic.

4 Experiments

4.1 Experiment Settings

Datasets

To maintain fair comparison between our work and prior benchmark, we test our model using the standard experiment procedure as in (Wu et al., 2021). Our experiments use following real-world datasets: ETT, Traffic, Electricity, Weather, Exchange-Rate and ILI (Wu et al., 2021). The ETT dataset contains information collected from electricity transformers, load and oil temperature data recorded at 15-minute intervals spanning from July 2016 to July 2018. The Electricity dataset records the hourly electricity consumption of 321 customers over a three-year period. Exchange-Rate dataset contains daily exchange rate data from eight different countries spanning from 1990 to 2016. The Traffic dataset records hourly data from the California Department of Transportation, including road occupancy rates measured by various sensors throughout the San Francisco Bay area. The Weather dataset provides 21 meteorological indicators, with data recorded at 10-minute intervals throughout the year 2020. Lastly, the ILI includes the influenza-like illness (ILI) patients data from CDC every week between 2002 and 2021.

We follow the standard experiment procedure as in (Wu et al., 2021). The time series data is split into training, validation and test sets in chronological order by the ratio of 7:1:2 for all the data sets. To ensure fair comparison, in our pretraining process we only use the training proportions of the original datasets. In our test, we use the same metrics as the prior reference (Wu et al., 2021) with batch size 32.

Table 1: Comparisons of the test performance from our pretrained model and other supervised learning models*.

Models Our PT model Ratio TimesNet ETSformer LightTS DLinear FEDformer Stationary Autoformer Data Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE \cellcolor[HTML]FFF7E7 96 0.415 0.430 \cellcolor[HTML]CCDDF81.228 \cellcolor[HTML]DFEAFB1.147 0.338 0.375 0.375 0.398 0.374 0.400 0.345 0.372 0.379 0.419 0.386 0.398 0.505 0.475 \cellcolor[HTML]FFF7E7 192 0.471 0.451 \cellcolor[HTML]C5D9F81.259 \cellcolor[HTML]DAE7FA1.165 0.374 0.387 0.408 0.410 0.400 0.407 0.380 0.389 0.426 0.441 0.459 0.444 0.553 0.496 \cellcolor[HTML]FFF7E7 336 0.513 0.479 \cellcolor[HTML]C7DAF81.251 \cellcolor[HTML]DAE7FA1.165 0.410 0.411 0.435 0.428 0.438 0.438 0.413 0.413 0.445 0.459 0.495 0.464 0.621 0.537 \cellcolor[HTML]FFF7E7 ETTm1 720 0.565 0.521 \cellcolor[HTML]D7E4FA1.182 \cellcolor[HTML]DCE8FB1.158 0.478 0.450 0.499 0.462 0.527 0.502 0.474 0.453 0.543 0.490 0.585 0.516 0.671 0.561 \cellcolor[HTML]FFF7E7 96 0.413 0.417 \cellcolor[HTML]EFF4FD1.076 \cellcolor[HTML]F7FAFE1.037 0.384 0.402 0.494 0.479 0.424 0.432 0.386 0.400 0.376 0.419 0.513 0.491 0.449 0.459 \cellcolor[HTML]FFF7E7 192 0.455 0.442 \cellcolor[HTML]F6F9FE1.044 \cellcolor[HTML]F9FBFF1.030 0.436 0.429 0.538 0.504 0.475 0.462 0.437 0.432 0.420 0.448 0.534 0.504 0.500 0.482 \cellcolor[HTML]FFF7E7 336 0.496 0.467 \cellcolor[HTML]FDFEFF1.010 \cellcolor[HTML]F4CCCC0.996 0.491 0.469 0.574 0.521 0.518 0.488 0.481 0.459 0.459 0.465 0.588 0.535 0.521 0.496 \cellcolor[HTML]FFF7E7 ETTh1 720 0.537 0.525 \cellcolor[HTML]F9FBFF1.031 \cellcolor[HTML]F4F8FE1.050 0.521 0.500 0.562 0.535 0.547 0.533 0.519 0.516 0.506 0.507 0.643 0.616 0.514 0.512 \cellcolor[HTML]FFF7E7 96 0.103 0.239 \cellcolor[HTML]F4CCCC0.963 \cellcolor[HTML]FBFCFF1.021 0.107 0.234 0.085 0.204 0.116 0.262 0.088 0.218 0.148 0.278 0.111 0.237 0.197 0.323 \cellcolor[HTML]FFF7E7 192 0.184 0.325 \cellcolor[HTML]F4CCCC0.814 \cellcolor[HTML]F4CCCC0.945 0.226 0.344 0.182 0.303 0.215 0.359 0.176 0.315 0.271 0.380 0.219 0.335 0.300 0.369 \cellcolor[HTML]FFF7E7 336 0.296 0.420 \cellcolor[HTML]F4CCCC0.807 \cellcolor[HTML]F4CCCC0.938 0.367 0.448 0.348 0.428 0.377 0.466 0.313 0.427 0.460 0.500 0.421 0.476 0.509 0.524 \cellcolor[HTML]FFF7E7 Exchange 720 0.537 0.588 \cellcolor[HTML]F4CCCC0.557 \cellcolor[HTML]F4CCCC0.788 0.964 0.746 1.025 0.774 0.831 0.699 0.839 0.695 1.195 0.841 1.092 0.769 1.447 0.941 \cellcolor[HTML]FFF7E7 96 0.253 0.336 \cellcolor[HTML]8EB4F01.506 \cellcolor[HTML]CBDCF81.235 0.168 0.272 0.187 0.304 0.207 0.307 0.197 0.282 0.193 0.308 0.169 0.273 0.201 0.317 \cellcolor[HTML]FFF7E7 192 0.247 0.337 \cellcolor[HTML]B3CCF51.342 \cellcolor[HTML]DAE7FA1.166 0.184 0.289 0.199 0.315 0.213 0.316 0.196 0.285 0.201 0.315 0.182 0.286 0.222 0.334 \cellcolor[HTML]FFF7E7 336 0.268 0.360 \cellcolor[HTML]B0CBF51.354 \cellcolor[HTML]D3E2F91.200 0.198 0.300 0.212 0.329 0.230 0.333 0.209 0.301 0.214 0.329 0.200 0.304 0.231 0.338 \cellcolor[HTML]FFF7E7 Electricity 720 0.310 0.398 \cellcolor[HTML]A4C2F31.409 \cellcolor[HTML]C9DBF81.244 0.220 0.320 0.233 0.345 0.265 0.360 0.245 0.333 0.246 0.355 0.222 0.321 0.254 0.361 \cellcolor[HTML]F1E5FF 96 0.200 0.296 \cellcolor[HTML]F0F5FD1.070 \cellcolor[HTML]E7EFFC1.109 0.187 0.267 0.189 0.280 0.209 0.308 0.193 0.292 0.203 0.287 0.192 0.274 0.255 0.339 \cellcolor[HTML]F1E5FF 192 0.286 0.360 \cellcolor[HTML]DEE9FB1.149 \cellcolor[HTML]DAE7FA1.165 0.249 0.309 0.253 0.319 0.311 0.382 0.284 0.362 0.269 0.328 0.280 0.339 0.281 0.340 \cellcolor[HTML]F1E5FF 336 0.424 0.452 \cellcolor[HTML]B7D0F61.321 \cellcolor[HTML]BFD5F71.288 0.321 0.351 0.314 0.357 0.442 0.466 0.369 0.427 0.325 0.366 0.334 0.361 0.339 0.372 \cellcolor[HTML]F1E5FF ETTm2 720 0.673 0.570 \cellcolor[HTML]6D9EEB1.650 \cellcolor[HTML]A2C2F31.414 0.408 0.403 0.414 0.413 0.675 0.587 0.554 0.522 0.421 0.415 0.417 0.413 0.433 0.432 \cellcolor[HTML]F1E5FF 96 0.317 0.370 \cellcolor[HTML]F4CCCC0.932 \cellcolor[HTML]F4CCCC0.989 0.340 0.374 0.340 0.391 0.397 0.437 0.333 0.387 0.358 0.397 0.476 0.458 0.346 0.388 \cellcolor[HTML]F1E5FF 192 0.420 0.433 \cellcolor[HTML]F5F9FE1.045 \cellcolor[HTML]F5F9FE1.046 0.402 0.414 0.430 0.439 0.520 0.504 0.477 0.476 0.429 0.439 0.512 0.493 0.456 0.452 \cellcolor[HTML]F1E5FF 336 0.536 0.507 \cellcolor[HTML]D6E4FA1.186 \cellcolor[HTML]E4EDFC1.122 0.452 0.452 0.485 0.479 0.626 0.559 0.594 0.541 0.496 0.487 0.552 0.551 0.482 0.486 \cellcolor[HTML]F1E5FF ETTh2 720 0.719 0.601 \cellcolor[HTML]82ACEE1.556 \cellcolor[HTML]C0D5F71.284 0.462 0.468 0.500 0.497 0.863 0.672 0.831 0.657 0.463 0.474 0.562 0.560 0.515 0.511 \cellcolor[HTML]F1E5FF 96 0.888 0.518 \cellcolor[HTML]90B5F01.497 \cellcolor[HTML]76A4ED1.614 0.593 0.321 0.607 0.392 0.615 0.391 0.650 0.396 0.587 0.366 0.612 0.338 0.613 0.388 \cellcolor[HTML]F1E5FF 192 0.781 0.477 \cellcolor[HTML]C4D8F71.266 \cellcolor[HTML]A1C1F31.420 0.617 0.336 0.621 0.399 0.601 0.382 0.598 0.370 0.604 0.373 0.613 0.340 0.616 0.382 \cellcolor[HTML]F1E5FF 336 0.786 0.484 \cellcolor[HTML]C7DAF81.250 \cellcolor[HTML]9CBEF21.440 0.629 0.336 0.622 0.396 0.613 0.386 0.605 0.373 0.621 0.383 0.618 0.328 0.622 0.337 \cellcolor[HTML]F1E5FF Traffic 720 0.813 0.497 \cellcolor[HTML]C3D7F71.270 \cellcolor[HTML]A1C1F31.420 0.640 0.350 0.632 0.396 0.658 0.407 0.645 0.394 0.626 0.382 0.653 0.355 0.660 0.408 \cellcolor[HTML]F1E5FF 96 0.222 0.276 \cellcolor[HTML]BED4F71.291 \cellcolor[HTML]C6D9F81.255 0.172 0.220 0.197 0.281 0.182 0.242 0.196 0.255 0.217 0.296 0.173 0.223 0.266 0.336 \cellcolor[HTML]F1E5FF 192 0.265 0.311 \cellcolor[HTML]D0E0F91.210 \cellcolor[HTML]D4E3FA1.192 0.219 0.261 0.237 0.312 0.227 0.287 0.237 0.296 0.276 0.336 0.245 0.285 0.307 0.367 \cellcolor[HTML]F1E5FF 336 0.302 0.345 \cellcolor[HTML]EEF4FD1.079 \cellcolor[HTML]E3ECFC1.127 0.280 0.306 0.298 0.353 0.282 0.334 0.283 0.335 0.339 0.380 0.321 0.338 0.359 0.395 \cellcolor[HTML]F1E5FF Weather 720 0.375 0.406 \cellcolor[HTML]F9FBFF1.027 \cellcolor[HTML]E2ECFB1.131 0.365 0.359 0.352 0.288 0.352 0.386 0.345 0.381 0.403 0.428 0.414 0.410 0.419 0.428

(*) The red bold text indicates the best amongst all methods in Table 3, and the blue underlined text indicate the second best method. The datasets within the pretrain collection are highlighted with yellow color while the others highlighted purple. In the Ratio column, the numbers highlighted red indicates the settings where the pretrained model has lower metrics than TimesNet. The blue color scale indicates the settings where the pretrained test metrics are worse than TimesNet test results.

4.2 Pretrain and Results

We choose a collection of pretraining datasets including ETTh1, ETTm1, Electricity and Exchange-Rate. The remaining datasets (ETTh2, ETTm2, Weather, Traffic, ILI) are reserved for fine-tuning and/or further testing the generalization ability of our model. Note that we do not compare the test performance for ILI data with the previous results because their setting of input and prediction lengths is different from other datasets. In order to handle the discrepancy within the sizes of the datasets, we design a pretraining collection such that it has equal number of data samples from each pretrain dataset. Note that we only choose a subset of features from Electricity data because it has much more than features other datasets. The detailed description of this collection is in the Appendix.

In our experiment, we choose a simple linear layer for the encoder and the decoder of the model. We choose this simple architecture to better analyze the pretrain-finetune process and also because simple models have shown impressive performance in time series forecasting (Zeng et al., 2022; Tran et al., 2023). We train our model using PyTorch (Paszke et al., 2019) with ADAM optimizer (Kingma and Ba, 2014) for 10 training epochs. We choose the regularization parameter $\lambda=0.1$ and the temperature parameter $\tau=0.1$ , the pretrain batch size is 512. The experiments are repeated 3 times. All the experiment details are in the Appendix. We compare the test errors (MSE and MAE) of our model with the following supervised learning models for time series forecasting: TimesNet (Wu et al., 2023), ETSformer (Woo et al., 2022), LightTS (Zhang et al., 2022a), DLinear (Zeng et al., 2022), FEDformer (Zhou et al., 2022), Stationary (Liu et al., 2022), Autoformer (Wu et al., 2021). Table 1 shows a record of their test results.

Note that we do not remove the last layer of our model before finetuning and the pretrained model can be tested on all of the datasets. Our pretrain results is reported in Table 1. In this table, we also report the ratios of the test errors between our method and TimesNet, a state-of-the-art model for time series processing.

Results. Table 1 shows that our pretrained model has promising generalization results in most of the pretraining datasets. For ETTh1, the model perform only slightly worse than TimesNet i.e. the difference is $3.4\%$ in average compared to TimesNet accuracy. The pretrain model has very good generalization in Exchange-Rate dataset, which is better than TimesNet in 7/8 settings and even better than all the supervised methods in the long term predictions (336 and 720 forward time steps). We argue that the pretrain process acts as a regularization for the Exchange-Rate dataset in this case. On the other hand, for ETTm1 and Electricity datasets the pretrained model is worse than TimesNet (the difference in average is $19.4\%$ and $30.7\%$ , respectively). However, that is not surprising because the supervised models are trained only specifically on that dataset.

For the other datasets which the pretrained models did not have access, it is reasonable that the test performance is worse than the datasets within the pretrain collection. The Weather and ETTh2 datasets have the best generalization with an average of $16.4\%$ difference and $14.5\%$ difference compared to TimesNet method.

4.3 Similarity Results

In this section, we show the estimated probability from the pretrained model. We compute this metric using our collection of pretrain samples, and averaging over the finetune samples. We report the percentages in Table 2.

Table 2: The chance that a finetune dataset is similar to one of the pretrain datasets, estimated by our pretrained model. The color scale highlights the high chances in red and low changes in blue.

Finetune	% of similarity to the pretrain datasets
datasets	\cellcolor [HTML]FFF7E7ETTh1	\cellcolor [HTML]FFF7E7ETTm1	\cellcolor [HTML]FFF7E7Exchange	\cellcolor [HTML]FFF7E7Electricity
\cellcolor[HTML]FFF7E7ETTh1	\cellcolor [HTML]EDA4A467.62	\cellcolor [HTML]C3D7F611.86	\cellcolor [HTML]C3D7F611.82	\cellcolor [HTML]ACC8F38.70
\cellcolor[HTML]FFF7E7ETTm1	\cellcolor [HTML]C1D6F611.55	\cellcolor [HTML]F2BFBF53.95	\cellcolor [HTML]FDF1F127.58	\cellcolor [HTML]9FBFF16.93
\cellcolor[HTML]FFF7E7Exchange	\cellcolor [HTML]78A5EC1.56	\cellcolor [HTML]B6CEF510.01	\cellcolor [HTML]E57E7E87.79	\cellcolor [HTML]71A1EB0.65
\cellcolor[HTML]FFF7E7 Electricity	\cellcolor [HTML]DBE7FA15.20	\cellcolor [HTML]FFFCFC22.08	\cellcolor [HTML]F2BFBF53.75	\cellcolor [HTML]AEC9F38.97
\cellcolor[HTML]F1E5FFETTh2	\cellcolor [HTML]FBEBEB30.87	\cellcolor [HTML]FEFAFA22.65	\cellcolor [HTML]FBEBEB30.49	\cellcolor [HTML]E1EBFB16.00
\cellcolor[HTML]F1E5FFETTm2	\cellcolor [HTML]DFEAFA15.71	\cellcolor [HTML]FCECEC30.10	\cellcolor [HTML]F6D2D244.00	\cellcolor [HTML]B7CFF510.19
\cellcolor[HTML]F1E5FFTraffic	\cellcolor [HTML]F7D7D741.40	\cellcolor [HTML]FFFCFC21.66	\cellcolor [HTML]EBF2FC17.39	\cellcolor [HTML]FBFCFE19.54
\cellcolor[HTML]F1E5FFWeather	\cellcolor [HTML]70A0EB0.52	\cellcolor [HTML]A3C1F27.4	\cellcolor [HTML]E4767691.87	\cellcolor [HTML]6E9FEB0.22
\cellcolor[HTML]F1E5FFILI	\cellcolor [HTML]FDF4F425.97	\cellcolor [HTML]FAE5E534.07	\cellcolor [HTML]FBEBEB30.93	\cellcolor [HTML]AEC9F49.04

This analysis shows that the pretrain model predicts well for the three datasets ETTh1, ETTm1 and Exchange. However, the model cannot classify Electricity data and this fact explains the bad generalization error in Table 1. The Exchange-Rate data is predicted with a high probability and the model shows good generalization. Among the other finetune datasets, a related phenomenon appears in Weather data: high probability of similarity and good generalization. ETTh2 datasets has good metrics since it is close to ETTh1 dataset, while ETTm2 has higher probability in Exchange data class. This confirms our finding that the model prediction and contrastive representation learning complement each other. Another observation from this experiment is that the correct classifications do not varies much between different sample batches.

Table 3: Comparisons of the test performance from our finetuned model and other supervised learning models**.

Models Our FT model Ratio TimesNet ETSformer LightTS DLinear FEDformer Stationary Autoformer Data Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE \cellcolor[HTML]FFF7E7 96 0.347 0.373 \cellcolor[HTML]F8FBFE1.027 \cellcolor[HTML]F4CCCC0.995 0.338 0.375 0.375 0.398 0.374 0.400 0.345 0.372 0.379 0.419 0.386 0.398 0.505 0.475 \cellcolor[HTML]FFF7E7 192 0.384 0.393 \cellcolor[HTML]F8FBFE1.027 \cellcolor[HTML]FBFDFF1.016 0.374 0.387 0.408 0.410 0.400 0.407 0.380 0.389 0.426 0.441 0.459 0.444 0.553 0.496 \cellcolor[HTML]FFF7E7 336 0.414 0.415 \cellcolor[HTML]FDFEFF1.010 \cellcolor[HTML]FDFEFF1.010 0.410 0.411 0.435 0.428 0.438 0.438 0.413 0.413 0.445 0.459 0.495 0.464 0.621 0.537 \cellcolor[HTML]FFF7E7 ETTm1 720 0.473 0.451 \cellcolor[HTML]F4CCCC0.990 \cellcolor[HTML]FFFFFF1.002 0.478 0.450 0.499 0.462 0.527 0.502 0.474 0.453 0.543 0.490 0.585 0.516 0.671 0.561 \cellcolor[HTML]FFF7E7 96 0.385 0.398 \cellcolor[HTML]FFFFFF1.003 \cellcolor[HTML]F4CCCC0.990 0.384 0.402 0.494 0.479 0.424 0.432 0.386 0.400 0.376 0.419 0.513 0.491 0.449 0.459 \cellcolor[HTML]FFF7E7 192 0.432 0.425 \cellcolor[HTML]F4CCCC0.991 \cellcolor[HTML]F4CCCC0.991 0.436 0.429 0.538 0.504 0.475 0.462 0.437 0.432 0.420 0.448 0.534 0.504 0.500 0.482 \cellcolor[HTML]FFF7E7 336 0.473 0.448 \cellcolor[HTML]F4CCCC0.963 \cellcolor[HTML]F4CCCC0.955 0.491 0.469 0.574 0.521 0.518 0.488 0.481 0.459 0.459 0.465 0.588 0.535 0.521 0.496 \cellcolor[HTML]FFF7E7 ETTh1 720 0.492 0.490 \cellcolor[HTML]F4CCCC0.944 \cellcolor[HTML]F4CCCC0.980 0.521 0.500 0.562 0.535 0.547 0.533 0.519 0.516 0.506 0.507 0.643 0.616 0.514 0.512 \cellcolor[HTML]FFF7E7 96 0.081 0.204 \cellcolor[HTML]F4CCCC0.757 \cellcolor[HTML]F4CCCC0.872 0.107 0.234 0.085 0.204 0.116 0.262 0.088 0.218 0.148 0.278 0.111 0.237 0.197 0.323 \cellcolor[HTML]FFF7E7 192 0.164 0.300 \cellcolor[HTML]F4CCCC0.726 \cellcolor[HTML]F4CCCC0.872 0.226 0.344 0.182 0.303 0.215 0.359 0.176 0.315 0.271 0.380 0.219 0.335 0.300 0.369 \cellcolor[HTML]FFF7E7 336 0.295 0.407 \cellcolor[HTML]F4CCCC0.804 \cellcolor[HTML]F4CCCC0.908 0.367 0.448 0.348 0.428 0.377 0.466 0.313 0.427 0.460 0.500 0.421 0.476 0.509 0.524 \cellcolor[HTML]FFF7E7 Exchange 720 0.535 0.587 \cellcolor[HTML]F4CCCC0.555 \cellcolor[HTML]F4CCCC0.787 0.964 0.746 1.025 0.774 0.831 0.699 0.839 0.695 1.195 0.841 1.092 0.769 1.447 0.941 \cellcolor[HTML]FFF7E7 96 0.197 0.282 \cellcolor[HTML]D0E0F91.173 \cellcolor[HTML]F5F9FE1.037 0.168 0.272 0.187 0.304 0.207 0.307 0.197 0.282 0.193 0.308 0.169 0.273 0.201 0.317 \cellcolor[HTML]FFF7E7 192 0.195 0.282 \cellcolor[HTML]EFF5FD1.060 \cellcolor[HTML]F4CCCC0.976 0.184 0.289 0.199 0.315 0.213 0.316 0.196 0.285 0.201 0.315 0.182 0.286 0.222 0.334 \cellcolor[HTML]FFF7E7 336 0.207 0.296 \cellcolor[HTML]F3F7FE1.045 \cellcolor[HTML]F4CCCC0.987 0.198 0.300 0.212 0.329 0.230 0.333 0.209 0.301 0.214 0.329 0.200 0.304 0.231 0.338 \cellcolor[HTML]FFF7E7 Electricity 720 0.242 0.229 \cellcolor[HTML]E4EDFC1.100 \cellcolor[HTML]F4CCCC0.716 0.220 0.320 0.233 0.345 0.265 0.360 0.245 0.333 0.246 0.355 0.222 0.321 0.254 0.361 \cellcolor[HTML]F1E5FF 96 0.196 0.294 \cellcolor[HTML]F2F7FE1.048 \cellcolor[HTML]E4EDFC1.101 0.187 0.267 0.189 0.280 0.209 0.308 0.193 0.292 0.203 0.287 0.192 0.274 0.255 0.339 \cellcolor[HTML]F1E5FF 192 0.266 0.342 \cellcolor[HTML]EDF3FD1.068 \cellcolor[HTML]E2ECFB1.107 0.249 0.309 0.253 0.319 0.311 0.382 0.284 0.362 0.269 0.328 0.280 0.339 0.281 0.340 \cellcolor[HTML]F1E5FF 336 0.365 0.412 \cellcolor[HTML]DAE7FA1.137 \cellcolor[HTML]D0E0F91.174 0.321 0.351 0.314 0.357 0.442 0.466 0.369 0.427 0.325 0.366 0.334 0.361 0.339 0.372 \cellcolor[HTML]F1E5FF ETTm2 720 0.492 0.485 \cellcolor[HTML]C7DAF81.206 \cellcolor[HTML]C8DAF81.203 0.408 0.403 0.414 0.413 0.675 0.587 0.554 0.522 0.421 0.415 0.417 0.413 0.433 0.432 \cellcolor[HTML]F1E5FF 96 0.316 0.372 \cellcolor[HTML]F4CCCC0.929 \cellcolor[HTML]F4CCCC0.995 0.340 0.374 0.340 0.391 0.397 0.437 0.333 0.387 0.358 0.397 0.476 0.458 0.346 0.388 \cellcolor[HTML]F1E5FF 192 0.419 0.433 \cellcolor[HTML]F4F8FE1.042 \cellcolor[HTML]F3F7FE1.046 0.402 0.414 0.430 0.439 0.520 0.504 0.477 0.476 0.429 0.439 0.512 0.493 0.456 0.452 \cellcolor[HTML]F1E5FF 336 0.536 0.507 \cellcolor[HTML]CDDEF91.186 \cellcolor[HTML]DEE9FB1.122 0.452 0.452 0.485 0.479 0.626 0.559 0.594 0.541 0.496 0.487 0.552 0.551 0.482 0.486 \cellcolor[HTML]F1E5FF ETTh2 720 0.708 0.543 \cellcolor[HTML]6D9EEB1.532 \cellcolor[HTML]D4E2F91.160 0.462 0.468 0.500 0.497 0.863 0.672 0.831 0.657 0.463 0.474 0.562 0.560 0.515 0.511 \cellcolor[HTML]F1E5FF 96 0.664 0.410 \cellcolor[HTML]DFEAFB1.120 \cellcolor[HTML]B3CDF51.277 0.593 0.321 0.607 0.392 0.615 0.391 0.650 0.396 0.587 0.366 0.612 0.338 0.613 0.388 \cellcolor[HTML]F1E5FF 192 0.606 0.380 \cellcolor[HTML]F4CCCC0.982 \cellcolor[HTML]DCE8FB1.131 0.617 0.336 0.621 0.399 0.601 0.382 0.598 0.370 0.604 0.373 0.613 0.340 0.616 0.382 \cellcolor[HTML]F1E5FF 336 0.608 0.378 \cellcolor[HTML]F4CCCC0.967 \cellcolor[HTML]DDE9FB1.125 0.629 0.336 0.622 0.396 0.613 0.386 0.605 0.373 0.621 0.383 0.618 0.328 0.622 0.337 \cellcolor[HTML]F1E5FF Traffic 720 0.647 0.398 \cellcolor[HTML]FDFEFF1.011 \cellcolor[HTML]DAE7FA1.137 0.640 0.350 0.632 0.396 0.658 0.407 0.645 0.394 0.626 0.382 0.653 0.355 0.660 0.408 \cellcolor[HTML]F1E5FF 96 0.194 0.251 \cellcolor[HTML]DCE8FB1.128 \cellcolor[HTML]D9E6FA1.141 0.172 0.220 0.197 0.281 0.182 0.242 0.196 0.255 0.217 0.296 0.173 0.223 0.266 0.336 \cellcolor[HTML]F1E5FF 192 0.235 0.291 \cellcolor[HTML]EBF2FD1.073 \cellcolor[HTML]E0EBFB1.115 0.219 0.261 0.237 0.312 0.227 0.287 0.237 0.296 0.276 0.336 0.245 0.285 0.307 0.367 \cellcolor[HTML]F1E5FF 336 0.281 0.327 \cellcolor[HTML]FFFFFF1.004 \cellcolor[HTML]EDF3FD1.069 0.280 0.306 0.298 0.353 0.282 0.334 0.283 0.335 0.339 0.380 0.321 0.338 0.359 0.395 \cellcolor[HTML]F1E5FF Weather 720 0.344 0.368 \cellcolor[HTML]F4CCCC0.942 \cellcolor[HTML]F9FBFF1.025 0.365 0.359 0.352 0.288 0.352 0.386 0.345 0.381 0.403 0.428 0.414 0.410 0.419 0.428

(**) The red bold text indicates the best amongst all methods, and the blue underlined text indicate the second best method. The datasets within the pretrain collection are highlighted with yellow color while the others highlighted purple. In the Ratio column, the numbers highlighted red indicate the settings where the finetuned model has lower metrics than TimesNet. The blue color scale indicates the settings where the finetuned test metrics are worse than TimesNet.

4.4 Finetune and Results

For the finetune step, we estimate the probabilities using a batch of pretrain samples (512 samples) and update the model weights using the finetune loss on that batch. In this stage, the pretrain batch of samples has equal proportions of each datasets. On each finetune datasets, we finetune the model using 50% of the training set compared to supervised approach (split with validation and test sets in chronological order by the ratio of 7:1:2) for 10 epochs. In Table 3, we report the test performance of the models which yields the best validation result and the corresponding ratio with TimesNet. Note that the test batch size is 32, consistent with prior practice (Wu et al., 2021).

Results. Table 3 shows that the finetuned model generalizes better than the supervised methods for two datasets ETTh1 and Exchange-Rate. Exchange-Rate data shows an impressive improvement which yields better results than supervised methods in all settings. This is consistent with our probabilistic result that the model predicts the Exchange-Rate and ETTh1 data the best amongst other datasets. In ETTm1 and Electricity datasets, the model is comparable with or only slightly worse than TimesNet where the difference in average is $0.9\%$ and $1.1\%$ , respectively.

When finetune with new datasets, the model shows competitive performance for Traffic and Weather datasets (with $6.2\%$ and $9.4\%$ worse in average difference). The ETTh2 and ETTm2 datasets follow with an average of $12.6\%$ and $13.0\%$ worse in difference. Given the fact that the model is trained with a different collection of datasets, this experiment shows promising results for our pretrain-finetune approach.

5 Further Analysis

5.1 Variations Analysis

In this section, we analyze some variations of our model. In the first variation, instead of applying the contrastive loss directly on the representation $z$ , we use another decoder to transform it to a representation $y$ , then apply the contrastive learning on $y$ . We also use a linear layer for the decoder transforming $z$ to $y$ . For the second variation, we do not use a decoder for the prediction output (i.e. using an identity layer in Figure 2(a) instead of the decoder). In the final variation, we replace the linear encoder and decoder by two-layer neural networks. We report the results in Table 4. This analysis shows that the linear encoder-decoder is essential for the good generalization performance of our method. The first variation with two decoders is also able to capture the dynamic, its performance has $6.2\%$ difference in average compared to the implementation of contrastive learning directly on the representation $z$ . The second variation shows that a decoder for the prediction output is needed. The full results are in the Appendix.

Table 4: Comparisons of the test performance from our finetuned model and the variations. We report the average MSE and MAE over four prediction lengths (96, 192, 336, 720) as in Table 3.

	Finetuned		Variation 1		\cellcolor[HTML]FFFFFFVariation 2		Variation 3
Data	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
\cellcolor[HTML]FFF7E7ETTm1	0.405	0.408	0.410	0.424	0.473	0.461	0.480	0.465
\cellcolor[HTML]FFF7E7ETTh1	0.446	0.440	0.482	0.484	0.476	0.528	0.518	0.525
\cellcolor[HTML]FFF7E7Exchange	0.269	0.375	0.269	0.396	0.281	0.414	0.282	0.423
\cellcolor[HTML]FFF7E7Electricity	0.210	0.272	0.227	0.298	0.221	0.338	0.243	0.320
\cellcolor[HTML]F1E5FFETTm2	0.330	0.383	0.356	0.418	0.367	0.418	0.395	0.451
\cellcolor[HTML]F1E5FFETTh2	0.495	0.464	0.514	0.498	0.532	0.549	0.522	0.542
\cellcolor[HTML]F1E5FFTraffic	0.631	0.392	0.643	0.412	0.693	0.431	0.773	0.428
\cellcolor[HTML]F1E5FFWeather	0.264	0.309	0.282	0.344	0.271	0.368	0.310	0.354

5.2 Parameters Analysis

In this section, we analyze how the performance of the model changes when the model parameters $\lambda$ changes. A similar analysis for $\tau$ is delayed to the Appendix. Note that $\lambda$ and $\tau$ are the regularization factor and the temperature parameter of the model, respectively, which controls the supervised contrastive loss term.

The choice of the regularization parameter $\lambda$ varies in $[0.01,0.05,0.1,0.5,1]$ . We perform the test for eight datasets using the pretrained models and report the average errors (MAE and MSE) over the four prediction lengths in Figure 4. We observe that in most datasets, the choice $\lambda=0.1$ performs the best.

6 Conclusion

In conclusion, our approach aims to address the discrepancy in pretrain-finetune time series and enhance the knowledge within foundation models training. We employ a pretraining procedure incorporating contrastive learning to distinguish features from different pretraining datasets. This supports the development of a probabilistic similarity metric, allows us to assess the likelihood of a univariate sample’s similarity to one of the pretraining datasets. We introduce a fine-tuning procedure designed to leverage the estimated probability. Our experiments demonstrate that our approach yields favorable results, with accuracy levels comparable to or in some cases outperform supervised models.

Future work in this direction offers promising problems. Addressing the inaccurate probability estimation is one of the interesting questions, which may requires further study into the dynamic of the pretraining datasets. There could be many potential reasons for this phenomenon: the two datasets may have similar dynamic that is difficult to distinguish, or the collective training with other datasets makes the model converges to some sub-optimal solutions. Another potential problem involves the discrepancy in a lower level: within each datasets. While we consider the simplified setting that features in the same datasets should be closer than the features from different datasets, it is still beneficial to take into account the potential dynamic variations within each dataset, and further apply that knowledge to improve the models.

Appendix A Experiment Details

A.1 Pretrain Sample Collection

We consider the time series forecasting problem where the model has the information of previous $I$ time steps and aims to predict the next $O$ future time steps. Thus every univariate sample has $I+O$ time steps. Note that we follow the standard experiment procedure as in (Wu et al., 2021), using following real-world datasets: ETT, Traffic, Electricity, Weather, Exchange-Rate and ILI. The time series data is split into training, validation and test sets in chronological order by the ratio of 7:1:2 for all the data sets. To ensure fair comparison, our pretraining collection only contains the samples in the training proportions of the original datasets.

We choose a collection of pretraining datasets including ETTh1, ETTm1, Electricity and Exchange-Rate. The remaining datasets (ETTh2, ETTm2, Weather, Traffic) are reserved for fine-tuning and/or further testing the generalization ability of our model. Note that in order to handle the discrepancy within the sizes of the datasets, we design a pretraining collection such that it has (approximately) the same number of data samples from each pretrain dataset.

The total number of (univariate) samples within each dataset scales proportionally with their number of features $d$ and total time steps $T$ . ETTh1 and ETTm1 represents similar data, however their granularities are different: ETTh1 is recorded hourly while ETTm1 is collected every 15-minute interval. We choose the ETTm1 to be the base dataset, and we sample other datasets so that they have approximately the same number of samples as ETTm1. Since the total time steps of ETTh1 is 4 times less than ETTm1, we repeat the data ETTh1 for 4 times. Similarly, we repeat the data Exchange-Rate for 6 time since it has 8 features with a total time length of 7588 compared to the total length 69680 and 7 features of ETTm1. Finally, since the Electricity is too large, we choose a stride 2 to reduce the total length in half, and we only choose a subset of features (27 features) because Electricity data has much more than features other datasets (321). These 27 features have equally spaced indices of the original 321 features. Table 5 describe our sampling procedure along with the descriptions of all the datasets in our experiments.

Table 5: Description the datasets within the pretrain collection and other datasets used in our experiment.

				Used in			Number of features
Datasets	$d$	$T$	Granularity	pretrain data	Stride	Repetition	in pretrain
\rowcolor[HTML]FFF7E7 ETTh1	7	17,420	1 hour	Yes	1	4	7
\rowcolor[HTML]FFF7E7 ETTm1	7	69,680	15 min	Yes	1	1	7
\rowcolor[HTML]FFF7E7 Electricity	321	26,304	1 hour	Yes	2	1	27
\rowcolor[HTML]FFF7E7 Exchange-Rate	8	7,588	1 day	Yes	1	6	8
\rowcolor[HTML]F1E5FF ETTh2	7	17,420	1 hour	No
\rowcolor[HTML]F1E5FF ETTm2	7	69,680	15 min	No
\rowcolor[HTML]F1E5FF Traffic	862	17,544	1 hour	No
\rowcolor[HTML]F1E5FF Weather	21	52,696	10 min	No
\rowcolor[HTML]F1E5FF ILI	7	966	1 week	No

A.2 Other experiment details

In our experiment, we choose a simple linear layer (with bias) for the encoder and the decoder of the model. We choose this simple architecture to better analyze the pretrain-finetune process and also because linear models have shown impressive performance in time series forecasting (Zeng et al., 2022). We test different dimensions for the representation of our model, for example we perform grid search with $\{48,96,192,384\}$ for outputs 96 and 192, while we search in $\{180,360,720,1440\}$ for larger output of 720. We report the results where the representation space is half the output space, as it perform well in our experiment. Table 6 describes this choice and reports our model size.

We pretrain our model using PyTorch (Paszke et al., 2019) with ADAM optimizer (Kingma and Ba, 2014) for 10 training epochs. We note that the size of our pretraining sample collection is approximately four times the size of ETTm1 (univariate) dataset. We choose the regularization parameter $\lambda=0.1$ and the temperature parameter $\tau=0.1$ , the pretrain batch size is 512. The experiments are repeated 3 times.

Table 6: Description of the model used in our pretrain and finetune experiments

Input	Representation Dimension	Output	Model size
96	48	96	9360
96	96	192	27936
96	168	336	73080
96	360	720	294840

Evaluation metrics. We report the MAE and MSE on test data where lower metrics indicate better results. We note that our model apply the same function for every channel of the test data (i.e. channel independence) and return the matrix output which contains $O$ time steps and $D$ features (the number of features of the corresponding testing data). Let $P\in\mathbb{R}^{D\times O}$ be the predicted value of our model and $V\in\mathbb{R}^{D\times O}$ be the ground truth value. The metrics are presented as follows:

\displaystyle\text{MAE}(P,V)

\displaystyle=\frac{1}{DO}\sum_{d=1}^{D}\sum_{t=1}^{O}|P_{t}^{d}-V_{t}^{d}|,\quad\text{MSE}(P,V)=\frac{1}{DO}\sum_{d=1}^{D}\sum_{t=1}^{O}(P_{t}^{d}-V_{t}^{d})^{2}.

In our test, we use the same metrics as the prior reference (Wu et al., 2021) with batch size 32. Note that we only implemented our results, the test results of other methods are from the TimesNet reference (Wu et al., 2023).

Appendix B Experiment Results

Table 7: Comparisons of the test performance from our finetuned model and the variations. We report the average MSE and MAE over four prediction lengths: 96, 192, 336, 720.

	Models	Standard		Variation 1		Variation 2		Variation 3
Data	Metric	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
\cellcolor[HTML]FFF7E7	96	0.347	0.373	0.354	0.393	0.414	0.400	0.423	0.433
\cellcolor[HTML]FFF7E7	192	0.384	0.393	0.389	0.409	0.446	0.447	0.458	0.449
\cellcolor[HTML]FFF7E7	336	0.414	0.415	0.418	0.429	0.480	0.473	0.488	0.471
\cellcolor[HTML]FFF7E7	720	0.473	0.451	0.479	0.464	0.554	0.523	0.550	0.509
\cellcolor[HTML]FFF7E7ETTm1	Avg.	0.405	0.408	0.410	0.424	0.473	0.461	0.480	0.465
\cellcolor[HTML]FFF7E7	96	0.385	0.398	0.428	0.448	0.412	0.484	0.456	0.481
\cellcolor[HTML]FFF7E7	192	0.432	0.425	0.473	0.474	0.461	0.512	0.504	0.510
\cellcolor[HTML]FFF7E7	336	0.473	0.448	0.512	0.494	0.503	0.537	0.546	0.533
\cellcolor[HTML]FFF7E7	720	0.492	0.490	0.515	0.522	0.527	0.580	0.566	0.576
\cellcolor[HTML]FFF7E7ETTh1	Avg.	0.446	0.440	0.482	0.484	0.476	0.528	0.518	0.525
\cellcolor[HTML]FFF7E7	96	0.081	0.204	0.080	0.220	0.082	0.236	0.082	0.227
\cellcolor[HTML]FFF7E7	192	0.164	0.300	0.165	0.318	0.168	0.333	0.169	0.324
\cellcolor[HTML]FFF7E7	336	0.295	0.407	0.296	0.434	0.314	0.440	0.315	0.438
\cellcolor[HTML]FFF7E7	720	0.535	0.587	0.535	0.614	0.561	0.646	0.560	0.703
\cellcolor[HTML]FFF7E7Exchange	Avg.	0.269	0.375	0.269	0.396	0.281	0.414	0.282	0.423
\cellcolor[HTML]FFF7E7	96	0.197	0.282	0.210	0.294	0.204	0.330	0.227	0.309
\cellcolor[HTML]FFF7E7	192	0.195	0.282	0.215	0.299	0.206	0.331	0.229	0.313
\cellcolor[HTML]FFF7E7	336	0.207	0.296	0.228	0.313	0.220	0.338	0.242	0.328
\cellcolor[HTML]FFF7E7	720	0.242	0.229	0.256	0.288	0.255	0.352	0.274	0.331
\cellcolor[HTML]FFF7E7Electricity	Avg.	0.210	0.272	0.227	0.298	0.221	0.338	0.243	0.320
\cellcolor[HTML]F1E5FF	96	0.196	0.294	0.237	0.342	0.200	0.310	0.257	0.359
\cellcolor[HTML]F1E5FF	192	0.266	0.342	0.304	0.386	0.286	0.371	0.336	0.416
\cellcolor[HTML]F1E5FF	336	0.365	0.412	0.384	0.440	0.401	0.448	0.419	0.475
\cellcolor[HTML]F1E5FF	720	0.492	0.485	0.498	0.504	0.581	0.545	0.568	0.555
\cellcolor[HTML]F1E5FFETTm2	Avg.	0.330	0.383	0.356	0.418	0.367	0.418	0.395	0.451
\cellcolor[HTML]F1E5FF	96	0.316	0.372	0.350	0.417	0.340	0.450	0.333	0.440
\cellcolor[HTML]F1E5FF	192	0.419	0.433	0.442	0.471	0.453	0.514	0.453	0.512
\cellcolor[HTML]F1E5FF	336	0.536	0.507	0.537	0.528	0.565	0.578	0.554	0.573
\cellcolor[HTML]F1E5FF	720	0.708	0.543	0.725	0.575	0.769	0.653	0.748	0.644
\cellcolor[HTML]F1E5FFETTh2	Avg.	0.495	0.464	0.514	0.498	0.532	0.549	0.522	0.542
\cellcolor[HTML]F1E5FF	96	0.664	0.410	0.665	0.419	0.706	0.440	0.791	0.421
\cellcolor[HTML]F1E5FF	192	0.606	0.380	0.624	0.408	0.670	0.421	0.754	0.423
\cellcolor[HTML]F1E5FF	336	0.608	0.378	0.631	0.405	0.677	0.422	0.758	0.429
\cellcolor[HTML]F1E5FF	720	0.647	0.398	0.651	0.415	0.719	0.442	0.790	0.439
\cellcolor[HTML]F1E5FFTraffic	Avg.	0.631	0.392	0.643	0.412	0.693	0.431	0.773	0.428
\cellcolor[HTML]F1E5FF	96	0.194	0.251	0.210	0.304	0.197	0.301	0.239	0.302
\cellcolor[HTML]F1E5FF	192	0.235	0.291	0.251	0.333	0.240	0.346	0.280	0.340
\cellcolor[HTML]F1E5FF	336	0.281	0.327	0.304	0.368	0.290	0.378	0.330	0.379
\cellcolor[HTML]F1E5FF	720	0.344	0.368	0.363	0.370	0.357	0.446	0.390	0.396
\cellcolor[HTML]F1E5FFWeather	Avg.	0.264	0.309	0.282	0.344	0.271	0.368	0.310	0.354

B.1 Variations Analysis

For the second variation, we do not use a decoder for the prediction output (i.e. using an identity layer in Figure 2(a) instead of the decoder). In the final variation, we replace the linear encoder and decoder by two-layer neural networks. We keep the same representation dimension for $z$ as described in Table 6. We also perform grid search on the number of hidden layers to find the best (training) model, then report the test result of that model.

We report the full results in Table 7. This analysis shows that the linear encoder-decoder is essential for the good generalization performance of our method. The first variation with two decoders is also able to capture the dynamic, its performance has $6.2\%$ difference in average compared to the implementation of contrastive learning directly on the representation $z$ . The second variation shows that a decoder for the prediction output is needed.

B.2 Parameters Analysis

In this section, we analyze how the performance of the model changes when the model parameters $\lambda$ changes. Note that $\lambda$ and $\tau$ are the regularization factor and the temperature parameter of the model, respectively, which controls the supervised contrastive loss term.

B.2.1 Parameter $\lambda$

Here we set the temperature parameter of the model $\tau$ to be 0.1. The choice of the regularization parameter $\lambda$ varies in $[0.01,0.05,0.1,0.5,1]$ . We perform the test for eight datasets using the pretrained models and report the average errors (MAE and MSE) over the four prediction lengths in Table 8 and Figure 5. We observe that in most datasets, the choice $\lambda=0.1$ performs the best.

Table 8: The average test performance of the pretrained model by parameter

\lambda

, for eights datasets. We report the average MSE and MAE over four prediction lengths: 96, 192, 336, 720.

MSE
$\lambda$	ETTm1	ETTh1	Exchange	Electricity	ETTm2	ETTh2	Traffic	Weather
0.01	0.608	0.628	0.445	0.320	0.482	0.681	0.920	0.363
0.05	0.523	0.526	0.336	0.288	0.409	0.511	0.867	0.301
0.1	0.491	0.475	0.280	0.270	0.396	0.498	0.817	0.291
0.5	0.536	0.531	0.387	0.290	0.430	0.587	0.850	0.319
1	0.676	0.605	0.539	0.305	0.513	0.663	0.935	0.368
MAE
$\lambda$	ETTm1	ETTh1	Exchange	Electricity	ETTm2	ETTh2	Traffic	Weather
0.01	0.589	0.576	0.495	0.420	0.483	0.544	0.585	0.386
0.05	0.479	0.490	0.401	0.385	0.426	0.488	0.514	0.340
0.1	0.470	0.463	0.393	0.358	0.420	0.478	0.494	0.335
0.5	0.516	0.482	0.434	0.389	0.448	0.544	0.537	0.343
1	0.581	0.548	0.519	0.415	0.513	0.572	0.550	0.408

B.2.2 Parameter $\tau$

We set the regularization factor $\lambda$ to be 0.1. The choice of the temperature parameter $\tau$ varies in $[0.05,0.1,0.5,1,5]$ . We perform the test for eight datasets using the pretrained models and report the average errors (MAE and MSE) over the four prediction lengths in Table 9 and Figure 6. Since the choice $\tau=0.1$ performs the reasonably well in most of the datasets, we choose $\tau=0.1$ in our experiment.

Table 9: The average test performance of the pretrained model by parameter

\tau