This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scaling Law for Time Series Forecasting

Jingzhe Shi1 , Qinwei Ma111footnotemark: 1 , Huan Ma2,  Lei Li3,4 🖂
1 Institute for Interdisciplinary Information Sciences, Tsinghua University
2 Zhili College, Tsinghua University
3 University of Copenhagen 4 University of Washington
1,2 {shi-jz21,mqw21,mah21}@mails.tsinghua.edu.cn, 3,4 [email protected]
Equal Contribution.
Abstract

Scaling law that rewards large datasets, complex models and enhanced data granularity has been observed in various fields of deep learning. Yet, studies on time series forecasting have cast doubt on scaling behaviors of deep learning methods for time series forecasting: while more training data improves performance, more capable models do not always outperform less capable models, and longer input horizons may hurt performance for some models. We propose a theory for scaling law for time series forecasting that can explain these seemingly abnormal behaviors. We take into account the impact of dataset size and model complexity, as well as time series data granularity, particularly focusing on the look-back horizon, an aspect that has been unexplored in previous theories. Furthermore, we empirically evaluate various models using a diverse set of time series forecasting datasets, which (1) verifies the validity of scaling law on dataset size and model complexity within the realm of time series forecasting, and (2) validates our theoretical framework, particularly regarding the influence of look back horizon. We hope our findings may inspire new models targeting time series forecasting datasets of limited size, as well as large foundational datasets and models for time series forecasting in future work.111Code for our experiments has been made public at: https://github.com/JingzheShi/ScalingLawForTimeSeriesForecasting.

1 Introduction

Because of the practical value of time series forecasting, past years have seen rapid development for time series forecasting methods using the paradigm of neural network training. Neural Nets utilize different model architectures, including FFN-based [1, 2, 3], Transformer-based [4, 5, 6, 7, 8] and Convolution-based [9, 10] neural nets have been proposed. Starting from around 2022, some previous work [1, 11, 7, 10] proposed that powerful models could be enhanced by extending the look-back horizon because more historical information can be utilized. However, our experiments (Figure 3) show that this claim may not hold for datasets in practice with a certain amount of training data: optimal horizon does exist, and it will increase if the amount of available training data increases. This calls for a more thorough understanding of the impact of horizon and dataset size on forecasting loss.

In Natural Language Processing (NLP) [12, 13], Computer Vision (CV) [14] and other fields in deep learning, the impact of dataset size, model size and data granularity on performance is sometimes summarized as the Scaling Law: larger dataset, larger models and more detailed data granularity improves performance in these cases, and theories [15, 16] have been proposed to explain these behaviors. However, these theories do not lay emphasis on the horizon of time series, hence cannot be used directly to explain the impact of the horizon.

In this work, we introduce a comprehensive scaling law theory for time series forecasting, with a particular emphasis on the impact of the horizon. This theory integrates dataset size and model size to optimize predictive performance based on the look-back horizon. We further conduct experiments to (1) verify that scaling behaviors for dataset size and model size do exist in time series forecasting and (2) validate our theory, especially about the influence of different horizons.

Our main contribution regarding time series forecasting includes:

  • 1.

    We introduce a novel theoretical framework that elucidates scaling behaviors from an intrinsic space perspective, highlighting the critical influence of the look-back horizon on model performance. Our theory identifies an optimal horizon, demonstrating that beyond this point, performance degrades due to the inherent limitations of dataset size.

  • 2.

    We conduct a comprehensive empirical investigation into the scaling behaviors of dataset size, model size, and look-back horizon across various models and datasets. Our research establishes a robust scaling law for time series forecasting, providing a foundational framework that adapts to diverse modeling contexts.

As a corollary to our conclusions, we point out that different models might have different optimal look-back horizon for the same dataset (Figure 1); therefore, we call for future work to compare different models using the optimal look-back horizon for each model correspondingly rather than using a fixed look-back horizon.

As a further result of our findings, though widely used in previous work, to show a model benefits from longer horizon compared to baseline models is unnecessary for proving its superiority over these baseline models.

We hope our work may inspire future work when designing forecasting models for specific datasets of limited size, as well as future work proposing large foundational datasets and models in the field of time series.

2 Related Work

2.1 Time Series Forecasting

The task of time series forecasting is to predict a time series with NN variables in its next SS frames (denoted as Y={y1,y2,,yN}N×SY=\{y_{1},y_{2},\ldots,y_{N}\}\in\mathbb{R}^{N\times S}) given its previous observations with HH frames (denoted as X={x1,x2,,xN}N×HX=\{x_{1},x_{2},\ldots,x_{N}\}\in\mathbb{R}^{N\times H}). HH is called look back horizon in some scenarios.

Channel-Independent model means to predict yiy_{i} by yi^=f(xi)\hat{y_{i}}=f(x_{i}). Linear models and MLPs have been proven to be effective learners for time series forecasting. A series of work [1, 3, 17] utilizes linear layers and methods like low-pass-filter and Ordinary Least Squares regression to learn linear relationships in the time series. Reversible MLP [18] proposes to use linear layers and MLPs with reversible Instance Norm [19] and obtains satisfying results. PatchTST [7] proposes to use patch embedding for time series.

Channel-Dependent model means to predict yy by yi^=[f(x1,x2,,xN)]i\hat{y_{i}}=[f(x_{1},x_{2},\ldots,x_{N})]_{i}. A series of works based on Transformers and its variants have been proposed[4, 5, 6], as well as a series of convolution-based methods based on TCN[20, 21]. More recently, iTransformer [8] proposes to use attention to capture the relationship between different variables. ModernTCN  [10] proposes new convolutional architectures enabling a wide Effective Receptive Field.

There have been many works analyzing on mathematical properties of time series before machine learning exists. [22, 23] give good summaries of the early works from different perspectives.

Recently, there have been works proposing large foundational datasets and models for time series. Some propose foundational models that can do zero-shot forecasting[24]. Some propose open-source foundational datasets and verify the ability for transfer learning of foundation models trained on the datasets[25, 26].222Released in close timing with ours, [27] verifies scaling law experimentally for the classic transformer architecture on a large mixed dataset for time series forecasting. There are also works utilizing LLMs to do zero-shot prediction[28], or use LLM backbones pretrained on text contexts to perform time series forecasting[29, 30, 31].

2.2 Scaling Law and related Theory

A plethora of research has been conducted to investigate the scaling law in various domains of deep learning, encompassing Natural Language Processing, Computer Vision tasks, and Graph-based Neural Networks [12, 13, 14, 32]. These studies have not only observed the existence of scaling law but also proposed theories to elucidate them. For instance, certain work has interpreted the scaling behavior from the standpoint of intrinsic space or data manifold, providing a deeper understanding of the underlying mechanisms [15, 16]. In parallel, the properties of time series have been the subject of extensive theoretical and empirical studies. Notably, some research has established bounds for the quantization error of time series, contributing to the body of knowledge on time series analysis [33].

3 Theory for Scaling Law in Time Series Forecasting

3.1 Forecasting in Intrinsic Space

To represent the amount of information carried by a time series slice of length LL, we consult the concept of intrinsic dimension and intrinsic space. Consider all time series slices of length LL in a particular scenario, these slices are hypothesized to reside within an intrinsic space that encapsulates the fundamental characteristics of the data, effectively representing its inherent features. Denote this intrinsic space as (L)\mathcal{M}(L) and its intrinsic dimension as dI(L)d_{I}(L).

It immediately follows that time series forecasting is equivalent to predicting a vector in (S)\mathcal{M}(S) given its previous HH frames, which can be represented by a vector in the space (H)\mathcal{M}(H).

3.2 Formulation

Before studying the impact of horizon, dataset size and model on loss, we first formulate the intrinsic space, the data distribution and the form of the loss.

3.2.1 Intrinsic Space

For a time series s0,1,,Ls_{0,1,\ldots,L} of length LL (each sis_{i} is a unit element in the sequence that may contain single or multiple variables, dependent or independent), we represent it using a vector xx in (L)\mathcal{M}(L).

We make these assumptions on the spaces {(1),(2),}\{\mathcal{M}(1),\mathcal{M}(2),\ldots\} and the distribution of data in the spaces:

  • 1.

    Information-preserving: Intuitively speaking, we should be able to recover the real sequence (which might be a multivariable sequence or a singlevariable sequence) from its corresponding intrinsic vector with the error bounded by a small constant value. Formally we can state this as follows:

    Exists a mapping ϕ\phi from the original length-LL sequence space 𝒪(L)\mathcal{O}(L) to (L)\mathcal{M}(L), an inverse mapping ϕ1:(L)𝒪(L)\phi^{-1}:\mathcal{M}(L)\to\mathcal{O}(L) and a constant e1e\ll 1 related to LL so that for any 𝔼x𝒪(L)\mathbb{E}_{x\sim\mathcal{O}(L)}, xϕ1(ϕ(x))22e(L)\|x-\phi^{-1}(\phi(x))\|_{2}^{2}\leq e(L).

  • 2.

    Inverse Lipschitz: ϕ1\phi^{-1} should be KIK_{I}-Lipschitz under L2 norm. That is:

    x,y(L),ϕ1(x)ϕ1(y)2KIxy2\forall x,y\in\mathcal{M}(L),\|\phi^{-1}(x)-\phi^{-1}(y)\|_{2}\leq K_{I}\|x-y\|_{2}
  • 3.

    Bounded: For simplicity, we assume the values in all dimensions of the intrinsic space are bounded, and thus we can scale the intrinsic space to fit it into (H)=[0,1]dI(H)\mathcal{M}(H)=[0,1]^{d_{I}(H)}.

  • 4.

    Isomorphism: (L1)\mathcal{M}(L_{1}) is isomorphic to a subspace of (L2)\mathcal{M}(L_{2}) for L1L2L_{1}\leq L_{2}. Moreover, the isometry should also preserve the data distribution in the space. Formally, we can state it as:

    Let P[H1,H2]P[H_{1},H_{2}] denote the linear projection from (H1)\mathcal{M}(H_{1}) to the subspace of it isomorphic to (H2)\mathcal{M}(H_{2}), and let CovLCov_{L} denote the covariance matrix of data distribution in (L)\mathcal{M}(L), then CovL1Cov_{L_{1}} should be congruent to P[L2,L1](CovL2)P[L_{2},L_{1}](Cov_{L_{2}}) for any L1<L2L_{1}<L_{2}.

  • 5.

    Linear Truncation: Truncation in time series space is close to linear projection mapping in (L)\mathcal{M}(L). Formally, we can state it as:

    Define a truncating function tp[H1,H2](x0:H1)t_{p}[H_{1},H_{2}](x_{0:H_{1}}), so that for any sequence ss, let sh1:h2s_{h_{1}:h_{2}} denote the intrinsic vector for the subsequence from s[h1]s[h_{1}] to s[h21]s[h_{2}-1], then tp[H1,H2](s0:H1)=s0:H2t_{p}[H_{1},H_{2}](s_{0:H_{1}})=s_{0:H_{2}}, then tp[H1,H2]P[H1,H2]t_{p}[H_{1},H_{2}]\approx P[H_{1},H_{2}].

  • 6.

    Causality: There should be an optimal model to predict the next SS frames given the previous hh\to\infty frames, so that the error only originates from the intrinsic noise. That is:

    F[S]:(S),s.t.limh(yxh:0)=(1η)δ(F[S](xh:0))+η𝒩(F[S](xh:0),ΣS),\exists F[S]:\mathcal{M}\to\mathcal{M}(S),s.t.\lim\limits_{h\to\infty}\mathbb{P}(y\mid x_{-h:0})=(1-\eta)\delta(F[S](x_{-h:0}))+\eta\mathcal{N}(F[S](x_{-h:0}),\Sigma_{S}),

    where η\eta stands for the noise ratio in the system, δ()\delta(\cdot) stands for the Dirichlet function and 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) stands for a normal distribution with mean μ\mu and covariance Σ\Sigma. (Notice that the noise distribution is not necessarily a normal distribution and our result holds for any noise distribution with mean μ\mu and covariance σ\sigma. However we only consider the normal distribution case here for simplicity.)

    Moreover, we assume F[S]F[S] is first-order K1K_{1}-Lipschitz and second-order K2K_{2}-Lipschitz in L2L_{2} metric (as we assume SS is fixed we don’t discuss how the Lipschitz coefficients vary with SS and simply take them as constant).

  • 7.

    Uniform Sampling Noise: When drawing a length-LL sample, we assume the sampling noise is uniform in each direction in (L)\mathcal{M}(L).

  • 8.

    Zip-f Distribution: We assume the data distribution in the intrinsic space follows a Zip-f law on different dimensions of the intrinsic space. That is, the eigenvalue spectrum of CovLCov_{L} satisfies λiλ0iαZ\lambda_{i}\approx\lambda_{0}i^{-\alpha_{Z}} where λi\lambda_{i} represents the ii-th largest eigenvalue. This is shown by other work [34, 35] and also verified in our experiments.

This result is asymptotic and does not suggest uniform intrinsic dimensions across sequence elements. Specifically, for a sequence element with vv variables, dI(H)vd_{I}(H)\propto v approximately in channel-independent scenarios. In contrast, for channel-dependent cases, which are more common, the total intrinsic dimension is typically less than the sum of dimensions for individual variables.

In the deduction part we do not assume the specific relationship between dI(H)d_{I}(H) and HH. Some previous work[36] shows that in some cases dI(H)Θ(H)d_{I}(H)\approx\Theta(H). In Appendix A.1 we discuss more about their relationship.

We formulate the intrinsic space and these assumptions more strictly and provide a brief construction of {(1),(2),}\{\mathcal{M}(1),\mathcal{M}(2),\ldots\} in Appendix A.2. Also, we discuss cases where these conditions are not strictly satisfied in Appendix A.3.

Moreover, under these assumptions made, the loss in the original space LoriL_{ori} can be linearly bounded by the loss in the intrinsic space LinsL_{ins}, details of which can be found at Appendix A.4.

In the following deduction, we use LL to denote LinsL_{ins}.

3.2.2 Loss: Overall

In the following sections, we consider the task to predict a vector in (S)\mathcal{M}(S) given the vector corresponding to its previous HH frames in (H)\mathcal{M}(H) with a model mm. For simplicity, we represent the operation of obtaining a vector in (t2t1)\mathcal{M}(t_{2}-t_{1}) by truncating the sequence ss from time t1t_{1} to t2t_{2} as x[t1:t2]x[t_{1}:t_{2}], where xx is a representation in (|s|)\mathcal{M}(|s|).

L\displaystyle L =Ex(H+S)[(x[H:H+S]m(x[0:H]))2]\displaystyle=E_{x\sim\mathcal{M}(H+S)}[(x[H:H+S]-m(x[0:H]))^{2}]

Let mm^{*} denote the optimal Bayesian model, then it should satisfy:

m(x[0:H])=Ex(H+S)[x[H:H+S]|x[0:H]]m^{*}(x[0:H])=E_{x\sim\mathcal{M}(H+S)}[x[H:H+S]|x[0:H]]

Thus:

L=\displaystyle L= Ex(H+S)[(x[H:H+S]m(x[0:H])+m(x[0:H]model(x[0:H]))2]\displaystyle E_{x\sim\mathcal{M}(H+S)}[(x[H:H+S]-m^{*}(x[0:H])+m^{*}(x[0:H]-model(x[0:H]))^{2}]
=\displaystyle= Ex(H+S)[(x[H:H+S]m(x[0:H])2]+Ex(H)[(m(x)m(x))2]\displaystyle E_{x\sim\mathcal{M}(H+S)}[(x[H:H+S]-m^{*}(x[0:H])^{2}]+E_{x\sim\mathcal{M}(H)}[(m^{*}(x)-m(x))^{2}]
+2Ex(H+S)[(x[H:H+S]m(x[0:H])(m(x)m(x))]\displaystyle+2*E_{x\sim\mathcal{M}(H+S)}[(x[H:H+S]-m^{*}(x[0:H])*(m^{*}(x)-m(x))]

Since x[H:H+S]m(x[0:H])x[H:H+S]-m^{*}(x[0:H]) and m(x)m(x)m^{*}(x)-m(x) are independent with respect to s[H:H+S]s[H:H+S] and Ex(H+S)[x[H:H+S]m(x[0:H])]=0E_{x\sim\mathcal{M}(H+S)}[x[H:H+S]-m^{*}(x[0:H])]=0, the loss is a sum of the previous two terms: one is decided by the capability of the optimal Bayesian model (or the Bayesian Error), and the other is the model’s ability to approximate the Bayes Estimation: L=LBayesian+LapproxL=L_{Bayesian}+L_{approx}.

We then calculate each of the two terms corresponding to the horizon, dataset size and model size.

3.2.3 Bayesian Loss

We consider the Bayesian error for ()\mathcal{M}(\infty), when the loss for predicting SS frames originates from the inherent uncertainty of the system. That is, we evaluate the amount of information carried by (H)\mathcal{M}(H) compared to ()\mathcal{M}(\infty). By assumption 5 and 6 in section 3.2.1, It can be verified that:

LBayesian(1η)K12𝔼[var(P1[,H](x))]+ηtr(ΣS)L_{Bayesian}\leq(1-\eta)K_{1}^{2}\mathbb{E}[var(P^{-1}[\infty,H](x))]+\eta\cdot tr(\Sigma_{S})

According to assumption 7, the noise in the ii-th predicted frame would be proportional to i\sqrt{i}, and the total noise in the predicted SS frames would be proportional to SS. Let σM2\sigma_{M}^{2} denote the variance of the noise for a single frame in a single dimension, then tr(ΣS)tr(\Sigma_{S}) should be equal to σM2S2dI(S)\sigma_{M}^{2}S^{2}d_{I}(S).

According to assumptions 4 and 8, we can express the inverse projection term into:

𝔼[var(P1[H](x))]\displaystyle\mathbb{E}[var(P^{-1}[H](x))] =dI(H)i<dIλiλ0(αZ1)dI(H)αZ1\displaystyle=\sum\limits_{d_{I}(H)\leq i<d_{I}}\lambda_{i}\approx\frac{\lambda_{0}}{(\alpha_{Z}-1)d_{I}(H)^{\alpha_{Z}-1}}

which indicates that:

LBayesianK12(1η)λ0(αZ1)dI(H)αZ1+ησM2S2dI(S)L_{Bayesian}\approx K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{(\alpha_{Z}-1)d_{I}(H)^{\alpha_{Z}-1}}+\eta\sigma_{M}^{2}S^{2}d_{I}(S)

3.2.4 Approximation Loss: Two Cases

The training data is sampled from the distribution of Ex(H+S)[x[H:H+S]|x[0:H]]E_{x\sim\mathcal{M}(H+S)}[x[H:H+S]|x[0:H]]. Following previous works [16, 37], we utilize the piece-wise linear assumption for deep learning models. That is, we assume the model partitions the intrinsic space into NN subregions, and does a linear prediction for each subregion independently. Here we discuss two cases: the large-dataset limit and the small-dataset limit.

If the dataset is large and contains many samples in the piece of the model, the model tends to learn the averaged behavior of the region. Intuitively, a larger dataset size brings smaller noise averages, and a larger model brings smaller piece-wise linear pieces, reducing the error caused by deviation of the data distribution to the linear model in the small blocks. Roughly speaking, the loss consists of two terms: one caused by the uncertainly within the subregions partitioned by the model LrL_{r}, and one caused by the noise in the data that makes the model fail to learn the optimal model LnL_{n}. If we assume that the model partitions the space uniformly, then the loss should satisfy:

LapproxK22dI(H)2N4dI(H)4π2+NdI(H)D(σM2S2dI(S)+K12λ0(αZ1)dI(H)αZ1).L_{approx}\approx K_{2}^{2}\frac{d_{I}(H)^{2}N^{-\frac{4}{d_{I}(H)}}}{4\pi^{2}}+\frac{Nd_{I}(H)}{D}(\sigma_{M}^{2}S^{2}d_{I}(S)+\frac{K_{1}^{2}\lambda_{0}}{(\alpha_{Z}-1)d_{I}(H)^{\alpha_{Z}-1}}).

The uniform partitioning should be the most naive case for the model to learn, and we also analyze on the cases where more advanced partitionings are learned in Appendix D. Also, it is worth noticing that the noise consists of two term, one is the systematic uncertainly as shown in section 3.2.3, and the other is caused by the horizon limitation so that the effect of unseen dimension seems exactly like noise for the model. Please refer to Appendix B.2 for a more detailed derivation.

Otherwise if there are few data samples, or the model is sufficiently large and the model cannot learn to average the noises of closed samples, but rather learns to remember each sample in certain pieces. This would give a data-scaling loss determined by nearest-neighbor distance. In this case:

LapproxK124πdI(H)D2dI(H).L_{approx}\approx\frac{K_{1}^{2}}{4\pi}d_{I}(H)D^{-\frac{2}{d_{I}(H)}}.

Please refer to Appendix B.3 for a more detailed derivation of this scenario.

Boundary of the two phases is decided by Dataset size and Model size, as well as horizon. It is worth noticing that in time series forecasting tasks, the sliding window method for constructing data samples is usually used. 333Data points between time [t-H:t+S] is considered one sample, and [t-H+1:t+S+1] is considered another sample in the dataset. These two samples are strongly correlated with each other. Therefore, the closest data points are always dependent on each other and are strongly correlated. The effective number of mutually independent data fragments is approximately proportional to D/HD/H rather than DD, and ξ=D/NH\xi=D/NH would be a very approximate order parameter separating the previous two scenarios. Again, please refer to Appendix B for a more detailed derivation.

3.3 Optimal Horizon under Each Circumstance

As stated in section 3.2.2, the total loss would be simply the sum of the two components deduced above. Analyzing the total loss form, we may achieve an optimal horizon HH^{*} (or a corresponding optimal intrinsic dimension dId_{I}^{*} that minimizes the loss for different cases.

3.3.1 Optimal Horizon for large amount of data

For the case with sufficient data, the total loss is:

L=\displaystyle L= LBayesian+LApprox\displaystyle L_{Bayesian}+L_{Approx}
\displaystyle\approx K22dI(H)2N4dI(H)4π2+K12(1η)λ0αZ11dI(H)αZ1\displaystyle K_{2}^{2}\frac{d_{I}(H)^{2}N^{-\frac{4}{d_{I}(H)}}}{4\pi^{2}}+K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{\alpha_{Z}-1}\frac{1}{d_{I}(H)^{\alpha_{Z}-1}}
+NdI(H)D(σM2S2dI(S)+K12λ0(αZ1)dIαZ1(H))\displaystyle+\frac{Nd_{I}(H)}{D}(\sigma_{M}^{2}S^{2}d_{I}(S)+\frac{K_{1}^{2}\lambda_{0}}{(\alpha_{Z}-1)d_{I}^{\alpha_{Z}-1}(H)})
\displaystyle\approx K22dI(H)2N4dI(H)4π2+K12(1η)λ0αZ11dI(H)αZ1+NdI(H)DσM2S2dI(S)\displaystyle K_{2}^{2}\frac{d_{I}(H)^{2}N^{-\frac{4}{d_{I}(H)}}}{4\pi^{2}}+K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{\alpha_{Z}-1}\frac{1}{d_{I}(H)^{\alpha_{Z}-1}}+\frac{Nd_{I}(H)}{D}\sigma_{M}^{2}S^{2}d_{I}(S)
(since we always assume NDN\ll D)

We consider two cases. The detailed derivation can be found in Appendix C.1.

If model size is too small compared to dataset size such that N=o(DdI(H)dI(H)+4)N=o(D^{\frac{d_{I}(H)}{d_{I}(H)+4}}), then the effect of dataset size on picking optimal horizon could be neglected, thus:

dI=𝒲(4αZC01αZln1+1αZN)4αZC01αZln1+1αZN.d_{I}^{*}=\mathcal{W}(\frac{4}{\alpha_{Z}C_{0}^{\frac{1}{\alpha_{Z}}}}\ln^{1+\frac{1}{\alpha_{Z}}}N)\approx\frac{4}{\alpha_{Z}C_{0}^{\frac{1}{\alpha_{Z}}}}\ln^{1+\frac{1}{\alpha_{Z}}}N.

where 𝒲()\mathcal{W}(\cdot) is the Lambert W function (𝒲(x)x\mathcal{W}(x)\approx x) and C0=K12π2(1η)λ0K22C_{0}=\frac{K_{1}^{2}\pi^{2}(1-\eta)\lambda_{0}}{K_{2}^{2}}.

If N is not that small: N=ω(DdI(H)dI(H)+4)N=\omega(D^{\frac{d_{I}(H)}{d_{I}(H)+4}}). Then the noise effect would be dominant in picking optimal HH, and the optimal dI(H)d_{I}(H) would be:

dI=(K12(1η)λ0DNσM2S2dI(S))1αZ.d_{I}^{*}=(\frac{K_{1}^{2}(1-\eta)\lambda_{0}D}{N\sigma_{M}^{2}S^{2}d_{I}(S)})^{\frac{1}{\alpha_{Z}}}.

In this case, dId_{I}^{*} grows noticeably with the increment of DD and the decrement of NN.

3.3.2 Optimal Horizon for a relatively small amount of data

If data is scarce, the loss could be written as:

lossK12(1η)λ0αZ11dI(H)αZ1+K124πdI(H)D2dI(H)\text{loss}\approx K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{\alpha_{Z}-1}\frac{1}{d_{I}(H)^{\alpha_{Z}-1}}+\frac{K_{1}^{2}}{4\pi}d_{I}(H)D^{-\frac{2}{d_{I}(H)}}

It is worth noticing that in this case, we assume the model is always large enough to find the nearest neighbor for a test sample, hence the loss is irrelevant with NN. We can estimate the optimal dId_{I}^{*} as:

dI=CslnDlnlnDd_{I}^{*}=C_{s}\frac{\ln D}{\ln\ln D}

Where CsC_{s} is a constant irrelevant with DD, NN or HH. The exact form is provided in Appendix C.2. Compared to the first scenario, the optimal dId_{I} changes much less in this scenario.

4 Experiment Results

4.1 Scaling Law for dataset size and model width do exist for Time Series Forecasting

Refer to caption
Figure 1: Data Scaling. The proposed formula loss(D)=A+B/Dαloss(D)=A+B/D^{\alpha} fits well. More comparison with other formulas can be found at Appendix I.
Refer to caption
Figure 2: Width Scaling. When the model is not powerful enough, loss(W)=A+B/Wαloss(W)=A+B/W^{\alpha} fits well for these situations. When data is scarce, a large model may lead to overfitting, as observed with ModernTCN on ETTm1.

As depicted in Figure 1 and Figure 2, we corroborate the scaling behaviors pertaining to data scaling and model-size scaling across a diverse range of datasets and models. This validation underscores the robustness and versatility of our proposed theoretical framework in the context of time series forecasting.

Here we mainly include NLinear[1]/MLP, ModernTCN[10] and iTransformer[8] as our models, covering a scenario of Channel-Independent and Channel-Dependent, FFN-based, Transformer-based and Convolution-based methods. For datasets, we mainly use ETTh1, ETTh2, ETTm1, ETTm2, Traffic and Weather[4, 5] as our datasets. Detailed experiment settings can be found at Appendix E.

It can be seen that for all these models on all these datasets, for the dataset-scaling case where loss=CD+1/DαDloss=C_{D}+1/D^{\alpha_{D}} and the model width-scaling with large amount of data: loss=CW+1/WαWloss=C_{W}+1/W^{\alpha_{W}}. The results fit well, thus verifying the existence of the original understanding of the scaling law.

However, in some special cases when the model is large enough to approximate the data, increasing model size would not gain performance, and may hurt performance (if regularization that is not strong enough is added) in some cases, like ModernTCN for ETTm1.

4.2 The Impact of Horizon to Final Loss

4.2.1 Optimal Horizon and Training Data Amount

Optimal Horizon grows with more available training data. We conduct experiments, fixing the available training data amount and model size while adjusting the horizon. As shown in Figure 3, an optimal horizon is observed at each data amount, and this optimal value grows with more available training data.

Refer to caption
Figure 3: Loss v.s. Horizon for a certain amount of training data, for different datasets and different models.

A dataset with a larger feature degradation has a smaller optimal horizon. As our theory predicts, larger αz\alpha_{z} leads to a smaller optimal horizon. In the experiment in Figure 4, we use the same Linear model on two different datasets with comparable amount of training data: the Exchange Dataset has 70%70\% available training data compared to the ETTh1 Dataset. We use eigenvalues obtained by doing Principal component analysis on sampled series as an approximation to the feature variance in the intrinsic space. Detailed procedure and more PCA results can be found at Appendix G.

Refer to caption
Figure 4: PCA results under Channel-Independent and Instance Normalization setting(left), Loss v.s. Horizon for certaim amount of training data on Exchange(middle) and ETTh1(right). Exchange dataset has 70%70\% data points compared to ETTh1 for training. However, since its feature degradation is stronger, the optimal horizon (<30<30) using 100%100\% of Exchange dataset is much smaller than the optimal horizon of the ETTh1 dataset (>300>300) with only 11%11\% of available training data.

Channel-Dependent and Channel-Independent Models sometimes are in different states of the two cases. For CD models, dI(H)d_{I}(H) is larger and less training data is available, hence it tends to be in the few-data limit. For CI models, dI(H)d_{I}(H) is smaller and DD is larger, hence it may reach the data-dense limit (where the scaling exponent for DD is 1-1).

In the following Figure 5, iTransformer on Traffic dataset is in the data-dense limit. For MLP on Transformer, when the horizon is small it is in the few-data limit. For the Linear model, since it is simply linear (rather than piece-wise linear), we expect it to be within the data-dense limit even for long horizons and when the dataset is relatively small (like ETTh1).

Refer to caption
Figure 5: Data scaling behavior for iTransformer (Channel-Dependent model, left) and Norm-MLP(Channel-Independent model, middle) and NLinear (CI, right).

The advantage of channel-independent and channel-dependent models can be explained from the perspective of our theory. For channel-dependent models, the horizon limitation is smaller hence it performs better with plenty of training data. For channel-independent models, more training data is available; moreover, dd is smaller, making the scaling exponent larger.

4.2.2 Optimal Horizon v.s. Model Size

As predicted by our theory there are two cases. (1) When NN is small, the optimal horizon does not change much with NN. (2) When NN is large, the model size scaling term no longer dominates, the coefficient of the noise term NdID\frac{Nd_{I}}{D} dominants thus larger NN leads to smaller optimal HH.

Refer to caption
Figure 6: Loss v.s. Horizon for models of different widths. For MLP on ETTh1 (left), ModernTCN on ETTh1(middle) and ModernTCN on Weather (right).

From the observations in Figure 6, it can be discerned that in the initial scenario, where an enhancement in N results in a performance augmentation, the optimal horizon remains relatively invariant with respect to NN. This is exemplified in the case of the SingleLayerMLP model. However, in the second scenario, where N attains its optimal value (i.e., for a certain horizon, a smaller N surpasses a larger NN in performance), a larger NN will correspond to a reduced optimal horizon. This phenomenon is evident in the instances involving the ModernTCN model.

As predicted by our theory, we see that dataset size has an impact on optimal horizon, while model size has a less significant influence on it.

5 Discussion and Conclusion

5.1 Limitation and Discussion

Our theory mainly covers the part of time series forecasting, and our experiments verify our proposed theory on some of the well-used datasets of various sizes. However, these datasets are still small compared to some of the recently proposed large datasets[25]. The scaling behavior predicted in our work of the horizon on these large datasets remains to be experimentally verified and studied. Moreover, we mainly use popular time series forecasting models in recent work and these models might be over-designed for a few datasets for time series.

Moreover, our work mainly focus on the in-domain setting, rather for pretrained-then-finetuned models or foundation models trained on mixed datasets. We discuss more about whether the assumptions made and our theory deduction work for mixed datasets at Appendix H.

Since we focus on the impact of the horizon which is an adjustable hyper-parameter for forecasting tasks but a fixed hyper-parameter for other time series tasks, this work does not involve scaling behaviors for other tasks related to time series analysis.The theory for scaling law in other areas of time series forecasting as well as for foundation time series forecasting models remains to be studied. Also, although our theory is compatible with analyzing the effect of the predict length SS and current experiment for fixed SS verifies the impact of HH, it is still worth studying the impact of SS on optimal HH. Also, our theory only holds with assumptions made in Section 3. We leave these to future work. 444Meanwhile, NLP and CV tasks feature long-range dependency (low degradation with horizon), large datasets, hence more detailed data granularity (e.g., context length, image resolution) produces better results in nowadays’ work. However, since the community is using larger and larger models and data granularity (context length, etc), but the amount of training data is limited, it is nature to raise such question: will we see similar patterns in time series forecasting that more detailed data granularity may hurt performance in NLP or CV tasks in the future? This is also an interesting question for future work.

We call for future work to compare different models by ‘performance at optimal look back horizon’ rather than ‘performance at a certain look back horizon’ to improve robustness. This work further elucidates that, though used in a lot of previous work, to show a model benefits from longer horizon compared to baseline models does not necessarily prove its superiority over these baseline models.

5.2 Conclusion

In this work, we propose a theoretical framework that contemplates the influence of the horizon on scaling behaviors and the performance of time series forecasting models. We take into account the size of the dataset, the complexity of the model, and the impact of the horizon on the ultimate performance. An expanded horizon results in a diminished Bayesian Error, but it simultaneously complicates the task for a limited dataset to fully encompass the entire space and for a model of restricted size to learn effectively. Furthermore, our empirical investigations corroborate the existence of scaling behaviors in relation to dataset size and model size, and validate our proposed theory concerning the impact of the horizon.

Our theory provides some insight into the area of time series forecasting. For a certain dataset, it would be beneficial to design the models and hyperparameters according to the dataset size and feature degradation property of the particular dataset. Moreover, we think further experiments on larger foundational time series dataset about the optimal horizon with respect to pretraining loss and the loss for transferring to certain datasets may provide further insight as well.

In conclusion, we aim to provide insights to future work on time series forecasting, emphasizing the importance of the horizon and its potential impact on scaling behaviors in time series forecasting tasks.

Acknowledgments and Disclosure of Funding

We would like to express our great gratitude to Professor Robert M. Haralick for significantly insightful and invaluable help and discussion. We would also like to extend our deep gratitude to Jiaye Teng, whose invaluable help and advice have been instrumental to our progress. In addition, we are profoundly grateful to Professor Yang Yuan for the enlightening discussion we had, which provided significant inspiration and insight into our work.

We would also like to thank Professor Xiaolong Wang and Jiarui Xu at UCSD for inspiring discussions about scaling context length for NLP, and Yu Sun at Stanford for discussions about properties of time series.

We would also like to express our gratitude to CPHOS (https://cphos.cn), an academic non-profit organization dedicated to providing Physics Olympiad Simulations (targeting Physics Olympiads like IPhO) for high school students for free. At CPHOS, we communicate innovative ideas and get to know great academic collaborators, without which this work could not have been carried out.

We would also like to thank reviewers, ACs and PCs from NeurIPS 2024 who have provided valuable suggestions which have helped to improve this work.

References

  • [1] A. Zeng, M. Chen, L. Zhang, and Q. Xu, “Are transformers effective for time series forecasting?” 2022.
  • [2] S.-A. Chen, C.-L. Li, N. Yoder, S. O. Arik, and T. Pfister, “Tsmixer: An all-mlp architecture for time series forecasting,” 2023.
  • [3] Z. Xu, A. Zeng, and Q. Xu, “Fits: Modeling time series with 10k10k parameters,” 2024.
  • [4] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” 2021.
  • [5] H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,” 2022.
  • [6] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting,” 2022.
  • [7] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A time series is worth 64 words: Long-term forecasting with transformers,” in International Conference on Learning Representations, 2023.
  • [8] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long, “itransformer: Inverted transformers are effective for time series forecasting,” 2024.
  • [9] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long, “Timesnet: Temporal 2d-variation modeling for general time series analysis,” 2023.
  • [10] L. donghao and wang xue, “ModernTCN: A modern pure convolution structure for general time series analysis,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=vpJMJerXHU
  • [11] H. Wang, J. Peng, F. Huang, J. Wang, J. Chen, and Y. Xiao, “Micn: Multi-scale local and global context modeling for long-term series forecasting,” 2023.
  • [12] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.
  • [13] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” 2022.
  • [14] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” 2022.
  • [15] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma, “Explaining neural scaling laws,” 2024.
  • [16] U. Sharma and J. Kaplan, “A neural scaling law from the dimension of the data manifold,” 2020.
  • [17] W. Toner and L. Darlow, “An analysis of linear time series forecasting models,” 2024.
  • [18] Z. Li, S. Qi, Y. Li, and Z. Xu, “Revisiting long-term time series forecasting: An investigation on linear mapping,” 2023.
  • [19] T. Kim, J. Kim, Y. Tae, C. Park, J.-H. Choi, and J. Choo, “Reversible instance normalization for accurate time-series forecasting against distribution shift,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=cGDAkQo1C0p
  • [20] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” 2018.
  • [21] R. Sen, H.-F. Yu, and I. Dhillon, “Think globally, act locally: A deep neural network approach to high-dimensional time series forecasting,” 2019.
  • [22] E. Parzen, “An Approach to Time Series Analysis,” The Annals of Mathematical Statistics, vol. 32, no. 4, pp. 951 – 989, 1961. [Online]. Available: https://doi.org/10.1214/aoms/1177704840
  • [23] J. G. De Gooijer and R. J. Hyndman, “25 years of time series forecasting,” International journal of forecasting, vol. 22, no. 3, pp. 443–473, 2006.
  • [24] A. Garza, C. Challu, and M. Mergenthaler-Canseco, “Timegpt-1,” 2024.
  • [25] Y. Liu, H. Zhang, C. Li, X. Huang, J. Wang, and M. Long, “Timer: Transformers for time series analysis at scale,” 2024.
  • [26] A. Das, W. Kong, R. Sen, and Y. Zhou, “A decoder-only foundation model for time-series forecasting,” 2024.
  • [27] T. D. P. Edwards, J. Alvey, J. Alsing, N. H. Nguyen, and B. D. Wandelt, “Scaling-laws for large time-series models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.13867
  • [28] N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, “Large language models are zero-shot time series forecasters,” 2023.
  • [29] C. Chang, W.-Y. Wang, W.-C. Peng, and T.-F. Chen, “Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters,” 2024.
  • [30] T. Zhou, P. Niu, X. Wang, L. Sun, and R. Jin, “One fits all:power general time series analysis by pretrained lm,” 2023.
  • [31] M. Jin, S. Wang, L. Ma, Z. Chu, J. Y. Zhang, X. Shi, P.-Y. Chen, Y. Liang, Y.-F. Li, S. Pan, and Q. Wen, “Time-llm: Time series forecasting by reprogramming large language models,” 2024.
  • [32] J. Liu, H. Mao, Z. Chen, T. Zhao, N. Shah, and J. Tang, “Neural scaling laws on graphs,” 2024.
  • [33] P. Zador, “Asymptotic quantization error of continuous signals and the quantization dimension,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 139–149, 1982.
  • [34] N. Levi and Y. Oz, “The underlying scaling laws and universal statistical structure of complex datasets,” 2024.
  • [35] S. Petrini, A. Casas-i Muñoz, J. Cluet-i Martinell, M. Wang, C. Bentz, and R. Ferrer-i Cancho, “Direct and indirect evidence of compression of word lengths. zipf’s law of abbreviation revisited,” Glottometrics, vol. 54, p. 58–87, 2023. [Online]. Available: http://dx.doi.org/10.53482/2023_54_407
  • [36] T. M. Buzug, J. von Stamm, and G. Pfister, “Characterising experimental time series using local intrinsic dimension,” Physics Letters A, vol. 202, no. 2-3, pp. 183–190, 1995.
  • [37] T. N. Chan, Z. Li, L. H. U, and R. Cheng, “Plame: Piecewise-linear approximate measure for additive kernel svm,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 10, pp. 9985–9997, 2023.
  • [38] K. D. Ba, P. Indyk, E. Price, and D. P. Woodruff, “Lower bounds for sparse recovery,” in Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms.   SIAM, 2010, pp. 1190–1197.
  • [39] X. Qiu, J. Hu, L. Zhou, X. Wu, J. Du, B. Zhang, C. Guo, A. Zhou, C. S. Jensen, Z. Sheng, and B. Yang, “Tfb: Towards comprehensive and fair benchmarking of time series forecasting methods,” 2024.
  • [40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019.
  • [41] H. Liu, Z. Dai, D. R. So, and Q. V. Le, “Pay attention to mlps,” 2021.

Appendix A More about the intrinsic space

A.1 Relationship between intrinsic dimension and horizon

We claimed in section 3.2.1 that dI(H)d_{I}(H) should be asymptotically linear to HH. Here we give a rough explanation for this. Since the reconstruction error is bounded by a constant value irrelevant to LL, the relative error should be O(1L)O(\frac{1}{L}). That is, the error reconstructing y𝒪(L)y\in\mathcal{O}(L) from x(L)x\in\mathcal{M}(L) should be O(1L)O(\frac{1}{L}). On the hand, this error can be viewed as an error caused by compressing a dO(L)d_{O}(L)-dimensional vector into a dI(L)d_{I}(L)-dimensional space. From a sparse recovery aspect, if the covariance matrix in 𝒪(L)\mathcal{O}(L) has rank rr, then the dO(L)d_{O}(L)-dimensional vector would be equivalent with a rr-sparse vector. [38] shows that, in this case, the minimal dimension that could recover it with relative error bounded by CC is Ω~(r)\tilde{\Omega}(r). Since in our case rLr\propto L due to intrinsic uncertainty, hence dI(L)=Ω(L)d_{I}(L)=\Omega(L). And since dI(L)dO(L)=O(L)d_{I}(L)\leq d_{O}(L)=O(L), we proved that dI(L)=O(L)d_{I}(L)=O(L).

A.2 A Construction Method

We informally provide a specified method to construct (L)\mathcal{M}(L) theoretically in a recursion method. If (0),(1),,(L1)\mathcal{M}(0),\mathcal{M}(1),\ldots,\mathcal{M}(L-1) has been defined, then we define (L)\mathcal{M}(L) as follows: given a time series x0,1,,L1x_{0,1,\ldots,L-1} of length LL, we represent it by concatenating the representation of x0,1,,L2x_{0,1,\ldots,L-2} in (L1)\mathcal{M}(L-1) and xL1x_{L-1} in a space with dimension dim((L1)+1)dim(\mathcal{M}(L-1)+1). Then we find a manifold that these points lie in in this space to represent (L)\mathcal{M}(L).

A.3 Non-linear truncation

We assumed that the truncating function is close to linear projection in section 3.2.1. But in fact the two functions might have some subtle differences. This is because for a certain feature, estimating the feature with a sequence of finite length LL might result in error for the feature even if each element in the sequence is accurate enough. (For example, if the feature comes directly from DFT (Discrete Fourier Transform), there might be ’leakage error’ if the base frequency is not an exact multiple of the feature frequency.) This is a systematic error caused by the ’precession’, where the feature’s characteristic frequency mismatches the measuring frequency.

For any sequence SS and let xt1:t2(t2t1)x_{t_{1}:t_{2}}\in\mathcal{M}(t_{2}-t_{1}) denote the vector for the sequence S[t1:t2]S[t1:t2], we can define a ’truncating function’ tp[H1,H2]:(H1)(H2)t_{p}[H_{1},H_{2}]:\mathcal{M}(H_{1})\to\mathcal{M}(H_{2}) for any H1H2>0H_{1}\geq H_{2}>0 with tp[H1,H2](x0:H1)=x0:H2t_{p}[H_{1},H_{2}](x_{0:H_{1}})=x_{0:H_{2}}. Let tp1[H1,H2]t_{p}^{-1}[H_{1},H_{2}] denote the inverse image function, which maps x(H2)x\in\mathcal{M}(H_{2}) to {yy(H1),tp[H1,H2](y)=x\{y\mid y\in\mathcal{M}(H_{1}),t_{p}[H_{1},H_{2}](y)=x. Notice that this truncating function is unrelated to the specific sequence. Now, assume that tp[H1,H2]t_{p}[H_{1},H_{2}] is smooth, then from implicit function theorem we know that for any vector x(H1)x\in\mathcal{M}(H_{1}) if tp[H1,H2]t_{p}[H_{1},H_{2}] is locally differentiable in a neighborhood of xx then tp1[H1,H2](tp[H1,H2](x))t_{p}^{-1}[H_{1},H_{2}](t_{p}[H_{1},H_{2}](x)) is a submanifold of (H1)\mathcal{M}(H_{1}) with dimension dI(H1)dI(H2)d_{I}(H_{1})-d_{I}(H_{2}), and from Sard’s theorem we know that the targets of these locally undifferentiable points have measure zero. That is:

Let S={xx(H1),tp1[H1,H2](x)MdI(H1)dI(H2)}S=\{x\mid x\in\mathcal{M}(H_{1}),t_{p}^{-1}[H_{1},H_{2}](x)\notin M^{d_{I}(H_{1})-d_{I}(H_{2})}\} where MdM^{d} denotes the set of all dd-dimensional manifolds. Let μ\mu denote a measuring function defined in (H1)\mathcal{M}(H_{1}), then μ(S)=0\mu(S)=0.

Hence the impact of such ‘critical points’ to the total loss is negligible. For simplicity, we can assume S=S=\emptyset. Then tpt_{p} could be written as:

tp[H1,H2](x)=P[H1,H2]g[H1,H2](x),t_{p}[H_{1},H_{2}](x)=P[H_{1},H_{2}]g[H_{1},H_{2}](x),

where P[H1,H2]P[H_{1},H_{2}] is the projection mapping from (H1)\mathcal{M}(H_{1}) to (H2)\mathcal{M}(H_{2}), and g[H1,H2]g[H_{1},H_{2}] is an invertible mapping in (H1)\mathcal{M}(H_{1}) that maps tp1(y)t_{p}^{-1}(y) to a (dI(H1)dI(H2))(d_{I}(H_{1})-d_{I}(H_{2}))-dimensional subspace of (H1)\mathcal{M}(H_{1}) for each y(H2)y\in\mathcal{M}(H_{2}).

We can naturally assume that the ’precession’ effect on each subsequence with a certain length ll is only related to ll and intrinsic properties of the task itself, then consider a length-2l2l sequence, it could be viewed as two independent measuring of length-ll subsequences, hence this error is always bounded by O(1L)O(\frac{1}{\sqrt{L}}). We can formalize this condition into:

g[H1,H2](x)x2g[H1,H2](x)2κH1,x(H1),\frac{\|g[H_{1},H_{2}](x)-x\|_{2}}{\|g[H_{1},H_{2}](x)\|_{2}}\leq\frac{\kappa}{\sqrt{H_{1}}},\forall x\in\mathcal{M}(H_{1}),

where κ\kappa is a constant, and 2\|\cdot\|_{2} represents L2-norm. Also we should have 𝔼[g(x)x]=0\mathbb{E}[g(x)-x]=0. This results in an extra term in the Bayesian loss. We deduce it again formally from 𝔼[var(tp1[H](x))]\mathbb{E}[var(t_{p}^{-1}[H](x))].

𝔼[var(tp1[H](x))]\displaystyle\mathbb{E}[var(t_{p}^{-1}[H](x))] =𝔼[var(g1[H]P1[H](x))]\displaystyle=\mathbb{E}[var(g^{-1}[H]P^{-1}[H](x))]
𝔼[κ2Hy22]+𝔼[var(P1(x)]\displaystyle\leq\mathbb{E}[\frac{\kappa^{2}}{H}\|y\|_{2}^{2}]+\mathbb{E}[var(P^{-1}(x)]
=κ2H0i<dIλi+𝔼[var(P1(x)]\displaystyle=\frac{\kappa^{2}}{H}\sum\limits_{0\leq i<d_{I}}\lambda_{i}+\mathbb{E}[var(P^{-1}(x)]

Same to what we have done in section 3.2.3:

𝔼[var(P1(x))]=dI(H)<=i<dIλi\mathbb{E}[var(P^{-1}(x))]=\sum\limits_{d_{I}(H)<=i<d_{I}}\lambda_{i}

Hence:

𝔼[var(tp1[H](x))]\displaystyle\mathbb{E}[var(t_{p}^{-1}[H](x))] κ2H0i<dIλi+dI(H)<=i<dIλi\displaystyle\leq\frac{\kappa^{2}}{H}\sum\limits_{0\leq i<d_{I}}\lambda_{i}+\sum\limits_{d_{I}(H)<=i<d_{I}}\lambda_{i}
λ0αZ1(κ2αZH+1dI(H)2αZ1)\displaystyle\leq\frac{\lambda_{0}}{\alpha_{Z}-1}(\frac{\kappa^{2}\alpha_{Z}}{H}+\frac{1}{d_{I}(H)^{2\alpha_{Z}-1}})

which indicates that:

lossK2(1η)λ0αZ1(κ2αZH+1dI(H)αZ1)+ησM2S2dI(S)\text{loss}\leq K^{2}(1-\eta)\frac{\lambda_{0}}{\alpha_{Z}-1}(\frac{\kappa^{2}\alpha_{Z}}{H}+\frac{1}{d_{I}(H)^{\alpha_{Z}-1}})+\eta\sigma_{M}^{2}S^{2}d_{I}(S)

Since κ\kappa should be small, and typically α<2\alpha<2 (it holds for all our experiment results), the extra term could be neglected. Hence, it is reasonable that we assume tp[H1,H2]P[H1,H2]t_{p}[H_{1},H_{2}]\approx P[H_{1},H_{2}].

A.4 Reduction of Loss into Intrinsic Space

We prove that under assumptions made in Section 3.2.1, the loss in the orginal space can be linearly bounded by the loss in the intrinsic space.

Consider we are predicting x[0:S]x[0:S] from x[H:0]x[-H:0], let yi(H)y_{i}\in\mathcal{M}(H) be the intrinsic vector of x[H:0]x[-H:0] and yo(S)y_{o}\in\mathcal{M}(S) be the intrinsic vector of x[0:S]x[0:S] (the true intrinsic vector). If we have a model mm so that:

𝔼x[H:S]𝒪(H+S)[|m(yi)yo|2]Lins\mathbb{E}_{x[-H:S]\in\mathcal{O}(H+S)}[|m(y_{i})-y_{o}|_{2}]\leq L_{ins}

where LinsL_{ins} represents the expected error (or loss) in the intrinsic space. Then, from assumption 2 we have:

𝔼x[H:S]𝒪(H+S)[|ϕ1(m(yi))ϕ1(yo)|2]KILins\mathbb{E}_{x[-H:S]\in\mathcal{O}(H+S)}[|\phi^{-1}(m(y_{i}))-\phi^{-1}(y_{o})|_{2}]\leq K_{I}L_{ins}

and from assumption 1 we know that 𝔼x[0:S]𝒪(S)[|ϕ1(yo)x[0:S]|2]e(S)\mathbb{E}_{x[0:S]\in\mathcal{O}(S)}[|\phi^{-1}(y_{o})-x[0:S]|_{2}]\leq e(S). Therefore:

Lori=\displaystyle L_{ori}= 𝔼x[H:S]𝒪(H+S)[|ϕ1(m(yi))x[0:S]|2]\displaystyle\mathbb{E}_{x[-H:S]\in\mathcal{O}(H+S)}[|\phi^{-1}(m(y_{i}))-x[0:S]|_{2}]
\displaystyle\leq 2𝔼x[H:S]𝒪(H+S)[|ϕ1(m(yi))ϕ1(yo)|2]+2𝔼x[0:S]𝒪(S)[|ϕ1(yo)x[0:S]|2]\displaystyle 2\mathbb{E}_{x[-H:S]\in\mathcal{O}(H+S)}[|\phi^{-1}(m(y_{i}))-\phi^{-1}(y_{o})|_{2}]+2\mathbb{E}_{x[0:S]\in\mathcal{O}(S)}[|\phi^{-1}(y_{o})-x[0:S]|_{2}]
\displaystyle\leq 2KILins+2e(S).\displaystyle 2K_{I}L_{ins}+2e(S).

Therefore, the loss in the original space is linearly bounded by the loss in the intrinsic space. W.l.o.g we may focus on studying the loss in the intrinsic space.

Appendix B Details about the loss deduction

B.1 Boundary

Recall that in section 3.2.4 we mentioned that there are two scenarios in which the model should work in totally different ways. The boundary between the two scenarios depends not only on the dataset size DD but also on the model size and prediction horizon. A larger model subdivides the intrinsic space into smaller subregions, each containing fewer training samples, with the number of samples per subregion inversely related to model size, expressed as N1N^{-1}.

In time series forecasting, a unique characteristic is that a single sequence, containing multiple length-HH subsequences, counts as multiple samples. However, these subsequences are not fully independent due to their proximity in the intrinsic space. To quantify independence, we define a separation function f(H)f(H), assumed to be proportional to HH, meaning that a sequence of total length LL represents Lf(H)\frac{L}{f(H)} independent samples. Thus, the effective number of independent samples in the dataset scales with DH\frac{D}{H}.

Combining the above analysis, we can define a hyperparameter as ξ=DNH\xi=\frac{D}{NH}. If ξ\xi is large enough, then we can assume data is sufficient, and the model tends to learn a linear approximation for each subregion; if ξ\xi is small, we can assume data is scarce and the model tends to find the nearest neighbour for each test sample.

B.2 Sufficient Data

We formally derive the total loss in the scenario where we have a large amount of data. First recall that the Bayesian loss calculated in section 3.2.3 is:

LBayesianK12(1η)λ0(αZ1)dI(H)αZ1+ησM2S2dI(S)L_{Bayesian}\approx K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{(\alpha_{Z}-1)d_{I}(H)^{\alpha_{Z}-1}}+\eta\sigma_{M}^{2}S^{2}d_{I}(S)

As we assume SS is fixed, ησM2S2dI(S)\eta\sigma_{M}^{2}S^{2}d_{I}(S) is constant and hence could be ignored. Now let’s consider the approximation loss, it should come from two sources: the limited model size would result in uncertainty in the subregions, and the limited dataset size would bring noise to the data. Let’s consider these two causes separately.

B.2.1 Model Size Limitation

We assume that a model is separating the intrinsic space into NN regions, and does a linear prediction for each region. It is easy to verify that the optimal prediction from (H)\mathcal{M}(H) to (S)\mathcal{M}(S) should be first-order K1K_{1}-Lipschitz and second-order K2K_{2}-Lipschitz, too.

For any region with volume VV, the loss could be estimated as:

L=VddI(H)x|f(x)l(x)|2L=\int_{V}\text{d}^{d_{I}(H)}x|f(x)-l(x)|^{2}

where ll is the predicted linear function for the region. Expanding with Taylor’s series, we have f(x)l(x)+12(f(x)f(μV))(xμV)f(x)\approx l(x)+\frac{1}{2}(f^{\prime}(x)-f^{\prime}(\mu_{V}))(x-\mu_{V}). From the second-order Lipschitz condition, we can conclude that:

|f(x)l(x)|2K22(xμV)4|f(x)-l(x)|^{2}\leq K_{2}^{2}(x-\mu_{V})^{4}

The total loss could be bounded as follows:

lossK22f(x)(xQ(x))4dx\text{loss}\leq K_{2}^{2}\int f(x)(x-Q(x))^{4}\text{d}x

where Q(x)Q(x) represents the center of the region that xx is in and f(x)f(x) represents the probability density. We can find that the integral in the formula is exactly the 4th power distortion of the quantization. For an arbitrary quantization with code density g(x)g(x), [33] gives an exact value of distortion equal to:

f(x)(xQ(x))4dx=π2N4dI(H)Γ(dI(H)+4dI(H))(Γ(dI(H)+22))4dI(H)f(x)g(x)4dI(H)dx\int f(x)(x-Q^{*}(x))^{4}\text{d}x=\pi^{-2}N^{-\frac{4}{d_{I}(H)}}\Gamma(\frac{d_{I}(H)+4}{d_{I}(H)})(\Gamma(\frac{d_{I}(H)+2}{2}))^{\frac{4}{d_{I}(H)}}\int\frac{f(x)}{g(x)^{\frac{4}{d_{I}(H)}}}\text{d}x

Assume dI(H)d_{I}(H) is sufficiently large, then Γ(dI(H)+4dI(H))1\Gamma(\frac{d_{I}(H)+4}{d_{I}(H)})\approx 1. So the expression can be transformed into:

f(x)(xQ(x))4dxdI(H)2N4dI(H)4π2f(x)g(x)4dI(H)dx\int f(x)(x-Q^{*}(x))^{4}\text{d}x\approx\frac{d_{I}(H)^{2}N^{-\frac{4}{d_{I}(H)}}}{4\pi^{2}}\int\frac{f(x)}{g(x)^{\frac{4}{d_{I}(H)}}}\text{d}x

For uniform distribution, g(x)=1g(x)=1 (as we assume (H)=[0,1]dI(H)\mathcal{M}(H)=[0,1]^{d_{I}(H)}. Hence:

f(x)g(x)4dI(H)dx=1\int\frac{f(x)}{g(x)^{\frac{4}{d_{I}(H)}}}\text{d}x=1

And the total loss could be bounded as:

lossK22dI(H)2N4dI(H)4π2\text{loss}\leq K_{2}^{2}\frac{d_{I}(H)^{2}N^{-\frac{4}{d_{I}(H)}}}{4\pi^{2}}

From this bound, we can see that a larger horizon implies a larger burden for the model, which might lead to worse performance for the model.

B.2.2 Dataset Limitation

Following the previous chapter, as we assume the mapping ff represented by the model is a piecewise linear function, then regionally it can be written as the usual form of the gaussian-markov linear model, assuming there are homoscedastic uncorrelated noise ε\varepsilon for every single sample in the dataset. Now we show that this noise will cause another term of loss even compared to the optimal performance of the model. We first analyze a single region containing DiD_{i} samples. Consider XX,YY from the training set:

Y=Xβ+ε.Y=X\beta+\varepsilon.

Where XX is of shape Di×dI(H)D_{i}\times d_{I}(H) and Y,εY,\varepsilon is of shape D×dI(S)D\times d_{I}(S). β\beta is the variable to learn of shape dI(H)×dI(S)d_{I}(H)\times d_{I}(S). The BLUE estimator β^\hat{\beta} should be β^=(XTX)1XTY\hat{\beta}=(X^{T}X)^{-1}X^{T}Y.

Now consider X,YX^{\prime},Y^{\prime} from the test set:

Y\displaystyle Y^{\prime} =Xβ+ε\displaystyle=X^{\prime}\beta+\varepsilon^{\prime}
Lrecon\displaystyle L_{recon} =E[(YXβ)2]\displaystyle=E[(Y^{\prime}-X^{\prime}\beta)^{2}]
=E[(εX(XTX)1XTε)2]\displaystyle=E[(\varepsilon^{\prime}-X^{\prime}(X^{T}X)^{-1}X^{T}\varepsilon)^{2}]
=E[ε2]+E[(X(XTX)1XTε)2]\displaystyle=E[\varepsilon^{\prime 2}]+E[(X^{\prime}(X^{T}X)^{-1}X^{T}\varepsilon)^{2}]

The first term goes to 0 if the test set is sufficiently large, and it is a constant given a fixed test set, so we can simply ignore it in our analysis. Consider the second term, we can show that:

𝔼[(X(XTX)1XTε)2]=Di1dI(H)εy2\mathbb{E}[(X^{\prime}(X^{T}X)^{-1}X^{T}\varepsilon)^{2}]=D_{i}^{-1}d_{I}(H)\varepsilon_{y}^{2}

Recall the noise consisting of two sources:

Y=Xβ+F[S](XPc[H]1(X))+εBY=X\beta+F[S](X^{*}-P_{c}[H]^{-1}(X))+\varepsilon_{B}

As we’ve done in section 3.2.3, we can derive that:

var(XPc[H]1(X))\displaystyle\text{var}(X^{*}-P_{c}[H]^{-1}(X)) dI(H)i<dIλi\displaystyle\approx\sum\limits_{d_{I}(H)\leq i<d_{I}}\lambda_{i}
λ0(αZ1)dIαZ1(H)\displaystyle\approx\frac{\lambda_{0}}{(\alpha_{Z}-1)d_{I}^{\alpha_{Z}-1}(H)}

As F[S]F[S] is K1K_{1}-Lipschitz, this indicates that:

εhK12λ0(αZ1)dIαZ1(H)\varepsilon_{h}\leq\frac{K_{1}^{2}\lambda_{0}}{(\alpha_{Z}-1)d_{I}^{\alpha_{Z}-1}(H)}

And as mentioned in section 3.2.3 var(εB)=σM2S2dI(S)\text{var}(\varepsilon_{B})=\sigma_{M}^{2}S^{2}d_{I}(S), hence we can bound the total loss related to DiD_{i} into:

lossiDi1dI(H)(σM2S2dI(S)+K12λ0(αZ1)dIαZ1(H))\text{loss}_{i}\leq D_{i}^{-1}d_{I}(H)(\sigma_{M}^{2}S^{2}d_{I}(S)+\frac{K_{1}^{2}\lambda_{0}}{(\alpha_{Z}-1)d_{I}^{\alpha_{Z}-1}(H)})

It is worth noticing that, although different regions have their own distributions and hence may have different eigenvalues, the loss expression is irrelevant with the eigenvalues, so we can ignore this issue. Considering the entire dataset with all regions, the total loss should be:

loss j=1NlossiDiD\displaystyle\approx\frac{\sum\limits_{j=1}^{N}\text{loss}_{i}\cdot D_{i}}{D}
=NdI(H)D(σM2S2dI(S)+K12λ0(αZ1)dIαZ1(H))\displaystyle=\frac{Nd_{I}(H)}{D}(\sigma_{M}^{2}S^{2}d_{I}(S)+\frac{K_{1}^{2}\lambda_{0}}{(\alpha_{Z}-1)d_{I}^{\alpha_{Z}-1}(H)})

This indicates that, using a larger model would be more prone to being affected by noise in the dataset, so especially when the data is noisy and the dataset size is limited, our theory suggests that it would perhaps be better to use a smaller model.

B.2.3 Total loss

Combining the above approximation loss and the Bayesian loss, the total loss would be:

lossK22d2N4d4π2+K12(1η)λ0αZ11dI(H)αZ1+NdI(H)DσM2S2dI(S)\text{loss}\approx K_{2}^{2}\frac{d^{2}N^{-\frac{4}{d}}}{4\pi^{2}}+K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{\alpha_{Z}-1}\frac{1}{d_{I}(H)^{\alpha_{Z}-1}}+\frac{Nd_{I}(H)}{D}\sigma_{M}^{2}S^{2}d_{I}(S)

B.3 Scarce Data

As the model finds the nearest neighbor, the approximation loss in this scenario should consist of two sources: the distance from the test sample to its nearest neighbor and the noise in its nearest neighbor. If the distribution is smooth, then locally around any point xx, the distribution of other points can be viewed as uniform when DD is large.

Consider a sphere of radius rr around a test sample, the probability that no point in the training set lies in the ball is approximately:

P(r)eDV(r)=eDCdrdI(H)P(r)\approx e^{-DV(r)}=e^{-DC_{d}r^{d_{I}(H)}}

where Cd=πdI(H)2Γ(dI(H)/2+1)C_{d}=\frac{\pi^{\frac{d_{I}(H)}{2}}}{\Gamma(d_{I}(H)/2+1)} denotes the volume of a unit sphere.

From this, we can derive that:

𝔼x[r2](Γ(d2+1))2dI(H)Γ(2dI(H))πdI(H)D2dI(H)f(x)2dI(H)\mathbb{E}_{x}[r^{2}]\approx\frac{(\Gamma(\frac{d}{2}+1))^{\frac{2}{d_{I}(H)}}\Gamma(\frac{2}{d_{I}(H)})}{\pi{d_{I}(H)}}D^{-\frac{2}{d_{I}(H)}}f(x)^{\frac{2}{d_{I}(H)}}

Considering dI(H)d_{I}(H) is large, we can get that:

𝔼[r2]dI(H)4πD2dI(H)\mathbb{E}[r^{2}]\approx\frac{d_{I}(H)}{4\pi}D^{-\frac{2}{d_{I}(H)}}

which means:

lossK12d4πfD2d\text{loss}\approx\frac{K_{1}^{2}d}{4\pi}fD^{-\frac{2}{d}}

Combining the Bayesian loss similar to the previous scenario, the total loss should be:

lossK12(1η)λ0αZ11dI(H)αZ1+K124πdI(H)D2dI(H)\text{loss}\approx K_{1}^{2}(1-\eta)\frac{\lambda_{0}}{\alpha_{Z}-1}\frac{1}{d_{I}(H)^{\alpha_{Z}-1}}+\frac{K_{1}^{2}}{4\pi}d_{I}(H)D^{-\frac{2}{d_{I}(H)}}

Appendix C Optimal Horizon deduction

Continuing from section B, we study the optimal horizon in two scenarios.

C.1 Sufficient Data

Let’s first study the case where the dataset is large, which is the case in section B.2.

C.1.1 Small Model

If the model size is too small compared to the dataset size, i.e. N=o(DdI(H)dI(H)+4)N=o(D^{\frac{d_{I}(H)}{d_{I}(H)+4}}) (however we still assume NN is sufficiently large compared to other variables), then the effect of dataset size on picking the optimal horizon could be neglected. In this case, the optimal value of dI(H)d_{I}(H) is unrelated with the dataset size DD, and should satisfy:

dI=(K12π2(1η)λ0N4dIK22lnN)1αZd_{I}^{*}=(\frac{K_{1}^{2}\pi^{2}(1-\eta)\lambda_{0}N^{\frac{4}{d_{I}^{*}}}}{K_{2}^{2}\ln N})^{\frac{1}{\alpha_{Z}}}

Solving the formula we get the following result:

dI=𝒲(4αZC01αZln1+1αZN)4αZC01αZln1+1αZNd_{I}^{*}=\mathcal{W}(\frac{4}{\alpha_{Z}C_{0}^{\frac{1}{\alpha_{Z}}}}\ln^{1+\frac{1}{\alpha_{Z}}}N)\approx\frac{4}{\alpha_{Z}C_{0}^{\frac{1}{\alpha_{Z}}}}\ln^{1+\frac{1}{\alpha_{Z}}}N

where 𝒲()\mathcal{W}(\cdot) is the Lambert W function and C0=K12π2(1η)λ0K22C_{0}=\frac{K_{1}^{2}\pi^{2}(1-\eta)\lambda_{0}}{K_{2}^{2}}.

We can see that in this case, dId_{I}^{*} is not only irrelevant with DD, but also changes very little with NN.

C.1.2 Large Model

On the other hand, if we assume the model is large enough, such that N=ω(DdI(H)dI(H)+4)N=\omega(D^{\frac{d_{I}(H)}{d_{I}(H)+4}}), then the noise effect would be dominant in picking the optimal HH. In this case, the optimal value of dI(H)d_{I}(H) should be:

dI=(K2(1η)λ0DNσM2S2dI(S))1αZd_{I}^{*}=(\frac{K^{2}(1-\eta)\lambda_{0}D}{N\sigma_{M}^{2}S^{2}d_{I}(S)})^{\frac{1}{\alpha_{Z}}}

We can see that in this case the optimal dId_{I} changes rapidly to DD and NN. Moreover, the optimal horizon increases with a larger dataset size DD and decreases with a larger model size NN (this is quite counter-intuitive but it surprisingly matches our experiment result).

C.2 Scarce Data

This is the case in section B.3. We can directly solve lossdI(H)=0\frac{\partial\text{loss}}{\partial d_{I}(H)}=0, which is equivalent to:

4π(1η)λ0dIαZ=D2dI(1+2lnDdI)\frac{4\pi(1-\eta)\lambda_{0}}{{d_{I}^{*}}^{\alpha_{Z}}}=D^{-\frac{2}{d_{I}^{*}}}(1+\frac{2\ln D}{d_{I}^{*}})

We can estimate that dI=βC1αZ1lnDlnlnDd_{I}^{*}=\beta C^{\frac{1}{\alpha_{Z}-1}}\frac{\ln D}{\ln\ln D}

dI2αZ+1lnDlnlnDd_{I}^{*}\approx\frac{2}{\alpha_{Z}+1}\frac{\ln D}{\ln\ln D}

where C=4π(1η)λ0C=4\pi(1-\eta)\lambda_{0}. Substituting dId_{I}^{*} into the formula, we can get:

(αZ1)β2+β2C1αZ1=0(\alpha_{Z}-1)\beta^{2}+\beta-\frac{2}{C^{\frac{1}{\alpha_{Z}-1}}}=0

Hence we have:

β=1+8(αZ1)C1αZ12(αZ1)\beta=\frac{\sqrt{1+\frac{8(\alpha_{Z}-1)}{C^{\frac{1}{\alpha_{Z}-1}}}}}{2(\alpha_{Z}-1)}

and dId_{I}^{*} should be:

dI=1+8(αZ1)C1αZ12(αZ1)C1αZ1lnDlnlnDlnDlnlnDd_{I}^{*}=\frac{\sqrt{1+\frac{8(\alpha_{Z}-1)}{C^{\frac{1}{\alpha_{Z}-1}}}}}{2(\alpha_{Z}-1)}C^{\frac{1}{\alpha_{Z}-1}}\frac{\ln D}{\ln\ln D}\propto\frac{\ln D}{\ln\ln D}

Compared to the first scenario, the optimal dId_{I} almost doesn’t change in this scenario.

Appendix D Details about model’s partitioning

As we mentioned in section 3.2.4, we assume the model partitions the intrinsic space uniformly, but there should be better partitions. As [33] states the optimal partition should satisfy g(x)f(x)dI(H)dI(H)+4g(x)\propto f(x)^{\frac{d_{I}(H)}{d_{I}(H)+4}}, and if the distribution is Gaussian then the loss (only for this term) should satisfy:

LpK22λ02N4dI(H)e2dI(H)2α2L_{p}\approx K_{2}^{2}\frac{\lambda_{0}^{2}N^{-\frac{4}{d_{I}(H)}}}{e^{2}d_{I}(H)^{2\alpha-2}}

Replacing the corresponding term with this, we see that in the case in section C.1.2, since the noise makes the most impact, this term does not matter. However, in the case in section C.1.1, if we assume the dataset size is infinite, then the loss is asymptotically monotonically decreasing with dI(H)d_{I}(H). Thus, the larger the horizon we use, the better performance the model will potentially reach.

However, experiments show that the optimal horizon always exists in this case. This might suggest that the model structure limits itself from finding the best partitioning. Hence, we expect that when we design a good enough model structure (maybe only exists theoretically) and when we have enough data, we should be able to make full use of our horizon regardless of the model size or computational resources.

Appendix E Experiment settings

E.1 Dataset

Dataset Dim Pred Len Dataset Size Frequency Information
ETTh1 7 192192 (8545,2881,2881)(8545,2881,2881) Hourly Electricity
ETTh2 7 192192 (8545,2881,2881)(8545,2881,2881) Hourly Electricity
ETTm1 7 192192 (34465,11521,11521)(34465,11521,11521) 15min15\mathrm{~{}min} Electricity
ETTm2 7 192192 (34465,11521,11521)(34465,11521,11521) 15min15\mathrm{~{}min} Electricity
Exchange 8 192192 (5120,665,1422)(5120,665,1422) Daily Economy
Weather 21 192192 (36792,5271,10540)(36792,5271,10540) 10min10\mathrm{~{}min} Weather
ECL 321 192192 (18317,2633,5261)(18317,2633,5261) Hourly Electricity
Traffic 862 192192 (12185,1757,3509)(12185,1757,3509) Hourly Transportation
Table 1: Dataset used for experiments

We conduct experiments on 88 datasets listed here. ETTh1, ETTh2, ETTm1, ETTm2 [4] contains 77 factors of electricity transformer from July 2016 - July 2018 at a certain frequency. Exchange [5] includes panel data for daily exchange rates from 88 countries across 2727 years. Weather [5] includes 2121 meteorological factors from a Weather Station. ECL [5] records electricity consumption of 321321 clients. Traffic [5] records hourly road occupancy rates for 22 years in San Francisco Bay area freeways. For all experiments, the pred-len is set to 192192.

In our experiments, we do 33 iterations for ETTh1, ETTh2, ETTm1, ETTm2, Weather and Exchange, and draw graphs with error bar equaling to the standard error of these iterations. For Weather dataset, actually the error is minor. (see Appendix E.3), and for larger datasets including Traffic and Electricity, the error is even smaller. Therefore, we conduct 11 iters for Traffic and ECL since they are large and will introduce minor std error.

E.2 Drop-last issue

The drop-last issue is reported by several researchers [3, 39]. That is, in some previous work evaluating the model on test set with drop_last=True setting may cause additional errors related to test batch size. In our experiments, we use drop-last set to False to avoid this issue.

E.3 Time-related Domain Difference in Datasets: can be neglected to a certain extent

We use the first pp percent of our training dataset as our dataset as available training data. This may introduce an error caused by time-index-related domain differences in these datasets. To avoid this we utilize the Instance Normalization [19]. Our further experiments on the Weather dataset show that this error is actually minor compared to the randomness introduced in the training procedure. We use an NLinear model to forecast 192192 frames based on a look-back window of size 336336. We use 10%10\% of training data, from different slices in the training set. For each slice, we repeat the experiment for 33 times. The result is:

MSE starting point=20%20\% 40%40\% 60%60\% 80%80\%
Exp1 0.2230.223 0.2200.220 0.2220.222 0.2230.223
Exp2 0.2210.221 0.2210.221 0.2230.223 0.2240.224
Exp3 0.2210.221 0.2210.221 0.2230.223 0.2210.221
Avg 0.222±0.0010.222\pm 0.001 0.221±0.0010.221\pm 0.001 0.223±0.0010.223\pm 0.001 0.223±0.0010.223\pm 0.001
Table 2: MSE measured for weather dataset

It can be seen that the variance introduced by different slicing points is very small.

E.4 Models and Experiment settings

For all models we utilize instance normalization that normalizes an input sequence based on its mean and std. All the deep learning networks are implemented in PyTorch[40] and conducted on a cluster with NVIDIA RTX 3080, RTX 3090, RTX 4090D and A100 40GB GPUs. Although we use multiple types of GPUs, it is possible to carry out all experiments on a single GPU with greater or equal to 24GB24\text{GB} of memory. Experiments conducted in this work cost about O(103)O(10^{3}) gpu hours for RTX 3090 in total.

E.4.1 Linear Models

For linear models, we use the linear layer after instance normalization, and perform de-normalization after the linear layer, equivalent to the NLinear model[1], for all datasets and all experiments.Batch size are chosen from {1024,2048,4096,8192,16384}\{1024,2048,4096,8192,16384\}, learning rate is chosen from {0.003,0.001}\{0.003,0.001\} and weight decay is chosen from {0.0005,0.005,0.001,0.0001}\{0.0005,0.005,0.001,0.0001\}, and we decay the learning rate with a factor of {0.96,0.97,0.98}\{0.96,0.97,0.98\} for at most 100100 epochs with patience at most 3030. For larger datasets including Electricity, Traffic and weather the learning rate and weight decay barely make differences to the result. For relatively small datasets includgin ETTh1/ETTh2, ETTm1/ETTm2 and Exchange, larger weight decay may improve performance.

E.4.2 MLP

We use two types of MLPs. For the data scaling (including the HorizonXDataset) experiments, we use the gated MLP[41]: hidden(x)=W1(x)sigmoid(W2(x))\text{hidden}(x)=W_{1}(x)\odot\text{sigmoid}(W_{2}(x)) where \odot is the element-wise product to improve the capability of our model. For the width scaling (and the HorizonXWidth) experiments, we use plain MLP with leaky-relu as activation function.

Gated MLP is used on Traffic, Weather and Electricity dataset. On these datasets we use 3layer3-layer mlp with model dimension 768768 and hidden dimension 27682*768. Learning rate is set to 0.0010.001 and weight decay is set to {0.00001,0.0005}\{0.00001,0.0005\}.

For single layer mlp that we use to experiment the effect of model size scaling, we us the relu activation function and the model width denotes the hidden dimension of the single hidden layer. For different datasets, the learning rate is set to {0.003}\{0.003\} and the weight decay is set to {0.0005}\{0.0005\}.

Batchsize is set to 40964096 or 81928192 for these cases.

E.4.3 iTransformer

We follow the original codes provided by iTransformer[10].

For dataset size scaling and (Horizon X Dataset Size) case we use 44 iTransformer encoder blocks of dimension 512512 for Traffic, Weather and Electricity. For other smaller datasets, we use 11 blocks of dimension 192192.

For Model size scaling case we use 22 iTransformer encoder blocks for large datasets including Traffic, Weather and ECL, and 11 block for smaller datasets. The width indicates the model dimension as well as the feedforward dimension.

Learning rate is set to 0.0030.003 for small datasets and 0.00030.0003 for large datasets. We decay the learning rate with a factor of 0.970.97 or 0.980.98 every epoch, and run the training procedure for at most 100100 epochs with patience 1010 for small datasets, and 4040 epochs with patience 1010 for larger datasets including Traffic, Weather and Electricity. Batch size is chosen from {12,20,32}\{12,20,32\}, mainly to fit into the gpu memory.

E.4.4 ModernTCN

We follow the original codes provided by ModernTCN[10]. We use the recommended settings given by the authors of ModernTCN except for the horizon, the channel dimension and the training procedure. Since ModernTCN has many hyperparameters that can be flexibly tuned, we do not include other same hyperparameter here, and please refer to the original paper and cods for ModernTCN for a more detailed description. Here we only include the different hyper-parameters.

For Dataset Size Scaling and (Horizon X Dataset Size) experiments, we only modify the look back horizon. For training procedure, we use learning rate in {0.00005,0.0001}\{0.00005,0.0001\}. We train for 100100 epochs with patience 2020 for small datasets and 4040 epochs with patience 1010 for larger datasets.

For Model Size Scaling and (Horizon X Model Size) experiments, we only modify the look back horizon as well as the chanenl dimension and the depth-wise dimensions in the code.Again the learning rate is chosen from {0.00005,0.0001}\{0.00005,0.0001\}, and we train for 100100 epochs with patience 2020 for small datasets and 4040 epochs with patience 1010 for larger datasets. Batchsize is chosen from {16,64}\{16,64\} following the original hyperparameter settings given by the author of ModernTCN.

Appendix F Downsampling May Improve Performance: power of patch, low-pass-filter, and downsampling

Previous work has proven the success of patches [7] and low-pass-filter [3] for time series prediction. These can be explained with our theory by making the assumption that high frequency features are the most unimportant ones: thus by filtering up high frequency features we make the model visible to the most important dimensions of the intrinsic space.

From a more detailed theoretical perspective, downsampling can be viewed as a projection to a subspace in the intrinsic space and thus has a similar effect as decreasing the horizon. Experimental results (as shown in experiments in previous works like FITS[3], PatchTST[7] and in Figure 7) show that the projected subspace of higher frequency tends to fall on the large-eigenvalue directions, or the ‘invisible’ dimensions masked by the projection tends to be the more unimportant ones. Although the precise effect of down sampling is unknown or may need further assumptions and methods for precise consideration, it is acceptable that we may approximate the effect of down sampling to be similar to a projection to the first deff<dI(H)d_{eff}<d_{I}(H) dimensions in the intrinsic space. After making this assumption, the overall loss could be expressed as:

lossnew=L(deff)where deff<dI(H).loss_{new}=L(d_{eff})\text{where $d_{eff}<d_{I}(H)$.}

Hence, if the original intrinsic dimension is larger than (local) optimal dd, indicating loss/d>0\partial loss/\partial d>0, reducing dd from dI(H)d_{I}(H) to deffd_{eff} would help reduce loss. Otherwise if dI(H)d_{I}(H) is already smaller than (local) optimal dd^{*}, meaning loss/d<0\partial loss/\partial d<0, we would expect no performance improvement from reducing dd from dI(H)d_{I}(H) to deffd_{eff}.

To validate this idea, We do a similar experiment here, that is to do down-sampling and to use the downsampled sequence as input sequence. We conduct experiments on Traffic dataset and Weather dataset with a lookback horizon set to be 336336 and a prediction length set to be 192192. We use a 4-layer 512-dim gated mlp for traffic dataset and 2-layer 192-dim gated mlp for weather dataset. We can see that doing downsampling would improve the performance of the weather dataset and the optimal downsample ratio also changes with different training dataset size. The explanation for this is similar to how we explain the optimal horizon growth with the growth of the training dataset. More data in the training dataset would lead to less noise on higher dimensions, thus seeing more of these dimensions (or filtering out fewer dimensions with downsampling with longer interpolate length) can improve performance.

Meanwhile, for the traffic dataset, downsampling may hurt performance. This is because Traffic has large amount of training data, and 336336 is smaller than the optimal horizon. At this all dimensions in the intrinsic space are not dominated by noise, hence filtering them out would hurt performance, and we would expect the performance to be improved with the increment of interpolate length.

Refer to caption
Figure 7: MSE loss v.s. interpolate-to-len (length after downsampling) for Traffic and Weather dataset for a certain amount of training data.

Appendix G PCA settings

We use PCA to obtain eigen values that we use to approximate the importance of features in the intrinsic space. We conduct PCA at (all or part of the) training samples (after instance normalization), mainly for the Channel-Independent case. For the PCA used in Figure4 and Figure 8 the horizon is set to 20002000 for both datasets and we perform PCA on all the training samples. We provide more PCA results in Figure 8: if we approximately view the PCA eig-val as the feature importance, then they do follow a Zip-f law approximately.

Refer to caption
Figure 8: PCA results for L=2000L=2000 under Channel-Independent setting on some datasets.

We further conduct PCA on intermediate vectors of the iTransformer[8] model and obtain results, further validating the Zip-f assumption in our paper for intrinsic space for Non-Linear Channel-Dependent Multivariable cases, as shown in Figure 9.

Refer to caption
Figure 9: PCA obtained with iTransformer features. Zip-f law fitting receives p>0.05p>0.05 in all cases. Note that it is hard for deep-learning methods (like iTransformer) to learn feature with small covariance very well, it may be hard to learn features in exchange dataset well (in which feature degrades very fast), hence the seemingly not-fitting figure for exchange is more likely caused by under-training of deep learning models.

Appendix H Foundation Models for Mixed Datasets

Our theory holds without the need for any modification as long as the dataset itself follows a Zip-f distribution in the intrinsic space. This assumption is sort of natural given that Zip-f law is a natural distribution, and previous datasets we examined (like Traffic, ETT, etc) to follow the Zip-f law are composed of smaller datasets. Moreover, we provide further analysis from Theoretical and Experimental perspective on why it holds for mixed datasets.

H.1 Zip-f distribution for Mixed Datasets: a toy model

Theoretically, if a large dataset is composed of ss sub-datasets of similar size each following the Zip-f law with degradation coefficient α1<α2<<αs\alpha_{1}<\alpha_{2}<\ldots<\alpha_{s}, each with size SiS_{i} and follows Zip-f law: λij=Ai/jαi\lambda_{ij}=A_{i}/j^{\alpha_{i}} where λij\lambda_{ij} represents the jj-th largest eigenvalue of the ii-th dataset. Suppose the intrinsic dimensions are orthogonal with each other (hence PCA components are orthogonal). A simple assumption is that the new intrinsic space is a direct product of the old intrinsic spaces, hence eigenvalues are the union of old eigenvalues. For which, an eigenvalue of value SS should be the idxtotalidx_{total}-largest eigen value, in which:

idxtotal=iidxi=i(Ai/S)1/αi.idx_{total}=\sum_{i}idx_{i}=\sum_{i}(A_{i}/S)^{1/\alpha_{i}}.

When SS is small (or correspondingly, when idxtotalidx_{total} is large) this sum is dominated by small α\alphas, and in limitation cases the sum is dominated by α1\alpha_{1} term: idx(A1/S)1/α1+Cidx\approx(A_{1}/S)^{1/\alpha_{1}}+C, which is approximately a Zip-f distribution.

H.2 Experiment results for Zip-f distribution for Mixed Datasets

Experimentally, we use the Mixed Dataset of Traffic, Weather, ETTh1, ETTh2, ETTm1, ETTm2, exchange and ECL to train a Channel-Independent 2-layer 512-dimension MLP and use the Intermediate vector before the decoder layer as a feature vector to do PCA analysis. We found that the result follows a Zip-f distribution for higher-order components, as shown in Figure 10.

Refer to caption
Figure 10: PCA obtained with MLP features trained on Mixed Dataset. The 2-layer 512-dimension MLP is trained under Channel Independent setting on a mixed dataset of Traffic, Weather, Exchange, ETTh1, ETTh2, ETTm1, ETTm2 and ECL. The mixed feature shows a well-aligned Zip-f law for features of higher rankings.

Appendix I More comparison with other formulas

We compare AiC and BiC value of our proposed formula f(x)=A+B/xαf(x)=A+B/x^{\alpha} with other possible formulas:

  • g1(x)=A/xαg_{1}(x)=A/x^{\alpha}

  • g2(x)=A+Blog(x)g_{2}(x)=A+B\log(x)

  • g3(x)=A+Bx+Cx2g_{3}(x)=A+Bx+Cx^{2}

For ModernTCN:

AiC, BiC Traffic Weather ETTh1 ETTh2
f -103.9,-103.4 -87.2,-87.0 -69.5,-69.6 -64.7,-65.3
g1 -95.3,-94.9 -79.1,-79.0 -60.6,-60.7 -45.6,-46.0
g2 -94.6,-63.3 -71.1,-71.0 -59.7,-59.8 -45.4,-45.8
g3 -93.3,-94.9 -81.3,-83.1 -56.5,-56.7 -43.1,-43.8
Table 3: regression results for ModernTCN

For iTransformer:

AiC, BiC Traffic Weather ETTh1 ETTh2
f -71.6,-70.7 -74.8,-74.6 -50.7,-50.9 -56.9,-56.7
g1 -66.7,-66.1 -73.1,-72.9 -45.7,-45.8 -57.9,-57.8
g2 -63.3,-62.6 -71.1,-71.0 -41.7,-41.8 -57.9,-57.7
g3 -54.9,-56.3 -70.1,-72.0 -38.9,-39.7 -56.4,-56.2
Table 4: regression results for iTransformer

For MLP or Linear models:

AiC, BiC Traffic Weather ETTh1 ETTh2
f -91.9,-91.3 -91.5,-91.7 -89.1,-88.8 -62.8,-62.6
g1 -67.1,-66.7 -83.1,-83.2 -67.5,-67.4 -61.1,-60.9
g2 -66.0,-65.6 -82.1,-82.3 -65.6,-65.5 -59.9,-59.8
g3 -60.1,-61.2 -81.3,-83.4 -89.1,-88.8 -60.1,-59.9
Table 5: regression results for MLP or Linear models