This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting thanks: 979-8-3503-7565-7/24/$31.00 ©2024 IEEE

Rui Cao School of Information Science and Engineering
Southeast University
Nanjing, China
[email protected]
   Qiao Wang School of Information Science and Engineering
and
School of Economics and Management

Southeast University
Nanjing, China
[email protected]  (Corresponding Author)
  
Abstract

This research examines the use of Large Language Models (LLMs) in predicting time series, with a specific focus on the LLMTIME model. Despite the established effectiveness of LLMs in tasks such as text generation, language translation, and sentiment analysis, this study highlights the key challenges that large language models encounter in the context of time series prediction. We assess the performance of LLMTIME across multiple datasets and introduce classical almost periodic functions as time series to gauge its effectiveness. The empirical results indicate that while large language models can perform well in zero-shot forecasting for certain datasets, their predictive accuracy diminishes notably when confronted with diverse time series data and traditional signals. The primary finding of this study is that the predictive capacity of LLMTIME, similar to other LLMs, significantly deteriorates when dealing with time series data that contain both periodic and trend components, as well as when the signal comprises complex frequency components.

Index Terms:
ARIMA, Large Language Model, Time Series Forecasting, Almost periodic functions.

I Introduction

Time series analysis is a highly practical and fundamental issue that holds significant importance in various real-world scenarios [1][2], such as predicting retail sales as discussed by Böse et al. [3], filling in missing data in economic time series according to Friedman [4], identifying anomalies in industrial maintenance per Gao et al. [5], and categorizing time series data from different domains by Ismail Fawaz et al. [6]. Numerous statistical and machine learning techniques have been created over time for time series analysis. Drawing inspiration from the remarkable success seen in natural language processing and computer vision, transformers [22] have been integrated into various tasks involving time series data. Wen et al. [8] demonstrated promising outcomes, particularly in time series forecasting as shown by Lim et al. [9]; Nie et al. [10].

Time series datasets frequently consist of sequences originating from various sources, each potentially characterized by distinct time scales, durations, and sampling frequencies. This diversity poses difficulties in both model training and data processing. Moreover, missing values are common in time series data, necessitating specific approaches to address these gaps to maintain the precision and resilience of the model. Furthermore, prevalent applications of time series forecasting, such as weather prediction or financial analysis, entail making predictions based on observations with limited available information, rendering forecasting a challenging task. Ensuring both accuracy and reliability is complex due to the necessity of handling unknown and unpredictable variables when extrapolating future data, underscoring the importance of estimating uncertainty. Consequently, it is crucial for the model to possess strong generalization capabilities and an effective mechanism for managing uncertainty to adeptly adapt to forthcoming data variations and obstacles.

Currently, large-scale pre-training has emerged as a crucial technique for training extensive neural networks in the domains of vision and text, leading to a significant enhancement in performance [11][12]. Nonetheless, the utilization of pre-training in time series modeling faces certain challenges. Time series data, unlike visual and textual data, lacks well-defined unsupervised targets, posing difficulties in achieving effective pre-training. Moreover, the scarcity of large-scale and coherent pre-trained datasets for time series further hinders the adoption of pre-training in this domain. Consequently, conventional time-series approaches (e.g., ARIMA [13] and linear models [14]) often outperform deep learning methods in widely used benchmarks [15]. With the advent of pre-trained models, many researchers have started to forecast time series by leveraging LLMs, such as the LLMTIME technique [16], illustrating how LLMs naturally bridge the gap between the simplistic biases of traditional methods and the intricate representational learning and generation capabilities of contemporary deep learning. The pre-trained LLMs are applied to tackle the challenge of continuous time series forecasting [33][34]. However, while this method demonstrates comparable performance to traditional processing methods on some datasets, extensive experimentation in this study reveals that the LLMTIME approach lacks the ability for zero-shot time series forecasting and even underperforms compared to the traditional time series method ARIMA when tested on a diverse set of datasets.

II Background

II-A Language Model

A language model is created to determine the probability that a specific sequence of words forms a coherent sentence. For example, let’s consider two instances: Sequence A = never, too, late, to, learn, which clearly constructs a sentence (”never too late to learn.”), and a proficient language model should assign it a high likelihood. Conversely, the word sequence B = ever, zoo, later, too, eat is less probable to create a coherent sentence, and a well-trained language model would assign it a lower probability. The main objectives of language models involve evaluating the probability of a word sequence forming a grammatically correct sentence and forecasting the probability of the next word based on the preceding words. To accomplish these objectives, language models work under the fundamental assumption that the occurrence of a subsequent word is influenced only by the words that come before it. This assumption allows the sentence’s joint probability to be related to the conditional probability of each word in the sequence. Language models are trained on a set of sequences denoted as U=U1,U2,Ui,,UNU={U_{1},U_{2},\ldots U_{i},\ldots,U_{N}}, where each sequence is depicted as Ui=(u1,u2,,uj,,uni)U_{i}=(u_{1},u_{2},\ldots,u_{j},\ldots,u_{n_{i}}), and each token uju_{j} (jj in {1,2,,ni}\{1,2,\ldots,n_{i}\}) belongs to a vocabulary 𝒱\mathcal{V}. Large language models typically function as autoregressive distributions, implying that the probability of each token is dependent solely on the preceding tokens in the sequence, and can be formulated as pθ(Ui)=j=1nipθ(uju0:j1)\begin{aligned} p_{\theta}\left(U_{i}\right)=\prod_{j=1}^{n_{i}}p_{\theta}\left(u_{j}\mid u_{0:j-1}\right)\end{aligned}. In this context, the model parameters θ\theta are acquired by maximizing the probability of the complete dataset, expressed as pθ(𝒰)=i=1Npθ(Ui)\begin{aligned} p_{\theta}(\mathcal{U})=\prod_{i=1}^{N}p_{\theta}(U_{i})\end{aligned}.

II-B Large Language Model

LLMs represent a type of artificial intelligence model created to comprehend and produce human language. They undergo training on extensive text datasets and have the capability to carry out various tasks, such as text creation [17, 18, 19], machine translation [20], dialogue systems [21], among others. These models are distinguished by their immense size, encompassing billions of parameters that facilitate their understanding of intricate patterns in language-related data. Typically, these models are constructed on deep learning frameworks, like the transformer architecture [22], which contributes to their remarkable performance across a spectrum of NLP assignments. The development of Bidirectional Encoder Representation Transformers (BERT) [11] and Generative Pre-trained Transformers (GPT) [12] has significantly propelled the NLP domain, ushering in the widespread adoption of pre-training and fine-tuning techniques [23][24]. Extensive datasets are leveraged through task-specific aims and are commonly denoted as unsupervised training (although technically supervised, it lacks human-annotated labels) [25][26]. Subsequently, it is standard procedure to fine-tune the pre-trained model to amplify the utility of the final model for downstream tasks. Noteworthy is the superior performance of these models over earlier deep learning approaches [27], prompting a focus on the meticulous design of pre-training and fine-tuning procedures. Research endeavors have also shifted towards the ultimate objectives of machine learning, pre-training language models as intermediary components of self-directed learning tasks.

II-C Time Series Forecasting Tasks

The various functions associated with time series can be categorized into multiple types, including prediction, anomaly identification, grouping, identification of change points, and segmentation. Prediction and anomaly detection are among the most commonly utilized applications. The task of time prediction involves examining the historical data patterns and trends in time series information to forecast future numerical changes using different time series prediction models. The forecasted future time point can be a singular value or a series of values over a specific time frame. This predictive task is frequently employed to analyze and predict time-dependent data such as stock prices, sales figures, temperature variations, and traffic volume. Given a sequence of values Y={x1,x2,xi,,xN}Y=\{x_{1},x_{2},\ldots x_{i},\ldots,x_{N}\}, where the value corresponding to time stamp tt is xtx_{t}, the objective of prediction is to estimate xt+1x_{t+1} based on {x1,x2,,xt}\{x_{1},x_{2},\ldots,x_{t}\}.

Traditional time series modeling, such as the ARIMA model, is commonly employed for forecasting time series data. The ARIMA model consists of three key components: the autoregressive model (AR), the differencing process, and the moving average model (MA). Essentially, the ARIMA model leverages historical data to make future predictions. The value of a variable at a specific time is influenced by its past values and previous random occurrences. This implies that the ARIMA model assumes the variable fluctuates around a general time trend, where the trend is shaped by historical values and the fluctuations are influenced by random events within a timeframe. Moreover, the general trend itself may not remain constant. In essence, the ARIMA model aims to uncover hidden patterns within the time series data using autocorrelation and differencing techniques. These patterns are then utilized for future predictions. ARIMA models are adept at capturing both trends and temporary, sudden, or noisy data, making them effective for a wide range of time series forecasting tasks.

If we temporarily set aside the distinction, the ARIMA model can be perceived as a straightforward fusion of the AR model and the MA model. In a formal manner, the equation for the ARIMA model can be represented as:

xt=\displaystyle x_{t}= c+φ1xt1+φ2xt2++φpxtp+\displaystyle c+\varphi_{1}x_{t-1}+\varphi_{2}x_{t-2}+\ldots+\varphi_{p}x_{t-p}+ (1)
θ1ϵt1+θ2ϵt2++θqϵtq+ϵt.\displaystyle\theta_{1}\epsilon_{t-1}+\theta_{2}\epsilon_{t-2}+\ldots+\theta_{q}\epsilon_{t-q}+\epsilon_{t}.

In the equation provided, xtx_{t} represents the time series data under examination, while φ1\varphi_{1} to φp\varphi_{p} denote the coefficients of the autoregressive (AR) model that depict the connection between the present value and the value at the previous pp time instances. Similarly, θ1\theta_{1} to θq\theta_{q} stand for the coefficients of the moving average (MA) model, which elucidate the relationship between the current value and the deviation at the last qq time points. The term ϵt\epsilon_{t} symbolizes the deviation at time tt, and cc is a constant term.

II-D LLMTIME

A technique for predicting time series using Large Language Models (like GPT-3) involves converting numerical data into text and generating potential future trends through text predictions. Essentially, this approach converts time series data into a sequence of numbers and interprets time series prediction as forecasting the next token in a textual context, leveraging advanced pre-trained models and probabilistic functions like likelihood evaluation and sampling. The tokens play a crucial role in shaping patterns within tokenized sequences and determining the operations that the language model can comprehend.

Common methods for tokenization, such as Byte Pair Encoding (BPE), often break down a single number into segments that do not correspond directly to the number [28]. For instance, the GPT-3 tagger decomposes the number 42235630 into [422,35,630]. New language models like LLaMA [29] typically treat numbers as a single entity by default. LLMTIME adopts a method where it separates each digit of a number using spaces, tokenizes each digit individually, and uses commas to distinguish between different time steps in a time series. Since decimal points do not add value for a specific precision level, LLMTIME eliminates them during encoding to reduce the length of the context. For example, with a precision of two digits, we preprocess a time series by converting it from

0.789,7.89,78.9,789.07 8 , 7 8 9 , 7 8 9 0 , 7 8 9 0 0.0.789,7.89,78.9,789.0\to\text{7 8 , 7 8 9 , 7 8 9 0 , 7 8 9 0 0}.

LLMTIME resizes the value to ensure that the rescaled time series value at the α\alpha-percentile is 1. It refrains from scaling by the maximum value to allow the LLM to observe instances (1α)(1-\alpha) where the quantity of numbers varies, and it mimics this pattern in its results to generate a larger value than it observes. Furthermore, LLMTIME endeavors to incorporate an offset β\beta computed from the percentile of the input data and fine-tunes these two parameters in accordance with the verification log-likelihood.

II-E Our Work

Although LLMTIME may perform similarly to the traditional autoregressive model ARIMA on some datasets, its forecasting accuracy is lower than ARIMA when dealing with other datasets such as traditional and artificial signals that exhibit noisy almost periodic patterns. Moreover, as the data values increase over time, the forecasting performance of LLMTIME diminishes noticeably.

III Experiments

The zero-shot forecasting ability of LLMs is assessed by contrasting LLMTIME with well-known time series benchmarks across different baseline time series datasets. The model ’text-davinci-003’ has been discontinued, further details can be found in [30]. Consequently, we exclusively employed the ’gpt-3.5-turbo-instruct’ model for our assessments. In terms of deterministic metrics like MAE, LLMs exhibit inferior performance when compared to conventional forecasting techniques like ARIMA.

Refer to caption
(a) The effect of AirPassengersDataset based on LLMTIME.
Refer to caption
(b) The effect of AirPassengersDataset based on ARIMA.
Figure 1: Figure (1a) illustrates the impact of forecasting the AirPassengersDataset dataset using the LLMTIME model. It is evident that as the time series values increase gradually over time, the LLMTIME model predicts the final series value accurately but then sharply decreases, resulting in significantly poorer forecasting performance compared to the ARIMA model. This discrepancy is particularly noticeable in the FredMd dataset at Monash. Figure (1b) displays the application of the ARIMA model for predicting the AirPassengersDataset dataset. Regardless of the waveform or periodicity of the predictions, ARIMA consistently produces excellent results. Moreover, ARIMA outperforms the LLMTIME model significantly in terms of overall Mean Squared Error (MSE) metrics. Further experimental results are elaborated in Appendix V-A.

III-A Darts

In Darts [31], the darts.datasets module offers a variety of pre-existing time series datasets suitable for showcasing, testing, and experimentation purposes. These datasets typically consist of traditional time series data, with Darts specializing in time series forecasting and boasting a comprehensive forecasting system. We utilized LLMTIME and ARIMA models to conduct forecasting on eight datasets (’AirPassengersDataset’, ’AusBeerDataset’, ’GasRateCO2Dataset’, ’MonthlyMilkDataset’, ’SunspotsDataset’, ’WineDataset’, ’WoolyDataset’, ’HeartRateDataset’) integrated into Darts. A portion of the experimental outcomes is depicted in Fig. 1. The AirPassengersDataset comprises classic airline data, documenting the total monthly count of international air passengers from 1949 to 1960, measured in thousands. All experimental findings are detailed in Appendix  V-A.

The experimental results demonstrate that the predictive performance of the LLMTIME approach utilizing GPT-3.5 on the Darts dataset is notably inferior to that of the ARIMA method. It is evident that as the data values increase steadily over time, the forecasting accuracy of LLMTIME diminishes significantly, a pattern observed in other data experiments as well.

Refer to caption
(a) The effect of FreMD dataset based on LLMTIME.
Refer to caption
(b) The effect of CovidDeaths dataset based on LLMTIME.
Figure 2: When forecasting the final data point, it is evident from the provided graph that the performance based on LLMTIME experiences a significant decline.

III-B Monash

The Monash dataset, referenced as [32], comprises 30 openly accessible datasets and baseline metrics for 12 predictive models. There are various versions of the datasets due to differences in frequency and handling of missing values, resulting in a total of 58 dataset variations. Additionally, the dataset collection includes a diverse range of real-world and competition time series datasets across different domains. Similar to the LLMTIME approach, we assessed GPT-3’s zero-shot performance on the top 10 datasets with the most effective baseline model. A noticeable decline in the LLMTIME forecasting performance is evident in the FredMd dataset, which is a monthly database focused on Macroeconomic Research, as described in Working Paper 2015-012B by Michael W. McCracken and Serena Ng. This database aims to provide a comprehensive monthly macroeconomic dataset to facilitate empirical analysis that leverages ”big data.” (The outcomes are depicted in Fig. 2 for FreMD and CovidDeaths, with additional details in Appendix  V-B).

Refer to caption
(a) Forecasts of total UK exports based on LLMTIME.
Refer to caption
(b) Forecasts of total UK exports based on ARIMA.
Figure 3: We predict the overall monetary worth of the UK’s exports spanning from January 1989 to December 2023, quantified in millions of dollars and recorded on a monthly basis. (Refer to Appendix  V-C for more information).
Refer to caption
(a) Predict the function f(t)=cos(2πt)+cos(2t)+noisef(t)=cos(2\pi t)+cos(2t)+noise sequence values based on LLMTIME.
Refer to caption
(b) Forecast the values of the noisy almost periodic function f(t)=cos(2πt)+cos(2t)+noisef(t)=\cos(2\pi t)+\cos(2t)+\text{noise} using ARIMA.
Figure 4: Utilizing the LLMTIME (4a) and ARIMA (4b) models, we anticipate the time series of the artificial signal f(t)=cos(2πt)+cos(2t)+noisef(t)=cos(2\pi t)+cos(2t)+noise, where noisenoise represents Gaussian noise with an average of 0 and a standard deviation of 0.1. Choose 500 data points from the function f(t)f(t) within the range of 0 to 8π\pi to create a series, and forecast the values of the subsequent 100 points in the series based on the initial 400 data points.

III-C Time series in economics

In this section, we examined time series data related to the economy, specifically focusing on the total exports of six countries over the past few decades. The data was sourced from CEIE [35], a comprehensive database offering economic information on over 213 countries and regions, encompassing macroeconomic indicators, industry-specific economics, and specialized data. Figure 3 illustrates the total export value of the United Kingdom from January 1989 to December 2023, denominated in millions of dollars and plotted against months.

III-D Data generated by noisy almost periodic functions

A range of artificial signals were forecasted, and data was produced by adding noise to the almost periodic function

f(t)=cos(2πt)+cos(2t)+noise,f(t)=\cos(2\pi t)+\cos(2t)+noise, (2)

where the term noisenoise denotes Gaussian noise with a variance of σ2\sigma^{2}. The almost periodic function has traditionally been significant in the development of harmonic analysis, leading to the formulation of Wiener’s general harmonic analysis theory (GHA) [2], and subsequently influencing the study of statistics of random processes.

The function (2) corresponds to an almost periodic function. In the field of mathematics, an almost periodic function is characterized by having multiple frequency components that do not share any common factor. Almost periodicity is a feature of dynamical systems that seem to revisit their trajectories in phase space, albeit not precisely. Several experiments were carried out by varying the standard deviations (σ\sigma). The outcomes of certain experiments (σ\sigma=0.1) are depicted in Fig. 4.

Furthermore, we conducted experiments on the four standard deviations of Gaussian noise using LLLTIME and ARIMA methods, and computed their mean square errors to generate Fig. 5.

Refer to caption
Figure 5: The graph illustrates that the x-axis depicts the standard deviation of Gaussian noise, while the y-axis represents the MSE value. The black and blue bars correspond to LLMTIME and ARIMA, respectively. It can be inferred that across all four scenarios, the MSE associated with LLMTIME is significantly greater than that of ARIMA.

The comparison of the mean square errors indicates that the forecasts generated by the LLMTIME model generally exhibit higher errors compared to those produced by the ARMIR model. Additionally, the forecasting effectiveness graphs demonstrate that the ARMIR model outperforms the LLMTIME model significantly. For further details on the remaining experimental outcomes, please refer to Appendix  V-D.

Furthermore, we examined the sine wave and the combination of sinusoidal and linear functions. The findings indicate that LLMTIME is effective only for time series that exhibit complete periodicity, while it struggles to forecast non-periodic signals. Refer to Fig. 6 for more information.

Refer to caption
(a) Predict the function sequence values based on LLMTIME.
Refer to caption
(b) Predict the function sequence values based on ARIMA.
Figure 6: The upper part of the chart shows the genuine signal anticipated as f(t)=sin(t)+noisef(t)=sin(t)+noise, while the lower part displays the genuine signal expected as g(t)=sin(t)+0.2t+noiseg(t)=sin(t)+0.2t+noise. The prediction performance of LLMTIME on f(t)f(t) exhibits periodic characteristics, but it is evident that it fails to predict g(t)g(t) accurately below. Here, the noise is Gaussian with a mean of zero and a standard deviation of 0.02.(See Appendix  V-D for details.)
TABLE I: The MSE in some datasets based on LLMTIME and ARIMA.
DATASETS       LLMTIME_MAE       ARIMA_MSE
AirPassengers 10380.56 2441.71
AusBeer 454.15 754.21
GasRateCO2 32.71 8.25
MonthlyMilk 6023.25 1670.71
Sunspots 1485.97 1007.62
Wine 31683665.37 8826996.13
Wooly 628419.65 587217.59
HeartRate 87.95 53.22
Cif2016 98009.38 3173.05
CovidDeaths 492338.81 10747.88
ElectricityDemand 983958.23 720077.08
FredMd 11344708.41 574689.74
Hospital 18.79 22.57
Nn5Weekly 1964.31 3920.95
PedestrianCounts 306676.68 1082241.58
TourismMonthly 3033529.57 346055.66
TrafficWeekly 1.28 1.31
UsBirths 1021661.84 411167.05
σ\sigma=0 0.7762 0.1997
σ\sigma=0.1 0.7811 0.0385
σ\sigma=0.2 1.1289 0.1708
σ\sigma=0.3 1.4096 1.1824
σ\sigma=0.4 2.0037 0.5909
The synthetic signal is f(t)=cos(2πt)+cos(2t)+noisef(t)=cos(2\pi t)+cos(2t)+noise,
where σ\sigma denotes the standard deviation of Gaussian noise.

Experimental outcomes obtained from the aforementioned datasets allow us to calculate the Mean Squared Error (MSE) between the predicted outcomes and the actual data. Our analysis indicates that, in most instances, the predictive accuracy of the LLMTIME model is inferior to that of the ARIMA model. Refer to Table I for details.

IV Conclusion

It has been observed that the LLMTIME model encounters difficulties in accurately predicting datasets that contain both trend and cyclical elements, as well as in cases where the signal consists of intricate frequency components. In comparison to the conventional ARIMA time series model, the LLMTIME model is less dependable. Despite the challenges and limitations associated with using LLMs as pre-trained models for time series prediction, there is potential for further investigation in this area. We are eager to explore improved methods for integrating LLMs research into time series forecasting.

Acknowledgment: We thank the referees of this paper for their very helpful comments. The code of this paper is available at: https://github.com/crSEU/llmtime_VS_arima

References

  • [1] George E. P. Box , Gwilym M. Jenkins , Gregory C. Reinsel and Greta M. Ljung, Time Series Analysis: Forecasting and Control (5th Edition) John Wiley and Sons Inc., Hoboken, New Jersey, 2015
  • [2] Wiener N. Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications[M]. The MIT press, 1949.
  • [3] Böse, J.-H., Flunkert, V., Gasthaus, J., Januschowski, T., Lange, D., Salinas, D., Schelter, S., Seeger, M., and Wang, Y. Probabilistic demand forecasting at scale. Proceedings of the VLDB Endowment, 10(12):1694–1705, 2017.
  • [4] Friedman, M. The interpolation of time series by related series. J. Amer. Statist. Assoc, 1962.
  • [5] Gao, J., Song, X., Wen, Q., Wang, P., Sun, L., and Xu, H. RobustTAD: Robust time series anomaly detection via decomposition and convolutional neural networks. KDD Workshop on Mining and Learning from Time Series (KDD-MileTS’20), 2020.
  • [6] Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L., and Muller, P.-A. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963, 2019.
  • [7] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser Lukasz, and Polosukhin, I. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  • [8] Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., and Sun, L. Transformers in time series: A survey. In International Joint Conference on Artificial Intelligence(IJCAI), 2023.
  • [9] Lim, B., Arık, S. Ö., Loeff, N., and Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 2021.
  • [10] Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. ArXiv, abs/2211.14730, 2022.
  • [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  • [12] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [13] George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control. Journal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91–109, 1968.
  • [14] Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504, 2022.
  • [15] Hansika Hewamalage, Klaus Ackermann, and Christoph Bergmeir. Forecast evaluation for data scientists: common pitfalls and best practices. Data Mining and Knowledge Discovery, 37(2): 788–832, 2023.
  • [16] Nate Gruver, Marc Anton Finzi, Shikai Qiu, and Andrew Gordon Wilson. Large language models are zero-shot time series forecasters. Advances in Neural Information Processing Systems, 2023.
  • [17] Leigang Qu, Shengqiong Wu, Hao Fei, Liqiang Nie, and Tat-Seng Chua. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 643–654, 2023.
  • [18] Zoie Zhao, Sophie Song, Bridget Duah, Jamie Macbeth, Scott Carter, Monica P Van, Nayeli Suseth Bravo, Matthew Klenk, Kate Sick, and Alexandre LS Filipowicz. More human than human: Llm-generated narratives outperform human-llm interleaved narratives. In Proceedings of the 15th Conference on Creativity and Cognition, pages 368–370, 2023.
  • [19] Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, and Ming Yin. Synthetic data generation with large language models for text classification: Potential and limitations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore, December 2023. Association for Computational Linguistics.
  • [20] Shoetsu Sato, Jin Sakuma, Naoki Yoshinaga, Masashi Toyoda, and Masaru Kitsuregawa. Vocabulary adaptation for domain adaptation in neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4269–4279, 2020.
  • [21] Zhi Jing, Yongye Su, Yikun Han, Bo Yuan, Haiyun Xu, Chunjiang Liu, Kehai Chen, and Min Zhang. When large language models meet vector databases: A survey, 2024.
  • [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [23] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  • [24] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  • [25] Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  • [26] Jiashu Liao, Victor Sanchez, and Tanaya Guha. Self-supervised frontalization and rotation gan with random swap for pose-invariant face recognition. In 2022 IEEE International Conference on Image Processing (ICIP), pages 911–915. IEEE, 2022.
  • [27] Kangrui Ruan, Cynthia He, Jiyang Wang, Xiaozhou Joey Zhou, Helian Feng, and Ali Kebarighotbi. S2e: Towards an end-to-end entity resolution solution from acoustic signal. In ICASSP 2024, 2024.
  • [28] Tiedong Liu and Bryan Kian Hsiang Low. Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks. arXiv preprint arXiv:2305.14201, 2023.
  • [29] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [30] https://platform.openai.com/docs/deprecations
  • [31] Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, et al. Darts: User-friendly modern machine learning for time series. The Journal of Machine Learning Research, 23(1):5442–5447, 2022.
  • [32] Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I Webb, Rob J Hyndman, and Pablo Montero-Manso. Monash time series forecasting archive. arXiv preprint arXiv:2105.06643, 2021.
  • [33] Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time series analysis by pretrained lm. arXiv preprint arXiv:2302.11939, 2023.
  • [34] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. 2023. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. arXiv preprint arXiv:2310.01728 (2023).
  • [35] https://www.ceicdata.com/en.

V Appendix

V-A Detailed experimental results of Darts dataset

Refer to caption
(a) AirPassengersDataset(LLMTIME)
Refer to caption
(b) AirPassengersDataset(ARIMA)
Refer to caption
(c) AusBeerDataset (LLMTIME)
Refer to caption
(d) AusBeerDataset (ARIMA)
Refer to caption
(e) GasRateCO2Dataset (LLMTIME)
Refer to caption
(f) GasRateCO2Dataset (ARIMA)
Refer to caption
(g) HeartRateDataset (LLMTIME)
Refer to caption
(h) HeartRateDataset (ARIMA)
Figure 7: Experimental results of 4 datasets in Darts.
Refer to caption
(a) MonthlyMilkDataset (LLMTIME)
Refer to caption
(b) MonthlyMilkDataset (ARIMA)
Refer to caption
(c) SunspotsDataset (LLMTIME)
Refer to caption
(d) SunspotsDataset (ARIMA)
Refer to caption
(e) WineDataset (LLMTIME)
Refer to caption
(f) WineDataset (ARIMA)
Refer to caption
(g) WoolyDataset (LLMTIME)
Refer to caption
(h) WoolyDataset (ARIMA)
Figure 8: Experimental results of 4 datasets in Darts.

V-B Detailed experimental results of Monash dataset

Refer to caption
(a) Cif2016 (LLMTIME)
Refer to caption
(b) Cif2016 (ARIMA)
Refer to caption
(c) CovidDeaths (LLMTIME)
Refer to caption
(d) CovidDeaths (ARIMA)
Refer to caption
(e) ElectricityDemand (LLMTIME)
Refer to caption
(f) ElectricityDemand (ARIMA)
Refer to caption
(g) FredMd (LLMTIME)
Refer to caption
(h) FredMd (ARIMA)
Refer to caption
(i) Hospital (LLMTIME)
Refer to caption
(j) Hospital (ARIMA)
Figure 9: Experimental results of 5 datasets in Monash.
Refer to caption
(a) Nn5Weekly (LLMTIME)
Refer to caption
(b) Nn5Weekly (ARIMA)
Refer to caption
(c) PedestrianCounts (LLMTIME)
Refer to caption
(d) PedestrianCounts (ARIMA)
Refer to caption
(e) TourismMonthly (LLMTIME)
Refer to caption
(f) TourismMonthly (ARIMA)
Refer to caption
(g) TrafficWeekly (LLMTIME)
Refer to caption
(h) TrafficWeekly (ARIMA)
Refer to caption
(i) UsBirths (LLMTIME)
Refer to caption
(j) UsBirths (ARIMA)
Figure 10: Experimental results of 5 datasets in Monash.

V-C Detailed experimental results of time series in economics

Refer to caption
(a) American exports (LLMTIME)
Refer to caption
(b) American exports (ARIMA)
Refer to caption
(c) China exports (LLMTIME)
Refer to caption
(d) China exports (ARIMA)
Refer to caption
(e) France exports (LLMTIME)
Refer to caption
(f) France exports (ARIMA)
Refer to caption
(g) HongKong exports (LLMTIME)
Refer to caption
(h) HongKong exports (ARIMA)
Refer to caption
(i) Japan exports (LLMTIME)
Refer to caption
(j) Japan exports (ARIMA)
Figure 11: Experimental results of economics dataset.

V-D Detailed experimental results of Synthetic dataset

Refer to caption
(a) σ=0\sigma=0 (LLMTIME)
Refer to caption
(b) σ=0\sigma=0 (ARIMA)
Refer to caption
(c) σ=0.1\sigma=0.1 (LLMTIME)
Refer to caption
(d) σ=0.1\sigma=0.1 (ARIMA)
Refer to caption
(e) σ=0.2\sigma=0.2 (LLMTIME)
Refer to caption
(f) σ=0.2\sigma=0.2 (ARIMA)
Refer to caption
(g) σ=0.3\sigma=0.3 (LLMTIME)
Refer to caption
(h) σ=0.3\sigma=0.3 (ARIMA)
Refer to caption
(i) σ=0.4\sigma=0.4 (LLMTIME)
Refer to caption
(j) σ=0.4\sigma=0.4 (ARIMA)
Figure 12: Experimental results of cos(2πt)+cos(2t)+noisecos(2\pi t)+cos(2t)+noise.
Refer to caption
(a) σ=0\sigma=0 (LLMTIME)
Refer to caption
(b) σ=0\sigma=0 (ARIMA)
Refer to caption
(c) σ=0.05\sigma=0.05 (LLMTIME)
Refer to caption
(d) σ=0.05\sigma=0.05 (ARIMA)
Refer to caption
(e) σ=0.1\sigma=0.1 (LLMTIME)
Refer to caption
(f) σ=0.1\sigma=0.1 (ARIMA)
Refer to caption
(g) σ=0.2\sigma=0.2 (LLMTIME)
Refer to caption
(h) σ=0.2\sigma=0.2 (ARIMA)
Figure 13: Experimental results of f(t)=sin(t)+noisef(t)=sin(t)+noise and f(t)=0.2t+sin(t)+noisef(t)=0.2t+sin(t)+noise.