This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LightWeather: Harnessing Absolute Positional Encoding for Efficient and Scalable Global Weather Forecasting

Yisong Fu1,2, Fei Wang1,2\c@eqfn\c@eqfnfootnotemark: \c@eqfn, Zezhi Shao1, Chengqing Yu1,2, Yujie Li1,2, Zhao Chen1,2, Zhulin An1,2, Yongjun Xu1,2\c@eqfn\c@eqfnfootnotemark: \c@eqfn
Abstract

Recently, Transformers have gained traction in weather forecasting for their capability to capture long-term spatial-temporal correlations. However, their complex architectures result in large parameter counts and extended training times, limiting their practical application and scalability to global-scale forecasting. This paper aims to explore the key factor for accurate weather forecasting and design more efficient solutions. Interestingly, our empirical findings reveal that absolute positional encoding is what really works in Transformer-based weather forecasting models, which can explicitly model the spatial-temporal correlations even without attention mechanisms. We theoretically prove that its effectiveness stems from the integration of geographical coordinates and real-world time features, which are intrinsically related to the dynamics of weather. Based on this, we propose LightWeather, a lightweight and effective model for station-based global weather forecasting. We employ absolute positional encoding and a simple MLP in place of other components of Transformer. With under 30k parameters and less than one hour of training time, LightWeather achieves state-of-the-art performance on global weather datasets compared to other advanced DL methods. The results underscore the superiority of integrating spatial-temporal knowledge over complex architectures, providing novel insights for DL in weather forecasting.

Introduction

Accurate weather forecasting is of great significance in a wide variety of domains such as agriculture, transportation, energy, and economics. In the past decades, there was an exponential growth in the number of automatic weather stations, which play a pivotal role in modern meteorology (Sose and Sayyad 2016). They are cost-effective for applications (Bernardes et al. 2023; Tenzin et al. 2017) and can be flexibly deployed to almost anywhere around the world, collecting meteorological data at any desired resolution.

With the development of deep learning (DL), studies have embarked on exploring DL approaches for weather forecasting. The goal of data-driven DL methods is to fully leverage the historical data to enhance the accuracy of forecasting (Schultz et al. 2021). The weather stations around the world are ideally positioned to provide a substantial amount of data for DL methods. However, the observations of worldwide stations exhibit intricate spatial-temporal patterns that vary across regions and periods, posing challenge for global-scale weather forecasting (Wu et al. 2023b).

Refer to caption
Figure 1: Comparison of MSE, epoch time and parameter count between LightWeather and mainstream Transformer-based methods on global wind speed dataset. The area of the plot represents the parameter count of the model.

Recently, Transformers have become increasingly popular in weather forecasting due to their capability to capture long-term spatial-temporal correlations. When confronting the challenge of global-scale forecasting, Transformer-based methods employ more sophisticated architectures, leading to hundreds of millions of parameters and multiple days of training time. In the era of large model (LM), this phenomenon become particularly evident. Such expenses limit their scalability to large-scale stations and restrict their application in practical scenarios (Deng et al. 2024).

Despite the complexity of these architectures, we observe that the resulting improvements in performance are, in fact, quite limited. This motivates us to rethink the bottleneck of station-based weather forecasting and further design a model as effective as Transformer-based methods but more efficient and scalable. For this purpose, we delve deeper into the architecture of Transformer-based weather forecasting models and obtain an interesting conclusion: absolute positional encoding is what really works in Transformer-based weather forecasting models, and the reason lies in the principle of atmospheric dynamics.

Positional encoding is widely regarded as an adjunct to permutation-invariant attention mechanisms, providing positional information of tokens in sequence (Vaswani et al. 2017). However, we empirically find that absolute positional encoding can inherently model the spatial-temporal correlations of worldwide stations, even in the absence of attention mechanisms, by integrating 3D geographical coordinates (i.e., latitude, longitude, and elevation) and real-world temporal knowledge.

Furthermore, we will theoretically elucidate why absolute positional encoding is pivotal by applying principles of atmospheric dynamics. In the global weather system, the evolution of atmospheric states is closely related to absolute spatial and temporal conditions, resulting in complex correlations. Absolute positional encoding enables the model to explicitly capture these correlations rather than blind guessing, which is the key bottleneck in model performance.

Based on the aforementioned findings, we propose LightWeather, a lightweight and effective weather forecasting model that can collaboratively forecast for worldwide weather stations. It utilizes the absolute positional encoding and replaces the main components of Transformer with an MLP as encoder. Benefiting from its simplicity, LightWeather significantly surpass the current Transformer-based models in terms of efficiency. Despite its efficiency, LightWeather also achieves state-of-the-art forecasting performance among 13 baselines. Figure 1 visualizes LightWeather’s lead in both efficiency and performance. Moreover, it is worth noting that the computational complexity of LightWeather grows linearly with the increase of the number of stations NN and the parameter count is independent of NN. Therefore, LightWeather can perfectly scale to the fine-grained data with a larger NN.

Our contributions can be summarized as follows:

  • We innovatively highlight the importance of the absolute positional encoding in Transformer-based weather forecasting model. Even in the absence of attention mechanisms, it helps model to explicitly capture spatial-temporal correlations by introducing spatial and temporal knowledge into the model.

  • We propose LightWeather, a lightweight and effective weather forecasting model. We utilize the absolute positional encoding and replace the main components of Transformer with an MLP. The concise structure endows it with high efficiency and scalability to fine-grained data.

  • LightWeather achieves collaborative forecasting for worldwide stations with state-of-the-art performances. Experiments on 5 datasets show that LightWeather can outperform 13 mainstream baselines.

Related Works

DL Methods for Station-based Weather Prediction

Although there has been a great success of radar- or reanalysis-based DL methods (Bi et al. 2023; Lam et al. 2023; Chen et al. 2023), they can only process gridded data and are incompatible with station-based forecasting.

For station-based forecasting, spatial-temporal graph neural networks (STGNNs) are proved to be effective in modeling spatial-temporal patterns of weather data (Lin et al. 2022; Ni, Wang, and Fang 2022), but most of them only provide short-term forecasting (i.e., 6 or 12 steps), which limits their applicability.

Recently, Transformer-based approaches have gained more popularity for their capability of capturing long-term spatial-temporal correlations. For instance, MGSFformer (Yu et al. 2024a) and MRIformer (Yu et al. 2024b) employ attention mechanisms to capture correlations from multi-resolution data obtained through down sampling. However, attention mechanisms take quadratic computational complexity for both spatial and temporal correlation modeling, which is unaffordable in global-scale forecasting.

Several studies attempted to enhance the efficiency of attention mechanisms. Typically, AirFormer (Liang et al. 2023) restricts attention to focusing only on local information. Corrformer (Wu et al. 2023b) employs a more efficient multi-correlation mechanism to supplant attention mechanisms. Nevertheless, these optimizations came with a greater amount of computation and parameter, resulting in limited improvements in efficiency.

Studies of the effectiveness of Transformers

The effectiveness of Transformers has been thoroughly discussed in the fields of computer vision (CV) (Yu et al. 2022; Lin et al. 2024) and natural language processing (NLP) (Bian et al. 2021). In time series forecasting (TSF), LSTF-Linear (Zeng et al. 2023) pioneered the exploration and outperformed a variety of Transformer-based methods with a linear model. Shao et al. (2023) posited that Transformer-based models face over-fitting problem on specific datasets. MTS-Mixers (Li et al. 2023) and MEAformer (Huang et al. 2024) further questioned the necessity of attention mechanisms in Transformers for TSF and replaced them with MLP-based information aggregations. These studies consider positional encoding as supplementary to attention mechanisms and consequently remove it along with attention mechanisms, yet none have recognized the importance of positional encoding.

Methodology

Refer to caption
Figure 2: Architecture of LightWeather.

Preliminaries

Problem Formulation.

We consider NN weather stations and each station collects CC meteorological variables (e.g., temperature). Then the observed data at time tt can be denoted as 𝐗tN×C\mathbf{X}_{t}\in\mathbb{R}^{N\times C}. The 3D geographical coordinates of stations are organized as a matrix 𝚯3×N\mathbf{\Theta}\in\mathbb{R}^{3\times N}, which is naturally accessible in station-based forecasting. Given the historical observation of all stations from the past ThT_{h} time steps and optional spatial and temporal information, we aim to learn a function ()\mathcal{F}(\cdot) to forecast the values of future TfT_{f} time steps :

𝐘t:t+Tf=(𝐗tTh:t;𝚯,t),\mathbf{Y}_{t:t+T_{f}}=\mathcal{F}(\mathbf{X}_{t-T_{h}:t};\mathbf{\Theta},t), (1)

where 𝐗tTh:tTh×N×C\mathbf{X}_{t-T_{h}:t}\in\mathbb{R}^{T_{h}\times N\times C} is the historical data, and 𝐘t:t+TfTf×N×C\mathbf{Y}_{t:t+T_{f}}\in\mathbb{R}^{T_{f}\times N\times C} is the future data.

Overview of LightWeather

As illustrated in Figure 2, LightWeather consists of a data embedding layer, an absolute positional encoding layer, an MLP as encoder, and a regression layer. LightWeather replaces the redundant structures in Transformer-based models with a simple MLP, which greatly enhances the efficiency without compromising performance.

Data Embedding

Let 𝐗i,jTh\mathbf{X}^{i,j}\in\mathbb{R}^{T_{h}} be the historical time series of station ii and variable jj. The data embedding layer maps 𝐗i,j\mathbf{X}^{i,j} to the embedding 𝐄i,jd\mathbf{E}^{i,j}\in\mathbb{R}^{d} in latent space:

𝐄i,j=FCembed(𝐗i,j),\mathbf{E}^{i,j}=\mathrm{FC_{embed}}(\mathbf{X}^{i,j}), (2)

where FC()\mathrm{FC(\cdot)} denotes a fully connected layer.

Absolute Positional Encoding

Absolute positional encoding injects information about the absolute position of the tokens in sequence, which is widely regarded as an adjunct to permutation-invariant attention mechanisms. However, we find it helpful to capture spatial-temporal correlations by introducing additional geographical and temporal knowledge into the model.

In our model, absolute positional encoding includes two parts: spatial encoding and temporal encoding.

Spatial Encoding.

Spatial encoding provides the geographical knowledge of stations to the model, which can explicitly model the spatial correlations among worldwide stations. Specifically, we encode the geographical coordinates of the station into latent space by a simple fully connected layer, thus spatial encoding 𝐒id\mathbf{S}^{i}\in\mathbb{R}^{d} can be denoted as:

𝐒i=FCs(𝚯i),\mathbf{S}^{i}=\mathrm{FC_{s}}(\mathbf{\Theta}^{i}), (3)

where 𝚯i3\mathbf{\Theta}^{i}\in\mathbb{R}^{3} represents the coordinates of the station ii.

Temporal Encoding.

Temporal encoding provides real-world temporal knowledge to the model. We utilize three learnable embedding matrices 𝐓24×d\mathbf{T}\in\mathbb{R}^{24\times d}, 𝐃31×d\mathbf{D}\in\mathbb{R}^{31\times d} and 𝐌12×d\mathbf{M}\in\mathbb{R}^{12\times d} to save the temporal encodings of all time steps (Shao et al. 2022). They represent the patterns of weather in three scales ( 𝐓\mathbf{T} denotes hours in a day, 𝐃\mathbf{D} denotes days in a month and 𝐌\mathbf{M} denotes the months in a year), contributing to model the multi-scale temporal correlations of weather. We add them together with data embedding to obtain 𝐇ti,j\mathbf{H}^{i,j}_{t}:

𝐇ti,j=𝐄i,j+𝐒i+𝐓t+𝐃t+𝐌t.\mathbf{H}^{i,j}_{t}=\mathbf{E}^{i,j}+\mathbf{S}^{i}+\mathbf{T}_{t}+\mathbf{D}_{t}+\mathbf{M}_{t}. (4)

Encoder

We utilize a LL-layer MLP as encoder to learn the representation 𝐙i,j\mathbf{Z}^{i,j} from the embedded data 𝐇i,j\mathbf{H}^{i,j}. The l-th MLP layer with residual connect can be denoted as:

(𝐙i,j)l+1=FC2l(σ(FC1l((𝐙i,j)l)))+(Zi,j)l,(\mathbf{Z}^{i,j})^{l+1}=\mathrm{FC}_{2}^{l}(\sigma(\mathrm{FC}_{1}^{l}((\mathbf{Z}^{i,j})^{l})))+(Z^{i,j})^{l}, (5)

where σ()\sigma(\cdot) is the activation function and (𝐙i,j)0=𝐇i,j(\mathbf{Z}^{i,j})^{0}=\mathbf{H}^{i,j}.

Regression Layer

We employ a linear layer to map the representation 𝐙d×N×C\mathbf{Z}\in\mathbb{R}^{d\times N\times C} to the specified dimension, yielding the prediction 𝐘^Tf×N×C\hat{\mathbf{Y}}\in\mathbb{R}^{T_{f}\times N\times C}.

Refer to caption
Figure 3: The essential difference that makes LightWeather more effective than other DL methods. (a) DL methods without spatial correlation modeling; (b) DL methods with spatial correlation modeling; (c) LightWeather makes predictions guided by geographical and temporal knowledge, while (a) and (b) are solely based on historical observations. For simplicity, all subscripts of 𝝂τ1:τTh\boldsymbol{\nu}_{\tau-1:\tau-T_{h}} are omitted.

Loss Function

We adopt Mean Absolute Error (MAE) as the loss function for LightWeather. MAE measures the discrepancy between the prediction 𝐘^\hat{\mathbf{Y}} and the ground truth 𝐘\mathbf{Y} by:

(𝐘^,𝐘)=1NCTfi=1Nj=1Ck=1Tf|𝐘^ti,j𝐘ti,j|.\mathcal{L}(\hat{\mathbf{Y}},\mathbf{Y})=\frac{1}{N\cdot C\cdot T_{f}}\sum_{i=1}^{N}\sum_{j=1}^{C}\sum_{k=1}^{T_{f}}|\hat{\mathbf{Y}}^{i,j}_{t}-\mathbf{Y}^{i,j}_{t}|. (6)
Dataset GlobalWind GlobalTemp Wind_CN Temp_CN Wind_US
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
HI 7.285 1.831 14.89 2.575 90.08 6.576 42.45 4.875 5.893 1.733
ARIMA 4.479 1.539 20.93 3.267 57.66 5.292 27.62 3.825 5.992 1.801
Informer 4.716 1.496 33.29 4.415 59.24 5.186 20.39 3.041 3.095 1.300
FEDformer 4.662 1.471 11.05 2.405 53.16 5.039 19.55 3.252 3.363 1.362
DSformer 4.028 1.347 9.544 2.057 51.36 4.923 21.33 3.301 3.266 1.305
PatchTST 3.891 1.332 9.795 2.062 51.33 4.932 23.11 3.426 3.220 1.324
TimesNet 4.724 1.459 11.81 2.327 51.27 4.899 20.21 3.203 3.294 1.312
GPT4TS 4.686 1.456 12.01 2.336 50.87 4.883 20.87 3.253 3.230 1.296
Time-LLM -* -* -* -* 56.34 5.079 30.40 4.189 4.044 1.497
DLinear 4.022 1.350 9.914 2.072 51.18 4.931 22.18 3.400 3.455 1.339
AirFormer 3.810 1.314 31.30 4.127 53.69 5.008 21.49 3.389 3.116 1.272
Corrformer 3.889 1.304 7.709 1.888 51.31 4.908 21.20 3.248 3.362 1.339
MRIformer 3.906 1.318 9.516 1.999 51.48 4.934 22.98 3.408 3.300 1.310
LightWeather (ours) 3.734 1.295 7.420 1.858 50.15 4.869 17.04 2.990 3.047 1.270
±\pm 0.003 ±\pm 0.002 ±\pm0.010 ±\pm0.002 ±\pm0.03 ±\pm 0.001 ±\pm0.03 ±\pm0.005 ±\pm0.003 ±\pm0.003
  • *

    Dashes denote the out-of-memory error.

Table 1: Weather forecasting results on 5 datasets. The best results are in bold and the second best results are underlined.

Theoretical Analysis

In this part, we provide a theoretical analysis of LightWeather, focusing on its effectiveness and efficiency of spatial-temporal embedding.

Effectiveness of LightWeather.

The effectiveness of LightWeather lies in the fact that absolute positional encoding integrates geographical coordinates and real-world time features into the model, which are intrinsically linked to the evolution of atmospheric states in global weather system. Here we theoretically demonstrate this relationship.

Theorem 1.

Let {λ,φ,z}\{\lambda,\varphi,z\} be the longitude, latitude and elevation of a weather station and ν\nu is a meteorological variable collected by the station, then the time evolution of ν\nu is a function of λ,φ,z\lambda,\varphi,z and time tt:

νt=F(λ,φ,z,t).\frac{\partial\nu}{\partial t}=F(\lambda,\varphi,z,t). (7)
Proof.

We provide the proof with zonal wind speed as an example; analogous methods can be applied to other meteorological variables.

According to the basic equations of atmospheric dynamics in spherical coordinates (Marchuk 2012), the zonal wind speed uu obeys the equation:

dudt=1ρprcosφλ+fv+uvtanφr+Fλ,\frac{\mathrm{d}u}{\mathrm{d}t}=-\frac{1}{\rho}\frac{\partial p}{r\cos\varphi\partial\lambda}+fv+\frac{uv\tan\varphi}{r}+F_{\lambda}, (8)

where ddt=t+uacosφλ+vaφ+wr\frac{\mathrm{d}}{\mathrm{d}t}=\frac{\partial}{\partial t}+u\frac{\partial}{a\cos\varphi\partial\lambda}+v\frac{\partial}{a\partial\varphi}+w\frac{\partial}{\partial r}, pp is pressure, ρ\rho is atmospheric density, FλF_{\lambda} is zonal friction force, and rr is geocentric distance.

The geocentric distance can be further denoted as r=a+zr=a+z, where aa is the radius of the earth and zz is the elevation. Since aa is a constant and a>>za>>z, we have r=z\frac{\partial}{\partial r}=\frac{\partial}{\partial z} and we can approximate rr with aa.

It is possible to render the left side of the equation spatial-independent by rearranging terms:

ut=(uuacosφλ+vuaφ+wuz)1ρpacosφλ+fv+uvtanφa+Fλ.\begin{split}\frac{\partial u}{\partial t}=&-(u\frac{\partial u}{a\cos\varphi\partial\lambda}+v\frac{\partial u}{a\partial\varphi}+w\frac{\partial u}{\partial z})\\ &-\frac{1}{\rho}\frac{\partial p}{a\cos\varphi\partial\lambda}+fv+\frac{uv\tan\varphi}{a}+F_{\lambda}.\end{split} (9)

Therefore, we have

ut=F(λ,φ,z,t).\frac{\partial u}{\partial t}=F(\lambda,\varphi,z,t). (10)

Considering the use of historical data spanning ThT_{h} steps for prediction, it is not difficult to draw the corollary:

Corollary 1.1.
ντ=𝜶𝝂τ1:τTh+G(λ,φ,z,τ),\nu_{\tau}=\boldsymbol{\alpha}\boldsymbol{\nu}_{\tau-1:\tau-T_{h}}+G(\lambda,\varphi,z,\tau), (11)

where 𝛎τ1:τTh\boldsymbol{\nu}_{\tau-1:\tau-T_{h}} is the historical data, and 𝛂Th\boldsymbol{\alpha}\in\mathbb{R}^{T_{h}}.

The detailed proof is provided in Appendix A.1. According to Eq. (11), we conclude that predictions for the future are bifurcated into two components: the fitting to historical observations and the modeling of the function G(λ,φ,z,τ)G(\lambda,\varphi,z,\tau), which represents the spatial-temporal correlations. When the scale is small, e.g., in single-station forecasting, even a simple linear model can achieve great performances (Zeng et al. 2023). However, as the scale expands to global level, modeling GG becomes the key bottleneck of forecasting.

The majority of prior models are designed to fit historical observations more accurately by employing increasingly complex structures. As shown in Figure 3 (a) (b), they simplistically regard GG as a function of historical values, and the complex structures may lead to over-fitting of it. In comparison, LightWeather can explicitly model GG with λ,φ,z,τ\lambda,\varphi,z,\tau introduced by absolute positional encoding, as shown in Figure 3 (c), thereby enhancing the predictive performance.

Efficiency of LightWeather.

We theoretically analyze the efficiency of LightWeather from the perspectives of parameter volume and computational complexity.

Theorem 2.

The total number of parameters required for the LightWeather is (2Ld+Th+Tf+70)(d+1)(2Ld+T_{h}+T_{f}+70)(d+1).

The proof is provided in Appendix A.2. According to Theorem 2, we conclude that the parameter count of LightWeather is independent of the number of stations NN. In addition, the computational complexity of LightWeather grows linearly as the increase of NN. In canonical spatial correlation modeling methods (Wu et al. 2019), it will cause quadratic complexity and linear increase of parameters with respect to NN, which is unafforable with large-scale stations. However, LightWeather can effectively model the spatial correlations among worldwide stations with NN-independent parameters and linear complexity, scaling perfectly to fine-grained data.

Experiments

Experimental Setup

Datasets.

We conduct extensive experiments on 5 datasets including worldwide and nationwide:

  • GlobalWind and GlobalTemp (Wu et al. 2023b) contains the hourly averaged wind speed and temperature of 3,850 stations around the world, spanning 2 years with 17,544 time steps.

  • Wind_CN and Temp_CN contains the daily averaged wind speed and temperature of 396 stations in China, spanning 10 years with 3,652 time steps.

  • Wind_US (Wang et al. 2019) contains the hourly averaged wind speed of 27 stations in US, spanning 62 months with 45,252 time steps.

We partition all datasets into training, validation and test sets in a ratio of 7:1:2. More details of the datasets are provided in Appendix B.1.

Baselines.

We compare our LightWeather with the following three categories of baselines:

  • Classic methods: HI (Cui, Xie, and Zheng 2021), ARIMA (Shumway, Stoffer, and Stoffer 2000).

  • Universal DL methods: Informer (Zhou et al. 2021), FEDformer (Zhou et al. 2022), DSformer (Yu et al. 2023), PatchTST (Nie et al. 2023), TimesNet (Wu et al. 2023a), GPT4TS (Zhou et al. 2023), Time-LLM (Jin et al. 2024), DLinear (Zeng et al. 2023).

  • Weather forecasting specialized DL methods: AirFormer (Liang et al. 2023), Corrformer (Wu et al. 2023b), MRIformer (Yu et al. 2024b).

In Appendix B.2, we provide a detailed introduction to the baselines.

Evaluation Metrics.

We evaluate the performances of all baselines by two commonly used metrics: Mean Absolute Error (MAE) and Mean Squared Error (MSE).

Implementation Details.

Consistent with the prior studies (Wu et al. 2023b), we set the input length to 48 and the predicted length to 24. Our model can support larger input and output lengths, whereas numerous models would encounter out-of-memory errors, making comparison infeasible. We adopt the Adam optimizer (Kingma and Ba 2014) to train our model. The number of layers in MLP is 2, and the hidden dimensions are contingent upon datasets, ranging from 64 to 2048. The batch size is set to 32 and the learning rate to 5e-4. All models are implemented with PyTorch 1.10.0 and tested on a single NVIDIA RTX 3090 24GB GPU.

Main Results

Table 1 presents the results of performance comparison between LightWeather and other baselines on all datasets. The results of LightWeather are averaged over 5 runs with standard deviation included. It can be found that most Transformer-based models presents limited performance, with some even outperformed by a simple linear model. On the contrary, LightWeather consistently achieve state-of-the-art performances, surpassing all other baselines with a simple MLP-based architecture. Especially on Temp_CN, LightWeather presents a significant performance advantage over the second-best model (MSE, 17.04 versus 19.55). This indicates that integrating geographical and temporal knowledge can significantly enhance performance, proving to be more effective than the complex architectures of Transformers. Additionally, we provide a comparison with numerical weather prediction (NWP) methods in Appendix B.3.

Efficiency Analysis

Earlier in the paper, we illustrated the performance-effciency comparison with other mainstream Transformer-based methods in Figure 1. In this part, we further conduct a comprehensive comparison between our model and other baselines in terms of parameter counts, epoch time, and GPU memory usage. Table 2 shows the results of the comparison. Benefiting from the simple architecture, LightWeather surpasses other DL methods both in performance and efficiency. Compared with the weather forecasting specialized methods, LightWeather demonstrates an order-of-magnitude improvement across three efficiency metrics, being about 6 to 6,000 times smaller, 100 to 300 times faster, and 10 times memory-efficient respectively.

Methods Parameters Epoch Max Mem.
Time(s) (GB)
Informer 23.94M 37 1.39
FEDformer 31.07M 50 1.63
DSformer 85.99M 250 13.6
PatchTST 424.1K 559 19.22
TimesNet 14.27M 55 3.32
GPT4TS 12.09M 36 1.43
AirFormer 148.7K 2986 14.01
Corrformer 148.7M 11739 18.41
MRIformer 11.66M 3431 12.69
LightWeather 25.8K 30 0.80
Table 2: Efficiency metrics of LightWeather and other Transformer-based methods on GlobalWind dataset. The results are averaged over 3 runs.

Ablation Study

Effects of Absolute Positional Encoding.

Absolute positional encoding is the key component of LightWeather. To study the effects of absolute positional encoding, we first conduct experiments on models that are removed the spatial encoding and the temporal encoding respectively. The results are shown in Exp. 1 and 2 depicted in Table 3, where we find that the removal of each encoding component leads a decrease on MSE. This indicates that both spatial and temporal encodings are beneficial.

Then we make a comparison between relative and absolute positional encoding. Specifically, relative spatial encoding embeds the indices of stations instead of the geographical coordinates. Kindly note that we project the temporal dimension of data into hidden space, thus there is no need for relative temporal encoding. The results are shown Exp. 3 of Table 3. It is notable that the adoption of relative positional encoding causes a degradation in performance. Moreover, comparison between Exp. 1 and 3 reveals that the use of relative positional encoding even results in poorer performance than the absence of positional encoding, which further substantiates the effectiveness of absolute positional encoding.

Exp. ID Spatial Temporal Global Global
Encoding Encoding Wind Temp
1 abs. 3.849 7.959
2 abs. 3.852 7.907
3 rel. abs. 3.844 8.410
4 abs. abs. 3.734 7.420
Table 3: Ablation MSE results of the absolute positional encoding on global weather datasets. Abs. denotes absolute positional encoding, and rel. represents the relative positional encoding. ✗ denotes component removal.

Hyperparameter Study.

We investigate the effects of two important hyperparameters: the number of layers LL in MLP and the hidden dimension dd. As illustrated in Figure 4 (a), LightWeather achieves the best performance when L=2L=2, whereas an increase in LL beyond 22 results in over-fitting and a consequent decline in model performance. Figure 4 (b) shows that the metrics decrease with the increment of hidden dimension and begin to converge when dd exceeds 10241024. For reasons of efficiency, we chose d=64d=64 in our previous experiments, but this selection did not yield the peak performance of our model. Moreover, it should be emphasized that LightWeather can outperform other Transformer-based models even when d=32d=32 and the parameters is less than 10k. This further substantiates that absolute positional encoding is more effective than the complex architectures of Transformer-based models.

Refer to caption
Figure 4: Results of hyperparamters analysis on GlobalWind dataset. (a) Effects of the number.of layers (d=64d=64). (b) Effects of the hidden dimension (L=2L=2).
Datasets GlobalWind Wind_CN
Metric MSE MAE MSE MAE
PatchTST Original 3.891 1.332 51.33 4.932
+Abs. PE 3.789 1.307 51.02 4.925
DSformer Original 4.028 1.347 51.36 4.923
+ Abs. PE 3.928 1.325 51.12 4.894
MRIformer Original 3.906 1.318 51.48 4.934
+Abs. PE 3.885 1.311 51.46 4.909
Table 4: Improvements obtained by the adoption of absolute positional encoding (Abs. PE).

Generalization of Absolute Positional Encoding

In this section, we further evaluate the effects of absolute positional encoding by applying it to Transformer-based models with the results reported in Table 4. Only channel-independent (CI) Transformers (Nie et al. 2023) are selected due to our encoding strategy, i.e., we generate spatial embeddings for each station respectively. It is evident that absolute positional encoding can significantly enhance the performance of Transformer-based models, enabling them to achieve nearly state-of-the-art results.

Visualization

Visualization of Forecasting Results.

To intuitively comprehend the collaborative forecasting capabilities of LightWeather for worldwide stations, we present a visualization of the forecasting results here. Kriging is employed to interpolate discrete points into a continuous surface, facilitating a clearer observation of the variations.

Refer to caption
Figure 5: Forecasting results of averaged wind speed, from 11:00 on 15 August 2020 to 11:00 to 16 August 2020 with a 6-hour interval. The resolution is 5° (i.e., 64×3264\times 32).

The forecasting results are shown in the right column of Figure 5, while the ground-truth values are in the left. Overall, the forecasting results closely align with the ground-truth values, indicating that LightWeather can effectively capture the spatial-temporal patterns of global weather data and make accurate predictions.

Visualization of Positional Encoding.

To better interpret the effectiveness of our model, we visualize the learned embeddings of absolute positional encoding. Due to the high dimension of the embeddings, t-SNE (Van der Maaten and Hinton 2008) is employed to visualize them on two-dimensional planes. The results are shown in Figure 6.

Figure 6 (a) shows that the embeddings of spatial encoding tend to cluster. Similar embeddings are either spatially proximate or exhibit analogous climatic patterns. Besides, the embeddings in Figure 6 (b) and (d) form ring-like structures in temporal order, revealing the distinct daily and annually periodicities of weather, which is consistent with our common understanding.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: Visualization of absolute positional encoding on GlobalWind dataset. (a) Spatial encoding 𝐒\mathbf{S}. (b) Temporal encoding of hours 𝐓\mathbf{T}. (c) Temporal encoding of days in a month 𝐃\mathbf{D}. (d) Temporal encoding of months in a year 𝐌\mathbf{M}.

Conclusion

This work innovatively highlights the importance of absolute positional encoding in Transformer-based weather forecasting models. Even in the absence of attention mechanisms, absolute positional encoding can explicitly capture spatial-temporal correlations by integrating geographical coordinates and real-world time features, which are closely related to the evolution of atmospheric states in global weather system. Subsequently, we present LightWeather, a lightweight and effective weather forecasting model. We utilize the absolute positional encoding and replace the main components of Transformer with an MLP. Extensive experiments demonstrate that LightWeather can achieve satisfactory performance on global weather datasets, and the simple structure endows it with high efficiency and scalability to fine-grained data. This work posits that the incorporation of geographical and temporal knowledge is more effective than relying on intricate model architectures. This approach is anticipated to have a substantial impact beyond the realm of weather forecasting, extending its relevance to the predictions that involve geographical information (e,g., air quality and marine hydrology) and illuminating a new direction for DL approaches in these domains. In future work, we plan to integrate the model with physical principles more closely to enhance its interpretability.

References

  • Bauer, Thorpe, and Brunet (2015) Bauer, P.; Thorpe, A.; and Brunet, G. 2015. The quiet revolution of numerical weather prediction. Nature, 525(7567): 47–55.
  • Bernardes et al. (2023) Bernardes, G. F.; Ishibashi, R.; Ivo, A. A.; Rosset, V.; and Kimura, B. Y. 2023. Prototyping low-cost automatic weather stations for natural disaster monitoring. Digital Communications and Networks, 9(4): 941–956.
  • Bi et al. (2023) Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; and Tian, Q. 2023. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619(7970): 533–538.
  • Bian et al. (2021) Bian, Y.; Huang, J.; Cai, X.; Yuan, J.; and Church, K. 2021. On attention redundancy: A comprehensive study. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies, 930–945.
  • Chen et al. (2023) Chen, L.; Du, F.; Hu, Y.; Wang, Z.; and Wang, F. 2023. Swinrdm: integrate swinrnn with diffusion model towards high-resolution and high-quality weather forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 322–330.
  • Cui, Xie, and Zheng (2021) Cui, Y.; Xie, J.; and Zheng, K. 2021. Historical inertia: A neglected but powerful baseline for long sequence time-series forecasting. In Proceedings of the 30th ACM international conference on information & knowledge management, 2965–2969.
  • Deng et al. (2024) Deng, J.; Song, X.; Tsang, I. W.; and Xiong, H. 2024. Parsimony or Capability? Decomposition Delivers Both in Long-term Time Series Forecasting. arXiv preprint arXiv:2401.11929.
  • Huang et al. (2024) Huang, S.; Liu, Y.; Cui, H.; Zhang, F.; Li, J.; Zhang, X.; Zhang, M.; and Zhang, C. 2024. MEAformer: An all-MLP transformer with temporal external attention for long-term time series forecasting. Information Sciences, 669: 120605.
  • Jin et al. (2024) Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J. Y.; Shi, X.; Chen, P.-Y.; Liang, Y.; Li, Y.-F.; Pan, S.; et al. 2024. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. In The Twelfth International Conference on Learning Representations.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lam et al. (2023) Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; et al. 2023. Learning skillful medium-range global weather forecasting. Science, 382(6677): 1416–1421.
  • Li et al. (2023) Li, Z.; Rao, Z.; Pan, L.; and Xu, Z. 2023. Mts-mixers: Multivariate time series forecasting via factorized temporal and channel mixing. arXiv preprint arXiv:2302.04501.
  • Liang et al. (2023) Liang, Y.; Xia, Y.; Ke, S.; Wang, Y.; Wen, Q.; Zhang, J.; Zheng, Y.; and Zimmermann, R. 2023. Airformer: Predicting nationwide air quality in china with transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 14329–14337.
  • Lin et al. (2022) Lin, H.; Gao, Z.; Xu, Y.; Wu, L.; Li, L.; and Li, S. Z. 2022. Conditional local convolution for spatio-temporal meteorological forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 36, 7470–7478.
  • Lin et al. (2024) Lin, S.; Lyu, P.; Liu, D.; Tang, T.; Liang, X.; Song, A.; and Chang, X. 2024. MLP Can Be A Good Transformer Learner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19489–19498.
  • Marchuk (2012) Marchuk, G. 2012. Numerical methods in weather prediction. Elsevier.
  • Ni, Wang, and Fang (2022) Ni, Q.; Wang, Y.; and Fang, Y. 2022. GE-STDGN: a novel spatio-temporal weather prediction model based on graph evolution. Applied Intelligence, 52(7): 7638–7652.
  • Nie et al. (2023) Nie, Y.; Nguyen, N. H.; Sinthong, P.; and Kalagnanam, J. 2023. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations.
  • Schultz et al. (2021) Schultz, M. G.; Betancourt, C.; Gong, B.; Kleinert, F.; Langguth, M.; Leufen, L. H.; Mozaffari, A.; and Stadtler, S. 2021. Can deep learning beat numerical weather prediction? Philosophical Transactions of the Royal Society A, 379(2194): 20200097.
  • Shao et al. (2023) Shao, Z.; Wang, F.; Xu, Y.; Wei, W.; Yu, C.; Zhang, Z.; Yao, D.; Jin, G.; Cao, X.; Cong, G.; et al. 2023. Exploring progress in multivariate time series forecasting: Comprehensive benchmarking and heterogeneity analysis. arXiv preprint arXiv:2310.06119.
  • Shao et al. (2022) Shao, Z.; Zhang, Z.; Wang, F.; Wei, W.; and Xu, Y. 2022. Spatial-temporal identity: A simple yet effective baseline for multivariate time series forecasting. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 4454–4458.
  • Shumway, Stoffer, and Stoffer (2000) Shumway, R. H.; Stoffer, D. S.; and Stoffer, D. S. 2000. Time series analysis and its applications, volume 3. Springer.
  • Sose and Sayyad (2016) Sose, D. V.; and Sayyad, A. D. 2016. Weather monitoring station: a review. Int. Journal of Engineering Research and Application, 6(6): 55–60.
  • Tenzin et al. (2017) Tenzin, S.; Siyang, S.; Pobkrut, T.; and Kerdcharoen, T. 2017. Low cost weather station for climate-smart agriculture. In 2017 9th international conference on knowledge and smart technology (KST), 172–177. IEEE.
  • Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al. (2019) Wang, B.; Lu, J.; Yan, Z.; Luo, H.; Li, T.; Zheng, Y.; and Zhang, G. 2019. Deep uncertainty quantification: A machine learning approach for weather forecasting. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2087–2095.
  • Wu et al. (2023a) Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; and Long, M. 2023a. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In The Eleventh International Conference on Learning Representations.
  • Wu et al. (2023b) Wu, H.; Zhou, H.; Long, M.; and Wang, J. 2023b. Interpretable weather forecasting for worldwide stations with a unified deep model. Nature Machine Intelligence, 5(6): 602–611.
  • Wu et al. (2019) Wu, Z.; Pan, S.; Long, G.; Jiang, J.; and Zhang, C. 2019. Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121.
  • Yu et al. (2023) Yu, C.; Wang, F.; Shao, Z.; Sun, T.; Wu, L.; and Xu, Y. 2023. Dsformer: A double sampling transformer for multivariate time series long-term prediction. In Proceedings of the 32nd ACM international conference on information and knowledge management, 3062–3072.
  • Yu et al. (2024a) Yu, C.; Wang, F.; Wang, Y.; Shao, Z.; Sun, T.; Yao, D.; and Xu, Y. 2024a. MGSFformer: A Multi-Granularity Spatiotemporal Fusion Transformer for Air Quality Prediction. Information Fusion, 102607.
  • Yu et al. (2024b) Yu, C.; Yan, G.; Yu, C.; Liu, X.; and Mi, X. 2024b. MRIformer: A multi-resolution interactive transformer for wind speed multi-step prediction. Information Sciences, 661: 120150.
  • Yu et al. (2022) Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; and Yan, S. 2022. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10819–10829.
  • Zeng et al. (2023) Zeng, A.; Chen, M.; Zhang, L.; and Xu, Q. 2023. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, 11121–11128.
  • Zhou et al. (2021) Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; and Zhang, W. 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 11106–11115.
  • Zhou et al. (2022) Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; and Jin, R. 2022. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, 27268–27286. PMLR.
  • Zhou et al. (2023) Zhou, T.; Niu, P.; Sun, L.; Jin, R.; et al. 2023. One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems, 36: 43322–43355.

Appendix A Thereotical Proofs

A.1 Proof of Corollary 1.1

Proof.

According to Theorem 1 , we have

ντ=ντ1+(τ1)ΔtτΔtF(λ,φ,z,t)dt,\nu_{\tau}=\nu_{\tau-1}+\int_{(\tau-1)\Delta t}^{\tau\Delta t}F(\lambda,\varphi,z,t)\mathrm{d}t, (12)

where Δt\Delta t is the interval between time steps.

For brevity, (τ1)ΔtτΔtF(λ,φ,z,t)dt\int_{(\tau-1)\Delta t}^{\tau\Delta t}F(\lambda,\varphi,z,t)\mathrm{d}t is denoted as Iτ=I(λ,φ,z,τ)I_{\tau}=I(\lambda,\varphi,z,\tau). We can split ντ1\nu_{\tau-1} into two parts, then we have

ντ=α1ντ1+(1α1)ντ1+Iτ=α1ντ1+(1α1)(ντ2+Iτ1)+Iτ.\begin{split}\nu_{\tau}&=\alpha_{1}\nu_{\tau-1}+(1-\alpha_{1})\nu_{\tau-1}+I_{\tau}\\ &=\alpha_{1}\nu_{\tau-1}+(1-\alpha_{1})\left(\nu_{\tau-2}+I_{\tau-1}\right)+I_{\tau}.\end{split} (13)

By repeating this procedure for the subsequent values of ν\nu, we have

ντ=α1ντ1+α2ντ2++αTh1ντTh+1+(1α1α2αTh1)ντTh+Iτ+(1α1)Iτ1+(1α1α2)Iτ2++(1α1α2αTh1)IτTh.\begin{split}\nu_{\tau}=&\alpha_{1}\nu_{\tau-1}+\alpha_{2}\nu_{\tau-2}+\cdots+\alpha_{T_{h}-1}\nu_{\tau-T_{h}+1}\\ &+\left(1-\alpha_{1}-\alpha_{2}-\cdots-\alpha_{T_{h}-1}\right)\nu_{\tau-T_{h}}\\ &+I_{\tau}+\left(1-\alpha_{1}\right)I_{\tau-1}+\left(1-\alpha_{1}-\alpha_{2}\right)I_{\tau-2}\\ &+\cdots+\left(1-\alpha_{1}-\alpha_{2}-\cdots-\alpha_{T_{h}-1}\right)I_{\tau-T_{h}}.\end{split} (14)

Let αTh=(1α1αTh1)\alpha_{T_{h}}=\left(1-\alpha_{1}-\cdots-\alpha_{T_{h}-1}\right) and 𝜶\boldsymbol{\alpha} denotes {α1,α2,,αTh}\left\{\alpha_{1},\alpha_{2},\cdots,\alpha_{T_{h}}\right\}, then Eq. (14) can be expressed as

ντ=𝜶𝝂τTh:τ1+Iτ+(1α1)Iτ1+(1α1α2)Iτ2++(1α1α2αTh1)IτTh.\begin{split}\nu_{\tau}=&\boldsymbol{\alpha}\boldsymbol{\nu}_{\tau-{T_{h}}:\tau-1}+I_{\tau}+\left(1-\alpha_{1}\right)I_{\tau-1}\\ &+\left(1-\alpha_{1}-\alpha_{2}\right)I_{\tau-2}+\cdots\\ &+\left(1-\alpha_{1}-\alpha_{2}-\cdots-\alpha_{T_{h}-1}\right)I_{\tau-T_{h}}.\end{split} (15)

The weighted sum of II is a function of λ,φ,z,τ\lambda,\varphi,z,\tau, thus we have

ντ=𝜶𝝂τTh:τ1+G(λ,φ,z,τ).\nu_{\tau}=\boldsymbol{\alpha}\boldsymbol{\nu}_{\tau-{T_{h}}:\tau-1}+G(\lambda,\varphi,z,\tau). (16)

A.2 Proof of Theorem 2

Proof.

The data embedding layer maps the input data into latent space with dimension dd, thereby introducing Th(d+1)T_{h}(d+1) parameters. Analogously, the regression layer introduces Tf(d+1)T_{f}(d+1) parameters. For the positional encoding, spatial encoding costs 3(d+1)3(d+1) parameters, and temporal encoding costs (24+31+12)(d+1)(24+31+12)(d+1) parameters. The parameter count of a LL-layer MLP with residual connect is 2Ld(d+1)2Ld(d+1). Thus, the total number of parameters is The total number of parameters required for the LightWeather is (2Ld+Th+Tf+70)(d+1)(2Ld+T_{h}+T_{f}+70)(d+1). ∎

Appendix B Experimental Details

B.1 Datasets Description

In order to evaluate the comprehensive performance of the proposed model, we conduct experiments on 5 weather datasets with different temporal resolutions and spatial coverages including:

  • GlobalWind and GlobalTemp are collected from the National Centers for Environmental Information (NCEI)111https://www.ncei.noaa.gov/data/global-hourly/access. These datasets contains the hourly averaged wind speed and temperature of 3,850 stations around the world, spanning two years from 1 January 2019 to 31 December 2020. Kindly note that these datasets are rescaled (multiplied by ten times) from the raw datasets.

  • Wind_CN and Temp_CN are also collected from NCEI222https://www.ncei.noaa.gov/data/global-summary-of-the-day/access. These datasets contains the daily averaged wind speed and temperature of 396 stations in China (382 stations for Temp_CN due to missing values), spanning 10 years from 1 January 2013 to 31 December 2022.

  • Wind_US is collected from Kaggle333https://www.kaggle.com/datasets/selfishgene/historical-hourly-weather-data. It contains the hourly averaged wind speed of 27 stations in US, spanning 62 months from 1 October 2012 to 30 November 2017. The original dataset only provide the latitudes and longitudes of stations and we obtain the elevations of stations through Google Earth444https://earth.google.com/web.

The statistics of datasets are shown in Table 5, and the distributions of stations are shown in Figure 7.

Table 5: Statistics of datasets.
Dataset Coverage Station Sample Time Length
Num Rate Span
GlobalWind Global 3850 1 hour 2 years 17,544
GlobalTemp Global 3850 1 hour 2 years 17,544
Wind_CN National 396 1 day 10 years 3,652
Temp_CN National 382 1 day 10 years 3,652
Wind_US National 27 1 hour 62 months 45,252
Refer to caption
(a) GlobalWind and GlobalTemp
Refer to caption
(b) Wind_CN and Temp_CN
Refer to caption
(c) Wind_US
Figure 7: Distributions of the stations.

B.2 Introduction to the Baselines

  • HI (Cui, Xie, and Zheng 2021), short for historical inertia, is a simple baseline that adopt most recent historical data as the prediction results.

  • ARIMA (Shumway, Stoffer, and Stoffer 2000) , short for auto regressive integrated moving average, is a statistical forecasting method that uses the combination of historical values to predict the future values.

  • Informer (Zhou et al. 2021) is a Transformer with a sparse self-attention mechanism.

  • FEDformer (Zhou et al. 2022) is a frequency enhanced Transformer combined with seasonal-trend decomposition to capture the overall trend of time series.

  • DSformer (Yu et al. 2023) utilizes double sampling blocks to model both local and global information.

  • PatchTST (Nie et al. 2023) divides the input time series into patches, which serve as input tokens of Transformer.

  • TimesNet (Wu et al. 2023a) transforms time series into 2D tensors and utilizes a task-general backbone that can adaptively discover multi-periodicity and extract complex temporal variations from transformed 2D tensors.

  • GPT4TS (Zhou et al. 2023) employs pretrained GPT2 backbone and a linear layer to obtain the output.

  • Time-LLM (Jin et al. 2024) reprograms the input time series with text prototypes before feeding it into the frozen LLM backbone to align time series with natural language modality.

  • DLinear (Zeng et al. 2023) is a linear model with time series decomposition.

  • AirFormer (Liang et al. 2023) employs a dartboard-like mapping and local windows to restrict attentions to focusing solely on local information.

  • Corrformer (Wu et al. 2023b) utilizes a decomposition framework and replaces attention mechanisms with a more efficient multi-correlation mechanism.

  • MRIformer (Yu et al. 2024b) employs a hierarchical tree structure, stacking attention layers to capture correlations from multi-resolution data obtained by down sampling.

Table 6: Forecasting results from NWP methods and our model on global weather datasets.
Dataset GlobalWind GlobalTemp
Metric MSE MAE MSE MAE
ERA5 (0.5°) 6.793 1.847 28.07 3.270
GFS (0.25°) 9.993 2.340 14.93 2.287
LightWeather 3.734 1.295 7.420 1.858

B.3 Comparison with Numerical Methods

In this section, we compare our model to numerical weather prediction (NWP) methods. Conventional NWP methods uses partial differential equations (PDEs) to describe the atmospheric state transitions across grid points and solve them through numerical simulations. Currently, the ERA5 from European Centre for Medium-Range Weather Forecasts (ECMWF) and the Global Forecast System (GFS) from NOAA are the most advanced global forecasting models. ERA5 provides gridded global forecasts at a 0.5° resolution while GFS at a 0.25° resolution.

To make the comparison pratical, we utilize bilinear interpolation with height correction to obtain the results for scattered stations, which is aligned with the convention in weather forecasting (Bauer, Thorpe, and Brunet 2015). The results are shown in Table 6.

Both ERA5 and GFS fail to provide accurate preictions, which indicates that grid-based NWP methods are inadequate for fine-grained station-based predictions. Conversely, LightWeather can accurately forecast the global weather for worldwide stations, significantly outperforming the numerical methods.

Appendix C Overall Workflow

The overall workflow of LightWeather is shown in Algorithm 1.

Algorithm 1 Overall workflow of LightWeather.
0:  historical data 𝐗Th×N×C\mathbf{X}\in\mathbb{R}^{T_{h}\times N\times C}, geographical coordinates 𝚯N×3\boldsymbol{\Theta}\in\mathbb{R}^{N\times 3}, the first time step tt
0:  forecasting result 𝐘Tf×N×C\mathbf{Y}\in\mathbb{R}^{T_{f}\times N\times C}
1:  𝐗=𝐗.transpose(1,1)\mathbf{X}=\mathbf{X}.\text{transpose}\left(1,-1\right)  /* 𝐗C×N×Th\mathbf{X}\in\mathbb{R}^{C\times N\times T_{h}} */
2:  𝐄=FCembed(𝐗)\mathbf{E}=\text{FC}_{\text{embed}}\left(\mathbf{X}\right)/* Data embedding, 𝐄C×N×d\mathbf{E}\in\mathbb{R}^{C\times N\times d} */
3:  𝐒=FCS(𝚯)\mathbf{S}=\text{FC}_{\text{S}}\left(\boldsymbol{\Theta}\right) /* 𝐒N×d\mathbf{S}\in\mathbb{R}^{N\times d} */
4:  𝐒=𝐒.repeat(C,1,1)\mathbf{S}=\mathbf{S}.\text{repeat}(C,1,1) /* 𝐒C×N×d\mathbf{S}\in\mathbb{R}^{C\times N\times d} */
5:  hour,day,mon=time_feature(t)hour,day,mon=\text{time\_feature}(t) /* Obtain hour, month and day from tt */
6:  𝐓=𝐓[hour].repeat(C,N,1)\mathbf{T}=\mathbf{T}[hour].\text{repeat}(C,N,1)
7:  𝐃=𝐃[day].repeat(C,N,1)\mathbf{D}=\mathbf{D}[day].\text{repeat}(C,N,1)
8:  𝐓=𝐌[mon].repeat(C,N,1)\mathbf{T}=\mathbf{M}[mon].\text{repeat}(C,N,1)
9:  𝐙0=𝐄+𝐒+𝐓+𝐃+𝐌\mathbf{Z}_{0}=\mathbf{E}+\mathbf{S}+\mathbf{T}+\mathbf{D}+\mathbf{M} /* 𝐙0C×N×d\mathbf{Z}_{0}\in\mathbb{R}^{C\times N\times d} */ /* MLP encoder */
10:  for lin{0,1,,L1}l\ \textbf{in}\ \{0,1,\cdots,L-1\} do
11:     𝐙l+1=FC2(σ(FC1(𝐙l)))+𝐙l\mathbf{Z}_{l+1}=\text{FC}_{2}\left(\sigma\left(\text{FC}_{1}\left(\mathbf{Z}_{l}\right)\right)\right)+\mathbf{Z}_{l}
12:  end for
13:  𝐘=FC(𝐙L)\mathbf{Y}=\text{FC}\left(\mathbf{Z}_{L}\right) /* Regression layer, 𝐘C×N×Tf\mathbf{Y}\in\mathbb{R}^{C\times N\times T_{f}} */
14:  𝐘=𝐘.transpose(1,1)\mathbf{Y}=\mathbf{Y}.\text{transpose}\left(1,-1\right)
15:  return  𝐘\mathbf{Y}