\disable@package@load

breakurl

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

[2]\fnmHao \surLi

[1,5]\fnmXiaoming \surShi

1]\orgdivDivision of Environment and Sustainability, \orgnameThe Hong Kong University of Science and Technology, \orgaddress\stateHong Kong, \countryChina

2]\orgdivArtificial Intelligence Innovation and Incubation Institute, \orgnameFudan University, \orgaddress\stateShanghai, \countryChina

3]\orgdivDepartment of Electronic and Computer Engineering, \orgnameThe Hong Kong University of Science and Technology, \orgaddress\stateHong Kong, \countryChina

4]\orgdivHong Kong Observatory,\orgname \orgaddress\stateHong Kong, \countryChina

5]\orgdivCenter for Ocean Research in Hong Kong and Macau, \orgnameThe Hong Kong University of Science and Technology, \orgaddress\stateHong Kong, \countryChina

Skillful Nowcasting of Convective Clouds With a Cascade Diffusion Model

\fnmHaoming \surChen [email protected] \fnmXiaohui \surZhong [email protected] \fnmQiang \surZhai [email protected] \fnmXiaomeng \surLi [email protected] \fnmYing Wa \surCHAN [email protected] \fnmPak Wai \surCHAN [email protected] \fnmYuanyuan \surHuang [email protected] [email protected] [email protected] [ [ [ [ [

Abstract

Accurate nowcasting of convective clouds from satellite imagery is essential for mitigating the impacts of meteorological disasters, especially in developing countries and remote regions with limited ground-based observations. Recent advances in deep learning have shown promise in video prediction; however, existing models frequently produce blurry results and exhibit reduced accuracy when forecasting physical fields. Here, we introduce SATcast, a diffusion model that leverages a cascade architecture and multimodal inputs for nowcasting cloud fields in satellite imagery. SATcast incorporates physical fields predicted by FuXi, a deep-learning weather model, alongside past satellite observations as conditional inputs to generate high-quality future cloud fields. Through comprehensive evaluation, SATcast outperforms conventional methods on multiple metrics, demonstrating its superior accuracy and robustness. Ablation studies underscore the importance of its multimodal design and the cascade architecture in achieving reliable predictions. Notably, SATcast maintains predictive skill for up to 24 hours, underscoring its potential for operational nowcasting applications.

keywords:

nowcasting, satellite, deep learning, FuXi, cascade

1 Introduction

Small-scale deep convection and mesoscale convective systems (MCSs) frequently lead to severe meteorological disasters, such as flooding, wind gusts, and turbulence [1, 2]. The United Nation’s "Early Warnings for All" initiative aims to protect global populations from weather hazards by enhancing early warning systems capable of predicting the timing, location, and intensity of such convective events [3]. Satellite observations play an indispensable role in monitoring the formation and evolution of convective systems, offering significantly broader spatial coverage than traditional radar systems. This advantage is particularly critical in regions with limited radar infrastructure, such as over oceans and in developing countries, where advanced radar networks are sparse or nonexistent [4]. To address these gaps, satellite-based nowcasting techniques have been developed. For example, the World Meteorological Organization (WMO) recently issued guidelines for satellite-based nowcasting in Africa, emphasizing its importance in areas with limited observational resources [3].

Traditional methods for satellite and radar-based nowcasting, such as the Lucas-Kanade optical flow algorithm, efficiently track MCS movement but struggle with capturing intensity variations due to assumptions of brightness constancy [5]. Similarly, systems such as the Short-Term Ensemble Prediction System offer uncertainty estimates but often smooth out small-scale convective details, since it uses stochastic noise to model the evolution of small-scale convection [6]. These limitations have catalyzed the adoption of deep learning models, which do not have these assumptions and excel in producing more reliable forecasts [7, 8, 9].

The evolution of deep learning in nowcasting has yielded promising advances, although certain challenges persist. Forecast horizons for satellite or radar-based predictions rarely exceed six hours due to cumulative errors, and deep learning models often produce blurred outputs [10, 11, 12, 13]. For instance, Shi et al. [14] introduced convolutional Long Short Term Memory (ConvLSTM) networks for radar image time series forecasting, demonstrating superior performance in precipitation nowcasting with lead times of up to 2 hours. More recently, DaYu, which integrates transformer [15] and residual convolution layers, has extended the skillful forecast lead time for satellite-based nowcasting to 12 hours [13]. However, its outputs become overly smooth after 3 hours, particularly in representing typhoon structure and the eye’s position (see Figures 3 and 4). Wang et al. [16] used Generative Adversarial Networks (GAN) [17] to generate high-quality radar images successfully,achieving skillful forecast up to 2 hours. Kim et al. [18] incorporated European Centre for Medium-Range Weather Forecasts Reanalysis version 5 (ERA5) data into a deep learning model, improving precipitation forecasts within six hours compared to those generated using only radar images inputs. However, the production schedule of ERA5 data is not compatible with the requirements of nowcasting, and they [18] suggest that numerical weather prediction (NWP) data could serve as a viable alternative. Despite this, NWP models also face significant challenges including high computational costs associated with running NWP models, and uncertainties in parameterization schemes for gray-zone convection, which limit their ability to accurately forecast convective activities [19, 20]. As a result, weather data-driven nowcasting techniques that operate independently of NWP models offer a more cost-effective and practical solution [8, 9].

Physical variables generated by either deep learning or NWP models exhibit significant differences in information content compared to radar and satellite observations, making them multimodal data sources. The integration of multimodal data is a key method for enhancing model performance, and further research is needed on how to effectively process and combine these diverse data types in deep learning models for atmospheric science. For example, the Contrastive Language-Image Pretraining model employs transformer and convolutional layers to encode text and images, enabling a more flexible classifier that serves as a foundation for subsequent multimodal models [21]. In atmospheric science, MetNet-3 [22] employs different ResNet blocks [23] to process multi-resolution data, thereby extending skillful forecast lead times up to 24 hours. Wang et al. [24] uses a cross-attention module to merge NWP results and radar images, extending precipitation forecasts up to six hours. Therefore, integrating physical variables and satellite imagery into deep learning models is beneficial, but bridging the gap between these data types remains challenging.

In this study, we introduce SATcast for predicting convective clouds observed by the FengYun-4A (FY-4A) geostationary satellite. Diffusion models, which are state-of-the-art in text-to-image and video generation in computer vision [25, 26], are employed for this purpose. The FY-4A satellite plays a crucial role in monitoring weather patterns, climate variations, environmental changes, and natural disasters across East Asia [27]. Our physical-driven model innovatively incorporates atmospheric physical conditions predicted by FuXi [8], which has demonstrated remarkable accuracy in deterministic weather predictions, as conditioning inputs. Additionally, we employ a cascade structure to mitigate the error accumulation in nowcasting. This multimodal, cascade-based deep learning nowcasting model achieves remarkable performance in predicting convective clouds with forecast lead times of up to 24 hours, addressing the long-standing issues of blurriness in nowcasting.

2 Results

We train our nowcasting model using four years (2019-2022) of hourly FY-4A satellite data from the infrared channel (channel 12) of the Advanced Geosynchronous Radiation Imager (AGRI). Channel 12 is selected as it is specifically designed for water vapor detection and mid-troposphere monitoring, making it particularly effective for tracking cloud system development and convective processes [28, 29]. The diffusion model employs a U-Net structure enhanced with ConvNeXt modules [30] and self-attention modules [15] (Fig. 6).

The model combines FuXi forecasts of atmospheric dynamic and thermodynamic fields with historical satellite imagery to predict cloud evolution. Detailed information about the specific FuXi forecast variables used is provided in Supplementary Table 1. The primary model, SATcast, implements a cascade, autoregressive, and multimodal framework. SATcast (referred to as SATcast-phase 1) generates four-hour forecasts (T+1 to T+4) of satellite imagery based on 12 hours (T $-$ 7 to T+4) of FuXi forecasts and eight hours (T $-$ 7 to T) of historical satellite imagery. To extend forecasts beyond the initial four-hour period, the model iteratively uses its outputs as inputs for subsequent predictions. To improve forecast accuracy for longer lead times, a separate model, SATcast-phase 2, which is fine-tuned specifically for lead times of 5 to 8 hours, is employed in a cascaded framework. This model shares the identical structure as SATcast-phase 1 but uses independent parameters optimized for longer time predictions. Further details regarding the model design, training process, and evaluation methodology are provided in Section Methods.

Furthermore, to assess the contributions of key components in the framework, we conduct ablation studies using three model variants: SATcast-NoC, which removes the cascade and autoregressive framework and generates eight-hour forecasts in a single step; SATcast-NoF, which excludes FuXi forecasts as conditioning inputs; SATcast-NoT, which excludes the cascaded model for longer lead times. For baseline models, we used persistence model, which assumes that convective cloud patterns remain constant across all lead times [31], and optical flow, a popular method in practical nowcasting. The results from optical flow are compared with those of SATcast. Model performance is evaluated from multiple perspectives, including statistical metrics, case studies, the clarity of weather system structures.

2.1 Overall performance of SATcast

Previous studies have employed multiple deep learning models, each tailored to specific lead time intervals, to mitigate the accumulation of prediction errors [8, 9]. In this study, constrained by computational resources, we train only two models within a cascade framework. SATcast-phase 1 predicts cloud evolution for lead times from T+1 to T+4, while SATcast-phase 2 utilizes outputs from SATcast-phase 1 to generate predictions for lead times from T+5 to T+8. For lead times beyond T+8, we evaluate model performance by iteratively applying SATcast-phase 1 to generate satellite image forecasts.

Refer to caption — Figure 1: Comparison of the RMSE, correlation coefficient, MSSSIM, and CSI spatially averaged over (86^∘ to 150^∘ E in longitude and 1^∘ to 41^∘ N in latitude. The results from persistence model, SATcast, and optical flow. The results using testing data from 2022 September to 2022 December. The threshold is 240 K in the calculation of CSI. The red shading represents the one standard deviation range of SATcast.

Figure 1 compares the root mean square error (RMSE), correlation coefficient, multi-scale structure similarity (MSSSIM) [32], and critical success index (CSI) between SATcast, the persistence model, and the optical flow method. The CSI is calculated using a threshold of 240 K. Results are spatially averaged over the region of interest (86^∘ to 150^∘ E in longitude and 1^∘ to 41^∘ N in latitude), based on 512 testing sequences, each spanning 32 hours (the first 8 hours serve as input, and the following 24 hours as the target), from September to December 2022. The red shading in SATcast denotes the one standard deviation calculated across lead time ranges for samples with different initialization times.

As shown in Figure 1, all metrics for SATcast and the persistence model degrade sharply up to T+8 but stabilize after T+12. In contrast, the performance of the optical flow method decreases consistently. Notably, the persistence model exhibits slight performance improvements between T+20 and T+24, likely due to the influence of the diurnal cycle [33]. Overall, SATcast consistently outperforms both the persistence model and the optical flow method across all metrics. The correlation coefficient for SATcast stabilizes above 0.7, significantly surpassing the persistence baseline, which drops below 0.5 after T+12. Similarly, SATcast achieves a CSI of around 0.5, indicating reliable detection of convective clouds, while the persistence model’s CSI falls below 0.3. The MSSSIM for SATcast remains approximately 0.6, suggesting that the quality of predicted cloud fields remains high even as the lead time increases. In contrast, the MSSSIM for the persistence model is about 0.45, due to variations in weather systems. These results demonstrate that the cascade framework with single-round fine-tuning effectively reduces cumulative errors, resulting in minimal degradation of predictive performance beyond T+8. The stability of SATcast’s metrics beyond T+12 further underscores its reliability for longer-term cloud predictions.

Figure 2 demonstrates the SATcast’s 24-hour forecasting capability using a typhoon case. The spatial distribution of brightness temperature from the optical flow method becomes very smooth after several time steps. We focus on SATcast’s performance, with the optical flow results presented in Supplementary Figure S12. SATcast successfully captures the spatiotemporal evolution of a large-scale convection system moving from northern China to Japan. It also accurately predicts two tropical disturbances east of the Philippines, which latter intensified into Typhoon Sonca and Typhoon Nesat. The model reproduces the complex intensity fluctuations, showing an initial weakening followed by intensification, due to interactions with cold eddies, within the 24-hour period. Predicted typhoon intensity and track closely align with observations, as shown in Supplementary Figure S1, with additional cases provided in Supplementary Figures S2–S3. Notably, SATcast achieves these high-fidelity predictions with a single round of fine-tuning, effectively mitigating forecast error accumulation. This continuous forecasting capability circumvents prediction discontinuities, suggesting SATcast as a reliable framework for convective cloud forecasting.

In addition, SATcast demonstrates strong generalization capability. Despite being trained exclusively on channel 12 data, it effectively predicts the brightness temperature on channel 9 by incorporating FuXi data. Relevant metrics and examples are provided in Supplementary Figures S9-S10.

2.2 Ablation experiments and nowcasting skills

In this subsection, we compare three variants of SATcast (see details in Table 1) for forecast lead times up to 8 hours, using 1,024 sequences, each spanning 16 hours (with the first 8 hours as input and the subsequent 8 hours as the target). The sequences, collected from September to December 2022, are used to evaluate the effectiveness of various modules within SATcast.

Figure 3 presents the metrics, spatially averaged over the region of interest (86^∘ to 150^∘ E in longitude and 1^∘ to 41^∘ N in latitude), for the persistence model, SATcast, and the three SATcast variants. CSI is still computed using a threshold of 240 K. Among all the variants, SATcast-NoC, which directly predicts the entire 8-hour sequence, performs the worst for lead times T+1 to T+3. This is likely due to the backward propagation of errors from frames beyond T+4, which hinders its optimization for earlier predictions. However, SATcast-NoC outperforms SATcast-NoF, which excludes FuXi forecasts for lead times beyond T+4, suggesting that FuXi forecasts enhance performance, as satellite imagery alone are insufficient for predicting longer lead times.

SATcast-NoF performs well up to T+3 but deteriorates rapidly thereafter, likely due to its inability to capture the physical processes underlying convective cloud evolution. To further examine the impact of fine-tuning on SATcast predictions beyond the initial four hours, a variant without fine-tuning, SATcast-NoT, performs better than other models except SATcast. Fine-tuning is essential in aligning SATcast’s predictions with the data distribution of satellite imagery, as SATcast-NoT often overestimates values beyond T+4 (Supplementary Figure S8).

Although the error differences among models appear subtle, Figure 4 and Supplementary Figures S4-S7 reveal significant variations in their ability to predict the spatial distribution of convection, with SATcast showing the best skill. Figure 4 also shows SATcast’s performance in predicting the intensification of Typhoon Nanmadol and the dissipation of Typhoon Muifa on September 13, 2022. Muifa, one of the most powerful typhoons on record to strike Shanghai, and, Nanmadol, one of the strongest typhoons of 2022, caused substantial damage, undersocring the importance of accurate forecasts.

Starting at T+1 (2022-09-14-07:00 UTC), SATcast and its variants capture the textural features and central position of Nanmadol, characterized by a low cloud-top temperature and distinct core region. However, significant differences emerge by T+3 (2022-09-14-09:00 UTC). Without FuXi forecasts, SATcast-NoF demonstrates significant degradation in its ability to resolve the typhoon’s structure, and SATcast-NoC produces only a loosely organized system. In contrast, SATcast maintains robust skill in delineating both the eye of the typhoon and its structural integrity (further details are provided in Supplementary Figure S7). By T+8 (2022-09-14-14:00 UTC), these differences become more pronounced. SATcast-NoC and SATcast-NoF predict only fragmented and sporadic convection patterns, while SATcast successfully captures the intensification and movement of the tropical cyclone. Although SATcast exhibits some discrepancies in the detailed organization of convection compared to observations, it remains the most reliable model in maintaining the typhoon’s overall structure and dynamics.

As Typhoon Muifa approached the east coast of China, it gradually weakened and split into two cloud clusters by T+3. SATcast demonstrated superior skill in accurately capturing both the spatial evolution and intensity variations during this process. By T+8, SATcast’s predictions exhibit a slight southward bias in the location of convective clouds north of Shanghai compared to observations. Despite this, SATcast still accurately forecasts the variations in convection intensity and shape. In contrast, although SATcast-NoC and SATcast-NoF also predict a weakening system, their representations of convective cloud organization show significant morphological inaccuracies.

Notably, satellite observations indicate that scattered convective clouds over Southeast Asia and the South China Sea intensified over time. SATcast accurately predicts this strengthening at T+3, and its forecast at T+8 remains acceptable. In comparison, both SATcast-NoC and SATcast-NoF significantly underestimate the intensity of these convective clouds at T+3 and T+8. Additional cases, including those with and without tropical cyclones, are detailed in Supplementary Figures S4-S7. In these cases, SATcast consistently outperforms its variants in predicting changes in the intensity and organization of convective clouds, while the variants often erroneously predict weakening and dispersion of convective systems.

2.3 Physical interpretations

Permutation feature importance is a valuable technique for interpreting multimodal deep learning models, especially in atmospheric studies [34, 31]. This method quantifies the importance of each feature by selectively shuffling specific feature dimensions and measuring the resulting performance degradation. It helps identify key physical parameters for predicting satellite imagery and refining the model. To address the computational cost of repeated shuffling, we propose a threshold-based permutation strategy (detailed in Supplementary Information). This strategy enables effective shuffling with only one shuffle per batch, significantly reducing computational costs.

Figure 5 presents heatmaps of the normalized mean squared error (N-MSE) ratios between permuted and baseline predictions ( $\frac{MSE_{f}}{MSE_{m}}$ ). Higher ratios correspond to greater feature importance, highlighting the most influential features for model predictions.

Figure 5a shows the feature importance, measured by N-MSE variations, for the eight hours of historical satellite imagery and FuXi predicted variables. For T+1 to T+3, the N-MSE values from shuffled satellite imagery are significantly higher than those from the FuXi data, indicating the essential role of satellite observations during this period. However, as the lead time increases, the importance of satellite imagery gradually diminishes, while the significance of FuXi data grows, ultimately surpassing that of satellite data at T+8. This shift highlights the increasing relevance of FuXi data, which provides insights into future atmospheric patterns, for predictions over longer lead times. It also explains why SATcast-NoC outperforms SATcast-NoF after T+3.

Figure 5b examines the importance of individual FuXi variables, all of which exhibit relatively low N-MSE values (below 1.1), with several parameters emerging as particularly impactful. Variables across different pressure levels, including high (250 hPa), middle (500 and 700 hPa), and low (850 hPa), are selected to ensure comprehensive representation. Notably, the importance of these variables varies with forecast lead time. Key features generating N-MSE about 1.05 include geopotential at 250 and 850 hPa (z250 and z850), specific humidity at 250 and 500 hPa (q250 and q500), near-surface temperature (t2m), and mean sea-level pressure (msl). z250 and z850 influences atmospheric motions, such as convergence and divergence patterns, through geostrophic balance and adjustment, thereby guiding cloud movement and intensity changes. q250 and q500, which are closely tied to clouds, exhibit a more pronounced influence from T+5 to T+8, whereas lower-level specific humidity (q850) has less impact. Surface temperature and pressure contribute thermodynamic and dynamic forcing to the atmosphere, affecting prediction accuracy. Furthermore, msl and geopotential (z250 and z850) serve as practical indicators of convection centers and tropical cyclones. These findings demonstrate that incorporating critical atmospheric fields significantly enhances the performance of the SATcast models.

The results obtained by SATcast on channel 9 (Supplementary Figures S9-S10) further demonstrate the model’s capacity to interpret physical information from a different perspective. Brightness temperature on Channel 9, which represent high-level water vapor, exhibit greater stability, leading to superior performance across various metrics compared to those from channel 12. This distinction can be attributed to the fact that, for shorter lead times, SATcast operates more like a video prediction model, heavily relying on satellite imagery for prediction. However, for longer lead times, the model’s performance declines rapidly due to inconsistencies between the targets detected by channels 9 (high-level water vapor) and 12 (cloud). This mismatch hinders the model’s ability to accurately interpret the influence of physical variables on the target, resulting in a significant reduction in skill for channel 9 predictions.

3 Discussion

In this study, we introduce SATcast, a cascade, autoregressive, and multimodal deep learning model developed for nowcasting convective clouds. SATcast combines atmospheric physical conditions predicted by FuXi with FY-4A satellite observations to generate high-fidelity cloud forecasts using a diffusion model. SATcast effectively addresses long-standing challenges in nowcasting, such as image blurring and the rapid decline of forecast accuracy over time. The model produces high-quality cloud fields and accurately forecasts both the temporal and spatial characteristics of clouds up to 24 hours in advance.

SATcast adopts a cascade framework, with SATcast-phase 1 providing forecasts up to four hours ahead, while the SATcast-phase 2 takes SATcast-phase 1’s outputs as input to extend predictions beyond the initial four hours. SATcast-phase 1 and SATcast-phase 2 are optimized for lead times of 1 to 4 hours and 5 to 8 hours, respectively. Remarkably, the model’s predictive capabilities extend beyond the optimized 8-hour forecast, as demonstrated in the 8-hour and 24-hour forecast tests, where it exhibited robustness and low cumulative error. Even with a 24-hour lead time, the forecasts remain valid and reliable. Moreover, SATcast can be applied to a different satellite channel from the same satellite without significant performance degradation. The model’s robust performance probably benefits from its incorporation of atmospheric physical information, the cascade structure, and the stability and generative power of the diffusion model.

However, one limitation in this study is the loss of information due to downsampling satellite observations to a coarser resolution of 0.25^∘ to reduce computational complexity. In the future, we aim to develop latent-space diffusion models to mitigate these computational demands and enable nowcasting at FY satellite’s native 4-km resolution, thereby capturing more detailed storm features. Another potential direction for future work is the generatation of probabilistic forecasts. Convective systems exhibit relatively low predictability, and deterministic forecasts cannot quantify uncertainty, which is critical for making informed, contingent decisions. The recent development of GenCast by Google Deepmind highlights the potential of diffusion models for producing ensemble forecasts by introducing additional stochasticity [35]. Future convective cloud nowcasting could adopt a similar approach to provide more reliable probabilistic storm predictions based on satellite data.

To our knowledge, SATcast is the first model capable of accurately forecasting satellite imagery beyond 8 hours, and even up to 24 hours. SATcast can play a critical role in providing timely alerts to vulnerable regions that lack advanced observation networks and high-performance computing facilities, as it can be operated without relying on NWP forecasts. It is also vital for aviation, especially for flights over oceans and remote areas where ground-based radar observations are inaccessible, and it can effectively track typhoon paths over the ocean. Additionally, cloud coverage is crucial for managing power grids and optimizing energy storage in solar photovoltaic plants [36].

Methods

Data

We obtained the sequences by incorporating 13 physical variables (detailed in Supplementary Table 1) from FuXi’s hourly predictions, aligned with the spatiotemporal coordinates of the FY-4A satellite data.

Our analysis utilizes FY-4A AGRI Level 1 data from channel 12, which has a central wavelength of 10.7 µm, a spatial resolution of 4 km, and a temporal resolution of 1 hour, covering the period from 2019 to 2022. This channel primarily captures cloud top temperature and surface temperature. The data is then interpolated to a $0.25^{\circ}$ resolution to reduce computational costs while maintaining resolution consistent with FuXi data.

These sequences are combined with the 13 physical variables predicted by FuXi every 12 hours, aligned with the latitude, longitude, and time information from the satellite data. The resulting dataset is organized in a T, C, H, W format (12, 14, 160, 256), where T represents time steps, C denotes channels (including both satellite data and physical variables), and H and W correspond to the spatial dimensions in the latitude and longitude directions. The training dataset for SATcast and SATcast-NoF comprises 19,904 sequences, each spanning 12 hours. Additionally, a separate dataset for training SATcast-NoC includes 17,408 sequences, each with a duration of 16 hours. To evaluate the performance of SATcast and assess the impact of its modules, distinct validation sequences lasting 16 and 32 hours were also prepared.

Basic diffusion

In the forward process of the diffusion model, noise is incrementally added to the target. Let $P(x_{0})$ represent the sample distribution. The following discrete-time Gaussian process is defined:

q(x_{t}|x_{t-1})=N(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\textit{{I}})

(1)

where $x_{t}$ denotes the noise added at time step $t$ to $x_{0}$ , and is calculated as: $x_{t}=\sqrt{\bar{\alpha_{t}}}x_{0}+\sqrt{1-\bar{\alpha_{t}}}\epsilon$ , with $\epsilon$ being the noise sampled from $\mathcal{N}(0\sim 1)$ . Here, $\alpha_{t}$ is a fixed schedule over $t$ , and $\beta_{t}=1-\alpha_{t}$ . The network is trained to learn the noise injected to the data. During sampling, the diffusion model reverses the forward diffusion process to recover $x_{t-1}$ from $x_{t}$ , given by:

p(x_{t-1}|x_{t},x_{0})=\mathcal{N}(x_{t-1};\tilde{\mu_{t}}(x_{t},x_{0}),\tilde{\beta_{t}}\mathbf{I})

(2)

where $\tilde{\mu_{t}}(x_{t},x_{0}=\frac{1}{\sqrt{\alpha_{t}}}(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha_{t}}}}\epsilon_{\theta})$ , $\bar{\alpha_{t}}=\prod_{1}^{t}\alpha_{t}$ , and $\epsilon_{\theta}$ is the noise predicted by the network. Additionally, $\tilde{\beta_{t}}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ . Thus, $x_{t-1}$ can be sampled from the mean and variance [26, 37].

Conditional diffusion for satellite image nowcasting

Consider a sequence of satellite image time series, where the nowcasting task predicts future $y$ frames satellite imageries based on a past $x$ frames satellite imageries and FuXi variables from time steps $\mathrm{T}-x-1$ to $\mathrm{T}+y$ . During training, noise is added to the future $y$ frames at level $t$ , while the past frames and FuXi variables serve as the conditions. Therefore, the reverse process is modified as:

p(x_{t-1}|x_{t},x_{0},cond)=\mathcal{N}(x_{t-1};\tilde{\mu_{t}}(x_{t},x_{0},cond),\tilde{\beta_{t}}\mathbf{I})

(3)

The network training objective is simplified as follows:

\mathcal{L}=\mathbb{E}_{x_{t},cond,t,\epsilon\sim\mathcal{N}(0,1)}\left\|\epsilon-\epsilon_{\theta}(x_{t},cond,t)\right\|_{2}^{2}

(4)

where $\epsilon_{\theta}$ represents the noise predicted by the network.

Model

The basic model, SATcast, is a cascade model that incorporates FuXi data, and undergoes a two-phase training, as illustrated in Figure 6b. In the first phase, the model learns to predict frames from T+1 to T+4 using past satellite frames from T $-$ 7 to T+0 and the corresponding FuXi predictions from T $-$ 7 to T+4. The second phase extends the prediction to frames T+5 to T+8, utilizing both past satellite frames from T $-$ 3 to T+0, the previously generated frames from T+1 to T+4, and the corresponding FuXi predictions. To optimize computational efficiency, SATcast-phase 2 is initialized with weights from SATcast-phase 1. We also present results from SATcast-NoT, where SATcast did not undergo fine-tuning in phase 2. The second variant, SATcast-NoC, adopts a direct prediction approach, predicting frames T+1 to T+8 simultaneously using the same input data as SATcast-phase 1. The third variant, SATcast-NoF excludes FuXi predictions as conditions while maintaining autoregressive prediction for all eight frames.

Table 1: The configurations of the SATcast and variants. NoC: No cascade (no autoregressive and fine-tuning), NoF: No FuXi data, NoT: No fine-tuning

Model	Autoregressive	Fine-Tuning	FuXi data
SATcast	✓	✓	✓
SATcast-NoC	✗	✗	✓
SATcast-NoT	✓	✗	✓
SATcast-NoF	✓	✗	✗

Cascade structure

We designed a cascade structure to allow the model to generate future convective clouds in multiple steps. Since predicting frames with longer lead times is more challenging, enabling the model to focus on earlier frames in a single forecast is more beneficial for training an effective model. Fine-tuning can be applied in subsequent iterations ti effectively reduce iterative errors [8]. Additionally, Supplementary Figure S11 shows a sample from the training of SATcast-NoC, where SATcast-NoC begins to capture the structure of the typhoon in the first four frames but fails to predict the subsequent four frames. This illustrates the rationale behind designing the cascade model to forecast four frames at a time.

Multimodal data

The physical-driven network is constructed by incorporating the physical variables from FuXi forecasts. We treat data representing the same variable at different altitudes as distinct modalities. Consequently, we apply normalization to these disparate modalities by linearly interpolating their values within the range between -1 and 1. After temporal and spatial alignment, we obtain a five-dimensional dataset characterized by dimensions B, T, C, H, W. For a sequence of 12 time steps, the dimensions are 16, 12, 14, 160, 256. Considering the fact that all variables can be expressed within a system of atmospheric dynamic equations, and to mitigate the loss from encoding compression and computational resources [38], the T and C dimensions within the five-dimensional data are merged before entering the model, resulting in a structure of B, T $\times$ C, H, W. This is followed by a convolutional layer with a larger kernel size to extract spatial information. We believe this approach enhances the model’s ability to accurately obtain the conditional information, ensuring that the predicted satellite imagery maintain correct spatial characteristics rather than merely presenting enhanced quality.

Network

We employed a U-Net2D architecture to predict noise, as shown in Figure 6a. This U-Net contains ConvNeXt modules [30] and self-attention modules [15]. Notably, the network does not include specific modules for processing the temporal correlations of the data. Instead, the temporal characteristics are implicitly learned by the model.

As previously described, after the initial convolutional layer, the data were reshaped to dimensions 16, 224, 160, 256. The data then pass through four downsampling blocks, where both the H and W are halved, while the number of channels is doubled. Each downsampling block consists of two ConvNeXt and self-attention modules with residual connections. Following this, the data undergo four upsampling blocks of the same structure with the downsampling blocks, to restore the original dimensions before entering the UNet. Finally, multiple convolutional layers progressively reduce the number of channels to T $\times$ C (12 $\times$ 14).

At the end of the network, the dimensions are restored, and only the noise-added portion, the target satellite imagery, is extracted to calculate the loss with the added noise.

Model training

The model is developed using Pytorch [39]. Training the model takes approximately 20 hours on a cluster of 8 Nvidia H800 GPUs. The model is trained for 300 epochs using a batch size of 16 per GPU. The AdamW [40] optimizer is used with parameters $\beta_{1}=0.9$ and $\beta_{2}=0.99$ , and the learning rate is initialized at $5.0\times 10^{-5}$ with a warm-up phase followed by cosine annealing. An exponential moving average is also applied during training [41]. The smooth L1 loss is used as the loss function.

To improve the model’s ability to predict severe cases, sequences containing category three typhoons are resampled once. Classifier-free guidance is employed during both training and sampling to improve image quality [42]. Additionally, offset noise is applied during sampling to reduce mismatches in data distribution [43].

Evaluation method

For target regions spanning a wide range of latitude, we apply latitude-weighted RMSE, correlation coefficient and CSI, which are calculated as follows:

\mathcal{\mathbf{missing}}{\mathbf{RMSE}}(\tau)=\sqrt{\frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W}a_{i}(\mathbf{\hat{X}}_{i,j}^{\mathrm{T_{0}}+\tau}-\mathbf{X}_{i,j}^{\mathrm{T_{0}}+\tau})^{2}}

(5)

\mathcal{\rho}(\tau)=\frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W}a_{i}\frac{\mathbf{cov}(\mathbf{\hat{X}}^{\mathrm{T_{0}}+\tau}_{i,j},\mathbf{X}^{\mathrm{T_{0}}+\tau}_{i,j})}{\sigma_{\mathbf{\hat{X}}^{\mathrm{T_{0}}+\tau}_{i,j}}{\sigma_{\mathbf{X}^{\mathrm{T_{0}}+\tau}_{i,j}}}}

(6)

\mathcal{\mathbf{missing}}{\mathbf{CSI}}(\tau)=\frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W}a_{i}\frac{TP^{\mathrm{T_{0}}+\tau}_{i,j}}{TP^{\mathrm{T_{0}}+\tau}_{i,j}+TN^{\mathrm{T_{0}}+\tau}_{i,j}+FP^{\mathrm{T_{0}}+\tau}_{i,j}}

(7)

where $\tau$ is the forecast lead time, $\mathrm{T_{0}}$ is the initial time, and H and W correspond to the numbers of grid points in longitude and latitude, respectively. $a_{i}$ is the latitude weighting factor for the ith latitude index [44]:

\mathcal{\mathbf{missing}}{a}_{i}=\frac{\textrm{cos}(\textrm{lat}(i))}{\frac{1}{H}\sum_{i}^{H}\textrm{cos}(\textrm{lat}(i))}

(8)

CSI is commonly used for the evaluation of severe weather forecasts. TP refers to the number of correctly predicted positive pixels; FP is the number of negative pixels incorrectly predicted as positive; FN is the number of positive pixels incorrectly predicted as negative; and TN is the number of correctly predicted negative pixels.

MSSSIM combines Structural Similarity Index (SSIM) at multiple scales to yield more robust results. Latitude-weighting is not applied here, as the image is treated as a whole. It is calculated by:

\mathcal{\mathbf{missing}}{\mathbf{MSSSIM}}(\tau)=\left[l_{M}(\mathbf{\hat{X}}^{\mathrm{T_{0}}+\tau},\mathbf{X}^{\mathrm{T_{0}}+\tau})\right]^{\alpha_{M}}\prod_{j=1}^{M}\left[c_{j}(\mathbf{\hat{X}}^{\mathrm{T_{0}}+\tau},\mathbf{X}^{\mathrm{T_{0}}+\tau})\right]^{\beta_{j}}\left[s_{j}(\mathbf{\hat{X}}^{\mathrm{T_{0}}+\tau},\mathbf{X}^{\mathrm{T_{0}}+\tau})\right]^{\gamma_{j}}

(9)

The comparisons about luminance, contrast and structure are denoted by the three terms in equation 9, the $M$ means difference scales of the images. The three exponents adjust the relative importance of different components. More details about the calculation can be found in Wang el al. (2020) [32].

Data availability

The Fengyun-4A satellite imagery can be downloaded on https://satellite.nsmc.org.cn/portalsite/default.aspx?currentculture=en-US.

Code availability

The scripts about the network, training, and inference can be found on https://github.com/cd4tpcell/SATcast/tree/main [45].

Acknowledgements

The work described in this paper was substantially supported by a grant from the Research Grants Council (RGC) of the Hong Kong Special Administrative Region, China (Project Reference Number: AoE/P-601/23-N) and the Center for Ocean Research in Hong Kong and Macau (CORE), a joint research center between the Laoshan Laboratory and the Hong Kong University of Science and Technology (HKUST). HC, YH, and XS are additionally supported by RGC grant HKUST-16301322 and the Aviation Research and Development Project Phase 2 (AvRDP-2) of the World Meteorological Organization (WMO). The authors thank the computing platform supported by HKUST Superpod Cluster to train the model and the Artificial Intelligence Innovation and Incubation Institute of Fudan University for providing achives of the FuXi data.

Author contributions

X.S., H.L., and X.L., conceptualized the project. H.C., and X.Z., completed the model training and evaluation, as well as the design of the experiments. Y.C., and P.C., provided the satellite data and helped with relevant analysis. Q.Z., contributed to the data cleansing and tested the workflow for model training. Y.H., helped with data analysis and visualization. H.C., X.Z., and X.S., prepared the initial manuscript. All authors contributed to further improvements.

Competing interests

The authors declare no competing interest.

References

\bibcommenthead
Chen et al. [2024] Chen, H., Shi, X., Nie, X., Wang, Y., Leung, C.Y.Y., Cheung, P., Chan, P.W.: Tropical aviation turbulence induced by the interaction between a jet stream and deep convection. Journal of Geophysical Research: Atmospheres 129(18), 2024–040763 (2024) https://doi.org/10.1029/2024JD040763 https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2024JD040763. e2024JD040763 2024JD040763
Guo et al. [2022] Guo, Y., Zhong, M., Chen, X., Zhou, Z., Xu, G., Xu, G., Dong, L.: A thunderstorm gale forecast method based on the objective classification and continuous probability. Atmosphere 13(8), 1308 (2022)
Organization [2023] Organization, W.M.: Guidelines for Satellite-based Nowcasting in Africa — library.wmo.int. https://library.wmo.int/records/item/58348-guidelines-for-satellite-based-nowcasting-in-africa. [Accessed 14-12-2024] (2023)
Schmit et al. [2017] Schmit, T.J., Griffith, P., Gunshor, M.M., Daniels, J.M., Goodman, S.J., Lebair, W.J.: A closer look at the abi on the goes-r series. Bulletin of the American Meteorological Society 98(4), 681–698 (2017)
Baker and Matthews [2004] Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. International journal of computer vision 56, 221–255 (2004)
Smith et al. [2024] Smith, J., Birch, C., Marsham, J., Peatman, S., Bollasina, M., Pankiewicz, G.: Evaluating pysteps optical flow algorithms for convection nowcasting over the maritime continent using satellite data. Natural Hazards and Earth System Sciences 24(2), 567–582 (2024)
Zhang et al. [2023] Zhang, Y., Long, M., Chen, K., Xing, L., Jin, R., Jordan, M.I., Wang, J.: Skilful nowcasting of extreme precipitation with nowcastnet. Nature 619(7970), 526–532 (2023)
Chen et al. [2023] Chen, L., Zhong, X., Zhang, F., Cheng, Y., Xu, Y., Qi, Y., Li, H.: Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science 6(1), 190 (2023)
Bi et al. [2023] Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: Accurate medium-range global weather forecasting with 3d neural networks. Nature 619(7970), 533–538 (2023)
Kumar et al. [2020] Kumar, A., Islam, T., Sekimoto, Y., Mattmann, C., Wilson, B.: Convcast: An embedded convolutional lstm based architecture for precipitation nowcasting using satellite data. Plos one 15(3), 0230114 (2020)
Tran and Song [2019] Tran, Q.-K., Song, S.-k.: Computer vision in precipitation nowcasting: Applying image quality assessment metrics for training deep neural networks. Atmosphere 10(5), 244 (2019)
Ehsani et al. [2021] Ehsani, M.R., Zarei, A., Gupta, H.V., Barnard, K., Behrangi, A.: Nowcasting-nets: Deep neural network structures for precipitation nowcasting using imerg. arXiv preprint arXiv:2108.06868 (2021)
Wei et al. [2024] Wei, X., Zhang, F., Zhang, R., Li, W., Liu, C., Guo, B., Li, J., Fu, H., Tang, X.: DaYu: Data-Driven Model for Geostationary Satellite Observed Cloud Images Forecasting (2024). https://arxiv.org/abs/2411.10144
Shi et al. [2015] Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28 (2015)
Vaswani [2017] Vaswani, A.: Attention is all you need. Advances in Neural Information Processing Systems (2017)
Wang et al. [2023] Wang, R., Su, L., Wong, W.K., Lau, A.K., Fung, J.C.: Skillful radar-based heavy rainfall nowcasting using task-segmented generative adversarial network. IEEE Transactions on Geoscience and Remote Sensing (2023)
Goodfellow et al. [2020] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020)
Kim et al. [2024] Kim, W., Jeong, C.-H., Kim, S.: Improvements in deep learning-based precipitation nowcasting using major atmospheric factors with radar rain rate. Computers & Geosciences 184, 105529 (2024) https://doi.org/10.1016/j.cageo.2024.105529
Trier et al. [2014] Trier, S.B., Davis, C.A., Ahijevych, D.A., Manning, K.W.: Use of the parcel buoyancy minimum (b min) to diagnose simulated thermodynamic destabilization. part i: Methodology and case studies of mcs initiation environments. Monthly Weather Review 142(3), 945–966 (2014)
Shi and Wang [2022] Shi, X., Wang, Y.: Impacts of cumulus convection and turbulence parameterizations on the convection-permitting simulation of typhoon precipitation. Monthly Weather Review 150 (2022) https://doi.org/10.1175/MWR-D-22-0057.1
Radford et al. [2021] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.: Learning transferable visual models from natural language supervision (2021)
Andrychowicz et al. [2023] Andrychowicz, M., Espeholt, L., Li, D., Merchant, S., Merose, A., Zyda, F., Agrawal, S., Kalchbrenner, N.: Deep learning for day forecasts from sparse observations. ArXiv abs/2306.06079 (2023)
Huang et al. [2016] Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.: Deep networks with stochastic depth. In: Springer International Publishing (2016)
Rui Wang [2024] Rui Wang, A.K.H.L. Jimmy C. H. Fung: Skillful precipitation nowcasting using physical-driven diffusion networks. Geophysical Reasearch Letter 51(24) (2024)
Ho et al. [2020] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
Dhariwal and Nichol [2021] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
Yang et al. [2017] Yang, J., Zhang, Z., Wei, C., Lu, F., Guo, Q.: Introducing the new generation of chinese geostationary weather satellites, fengyun-4. Bulletin of the American Meteorological Society 98(8), 1637–1658 (2017)
Lu et al. [2017] Lu, F., Zhang, X.-H., Chen, B.-Y., Liu, H., Wu, R., Han, Q., Feng, X., Li, Y., Zhang, Z.: Fy-4 geostationary meteorological satellite imaging characteristics and its application prospects. J. Mar. Meteorol 37(2), 1–12 (2017)
Yan et al. [2024] Yan, C., Guang, J., Li, Z., Leeuw, G., Chen, Z.: A study on typhoon center localization based on an improved spatio-temporally consistent scale-invariant feature transform and brightness temperature perturbations. Remote Sensing 16(21), 4070 (2024)
Liu et al. [2022] Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Lagerquist et al. [2021] Lagerquist, R., Stewart, J.Q., Ebert-Uphoff, I., Kumler, C.: Using deep learning to nowcast the spatial coverage of convection from himawari-8 satellite data. Monthly Weather Review 149(12), 3897–3921 (2021)
Wang et al. [2003] Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment, vol. 2, pp. 1398–14022 (2003). https://doi.org/10.1109/ACSSC.2003.1292216
Wallace [1975] Wallace, J.M.: Diurnal variations in precipitation and thunderstorm frequency over the conterminous united states. Monthly Weather Review 103(5), 406–419 (1975)
Joshi et al. [2021] Joshi, G., Walambe, R., Kotecha, K.: A review on explainability in multimodal deep neural nets. IEEE Access 9, 59800–59821 (2021)
Price et al. [2024] Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T.R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., Lam, R., Willson, M.: Probabilistic weather forecasting with machine learning. Nature (2024) https://doi.org/10.1038/s41586-024-08252-9 . Accessed 2024-12-16
Xia et al. [2024] Xia, P., Zhang, L., Min, M., Li, J., Wang, Y., Yu, Y., Jia, S.: Accurate nowcasting of cloud cover at solar photovoltaic plants using geostationary satellite images. Nature Communications 15(1), 510 (2024)
Nichol and Dhariwal [2021] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171 (2021). PMLR
Voleti et al. [2022] Voleti, V., Jolicoeur-Martineau, A., Pal, C.: Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems 35, 23371–23385 (2022)
Paszke et al. [2017] Paszke, A., et al.: Automatic differentiation in pytorch. In: NIPS 2017 Workshop on Autodiff (2017)
Loshchilov [2017] Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Morales-Brotons et al. [2024] Morales-Brotons, D., Vogels, T., Hendrikx, H.: Exponential moving average of weights in deep learning: Dynamics and benefits. Transactions on Machine Learning Research (2024)
Ho and Salimans [2022] Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Lin et al. [2024] Lin, S., Liu, B., Li, J., Yang, X.: Common Diffusion Noise Schedules and Sample Steps are Flawed (2024). https://arxiv.org/abs/2305.08891
Rasp et al. [2020] Rasp, S., Dueben, P.D., Scher, S., Weyn, J.A., Mouatadid, S., Thuerey, N.: Weatherbench: A benchmark data set for data-driven weather forecasting. Journal of Advances in Modeling Earth Systems 12 (2020)
Chen et al. [2025] Chen, H., Zhong, X., Zhai, Q., Li, X., Chan, Y.W., Chan, P.W., Huang, Y., Li, H., Shi, X.: Skillful Nowcasting of Convective Clouds with a Cascade Diffusion Model. Zenodo (2025). https://doi.org/10.5281/zenodo.14643154 . https://doi.org/10.5281/zenodo.14643154