Operational Solar Flare Prediction Model Using Deep Flare Net

Naoto Nishizuka Applied Electromagnetic Research Institute National Institute of Information and Communications Technology Tokyo 184-0015 Japan [email protected]
Yûki Kubo Applied Electromagnetic Research Institute National Institute of Information and Communications Technology Japan [email protected]
Komei Sugiura Department of Information and Computer Science Keio University Japan [email protected]
Mitsue Den Applied Electromagnetic Research Institute National Institue of Information and Communications Technology Japan [email protected]
Mamoru Ishii Applied Electromagnetic Research Institute National Institute of Information and Communications Technology Japan [email protected]

Abstract

We developed an operational solar flare prediction model using deep neural networks, named Deep Flare Net (DeFN). DeFN can issue probabilistic forecasts of solar flares in two categories, such as $\geq$ M-class and $<$ M-class events or $\geq$ C-class and $<$ C-class events, occurring in the next 24 h after observations and the maximum class of flares occurring in the next 24 h. DeFN is set to run every 6 h and has been operated since January 2019. The input database of solar observation images taken by the Solar Dynamic Observatory (SDO) is downloaded from the data archive operated by the Joint Science Operations Center (JSOC) of Stanford University. Active regions are automatically detected from magnetograms, and 79 features are extracted from each region nearly in real-time using multiwavelength observation data. Flare labels are attached to the feature database, then the database is standardized and input into DeFN for prediction. DeFN was pretrained using the datasets obtained from 2010 to 2015. The model was evaluated with the skill score of the true skill statistics (TSS) and achieved predictions with TSS = 0.80 for $\geq$ M-class flares and TSS = 0.63 for $\geq$ C-class flares. For comparison, we evaluated the operationally forecast results from January 2019 to June 2020. We found that operational DeFN forecasts achieved TSS = 0.70 (0.84) for $\geq$ C-class flares with the probability threshold of 50 (40) %, although there were very few M-class flares during this period and we should continue monitoring the results for a longer time. Here, we adopted a chronological split to divide the database into two for training and testing. The chronological split appears suitable for evaluating operational models. Furthermore, we proposed the use of time-series cross-validation. The procedure achieved TSS = 0.70 for $\geq$ M-class flares and 0.59 for $\geq$ C-class flares using the datasets obtained from 2010 to 2017. Finally, we discuss the standard evaluation methods for operational forecasting models, such as the preparation of observation, training, and testing datasets, and the selection of verification metrics.

keywords:

solar flares, space weather forecasting, prediction, operational model, deep neural networks, verification

1 1. Introduction

The mechanism of solar flares is a long-standing puzzle. Solar flares emit X-rays, highly energetic particles, and coronal mass ejections (CMEs) into the interplanetary space in the heliosphere, whereby these flares become one of the origins of space weather phenomena (e.g., Schwenn et al., 2005; Fletcher et al., 2011; Liu et al., 2014; Möstl et al., 2015). The prediction of flares is essential for reducing the damage to technological infrastructures on Earth. Solar flares are triggered by a newly emerging magnetic flux or magnetohydrodynamic instability to release excess magnetic energy stored in the solar atmosphere (e.g., Aulanier et al., 2010; Shibata & Magara, 2011; Cheung & Isobe, 2014; Wang et al., 2017; Toriumi & Wang, 2019). Such phenomena are monitored by the Solar Dynamic Observatory (SDO; Pesnell et al., 2012) and the Geostationary Orbital Environment Satellite (GOES), and observation data are used for the prediction of flares.

Currently, flare prediction is tackled by the following four approaches: (i) empirical human forecasting (Crown, 2012; Devos et al., 2014; Kubo et al., 2017; Murray et al., 2017), (ii) statistical prediction methods (Lee et al., 2012; Bloomfield et al., 2012; McCloskey et al., 2016; Leka et al., 2018), (iii) machine learning methods (e.g., Bobra & Couvidat, 2015; Muranushi et al., 2015; Nishizuka et al., 2017, and references therein), and (iv) numerical simulations based on physics equations (e.g., Kusano et al., 2012, 2020; Inoue et al., 2018; Korsós et al., 2020). Some of the models have been made available for community use at the Community Coordinated Modeling Center (CCMC) of NASA (e.g., Gallagher et al., 2002; Shih & Kowalsky, 2003; Colak & Qahwaji, 2008, 2009; Krista & Gallagher, 2009; Steward et al., 2011; Falconer et al., 2011, 2012). It is useful to show the robust performance of each model, and in benchmark workshops, prediction models were evaluated for comparison, where methods that included machine learning algorithms as part of their system were also discussed (Barnes et al., 2016; Leka et al., 2019; Park et al., 2020).

Recently, the application of supervised machine learning methods, especially deep neural networks (DNNs), to solar flare prediction has been a hot topic, and their successful application in research has been reported (Huang et al., 2018; Nishizuka et al., 2018; Park et al., 2018; Chen et al., 2019; Domijan et al., 2019; Liu et al., 2019; Zheng et al., 2019; Bhattacharjee et al., 2020; Jiao et al., 2020; Li et al., 2020; Panos & Kleint, 2020; Yi et al., 2020). However, there is insufficient discussion on how to develop the methods available to real-time operations in space weather forecasting offices, including the methods for validation and verification of the models. Currently, new physical and geometrical (topological) features are applied to flare prediction using machine learning (e.g., Wang et al., 2020a; Deshmukh et al., 2020), and it has been noted that training sets may be sensitive to which period in the solar cycle they are drawn from. (Wang et al., 2020b).

It has been one year since we started operating our flare prediction model using DNNs, which we named Deep Flare Net (DeFN: Nishizuka et al., 2018). Here, we evaluate the prediction results during real-time operations at the NICT space weather forecasting office in Tokyo, Japan. In this paper, we introduce the operational version of DeFN in sections 2 and 3, and we show the prediction results in section 4. We propose the use of time-series cross-validation (CV) to evaluate operational models in section 5. We summarize our results and discuss the selection of a suitable evaluation method for models used in operational settings in section 6.

2 2. Flare Forecasting Tool in Real-Time Operation

2.1 2.1. Procedures of Operational DeFN

DeFN is designed to predict solar flares occurring in the following 24 h after observing magnetogram images, which are categorized into the two categories: ( $\geq$ M-class and $<$ M-class) or ( $\geq$ C-class and $<$ C-class). In the operational system of DeFN forecasts, observation images are automatically downloaded, active regions (ARs) are detected, and 79 physics-based features are extracted for each region. Each feature is standardized by the average value and standard deviation and is input into the DNN model, DeFN. The output is the flare occurrence probabilities for the two categories. Finally, the maximum class of flares occurring in the following 24 h is forecast by taking the maximum probability of the forecasts.

Operational DeFN was redesigned for automated real-time forecasting with operational redundancy. All the programs written in IDL and Python languages are driven by cron scripts at the prescribed forecast issuance time as scheduled. There are a few differences from the original DeFN used for research, as explained in the next subsection. A generalized flow chart of operational DeFN is shown in Figure 1.

2.2 2.2. NRT Observation Data

The first difference between development DeFN and operational DeFN is the use of near real-time (NRT) observation data. We use the observation data of line-of-sight magnetograms and vector magnetograms taken from Helioseismic and Magnetic Imager (HMI; Scherrer et al., 2012; Schou et al., 2012; Hoeksema et al., 2014) on board SDO, ultraviolet (UV) and extreme ultraviolet (EUV) images obtained from Atmospheric Imaging Assembly (AIA; Lemen et al., 2012) through 1600 Å and 131 Å filters; and the full-disk integrated X-ray emission over the range of 1–8 Å observed by GOES. For visualization, we also use white light images taken from HMI and EUV images obtained using AIA through 304 Å and 193 Å filters. The time cadence of the vector magnetograms is 12 min, that of the line-of-sight magnetograms is 45 s, those of the 1600 Å and 131 Å filters are both 12 s, and that of GOES is less than 1 min.

The data product of SDO is provided by the Joint Science Operations Center (JSOC) of Stanford University. The HMI NRT data are generally processed and available for transfer within 90 min after observations (Leka et al., 2018). This is why DeFN was designed to download the observation dataset 1 h earlier. If the observation data are unavailable because of processing or transfer delays, the target of downloading is moved back in time to 1 to 5 h earlier in the operational DeFN system. When no data can be found beyond 5 h earlier, it is considered that the data are missing. Here, the time of 5 h was determined by trial and error. Forecasting takes 20–40 min for each prediction; thus, it is reasonable to set the forecasting time to as often as once per hour. The 1 h cadence is comparable to that of the time evolution of the magnetic field configuration in active regions due to flux emergence or changes before and after a flare. However, DeFN started operating in the minimum phase of solar activity, so we started forecasting with a 6 h cadence instead of a 1 h cadence.

The NRT vector magnetograms taken by HMI/SDO are used for operational forecasts, whereas the calibrated HMI ‘definitive’ series of vector magnetograms are used for scientific research. The NRT vector magnetograms are accessed from the data series ‘hmi.bharp_720s_nrt’ with segmentations of ‘field’, ‘inclination’, ‘azimuth’, and ‘ambig’. These segmentations indicate the components of field strength, inclination angle, azimuth angle, and the disambiguation of magnetic field in the photosphere, respectively. Additionally, the NRT line-of-sight magnetograms are downloaded from the data series ‘hmi.M_720s_nrt’, and the NRT white light images are from the ‘hmi.Ic_noLimbDark_720s_nrt’ (jsoc2) series. The NRT data of AIA 131 Å 193 Å 304 Å and 1600 Å filters are retrieved from the ‘aia.lev1_nrt2’ (jsoc2) series.

Note that the HMI NRT vector magnetogram is not for the full disk, in contrast to the HMI definitive series data. HMI Active Region Patches (HARP) are automatically detected in the pipeline of HMI data processing (Bobra et al., 2014), and the HMI NRT vector magnetogram is limited to the HARP areas plus a buffer, on which we overlaid our active region frames detected by DeFN and extracted 79 physics-based features (Figure 2; also see the detection algorithms and extracted features in Nishizuka et al. (2017, 2018), and details of the HMI NRT vector magnetogram in Leka et al., 2018). Furthermore, the correlation between the HMI NRT data and the definitive data has not been fully statistically revealed. A future task is to reveal how the difference between the HMI NRT and definitive series data affects the forecasting results. The same comments can be made for the AIA NRT and definitive series data.

2.3 2.3. Implementation of Operational DeFN

Operational DeFN runs autonomously every 6 h by default, forecasting at 03:00, 09:00, 15:00, and 21:00 UT. The forecasting time of 03:00 UT was set to be before the daily forecasting meeting of NICT at 14:30 JST. The weights of multi-layer perceptrons of DeFN were trained with the 2010-2014 observation datasets, and we selected representative hyperparameters by the observation datasets in 2015.

For the classification problem, parameters are optimized to minimize the cross entropy loss function. However, since the flare occurrence ratio is imbalanced, we adopted a loss function with normalizations of prior distributions. It is the sum of the weighted cross entropy.

J_{WCE}=-\sum_{n=1}^{N}\sum_{k=1}^{K}w_{k}y_{nk}^{*}\log p(y_{nk}).

(1)

Here, $p(y_{nk}^{*})$ is the initial probability of correct labels $y_{nk}^{*}$ , i.e., 1 or 0, whereas $p(y_{nk})$ is the estimated value of probability. The components of $y_{nk}^{*}$ are 1 or 0; thus, $p(y_{nk}^{*})$ = $y_{nk}^{*}$ . $w_{k}$ is the weight of each class and is the inverse of the class occurrence ratio, i.e., [1, 50] for $\geq$ M-class flares and [1, 4] for $\geq$ C-class flares. Parameters are stochastically optimized by adaptive moment estimation (Adam; Kingma & Ba, 2014) with learning rate = 0.001, $\beta_{1}$ = 0.9, and $\beta_{2}$ = 0.999. The batch size was set to 150 (for details, see Nishizuka et al., 2018).

Theoretically, the positive and negative events, i.e., whether $\geq$ M-class flares occur or not, are predicted in the following manner. The following equation is commonly used in machine learning:

\hat{y}=\mathop{\rm argmax}\limits_{k}p(y_{k}).

(2)

Here, $\hat{y}$ is the prediction result, and the threshold is usually fixed. For example, in the case of two-class classifications, the events with a probability greater than 50 % are output. When we use the model as a probabilistic prediction model, we also tried smaller threshold values for safety in operations, although there are no obvious theoretical meanings.

Note that the loss function weights cannot be selected arbitrarily. The positive to negative event ratios of $\geq$ M-class and $<$ M-class or $\geq$ C-class and $<$ C-class flares, which are called the occurrence frequency and the climatological base rate, are 1:50 and 1:4 during 2010-2015, respectively, in the standard. Only when the cross entropy is applied to the weight of the inverse ratio of positive to negative events does it become theoretically valid to output the prediction by equation (2). Therefore, we used the base rate as the weight of cross entropy.

The DNN model of the operational DeFN was developed as in Nishizuka et al. (2018). Because the full HMI and AIA datasets obtained from 2010 to 2015 were too large to save and analyze, the cadence was reduced to 1 h, although in general a larger amount of data is useful for better predictions. We divided the feature dataset into two for training and validation with a chronological split: the dataset obtained from 2010 to 2014 for training and the 2015 dataset for validation. The point of this paper is to contrast how well the DeFN model can predict solar flares in the real-time operations and in the research using time series CV methods (=shuffle and divide CV is insufficient). Then, we will discuss that the gap between the prediction accuracies in operations and in research using a time-series CV is small (see section 4.3).

The time-series CV is stricter than a K-fold CV on data split by active region. It might be true that a K-fold CV on data split by active region can also prevent data from a single active region being used in training and testing (e.g., Bobra & Couvidat, 2015). However, a K-fold CV on data split by active region allows the training set to contain future samples from different active regions. This may affect the prediction results, when there is a long-term variation of solar activity. As well, the number of active regions which produced X-class and M-class flares is not so large that a K-fold CV on data split by active region may be biased and not equal.

Indeed, solar flare prediction in operation has been done in a very strict condition, where no future data is available. Our focus is not to deny a K-fold CV on data split by active region. Instead, our focus is to discuss more appropriate CVs in operational setting.

The model was evaluated with a skill score, the true skill statistic (TSS; Hanssen & Kuipers, 1965), which is a metric of the discrimination performance. Then, the model succeeded in predicting flares with TSS = 0.80 for $\geq$ M-class and TSS = 0.63 for $\geq$ C-class (Table 1). Note that the data for 2016–2018 were not used, because there were fewer flares in this period than in the period between 2010 and 2015.

Flare labels were attached to the 2010–2015 feature database for supervised learning. We collected on the disk all the flare samples that occurred from the flare event list. We visually checked the locations of the flares, compared them with NOAA numbers, and found the corresponding active regions in our database when there were two or more active regions. Then we attached flare labels to predict the maximum class of flares occurring in the following 24 h. If $\geq$ M-class flares are observed within 24 h after observations, the data are attached with the label (0, 1)_M; otherwise, they are attached with the label (1, 0)_M. When two M-class flares occur in 24 h, the period with the label (0, 1)_M is extended. Similarly, the labels (0, 1)_C and (1, 0)_C are separately attached for the prediction model of C-class flares. The training was executed using these labels. On the other hand, in real-time operation, we do not know the true labels of flares, so we attached the NRT feature database with dummy labels (1, 0), which are not used in the predictions. It is possible to update the model by retraining it using the latest datasets if the prediction accuracy decreases. However, the pretrained operational model is currently fixed and has not been changed.

3 3. Operation Forecasts Using DeFN

3.1 3.1. Graphical Output

The graphical output is automatically generated and shown on a website (Figure 3). The website was designed to be easy for professional space weather forecasters who are often not scientists to understand. Prediction results for both the full-disk and region-by-region images are shown on the website, and the risk level is indicated by a mark, “Danger flares”, “Warning”, and “Quiet”. Images are updated every 6 h as new data are downloaded. Details of the DeFN website are described below:

•

Solar full-disk images and detected ARs:
Images obtained by multiwavelength observations, such as magnetograms and white light, 131 Å 193 Å 304 Å and 1600 Å images taken by SDO, are shown along with ARs detected by DeFN, where the threshold is set to 140 G in the line-of-sight magnetograms taken by HMI/SDO (see details of the detection method in Nishizuka et al., 2017).
•

Probabilistic forecasts at each AR:
Probabilistic forecasts of flare occurrence at each AR are shown for $\geq$ M-class and $<$ M-class flares or $\geq$ C-class and $<$ C-class flares by bar graphs, in analogy with the probabilistic forecasts of precipitation. Note that this forecasted probability does not indicate the real observation frequency, because the prior distributions are normalized to the peak at 50 % by the weighted cross entropy, where the loss function weights are the inverse of the flare occurrence ratio (Nishizuka et al., 2020). Thus, operational DeFN is optimized for forecasting with the default probability threshold of 50 %. That is, operational DeFN forecasts flares if the real occurrence probability, which is forecast by the non-weighted cross entropy loss function, is greater than the climatological event rate, and it does not forecast flares if the real occurrence probability is less than the climatological event rate (see also Nishizuka et al., 2020). Therefore, the normalized forecasted probability of $\geq$ M-class flares sometimes becomes larger than that of $\geq$ C-class flares.
•

Full-disk probability forecasts and alert marks:
The full-disk flare occurrence probability of $\geq$ M-class flares, $P_{FD}$ , is calculated using

$P_{FD}=1.0-\prod_{AR_{i}\in S}(1.0-P_{i}),$ (3)

where $S$ is the set of ARs on the disk, $i$ is the number of ARs on the disk, element $AR_{i}$ is a member of set $S$ , and $P_{i}$ is the probabilistic forecast at each $AR_{i}$ (Leka et al., 2018). The risk level is indicated by a mark based on $P_{FD}$ and is categorized into three categories: “Danger flares” ( $P_{FD}$ $\geq$ 80 %), “Warning” ( $P_{FD}$ $\geq$ 50 %), and “Quiet” ( $P_{FD}$ $<$ 50 %). This is analogous to weather forecasting, e.g., sunny, cloudy, and rainy.
•

List of comments and remarks:
Forecasted probabilities (percentages), comments, and remarks are summarized in a list.

3.2 3.2. Details of Operations and Redundancy

Operational DeFN has operated stable forecasts since January 2019. In this subsection, we explain the redundancy and operational details of operational DeFN.

•

Forecast outages: A forecast is the category with the maximum probability of a flare in each of the categories in the following 24 h after the forecast issue time. A forecast is normally issued every 6 h. If problems occur when downloading or processing data, the forecast is skipped and handled as a forecast failure.
•

Data outages of SDO/HMI, AIA: There are delays in the HMI/SDO data processing when no applicable NRT data are available for a forecast. In this case, the NRT data to download is moved back in time to 1 to 5 h earlier. In such as case, the forecasting target will change from 24 h to 25-29 h, though the operational DeFN is not retrained. If no data can be found beyond 5 h earlier, the “no data” value is assigned and the forecast is skipped.
•

No sunspots or ARs with strong magnetic field on disk: If there are no sunspots or ARs detected with the threshold of 140 G on the disk image of the line-of-sight magnetogram, feature extraction is skipped, a forecast of “no flare” with a probability forecast of 1 % is issued, and the “no sunspot” value is assigned.
•

Forecasts at ARs near/across limb: DeFN is currently not applicable to limb events. If an AR is detected across a limb, it is ignored in forecast targets.
•

Flares not assigned to an active region: Detected active regions by operational DeFN are not completely the same as active regions registered by NOAA. There are cases where flares occur in decaying or emerging active regions which are not detected by DeFN with the threshold of 140 G. This occurs most often for C-class and lower intensity flares, for example, C2.0 flare in NOAA 12741 on 2019 May 15. Such a flare is missed in real-time forecasts but included in evaluations.
•

Retraining: DeFN can be retrained on demand, and a newly trained model can be used for forecasting. Currently, the pretrained model is fixed and has not been changed so far.
•

Alternative data input after SDO era (option): Since DeFN is designed to detect ARs and extract features by itself, it can be revised and trained to include other space- and ground-based observation data in DeFN, even when SDO data are no longer available.

4 4. Forecast Results and Evaluation

4.1 4.1. Operational Benchmark

The purpose of machine-learning techniques is to maximize the performance for unseen data. This is called generalization performance. Because it is hard to measure generalization performance, it is usually approximated by test-set performance, where there is no overlap between the training and test sets.

On the other hand, as the community continues to use a fixed test set, on the surface the performance of newly proposed models will seem to improve year by year. In reality, generalization performance is not constantly improving, but there will be more models that are effective only for the test set. This is partially because that models with lower performance than state-of-the-art are not reported. In other words, there are more and more models that are not always valid for unseen datasets. It is essentially indistinguishable whether the improvement is due to an improvement in generalization performance or because it is a method that is effective only for the test set.

The above facts are well-known in the machine learning community, and the evaluation conditions are mainly divided into two, basic and strict. Under the strict evaluation conditions, only an independent evaluation body evaluates each model using the test set only once. The test set is not published to the model builders (see e.g., Ardila et al., 2019). The solar flare prediction is originally a prediction of future solar activity using a present observation dataset, and the data available to researchers are only the past data. This fact is consistent with the strict evaluation condition in machine-learning community.

In this section, we evaluate our operational forecasting results. We call this the “operational benchmark” in this paper. In the machine learning community, a benchmark using a fixed test set is used only for basic benchmark tests. The basic approach is simple but is known to be insufficient. This is because no one can guarantee that the test set is used only once. In strict machine learning benchmarks, evaluation with a completely unseen test set is required. Only organizer can see the “completely unseen test set”, which cannot be seen by each researcher. This is because, if researchers use the test set many times, they implicitly tend to select models effective only for the fixed test set.

We think that the evaluation methods of operational solar flare prediction models are not limited to evaluations using a fixed test set. However, this paper does not deny the performance evaluation using a fixed test set. The purpose of this paper is to show that the operational evaluation is important. From a fairness perspective, the strict benchmarking approach takes precedence over the basic approach. Our operational evaluation is based on the strict benchmarking approach. We did not retrain our model after the deployment of our system.

4.2 4.2. Forecast Results and Evaluation

We evaluated the forecast results from January 2019 to June 2020, when we operated operational DeFN in real time. During this period, 24 C-class flares and one M-class flare were observed. The M-class flare was observed on 6 May 2019 as M1.0, which was originally reported as C9.9 and corrected to M1.0 later. The forecast results are shown in Table 2. Each contingency table shows the prediction results for $\geq$ M-class and $\geq$ C-class flares. operational DeFN was originally trained with the probability threshold of 50 % to decide the classification, but in operations, users can change it according to their purposes. In Table 2, we show three cases for $\geq$ M-class and $\geq$ C-class predictions using different probability thresholds, such as 50 %, 45 %, and 40 % for reference.

Each skill score can be computed from the items shown in contingency tables, and not vice versa. This is a well-known fact. No matter how many skill scores you show, you will not have more information than one contingency table. The relative operating characteristic (ROC) curve and the reliability diagram, which are shown in Leka et al. (2019), can also be reproduced from the contingency table if it is related to the deterministic forecast (forecast of this paper). The ROC curve is a curve or straight line made by plots on a probability of false detection (POFD) - probability of detection (POD) plane. The ROC curve for a deterministic forecast is made by connecting three points (0,0), (POFD, POD) for a deterministic forecast, and (1,1) (see e.g., Richardson, 2000; Jolliffe & Stephenson, 2012). For reference, we introduce skill scores used in Leka et al. (2019), such as the accuracy, Hanssen & Kuiper skill score/Pierce skill score/True skill statistic (TSS/PSS), Appleman skill score (ApSS), equitable threat score (ETS), Brier skill score, mean-square-error skill score (MSESS), Gini coefficient, and frequency bias (FB).

According to Table 2, the flare occurrence was very rare and imbalanced in the solar minimum phase. Most of the forecasts are true negative. When we decrease the probability threshold, the number of forecast events increases. We evaluated our results with the four verification metrics in Table 3: accuracy, TSS, false alarm ratio (FAR), and Heidke skill score (HSS) (Murphy, 1993; Barnes et al., 2009; Kubo et al., 2017). They show that operational DeFN optimized for $\geq$ C-class flare prediction achieved accuracy of 0.99 and TSS of 0.70 with the probability threshold of 50 %, whereas they were 0.98 and 0.83 with the probability threshold of 40 %. DeFN optimized for $\geq$ M-class flare prediction achieved accuracy of 0.99 but TSS was only 0.24 because only a single M1.0 flare occurred. Operational DeFN did not predict this flare because it was at the boundary of the two categories of $\geq$ M-class and $<$ M-class flares. This happens a lot in real operations, and this is a weakness of binary classification systems used in operational settings.

The trends of the contingency tables are similar to those evaluated in the model development phase. (Table 2). However, there are two differences. First, the data used were the NRT data, whereas the definitive series was used for development. However, in this case, there was negligible difference between them. Second, the evaluation methods are different. The operational DeFN was evaluated on the actual data from 2019 to 2020, whereas the development model was validated with the 2010-2014 dataset and tested with the 2015 dataset. It appears that the chronological split provides more suitable evaluation results for operations than the common methods, namely, shuffle and split CV and K-fold CV.

4.3 4.3. Time-series CV

Here we propose the use of time-series CV for evaluations of operational forecasting models. In previous papers on flare predictions, we used hold-out CV, where a subset of the data split chronologically was reserved for validation and testing, rather than the naïve K-fold CV. This is because it is necessary to be careful when splitting the time-series data to prevent data leakage (Nishizuka et al., 2018). To accurately evaluate prediction models in an operational setting, we must not use all the data about events that occur chronologically after the events used for training.

The time-series CV is illustrated in Figure 4. In this procedure, there are a series of testing datasets, each consisting of a set of observations and used for prediction error. The corresponding training dataset consists of observations that occurred prior to the observations that formed the testing dataset and is used for parameter tuning. Thus, model testing is not done on data that may have pre-dated the training set. Furthermore, the training dataset is divided into training and validation datasets. The model prediction accuracy is calculated by averaging over the testing datasets. This procedure is called rolling forecasting origin-based CV (Tashman, 2000). In this paper, we call it time-series CV, and it provides an almost unbiased estimate of the true error (Varma & Simon, 2006).

Note that the time-series CV has the following advantages: (i) The time-series CV is the standard validation scheme in time-series prediction. (ii) A single chronological split does not always reflect low generalization error (Bishop, 2006). In other words, the trained model is not guaranteed to work for unseen test set. To avoid this, the time-series CV applies multiple chronological splits. The ability to predict new examples correctly that differ from those used for training is known as generalization performance (Bishop, 2006). Therefore, the time-series CV is more generic and appropriate.

The evaluation results obtained by time-series CV using the 2010–2017 datasets are summarized in Table 4. The datasets were chronologically split to form the training, validation, and testing datasets. TSS is largest with the 2010–2014 datasets for training, the 2015 datasets for validation, and the 2016 datasets for testing. This is probably because it is not possible to obtain a reliable forecast based on a small training dataset obtained from 2010 to 2012. By averaging over the five testing datasets, we found that TSS is 0.70 for $\geq$ M-class flares and 0.59 for $\geq$ C-class flares. This procedure will be more suitable for an observation dataset with a longer time period.

5 5. Summary and Discussion

We developed an operational flare prediction model using DNNs, which was based on a research version of the DeFN model, for operational forecasts. It can provide probabilistic forecasts of flares in two categories occurring in the next 24 h from observations: $\geq$ M-class and $<$ M-class flares or $\geq$ C-class and $<$ C-class flares. DeFN has been continuously used for operational forecasting since January 2019, and we evaluated its performance using the forecast and actual flare occurrences between January 2019 and June 2020. We found that operational DeFN achieved an accuracy of 0.99 and TSS of 0.70 for $\geq$ C-class flare predictions, whereas the accuracy was 0.99 but TSS was only 0.24 for $\geq$ M-class flare prediction using a probability threshold of 50 %. using a probability threshold of 40 %, the accuracy was 0.98 and TSS was 0.83 for $\geq$ C-class flares, whereas they were 0.98 and 0.48 for $\geq$ M-class flares.

Operational DeFN has the advantages of a large TSS, good discrimination performance, and the low probability of missed detection of observed flares. This is why it is useful for operations that require that no flares are missed, such as human activities in space and critical operations of satellites. On the other hand, it tends to over-forecast and the false alarm ratio (FAR) increases. Because the number of true negatives is very large in an imbalanced problem such as solar flare prediction, TSS is less sensitive to false positives than to false negatives. Currently, the prior distributions of $\geq$ M-class and $<$ M-class flares are renormalized to increase TSS at threshold probability of 50 %, but this results in an increase in FAR.

When we compared the evaluation results, we observed no significant difference between the pretrained and operational results. This means that, at least during January 2019 – June 2020, the difference between NRT and definitive series science data did not greatly affect the forecasts. We found a TSS of 0.63 for the $\geq$ C-class model evaluated using the pretrained model was maintained and even increased to 0.70 (0.83) for operational forecasts with the probability threshold of 50 (40) %. This suggests that the chronological split is more suitable for the training and validation of the operational model than shuffle and split CV.

Here, we discuss how to train and evaluate machine learning models for operational forecasting. For an exact comparison, it is desirable to use the same datasets among participants. If this is not possible, there are three points that require attention.

(i)

Observation Database: The ratio of positive to negative events should not be artificially changed, and datasets should not be selected artificially. Data should be the climatological event rate and kept natural. This is because some metrics are affected by controlling the positive to negative event ratio of datasets, especially HSS, which will result in a difference from the operational evaluations. For operational evaluations, it is also desirable to include ARs near the limb, although they are excluded in most papers because the values of magnetograms are unreliable owing to the projection effect. Currently, in machine learning models, limb flares are not considered, but they also need to be considered in the near future, using GOES X-ray statistics as in human forecasting or magnetograms reproduced by STEREO EUV images (Kim et al., 2019).
(ii)

Datasets for Training and Testing: We recommend that a chronological split or time-series CV is used for training and evaluation of operational models. Although K-fold CV using random shuffling is common in solar flare predictions, it has a problem for a time-series dataset divided into two for training and testing when the time variation is very small, e.g., the time evolution of magnetic field. If the two neighboring datasets, which are very similar, are divided into both training and testing sets, the model becomes biased to overpredict flares. It might be true that a K-fold CV on data split by active region can also prevent data from a single active region being used in training and testing. However, a K-fold CV on data split by active region allows the training set to contain future samples from different active regions. Therefore, in the point of view of generalization performance, a time-series CV is stricter and more suitable for operational evaluation.
(iii)

Selection of Metrics: The ranking of models is easily affected by the selection of the metric. Depending on the purpose, users should select their preferred model by looking at the contingency tables and skill scores of each model. After understanding that each skill score can evaluate one aspect of performance, verification methods should be discussed in the space weather community (see also Pagano et al., 2019; Cinto et al., 2020).

In this paper, we showed contingency tables of our prediction results. No matter how many skill scores you show, you will not have more information than one contingency table. We evaluated our prediction results as a deterministic forecasting model. The ROC curve and the reliability diagram, which are shown in Barnes et al. (2016) and Leka et al. (2019), can also be reproduced from the contingency table if it is related to the deterministic forecast.

We demonstrated the performance of a machine learning model in an operational flare forecasting scenario. The same methods and discussion of prediction using machine learning algorithms can be applied to other forecasting models of space weather in the magnetosphere and ionosphere. Our future aim is to extend our model to predicting CMEs and social impacts on Earth by extending our database to include geoeffective phenomena and technological infrastructures.

6 Declarations

7 Availability of data and materials

The code is available at https://github.com/komeisugiura/defn18. In the README file, we explain the architecture and selected hyper parameters. The feature database of DeFN is available at the world data center of NICT (http://wdc.nict.go.jp/IONO/wdc/). The SDO data are available from the SDO data center (https://sdo.gsfc.nasa.gov/data/) and JSOC (https://jsoc.stanford.edu/). The GOES data are available at https://services.swpc.noaa.gov/json/goes/.

8 Competing interests

The authors declare that they have no competing interests.

9 Funding

This work was partially supported by JSPS KAKENHI Grant Number JP18H04451 and NEDO. A part of these research results was obtained within “Promotion of observation and analysis of radio wave propagation”, commissioned research of the Ministry of Internal Affairs and Communications, Japan.

10 Authors’ contributions

N.N., Y.K. and K.S. developed the model. N.N. analyzed the data and wrote the manuscript. M.D. and M.I. participated in discussing the results.

Acknowledgements.

We thank all members of JSOC of Stanford University for their support and allowing us to use the SDO NRT data. The data used here are courtesy of NASA/SDO, the HMI & AIA science teams, JSOC of Stanford University, and the GOES team.

References

Ardila et al. (2019) Ardila D, Kiraly A, Bharadwaj S, Choi B, Reicher JJ, Peng L, Tse D, Etemadi M, Ye W, Corrado G, Naidich DP, Shetty S (2019) End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25, 954:961. https://doi.org/10.1038/s41591-019-0447-x
Aulanier et al. (2010) Aulanier G, Török T, Démoulin P, DeLuca EE (2010) Formation of torus-unstable flux ropes and electric currents in erupting sigmoids. Astrophys. J, 708: 314-333. https://doi.org/10.1088/0004-637X/708/1/314
Barnes et al. (2009) Barnes LR, Schults DM, Gruntfest EC, Hayden MH, Benight CC (2009) Corrigendum: False alarm rate or false alarm ratio? Weather and Forecasting, 24: 1452-1454 https://doi.org/10.1175/2009WAF2222300.1
Barnes et al. (2016) Barnes G, Leka KD, Schrijver CJ, Colak T, et al. (2016) A Comparison of Flare Forecasting Methods. I. Results from the “All-Clear” Workshop Astrophys. J. 829: 89 (32pp) https://doi.org/10.3847/0004-637X/829/2/89
Bhattacharjee et al. (2020) Bhattacharjee S, Alshehhi R, Dhuri DB, Hanasoge SM (2020) Supervised convolutional neural networks for classification of flaring and nonflaring active regions using line-of-sight magnetograms. Astrophysical J., 898: 98 https://doi.org/10.3847/1538-4357/ab9c29
Bishop (2006) Bishop CM (2006) Pattern recognition and machine learning. Information Science and Statistics, ed. M. Jordan, J. Kleinberg, & M. Schölkopf (Springer-Verlag New York), 738
Bloomfield et al. (2012) Bloomfield DS, Higgins PA, McAteer RTJ, Gallagher PT (2012) Toward Reliable Benchmarking of Solar Flare Forecasting Methods. Astropphys. J. Lett., 747: L41 (7pp) https://doi.org/10.1088/2041-8205/747/2/L41
Bobra et al. (2014) Bobra MG, Sun X, Hoeksema JT, Turmon M, Liu Y, Hayashi K, Barnes G, Leka KD (2014) The Helioseismic and Magnetic Imager (HMI) vector magnetic field pipeline: SHARPs - Space-Weather HMI Active Region Patches Solar Physics, 289 (9): 3549-3578 https://doi.org/10.1007/s11207-014-0529-3
Bobra & Couvidat (2015) Bobra MG, Couvidat S (2015) Solar flare prediction using SDO/HMI vector magnetic field data with a machine-learning algorithm. Astropphys. J., 798: 135 (11pp) https://doi.org/10.1088/0004-637X/798/2/135
Chen et al. (2019) Chen Y, Manchester WB, Hero AO, Toth G, DuFumier B, Zhou T, et al. (2019) Identifying solar flare precursors using time series of SDO/HMI Images and SHARP Parameters. Space Weather, 17: 1404-1426 https://doi.org/10.1029/2019SW002214
Cheung & Isobe (2014) Cheung MCM, Isobe H (2014) Flux emergence (Theory). Living Reviews in Solar Physics, 11: 3 (128pp) https://doi.org/10.12942/lrsp-2014-3
Cinto et al. (2020) Cinto T, Gradvohl A, Coelho GP, da Silva AEA (2020) A framework for designing and evaluating solar flare forecasting systems. Monthly Notices of the Royal Astronomical Society, 495: 3332-3349 https://doi.org/10.1093/mnras/staa1257
Colak & Qahwaji (2008) Colak T, Qahwaji R (2008) Automated McIntosh-based classification of sunspot groups using MDI images. Solar Physics, 248: 277-296 https://doi.org/10.1007/s11207-007-9094-3
Colak & Qahwaji (2009) Colak T, Qahwaji R (2009) Automated Solar Activity Prediction: A hybrid computer platform using machine learning and solar imaging for automated prediction of solar flares. Space Weather, 7 (6): S06001 https://doi.org/10.1029/2008SW000401
Crown (2012) Crown MD (2012) Validation of the NOAA Space Weather Prediction Center’s solar flare forecasting look-up table and forecaster-issued probabilities. Space Weather, 10: S06006 (4pp) https://doi.org/10.1029/2011SW000760
Deshmukh et al. (2020) Deshmukh V, Berger T, Bradley E, Meiss JD (2020) Leveraging the mathematics of shape for solar magnetic eruption prediction. J. Space Weather Space Climate, 10: 13 (16pp) https://doi.org/10.1051/swsc/2020014
Devos et al. (2014) Devos A, Verbeeck C, Robbrecht E (2014) Verification of space weather forecasting at the Regional Warning Center in Belgium. J. Space Weather Space Clim., 4: A29 (15pp) https://doi.org/10.1051/swsc/2014025
Domijan et al. (2019) Domijan K, Bloomfield DS, Pitié F (2019) Solar flare forecasting from magnetic feature properties generated by the Solar Monitor Active Region Tracker. Solar Physics, 294: 6 (19pp) https://doi.org/10.1007/s11207-018-1392-4
Falconer et al. (2011) Falconer D, Barghouty AF, Khazanov I, Moore R (2011) A Tool for Empirical Forecasting of Major Flares, Coronal Mass Ejections, and Solar Particle Events from a Proxy of Active-Region Free Magnetic Energy. Space Weather, 9: S04003 (12pp) https://doi.org/10.1029/2009SW000537
Falconer et al. (2012) Falconer DA, Moore RL, Barghouty AF, Khazanov I (2012) Prior Flaring as a Complement to Free Magnetic Energy for Forecasting Solar Eruptions. Astropphys. J., 757: 32 (6pp) https://doi.org/10.1088/0004-637X/757/1/32
Fletcher et al. (2011) Fletcher L, et al. (2011) An observational overview of solar flares. Space Science Reviews, 159: 19-106 https://doi.org/10.1007/s11214-010-9701-8
Gallagher et al. (2002) Gallagher PT, Moon Y-J, Wang H (2002) Active-region monitoring and flare forecasting I. Data processing and first results. Solar Physics, 209: 171-183 https://doi.org/10.1023/A:1020950221179
Hanssen & Kuipers (1965) Hanssen AW, Kuipers WJA (1965) On the Relationship Between the Frequency of Rain and Various Meteorological Parameters: (with Reference to the Problem Ob Objective Forecasting) Mededelingen en verhandelingen, 81, Royal Netherlans Meteorological Institute (65pp)
Hoeksema et al. (2014) Hoeksema JT, Liu Y, Hayashi K, Sun X, Schou J, Couvidat S, Norton A, Bobra M, Centeno R, Leka KD, Barnes G, Turmon M (2014) The Helioseismic and Magnetic Imager (HMI) vector magnetic field pipeline: overview and performance. Solar Physics, 289: 3483-3530 https://doi.org/10.1007/s11207-014-0516-8
Huang et al. (2018) Huang X, Wang H, Xu L, Liu J, Li R, Dai X (2018) Deep learning based solar flare forecasting model. I. Results for line-of-sight magnetograms. Astrophysical J. 856: 7 (11pp) https://doi.org/10.3847/1538-4357/aaae00
Inoue et al. (2018) Inoue S, Kusano K, Büchner J, Skála J (2018) Formation and dynamics of a solar eruptive flux tube. Nature Communications 9: 174 (11pp) https://doi.org/10.1038/s41467-017-02616-8
Jiao et al. (2020) Jiao Z, Sun H, Wang X, Manchester W, Gombosi T, Hero A, Chen Y (2020) Solar flare intensity prediction with machine learning models. Space Weather, 18: e2020SW002440 https://doi.org/10.1029/2020SW002440
Jolliffe & Stephenson (2012) Jolliffe IT, & Stephenson DB (2012) Forecast verification: a practitioner’s guide in atmospheric science. (2nd ed.; Hoboken, NJ: John Wiley & Sons, Ltd) https://doi.org/10.1002/9781119960003
Kim et al. (2019) Kim T, Park E, Lee H, Moon Y-, Bae S-, Lim D, Jang S, Kim L, Cho I-H, Choi M, Cho K-S (2019) Solar farside magnetograms from deep learning analysis of STEREO/EUVI data. Nature Astronomy, 3: 397-400 https://doi.org/10.1038/s41550-019-0711-5
Kingma & Ba (2014) Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. International Conference on learning representations (ICLR) 2015. arXiv preprint arXiv:1412.6980.
Korsós et al. (2020) Korsós MB, Georgoulis MK, Gyenge N, Bisoi SK, Yu S, Poedts S, Nelson CJ, Liu J, Yan Y, Erdélyi R (2020) Solar flare prediction using magnetic field diagnostics above the photosphere. Astrophysical J., 896: 119 https://doi.org/10.3847/1538-4357/ab8fa2
Krista & Gallagher (2009) Krista LD, Gallagher PT (2009) Automated coronal hole detection using local intensity thresholding techniques. Solar Physics, 256: 87-100 https://doi.org/10.1007/s11207-009-9357-2
Kubo et al. (2017) Kubo Y, Den M, Ishii M (2017) Verification of operational solar flare forecast: case of Regional Warning Center Japan. J. Space Weather Space Clim., 7: A20 (16pp) https://doi.org/10.1051/swsc/2017018
Kusano et al. (2020) Kusano K, Iju T, Bamba Y, Inoue S (2020) A physics-based method that can predict imminent large solar flares. Science, 369 (6503): 587-591 https://doi.org/10.1126/science.aaz2511
Kusano et al. (2012) Kusano K, Bamba Y, Yamamoto TT, Iida Y, Toriumi S, Asai A (2012) Magnetic field structures triggering solar flares and coronal mass ejections. Astropphys. J., 760: 31 (9pp) https://doi.org/10.1088/0004-637X/760/1/31
Lee et al. (2012) Lee K, Moon Y-J, Lee J-Y, Lee K-S, Na H (2012) Solar flare occurrence rate and probability in terms of the sunspot classification supplemented with sunspot area and its changes. Solar Physics, 281: 639-650 https://doi.org/10.1007/s11207-012-0091-9
Leka et al. (2018) Leka KD, Barnes G, Wagner E (2018) The NWRA classification infrastructure: description and extension to the Discriminant Analysis Flare Forecasting System (DAFFS). Space Weather Space Clim., 8: A25 (23pp) https://doi.org/10.1051/swsc/2018004
Leka et al. (2019) Leka KD, et al. (2019) A comparison of flare forecasting methods. II. benchmarks, metrics, and performance results for operational solar flare forecasting systems. Astropphys. J. S., 243: 36 (15pp) https://doi.org/10.3847/1538-4365/ab2e12
Lemen et al. (2012) Lemen J. et al. (2012) The Atmospheric Imaging Assembly (AIA) on the Solar Dynamics Observatory (SDO). Solar Physics, 275: 17-40 https://doi.org/10.1007/s11207-011-9776-8
Li et al. (2020) Li X, Zheng Y, Wang X, Wang L (2020) Predicting solar flares using a novel deep convolutional neural network. Astrophys J., 891: 10 (11pp) https://doi.org/10.3847/1538-4357/ab6d04
Liu et al. (2014) Liu YD, Luhmann JG, Kajdič P, Kilpua EKJ, Lugaz N, Nitta NV, Möstl C, Lavraud B, Bale SD, Farrugia CJ, Galvin AB (2014) Observations of an extreme storm in interplanetary space caused by successive coronal mass ejections. Nature Communications, 5: 3481 https://doi.org/10.1038/ncomms4481
Liu et al. (2019) Liu H, Liu C, Wang JTL, Wang H (2019) Predicting solar flares using a long short-term memory network. Astrophys. J., 877: 121 (14pp) https://doi.org/10.3847/1538-4357/ab1b3c
McCloskey et al. (2016) McCloskey AE, Gallagher PT, Bloomfield DS (2016) Flaring rates and the evolution of sunspot group McIntosh classifications. Solar Physics, 291: 1711-1738 https://doi.org/doi:10.1007/s11207-016-0933-y
Möstl et al. (2015) Möstl C, et al. (2015) Strong coronal channeling and interplanetary evolution of a solar storm up to Earth and Mars. Nature Communications, 6: 7135 https://doi.org/10.1038/ncomms8135
Muranushi et al. (2015) Muranushi T, Shibayama T, Muranushi YH, Isobe H, Nemoto S, Komazaki K, Shibata K (2015) UFCORIN: A fully automated predictor of solar flares in GOES X-ray flux. Space Weather, 13 (11): 778-796 https://doi.org/doi:10.1002/2015SW001257
Murphy (1993) Murphy AH (1993) What is a good forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8: 281-293
Murray et al. (2017) Murray SA, Bingham S, Sharpe M, Jackson DR (2017) Flare forecasting at the Met Office Space Weather Operations Centre. Space Weather, 15 (4): 577-588 https://doi.org/10.1002/2016SW001579
Nishizuka et al. (2017) Nishizuka N, Sugiura K, Kubo Y, Den M, Watari S, Ishii M (2017) Solar flare prediction model with three machine-learning algorithms using ultraviolet brightening and vector magnetograms. Astropphys. J., 835 (2): 156 (10pp) https://doi.org/10.3847/1538-4357/835/2/156
Nishizuka et al. (2018) Nishizuka N, Sugiura K, Kubo Y, Den M, Ishii M (2018) Deep Flare Net (DeFN) model for solar flare prediction. Astropphys. J., 858 (2): 113 (8pp) https://doi.org/10.3847/1538-4357/aab9a7
Nishizuka et al. (2020) Nishizuka N, Kubo Y, Sugiura K, Den M, Ishii M (2020) Reliable probability forecast of solar flares: Deep Flare Net-Reliable (DeFN-R). Astropphys. J., 899: 150 (8pp) https://doi.org/10.3847/1538-4357/aba2f2
Pagano et al. (2019) Pagano P, Mackay DH, Yardley SL (2019) A new space weather tool for identifying eruptive active regions. Astrophysical J., 886: 81 (11pp) https://doi.org/10.3847/1538-4357/ab4cf1
Panos & Kleint (2020) Panos B, Kleint L (2020) Real-time flare prediction based on distinctions between flaring and non-flaring active region spectra. Astrophys. J., 891: 17 (18pp) https://doi.org/10.3847/1538-4357/ab700b
Park et al. (2018) Park E, Moon Y-J, Shin S, Yi K, Lim D, Lee H, Shin G (2018) Application of the deep convolutional neural network to the forecast of solar flare occurrence using full-disk solar magnetograms. Astrophysical J. 869: 91 (6pp) https://doi.org/10.3847/1538-4357/aaed40
Park et al. (2020) Park S-H et al. (2020) A comparison of flare forecasting methods. IV. Evaluating consecutive-day forecasting patterns Astropphys. J., 890 (2): 124 (33pp) https://doi.org/10.3847/1538-4357/ab65f0
Pesnell et al. (2012) Pesnell WD, Thompson BJ, Chamberlin PC (2012) The Solar Dynamics Observatory (SDO). Solar Physics, 275: 3-15 https://doi.org/10.1007/s11207-011-9841-3
Richardson (2000) Richardson DS (2000) Skill and relative economic value of the ECMWF ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 126: 649. https://doi.org/10.1002/qj.49712656313
Scherrer et al. (2012) Scherrer PH, et al. (2012) The Helioseismic and Magnetic Imager (HMI) investigation for the Solar Dynamics Observatory (SDO). Solar Physics, 275: 207-227 https://doi.org/10.1007/s11207-011-9834-2
Schou et al. (2012) Schou J, et al. (2012) Design and Ground Calibration of the Helioseismic and Magnetic Imager (HMI) Instrument on the Solar Dynamics Observatory (SDO). Solar Physics, 275: 229-259 https://doi.org/10.1007/s11207-011-9842-2
Schwenn et al. (2005) Schwenn R, dal Lago A, Huttunen E, Gonzalez WD (2005) The association of coronal mass ejections with their effects near the Earth. Annales Geophysicae, 23: 1033-1059 https://doi.org/10.5194/angeo-23-1033-2005
Shibata & Magara (2011) Shibata K, Magara T (2011) Solar flares: magnetohydrodynamic processes. Living Rev. Solar Phys. 8: 6 (99pp) https://doi.org/10.12942/lrsp-2011-6
Shih & Kowalsky (2003) Shih FY, Kowalsky AJ (2003) Automatic extraction of filaments in H $\alpha$ solar images. Solar Physics, 218: 99-122 https://doi.org/10.1023/B:SOLA.0000013052.34180.58
Steward et al. (2011) Steward GA, Lobzin VV, Wilkinson PJ, Cairns IH, Robinson PA (2011) Automatic recognition of complex magnetic regions on the sun in GONG magnetogram images and prediction of flares: techniques for the flare warning program Flarecast. Space Weather, 9: S11004 (11pp) https://doi.org/10.1029/2011SW000703
Tashman (2000) Tashman LJ (2000) Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal of Forecasting, 16 (4): 437–450 https://doi.org/10.1016/S0169-2070(00)00065-0
Toriumi & Wang (2019) Toriumi S, Wang H (2019) Flare-productive active regions. Invited review for Living Reviews in Solar Physics, 16: 3 (128pp) https://doi.org/10.1007/s41116-019-0019-7
Varma & Simon (2006) Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics, 7: 91 (8pp) https://doi.org/10.1186/1471-2105-7-91
Wang et al. (2017) Wang H, Liu C, Ahn K, Xu Y, Jing J, Deng N, Huang N, Liu R, Kusano K, Fleishman GD, Gary DE, Cao W (2017) High-resolution observations of flare precursors in the low solar atmosphere. Nature Astronomy, 1: 0085 https://doi.org/10.1038/s41550-017-0085
Wang et al. (2020a) Wang J, Zhang Y, Hess W, Shea A, Liu S, Meng X, Wang T (2020a) Solar flare predictive features derived from polarity inversion line masks in active regions using an unsupervised machine learning algorithm. Astrophysical J., 892: 140 https://doi.org/10.3847/1538-4357/ab7b6c
Wang et al. (2020b) Wang X, Chen Y, Toth G, Manchester WB, Gombosi TI, Hero AO, Jiao Z, Sun H, Jin M, Liu Y (2020b) Predicting solar flares with machine learning: Investigating solar cycle dependence. Astrophysical J. 895: 3 https://doi.org/10.3847/1538-4357/ab89ac
Yi et al. (2020) Yi K, Moon Y-J, Shin G, Lim D (2020) Forecast of major solar x-ray flare flux profiles using novel deep learning models. Astrophys. J. Let.., 890: L5 (7pp) https://doi.org/10.3847/2041-8213/ab701b
Zheng et al. (2019) Zheng Y, Li X, Wang X (2019) Solar flare prediction with the hybrid deep convolutional neural network. Astrophys. J., 885: 73 (14pp) https://doi.org/10.3847/1538-4357/ab46bd

Refer to caption — Figure 1: Flow chart of operational DeFN. It is executed using the NRT data and the pretrained model. The best-performing model among the models trained using the 2010-2015 datasets is chosen as the pretrained model.

11 Preparing tables

Table 1: Contingency tables of DeFN using 2010-2015 datasets.

(a) Threshold = 50 %

$\geq$ M-class Flares		Observed Events
		Yes	No
Forecast	Yes	963	4382
Events	No	54	25937

(b) Threshold = 50 %

$\geq$ C-class Flares		Observed Events
		Yes	No
Forecast	Yes	4967	4420
Events	No	1171	20778

Table 2: Contingency tables of DeFN forecasts in operation from January 2019 to June 2020. They show the forecast results for

\geq

M-class flares and for

\geq

C-class flares, with three different probability thresholds such as 50 %, 45 %, and 40 %.

(a) Threshold = 50 %

$\geq$ M-class Flares		Observed Events
		Yes	No
Forecast	Yes	1	28
Events	No	3	2201

(b) Threshold = 50 %

$\geq$ C-class Flares		Observed Events
		Yes	No
Forecast	Yes	27	18
Events	No	11	2177

$\geq$ M-class Flares		Observed Events
		Yes	No
Forecast	Yes	1	31
Events	No	3	2198

(d) Threshold = 45 %

$\geq$ C-class Flares		Observed Events
		Yes	No
Forecast	Yes	30	27
Events	No	8	2168

(e) Threshold = 40 %

$\geq$ M-class Flares		Observed Events
		Yes	No
Forecast	Yes	2	34
Events	No	2	2195

(f) Threshold = 40 %

$\geq$ C-class Flares		Observed Events
		Yes	No
Forecast	Yes	32	34
Events	No	6	2161

Table 3: Evaluations of operational forecast results by DeFN from January 2019 to June 2020 with three verification metrics.

(a)

\geq

M-class flare predictions

Probability threshold	Accuracy	TSS	FAR	HSS
50 %	0.99	0.24	0.97	0.06
45 %	0.98	0.24	0.97	0.05
40 %	0.98	0.48	0.94	0.10

(b)

\geq

C-class flare predictions

Probability threshold	Accuracy	TSS	FAR	HSS
50 %	0.99	0.70	0.40	0.64
45 %	0.98	0.78	0.47	0.62
40 %	0.98	0.83	0.52	0.61

Table 4: Evaluation of DeFN forecasts using the time-series CV.

Datasets	TSS ( $\geq$ M-class flares)	TSS ( $\geq$ C-class flares)
Training (2010-2011), Validation (2012), Test (2013)	0.49	0.53
Training (2010-2012), Validation (2013), Test (2014)	0.66	0.60
Training (2010-2013), Validation (2014), Test (2015)	0.77	0.66
Training (2010-2014), Validation (2015), Test (2016)	0.87	0.56
Training (2010-2015), Validation (2016), Test (2017)	0.72	0.61
Average	0.70	0.59