On the use of adversarial validation for quantifying dissimilarity in geospatial machine learning prediction

\nameYanwen Wang^a,*, Mahdi Khodadadzadeh^a, and Raúl Zurita-Milla^a CONTACT Yanwen Wang. Email: [email protected] ^aFaculty of Geo-Information Science and Earth Observation (ITC), University of Twente, 7522NH Enschede, the Netherlands

Abstract

Recent geospatial machine learning studies have shown that the results of model evaluation via cross-validation (CV) are strongly affected by the dissimilarity between the sample data and the prediction locations. In this paper, we propose a method to quantify such a dissimilarity in the interval 0 to 100%, and from the perspective of the data feature space. The proposed method is based on adversarial validation, which is an approach that can check whether sample data and prediction locations can be separated with a binary classifier. To study the effectiveness and generality of our method, we tested it on a series of experiments based on both synthetic and real datasets and with gradually increasing dissimilarities. Results show that the proposed method can successfully quantify dissimilarity across the entire range of values. Next to this, we studied how dissimilarity affects CV evaluations by comparing the results of random CV and of two spatial CV methods, namely block and spatial+ CV. Our results showed that CV evaluations follow similar patterns in all datasets and predictions: when dissimilarity is low (usually lower than 30%), random CV provides the most accurate evaluation results. As dissimilarity increases, spatial CV methods, especially spatial+ CV, become more and more accurate and even outperforming random CV. When dissimilarity is high ( $>=$ 90%), no CV method provides accurate evaluations. These results show the importance of considering feature space dissimilarity when working with geospatial machine learning predictions, and can help researchers and practitioners to select more suitable CV methods for evaluating their predictions.

keywords:

Machine learning; geospatial regression; model evaluation; cross-validation; feature space.

1 Introduction

Machine learning (ML) is widely used in geospatial prediction to estimate unknown values at specific prediction locations (Hengl et al. 2018; Aguilar et al. 2018; Usman et al. 2023). These predictions are often done to create spatially continuous products, for example, mineral (Khodadadzadeh and Gloaguen 2019), health risk (Garcia-Marti et al. 2018), or phenological (Zurita-Milla, Laurent, and van Gijsel 2015) maps. In these and many other applications, predictions come from ML regression models trained on limited sample data, where the number of samples is typically much smaller than the number of prediction locations. This imbalance between samples and prediction locations, is mostly due to practical limitations such as accessibility (Lamichhane, Kumar, and Wilson 2019) or sampling costs (Hengl et al. 2015). For similar reasons, collecting additional data for an independent evaluation of geospatial ML prediction is rarely feasible (Valavi et al. 2019). To address these operational constraints, the evaluation of geospatial ML models is mainly conducted by splitting the available sample data into training and validation subsets (de Bruin et al. 2022; Wang, Khodadadzadeh, and Zurita-Milla 2023). Random k-fold cross-validation (RDM-CV) stands out as the most popular evaluation method (Chen et al. 2018; Nesha et al. 2020; Guo et al. 2022). As the name indicates, RMD-CV randomly splits the sample data into k equal-size folds, and then, it iteratively uses one of them as a validation subset and the remaining ones as a training subset. When sample data are collected by a probability sampling strategy, such as simple random sampling (Brus, Kempen, and Heuvelink 2011; Wang et al. 2012) and regular sampling (Lagacherie et al. 2020), RDM-CV can provide sufficiently accurate evaluation results (Wadoux et al. 2021; Milà et al. 2022). This is because, under these circumstances, the training and validation subsets are representative of the relationship between sample data and prediction locations. Specifically, probability sampling ensures that the sample data and prediction locations are similar from the perspective of data distribution, whilst the random split of RDM-CV can also guarantee that the training and validation subsets are similar.

In practical situations, most geospatial ML predictions do not follow probability sampling, potentially leading to significant differences between the sample data and the prediction locations. A representative case is large-scale prediction (Mussumeci and Codeço Coelho 2020; Chen et al. 2022; Ludwig et al. 2023) where sample data are often concentrated on a few developed and accessible regions (Meyer and Pebesma 2022); for instance, global soil maps are produced with sample data clustering among Europe and North America (Guerra et al. 2020). Another case is making predictions in a completely new area. For example, the predictions of the affected area after an earthquake are so urgent that collecting samples is almost unfeasible (Li et al. 2021a); other examples are the predictions of landslides (Goetz et al. 2015; Li et al. 2021b; Zhao et al. 2017) or the predictions of invasive species diffusion (Cheng et al. 2018), where collecting samples in the study area is also impossible, as the target phenomena have not occurred yet. In all the above cases, geospatial ML acts as an extrapolation model for predicting values that extend beyond the known data (i.e., training data).

In the scenarios discussed above where the sample data and prediction location are different, the RDM-CV tends to be over-optimistic and not suitable for evaluation (Brenning 2005; Wiens et al. 2008; Pohjankukka et al. 2017; Stock and Subramaniam 2022). Consequently, a series of spatial CV methods have been proposed with the core idea of avoiding excessive similarity between the training and validation subsets. Block CV (BLK-CV) and spatial+ CV (i.e., spatial-plus CV, SP-CV) are two representative methods in this regard. BLK-CV has a long history (Brenning 2012; Roberts et al. 2017; Valavi et al. 2019) and is widely used in the evaluation (Wadoux et al. 2021; Wang, Khodadadzadeh, and Zurita-Milla 2023; Bueno, Macera, and Montoya 2023). As its name implies, BLK-CV would divide the sample data into contiguous blocks and then randomly split blocks (instead of samples) as k-folds. SP-CV is a recently proposed spatial CV method (Wang, Khodadadzadeh, and Zurita-Milla 2023) that considers the feature space. In SP-CV, agglomerative hierarchical clustering (AHC) is used first to divide samples into improved blocks. Then, all blocks are split into folds by cluster ensembles based on their locations, covariates, and the target variable. As shown in Wang, Khodadadzadeh, and Zurita-Milla (2023), SP-CV shows outstanding evaluation results when sample data and prediction locations are substantially different.

According to the above descriptions of RDM-CV and spatial CV, it can be observed that dissimilarity (or similarity) between sample data and prediction locations is a decisive factor for determining the evaluation accuracy of CV methods. This has been confirmed by recent studies (Wadoux et al. 2021; Milà et al. 2022; de Bruin et al. 2022). It should be noticed that the transition from largely similar to substantially different is gradual. For example, varying degrees of samples clustering in the prediction area would result in different degrees of dissimilarity (Milà et al. 2022). Therefore, here we use dissimilarity as a continuous attribute to describe the relationship between sample data and prediction locations. Although few studies recognized this and considered dissimilarity when proposing new CV methods (e.g., Milà et al. (2022) and Linnenbrink et al. (2023)), they have not explicitly expressed and quantified the dissimilarity between the sample data and the prediction locations.

This paper stands as one of the pioneering attempts to propose a method for quantifying dissimilarity needed to properly evaluate and interpret geospatial ML predictions. The method introduced is founded on Adversarial Validation (AV), a technique developed in the ML community (FastML 2016). AV primarily serves to construct the optimal validation subset to evaluate ML models, when faced with varying feature distributions between samples and prediction data (Qian et al. 2022; Zhang et al. 2023; Montesinos-López, Montesinos-López, and Kismiantini 2023). AV demonstrates the potential to quantify dissimilarity based on the feature space, by treating the dissimilarity between samples and prediction data as a binary classification task (Qian et al. 2022). In this paper, we, for the first time in the geoscience community, leverage the potential of AV for evaluating geospatial predictions by quantifying dissimilarity between sample data and prediction locations. In addition, the proposed dissimilarity quantification method is based on the feature space, aligning with the data-driven nature of geospatial ML predictions.

The second contribution of this paper consists in the experimental comparison of evaluation performances of CV methods on the basis of proposed method. In the experiment, we set up numerious prediction tasks with gradually changing dissimilarity scenarios on both synthetic and real datasets. By the experiment, we can investigate the relationship between dissimilarity and random CV & spatial CV methods evaluations in detail. Hence, experimental results of this research are important supplements to previous studies with only a few dissimilarity scenarios.

The remainder of this paper is organized as follows. In section 2, we specify the proposed dissimilarity method based on AV. In section 3, we describe and discuss our experiments and results. Finally, in section 4 we present the main conclusions of this study and provides recommendations for future research.

2 Methods

In AV, its ML classifier is the core part and enables AV to measure the dissimilarity between samples and prediction data (Zhang et al. 2023). The AV classifier can detect whether the sample and prediction data can be classified into separate classes based on their covariates. If the classification accuracy is high, sample and prediction data can be easily distinguished in the feature space (i.e. they are different). Because the classification accuracy is a numerical value, the measured dissimilarity can also be described as a numerical value. Besides, the AV classifier is building on ML algorithm means that AV can capture complex and nonlinear relationships between covariates better than other methods (such as directly calculating the Euclidean distance in the feature space). Hence, quantifying dissimilarity based on AV could better match geospatial predictions based on ML models.

As figure 1 shows, the proposed method is composed of three stages.

Refer to caption — Figure 1: the workflow of method for quantifying dissimilarity.

2.1 The 1st stage: construct AV data.

The first stage of the proposed method focuses on constructing appropriate AV training and test data, acknowledging that the sample data are typically much fewer than the prediction locations. To avoid class imbalance, it is advisable to select the same number of prediction locations as the number of samples. When selecting a subset of the prediction locations, we employ a random selection approach to guarantee an unbiased representation of the original set (Brus, Kempen, and Heuvelink 2011; Wadoux et al. 2021). This is shown in the left part of figure 1 where we see eight samples (black points in figure 1) over the prediction area. Hence, we randomly select eight locations (hollow points in figure 1) distributed across the prediction area.

After this, sample data and the subset of prediction locations are labeled as two classes. The sample data (black points in figure 1) are labeled as 1 and the selected prediction locations (hollow points in figure 1) are labeled as 0 following Qian et al. (2022). All samples and selected prediction locations, with their covariates as inputs and given labels as outputs, compose the entire set of AV data. Next, the AV data should be split into AV training and test data. To guarantee that AV training and test data can simultaneously represent sample data and prediction locations well, we randomly split AV data into two equal-size halves. AV training data is composed by taking one half and AV test data is the remaining half. As figure 1 shows, both AV training and test data have four black points and four hollow points.

2.2 The 2nd stage: Build and apply AV classifier.

In this stage, an ML algorithm is selected as the AV classifier. Random forest (RF) is an ensemble ML algorithm (Breiman 2001) that has been successfully applied both in classification and regression problems. RF is popular in the geosciences (Roberts et al. 2017; Hengl et al. 2018) because of its robustness (Chen et al. 2018), stability (Garcia-Marti et al. 2018), and being user-friendly (Hengl et al. 2018). Especially, it has good versatility and quite commonly used in various geospatial predictions (de Bruin et al. 2022; Milà et al. 2022; Wang, Khodadadzadeh, and Zurita-Milla 2023). Considering that the quantification method needs to have good versatility in different types of predictions, RF is used in our method. Moreover, RF can be used to produce probabilistic classification results by averaging the classifications of all decision trees (Belgiu and Drăguţ 2016). Following Wang, Khodadadzadeh, and Zurita-Milla (2023), we fix the number of decision trees to 500, and the maximum number of features in an individual tree to the square root of the number of covariates.

Now that we have fixed the AV classifier, the 2nd stage is simple. First we use the AV training data to train the classifier, and then predict the class of the AV test data to get the probabilities of each test data point to belong to either class 0 or 1. These probabilities are shown in various shades of grey in figure 1.

2.3 The 3rd stage: Calculate dissimilarity.

The final stage of the proposed method is to calculate the dissimilarity metric. Based on the AV test data probabilistic classification results, we can quantify the accuracy of AV classifier. For this we use the Area Under Curve (AUC) score, i.e. the area under Receiver Operating Characteristic (ROC) curve. The AUC is a widely used metric to quantitatively describe the classification accuracy of binary classifiers (Wu et al. 2019), which is also widely used in geospatial ML prediction (Hitouri et al. 2022; Chen et al. 2024). When the AUC value is large, the classifier accuracy is high, meaning that the dissimilarity is larger. The value range of AUC is usually [0.5, 1], but sometimes AUC might be slightly lower than 0.5. An AUC value of 0.5 means that the classifier is almost randomly guessing if a sample belongs to class 0 or 1, indicating that sample data and prediction locations have the same data distributions.

Because using 0.5 as the minimum value of dissimilarity might be confusing, we normalize the AUC score and create a new metric, directly named as dissimilarity (D). The normalization function is shown below. Similarly to the AUC, the larger the D, the greater the dissimilarity.

D=\left\{\begin{array}[]{rcl}(\frac{AUCscore-0.5}{1-0.5})*100\%,&&(AUCscore>0.5)\\ 0\%,&&(AUCscore<=0.5)\end{array}\right.

3 Experiments

To study the effectiveness and versatility of our method, we designed a series of experiments using synthetic and real datasets. The following subsections describe the datasets, our experimental set up, and results.

3.1 Datasets

3.1.1 Synthetic dataset

We use a synthetic dataset developed by Roberts et al. (2017) as an ecological prediction case (figure 2(a)). This dataset contains seven covariates and a target variable (i.e., species abundance). All these variables are generated over a 1000x1000 raster layer. Most covariates are created based on Gaussian Random Field (GRF) (Schlather et al. 2015) to simulate the actual spatial variables autocorrelation structures (Le Rest et al. 2014; Sarafian et al. 2021). There is also a regional covariate generating by Markov Random Field (MRF) to simulate regional patterns of real geoscience covariates. Detailed information on generating the synthetic dataset is provided in Appendix 1.

3.1.2 Real dataset

A real dataset of above ground biomass (AGB) from the Brazil Amazon basin is adopted from Wadoux et al. (2021). This dataset has 28 covariates, and the AGB target variable (figure 2(b)). All the covariates and the target variable are available as raster layers with a resolution of 1x1 km. Detailed information of this dataset is also included in Appendix 1.

3.2 Experiments and results

The main steps of our experiments are shown in figure 3. Step 1 deals with the construction of prediction tasks with gradually increasing dissimilarities. In step 2 we calculate all the dissimilarities and the corresponding CV evaluation performances. In step 3, the results from step 2 are put together as a scatter plot that reveals the relationship between dissimilarity and CV evaluations.

3.2.1 Step 1: construct predictions with gradually changing dissimilarities.

In the experiments, we adopted a commonly used approach to construct dissimilarities that gradually change by altering the spatial coverage of samples (Wadoux et al. 2021; Milà et al. 2022; de Bruin et al. 2022; Linnenbrink et al. 2023). First, the number of samples in all predictions is set to be constant. Following other studies (Amato et al. 2020; Sarailidis, Wagener, and Pianosi 2023), it is set to 1000. Then, as in Wadoux et al. (2021), the study area is divided into 100 subregions by K-Means clustering based on raster grids coordinates. Thirdly, a number of subregions are randomly selected. Finally, sample data are equally and randomly selected only from the selected subregions. For each dataset, the selected subregions are continuously increased from 1 to 100, ensuring coverage of a wide and comprehensive range of dissimilarities. To reduce random errors, the sampling of each specific number of selected subregions is repeated 10 times. Therefore, the amount of all constructed predictions for a dataset (the N in figure 3) is 100x10 = 1000. Figure 4 shows some examples of constructed predictions for the synthetic and real datasets.

3.2.2 Step 2: Calculate dissimilarities and CV methods evaluation performances.

After constructing the predictions, it is still necessary to inspect if dissimilarities can be effectively quantified by the proposed method. Figure LABEL:fig:5 shows the results of the dissimilarities. Figures LABEL:fig:5(a0) and LABEL:fig:5(b0) are the scatter plots of the number of selected subregions vs the dissimilarity value. Figures LABEL:fig:5(a1) and LABEL:fig:5(b1) are the histograms of dissimilarity. Figure 5 demonstrates that the proposed method successfully quantifies the gradually changing dissimilarities because the dissimilarity values are completely distributed among the entire interval of possible values. Moreover, figures LABEL:fig:5(a0) and LABEL:fig:5(b0) also suggests that the dissimilarity in the feature space cannot be accurately represented by the information in the geographic space. First, there are multiple dissimilarity values on the same spatial coverage (i.e., same value of x-axis). It is impossible to recognize a unique dissimilarity according to a specific spatial coverage. Second, scatter plots show points distributed following the curves of natural logarithm function multiplying negative constant (R2 $>$ 0.95 in both datasets), instead of the straight lines. That is, the change in the geographic space cannot accurately represent the change of dissimilarity in the feature space. Therefore, these results also support our argument that quantifying dissimilarity should be based on the feature space rather than on the geographic space.

Step 2 also deals with the calculation of the evaluation performances of the CV methods. This requires calculating the actual prediction error and the error estimated via CV (Wadoux et al. 2021; Milà et al. 2022; Wang, Khodadadzadeh, and Zurita-Milla 2023).

To calculate the actual prediction error, we have to build ML regression models that predict the target variable. Considering its advantages, RF was also chosen for this task. Once we build the model based on sample data, all prediction locations can be predicted. Finally, these predicted values are compared with their actual values to obtain the actual prediction error. For this we use the root-mean-square error (RMSE), which is a widely used statistical metric for describing prediction error (Oliveira, Torgo, and Costa 2021), also in spatial CV studies (Roberts et al. 2017; Ploton et al. 2020). The RMSE is not only used for the actual prediction error, but also for the estimation of the prediction errors of RDM-CV and of the two spatial CV methods – BLK-CV and SP-CV.

The calculation of the estimated prediction error (i.e., evaluation result) is the same for three CV methods. First, we apply each CV method on the same sample data to produce the necessary k folds split. Then, we use k-1 folds (i.e., training subset) to train an RF regression model and apply this model to predict the remaining fold (i.e., validation subset). This step is repeated k times to cover all the folds. Finally, every sample has a predicted value and an actual value, enabling the calculation of the RMSE of CV estimated prediction error. In this study, k is set to 5 because it is a commonly used number of folds (Lyons et al. 2018). The the “estimated prediction error” of CV is precisely the evaluation of the prediction. Because in real-world predictions the actual prediction error is unattainable, while we can only evaluate it by applying a given CV method on the available sample data to get an estimated error. Finally, the CV evaluation performance (P in figure 3) is calculated by subtracting the CV estimated prediction error from the actual prediction error. The larger the absolute value of P is, the worse the CV evaluation performance. Moreover, P values lower than 0 indicate that the corresponding CV method is pessimistic (i.e., the estimated prediction error is larger than the actual prediction error). Conversely, if P is greater than 0, the corresponding CV evaluation method is considered optimistic.

3.2.3 Step 3: Plot the relationship of dissimilarity and CV methods evaluation performances.

In summary, following steps 1 and 2, we have 1000 predictions for each dataset. Each prediction corresponds to a unique dissimilarity value and is used to obtain the evaluation performances of three CV methods. By plotting the dissimilarity vs the evaluation performance, we analyze the relationship between dissimilarity and CV evaluation performance.

Our results are presented in figure LABEL:fig:6. The y-axis represents the value of CV method evaluation performance (P in figure 3), and the x-axis represents the value of dissimilarity (D in figure 3). In each scatter plot, there are 3000 points that correspond to the evaluation performances of the 1000 predictions linked to each of the three CV methods considered in this research. Points around the zero line (x-axis) correspond to accurate evaluations. Points below and above that line respectively represent pessimistic and optimistic evaluations.

Figure LABEL:fig:6 confirms the results presented by recent studies (Wadoux et al. 2021; Milà et al. 2022; de Bruin et al. 2022): RDM-CV is over-optimistic when sample data and prediction locations are different, while spatial CV tends to be over-pessimistic when samples almost cover the entire prediction area. In figure LABEL:fig:6, it is obvious that RDM-CV points are clearly above the zero line in large dissimilarity values, and it is also worth noting that SP-CV points correspond to pessimistic evaluations in low dissimilarity values.

Furthermore, figure LABEL:fig:6 provides new insights into the relationship between dissimilarity and CV evaluation performance. Unlike previous studies, which only analyzed a few dissimilarity levels, here we explore gradually changing dissimilarities. Firstly, we observe that over-optimistic RDM-CV and over-pessimistic spatial CV could happen simultaneously in the intermediate dissimilarity scenarios. This finding further reinforces the argument put forth by Wadoux et al. (2021), suggesting that neither RDM-CV nor spatial CV is suitable for evaluating geospatial ML predictions, particularly the presence of diverse dissimilarity scenarios. Secondly, the variations in CV evaluation performance are not uniform across all dissimilarities. As dissimilarity increases, the rate of change also increases. This discovery serves as a significant addition to comprehensively understanding the relationship between dissimilarity and CV evaluation performance.

Because it is hard to read scatter plots with 3000 points, we binned all dissimilarities to the 1% (i.e. we create 100 bins from the original experiments). After that, the corresponding CV evaluation performance of each bin is calculated by averaging the absolute values of all the values in the bin. The results of this operation are depicted in Figure 6, where we see three rough intervals based on the dissimilarity values. The first one is [0%, 50%), shown in figures 6(a1)&(b1). In this interval, SP-CV is appreciably worse than the other CV methods and RDM-CV seems to provide almost unbiased evaluations, especially in the first half of this interval. When the dissimilarity is larger than 30%, BLK-CV is slightly better than RDM-CV.

The second interval is [50%, 90%). As figures 6(a2)&(b2) shows, in this interval the performance of SP-CV gradually becomes better. However, when dissimilarity is between 50% and 80%, the evaluations of RDM-CV and spatial CV (BLK-CV and SP-CV) are all less than satisfactory, and it is not clear which method is certainly more accurate. Until dissimilarity surpassing 80% and below 90%, SP-CV becomes notably superior to other CV methods. This suggests that the consideration of feature space in SP-CV plays an important role, especially when there are substantial differences between sample data and prediction locations.

In the third and last interval (figures 6(a3)&(b3)), i.e., [90%, 100%], the dissimilarity between sample data and prediction locations is too large and none of the CV methods provides acceptable evaluation performances, they all being over-optimistic.

To gain a deeper understanding of how the evaluation performances of CV methods change with the dissimilarities, the scatter plots of actual prediction and CV estimated prediction errors are put together in Figure 7. In this figure, it is clearly noticeable that the variations in actual prediction error are much greater than the changes of every CV estimated prediction error. Consequently, the differences of CV evaluation performances in diverse dissimilarity scenarios are mainly due to the variations of actual prediction errors. RDM-CV and spatial CV are not capable of reflecting the changes of dissimilarity, which results in that they cannot consistently provide accurate evaluations in diverse dissimilarity scenarios. Figure 7 also shows that prediction errors of SP-CV are consistently higher than that of BLK-CV and RDM-CV, and that the errors of BLK-CV are slightly higher than those of RDM-CV. In other words, spatial CV methods provide higher prediction errors reflecting that they indeed have the ability to better simulate the difference between sample data and prediction locations.

In addition, the changing pattern of SP-CV prediction errors in the dissimilarity range [60%, 90%) is different from that of RDM-CV and BLK-CV. In this range, SP-CV shows a relatively stable behavior, while RDM-CV and BLK-CV show rapidly decreasing evaluation results. Another interesting pattern is observed in the dissimilarity range of [90%, 100%] where we see that the prediction error of SP-CV rapidly decreases. This is mainly because the spatial coverage of samples in this context is too small, and the configurated sample data lack sufficient internal variation. As a result, SP-CV could not completely reflect the dissimilarity in this range.

Although differences can be observed in the results of two datasets (e.g., the concrete thresholds of dissimilarity are different when SP-CV outperforms RDM-CV and BLK-CV), the relationship between dissimilarity and CV evaluations exhibit considerable commonalities. This is why in the above discussions we do not distinguish between the two datasets. These commonalities demonstrate the effectiveness and versatility of the proposed method to quantify dissimilarity in different geospatial ML predictions. They also demonstrate that the impact of dissimilarity on CV methods performances roughly follow similar patterns.

4 Conclusions & future research

With the advancement of geographical ML predictions, researchers have recognized the importance of dissimilarity between sample data and prediction locations and its crucial role in the evaluation of such predictions. However, there is a lack of methods to quantify this dissimilarity, which could also be used to help select a suitable CV evaluation method. Here we propose a method for quantifying dissimilarity based on adversarial validation and on the information contained in the feature space.

The method was tested using a series of predictions tasks with gradually changing dissimilarities and using both synthetic and real datasets. Results showed that the proposed method can successfully quantify dissimilarities. To further investigate how dissimilarity affects the performance of CV methods, we evaluated the traditional CV method – RDM-CV and two spatial CV methods – BLK-CV and SP-CV. Our results indicate that the impact of dissimilarity is generally consistent in both datasets. When dissimilarity is low (e.g., lower than 30%), the evaluation performance of RDM-CV is excellent. As dissimilarity increased, the best CV method gradually transitions from RDM-CV to spatial CV methods. However, when dissimilarity is extremely large, e.g., in [90%, 100%], all CV methods yield unreliable evaluations. Our results also show that the changes in CV methods performances are mainly determined by the actual prediction error that, in turn, exponentially increases with dissimilarity.

This study also reveals that the use of CV for evaluation is not straightforward. First, neither random CV nor spatial CV can provide satisfactory evaluations over a considerable intermediate range of dissimilarity. Second, both random CV and spatial CV cannot consistently provide accurate evaluations across diverse dissimilarity scenarios, and this hampers studies that focus on the generalization ability of geospatial ML prediction models. Therefore, we suggest designing “self-adaptive” CV methods that future work can concentrate on providing accurate evaluations in a much wider dissimilarity range.

Data and code availability

All data (including Appendix 1, the file for introducing datasets), code, and results of this research are available in the DANS (Dutch national centre of expertise and repository for research data) platform, can be accessed by https://doi.org/10.17026/PT/OPPCTP and https://doi.org/10.5281/zenodo.10460536.

Declaration of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Aguilar et al. (2018) Aguilar, Rosa, Raul Zurita-Milla, Emma Izquierdo-Verdiguier, and Rolf A. de By. 2018. “A Cloud-Based Multi-Temporal Ensemble Classifier to Map Smallholder Farming Systems.” Remote Sensing 10 (5): 729.
Amato et al. (2020) Amato, Federico, Fabian Guignard, Sylvain Robert, and Mikhail Kanevski. 2020. “A novel framework for spatio-temporal prediction of environmental data using deep learning.” Scientific Reports 10 (1): 1–11.
Belgiu and Drăguţ (2016) Belgiu, Mariana, and Lucian Drăguţ. 2016. “Random forest in remote sensing: A review of applications and future directions.” ISPRS International Journal of Geo-Information 114: 24–31.
Breiman (2001) Breiman, Leo. 2001. “Random forests.” Machine Learning 45 (1): 5–32.
Brenning (2005) Brenning, A. 2005. “Spatial prediction models for landslide hazards: review, comparison and evaluation.” Natural Hazards and Earth System Sciences 5 (6): 853–862.
Brenning (2012) Brenning, Alexander. 2012. “Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest.” In International Geoscience and Remote Sensing Symposium (IGARSS), 5372–5375.
Brus, Kempen, and Heuvelink (2011) Brus, D. J., B. Kempen, and G. B.M. Heuvelink. 2011. “Sampling for validation of digital soil maps.” European Journal of Soil Science 62 (3): 394–407.
Bueno, Macera, and Montoya (2023) Bueno, Marcelo, Briggitte Macera, and Nilton Montoya. 2023. “A Comparative Analysis of Machine Learning Techniques for National Glacier Mapping: Evaluating Performance through Spatial Cross-Validation in Perú.” Water 15 (24): 4214.
Chen et al. (2018) Chen, Gongbo, Yichao Wang, Shanshan Li, Wei Cao, Hongyan Ren, Luke D. Knibbs, Michael J. Abramson, and Yuming Guo. 2018. “Spatiotemporal patterns of PM10 concentrations over China during 2005–2016: A satellite-based estimation using the random forests approach.” Environmental Pollution 242: 605–613.
Chen et al. (2024) Chen, Jianhua, Kaihang Xu, Zheng Zhao, Xianxia Gan, and Huawei Xie. 2024. “A cellular automaton integrating spatial case-based reasoning for predicting local landslide hazards.” International Journal of Geographical Information Science 38 (1): 100–127.
Chen et al. (2022) Chen, Songchao, Dominique Arrouays, Vera Leatitia Mulder, Laura Poggio, Budiman Minasny, Pierre Roudier, Zamir Libohova, et al. 2022. “Digital mapping of GlobalSoilMap soil properties at a broad scale: A review.” Geoderma 409: 115567.
Cheng et al. (2018) Cheng, Yanchao, Nils Benjamin Tjaden, Anja Jaeschke, Renke Lühken, Ute Ziegler, Stephanie Margarete Thomas, and Carl Beierkuhnlein. 2018. “Evaluating the risk for Usutu virus circulation in Europe: Comparison of environmental niche models and epidemiological models.” International Journal of Health Geographics 17 (1): 1–14.
de Bruin et al. (2022) de Bruin, Sytze, Dick J. Brus, Gerard B.M. Heuvelink, Tom van Ebbenhorst Tengbergen, and Alexandre M.J-C. Wadoux. 2022. “Dealing with clustered samples for assessing map accuracy by cross-validation.” Ecological Informatics 69: 101665.
FastML (2016) FastML. 2016. “Adversarial validation.” http://fastml.com/adversarial-validation-part-one/.
Garcia-Marti et al. (2018) Garcia-Marti, Irene, Raul Zurita-Milla, Margriet G. Harms, and Arno Swart. 2018. “Using volunteered observations to map human exposure to ticks.” Scientific Reports 8 (1): 15435.
Goetz et al. (2015) Goetz, J. N., A. Brenning, H. Petschko, and P. Leopold. 2015. “Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling.” Computers & Geosciences 81: 1–11.
Guerra et al. (2020) Guerra, Carlos A., Anna Heintz-Buschart, Johannes Sikorski, Antonis Chatzinotas, Nathaly Guerrero-Ramírez, Simone Cesarz, Léa Beaumelle, et al. 2020. “Blind spots in global soil biodiversity and ecosystem function research.” Nature Communications 11 (1): 1–13.
Guo et al. (2022) Guo, Jiangang, Jinfeng Wang, Chengdong Xu, and Yongze Song. 2022. “Modeling of spatial stratified heterogeneity.” GIScience & Remote Sensing 59 (1): 1660–1677.
Hengl et al. (2015) Hengl, Tomislav, Gerard B. M. Heuvelink, Bas Kempen, Johan G. B. Leenaars, Markus G. Walsh, Keith D. Shepherd, Andrew Sila, et al. 2015. “Mapping Soil Properties of Africa at 250 m Resolution: Random Forests Significantly Improve Current Predictions.” Plos One 10 (6): e0125814.
Hengl et al. (2018) Hengl, Tomislav, Madlene Nussbaum, Marvin N. Wright, Gerard B.M. Heuvelink, and Benedikt Gräler. 2018. “Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables.” PeerJ 6: e5518.
Hitouri et al. (2022) Hitouri, Sliman, Antonietta Varasano, Meriame Mohajane, Safae Ijlil, Narjisse Essahlaoui, Sk Ajim Ali, Ali Essahlaoui, et al. 2022. “Hybrid Machine Learning Approach for Gully Erosion Mapping Susceptibility at a Watershed Scale.” ISPRS International Journal of Geo-Information 2022, Vol. 11, Page 401 11 (7): 401.
Khodadadzadeh and Gloaguen (2019) Khodadadzadeh, Mahdi, and Richard Gloaguen. 2019. “Upscaling High-Resolution Mineralogical Analyses to Estimate Mineral Abundances in Drill Core Hyperspectral Data.” In International Geoscience and Remote Sensing Symposium (IGARSS) 2019, jul, 1845–1848. Institute of Electrical and Electronics Engineers Inc.
Lagacherie et al. (2020) Lagacherie, P., D. Arrouays, H. Bourennane, C. Gomez, and L. Nkuba-Kasanda. 2020. “Analysing the impact of soil spatial sampling on the performances of Digital Soil Mapping models and their evaluation: A numerical experiment on Quantile Random Forest using clay contents obtained from Vis-NIR-SWIR hyperspectral imagery.” Geoderma 375: 114503.
Lamichhane, Kumar, and Wilson (2019) Lamichhane, Sushil, Lalit Kumar, and Brian Wilson. 2019. “Digital soil mapping algorithms and covariates for soil organic carbon mapping and their implications: A review.” Geoderma 352: 395–413.
Le Rest et al. (2014) Le Rest, Kévin, David Pinaud, Pascal Monestiez, Joël Chadoeuf, and Vincent Bretagnolle. 2014. “Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation.” Global Ecology and Biogeography 23 (7): 811–820.
Li et al. (2021a) Li, Boyi, Adu Gong, Tingting Zeng, Wenxuan Bao, Can Xu, and Zhiqing Huang. 2021a. “A Zoning Earthquake Casualty Prediction Model Based on Machine Learning.” Remote Sensing 14 (1): 30.
Li et al. (2021b) Li, Yao, Peng Cui, Chengming Ye, José Marcato Junior, Zhengtao Zhang, Jian Guo, and Jonathan Li. 2021b. “Accurate Prediction of Earthquake-Induced Landslides Based on Deep Learning Considering Landslide Source Area.” Remote Sensing 13 (17): 3436.
Linnenbrink et al. (2023) Linnenbrink, Jan, Carles Milà, Marvin Ludwig, and Hanna Meyer. 2023. “kNNDM: k-fold Nearest Neighbour Distance Matching Cross-Validation for map accuracy estimation.” EGUsphere [preprint] .
Ludwig et al. (2023) Ludwig, Marvin, Alvaro Moreno-Martinez, Norbert Hölzel, Edzer Pebesma, and Hanna Meyer. 2023. “Assessing and improving the transferability of current global spatial prediction models.” Global Ecology and Biogeography 32 (3): 356–368.
Lyons et al. (2018) Lyons, Mitchell B., David A. Keith, Stuart R. Phinn, Tanya J. Mason, and Jane Elith. 2018. “A comparison of resampling methods for remote sensing classification and accuracy assessment.” Remote Sensing of Environment 208: 145–153.
Meyer and Pebesma (2022) Meyer, Hanna, and Edzer Pebesma. 2022. “Machine learning-based global maps of ecological variables and the challenge of assessing them.” Nature Communications 13 (1): 1–4.
Milà et al. (2022) Milà, Carles, Jorge Mateu, — Edzer Pebesma, and Hanna Meyer. 2022. “Nearest neighbour distance matching Leave-One-Out Cross-Validation for map validation.” Methods in Ecology and Evolution 13 (6): 1304–1316.
Montesinos-López, Montesinos-López, and Kismiantini (2023) Montesinos-López, Osval A., Abelardo Montesinos-López, and Kismiantini. 2023. “Designing optimal training sets for genomic prediction using adversarial validation with probit regression.” Plant Breeding 142 (5): 594–606.
Mussumeci and Codeço Coelho (2020) Mussumeci, Elisa, and Flávio Codeço Coelho. 2020. “Large-scale multivariate forecasting models for Dengue - LSTM versus random forest regression.” Spatial and Spatio-temporal Epidemiology 35: 100372.
Nesha et al. (2020) Nesha, Mst Karimon, Yousif Ali Hussin, Louise Marianne van Leeuwen, and Yohanes Budi Sulistioadi. 2020. “Modeling and mapping aboveground biomass of the restored mangroves using ALOS-2 PALSAR-2 in East Kalimantan, Indonesia.” International Journal of Applied Earth Observation and Geoinformation 91: 102158.
Oliveira, Torgo, and Costa (2021) Oliveira, Mariana, Luís Torgo, and Vítor Santos Costa. 2021. “Evaluation Procedures for Forecasting with Spatiotemporal Data.” Mathematics 9 (6): 691.
Ploton et al. (2020) Ploton, Pierre, Frédéric Mortier, Maxime Réjou-Méchain, Nicolas Barbier, Nicolas Picard, Vivien Rossi, Carsten Dormann, et al. 2020. “Spatial validation reveals poor predictive performance of large-scale ecological mapping models.” Nature Communications 11: 4540.
Pohjankukka et al. (2017) Pohjankukka, Jonne, Tapio Pahikkala, Paavo Nevalainen, and Jukka Heikkonen. 2017. “Estimating the prediction performance of spatial models via spatial k-fold cross validation.” International Journal of Geographical Information Science 31 (10): 2001–2019.
Qian et al. (2022) Qian, Hongyi, Baohui Wang, Ping Ma, Lei Peng, Songfeng Gao, and You Song. 2022. “Managing Dataset Shift by Adversarial Validation for Credit Scoring.” In PRICAI 2022: Trends in Artificial Intelligence., edited by G. Khanna, S., Cao, J., Bai, Q., Xu, Vol. 13629 LNCS, 477–488. Springer, Cham.
Roberts et al. (2017) Roberts, David R., Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure.” Ecography 40 (8): 913–929.
Sarafian et al. (2021) Sarafian, Ron, Itai Kloog, Elad Sarafian, Ian Hough, and Jonathan D. Rosenblatt. 2021. “A Domain Adaptation Approach for Performance Estimation of Spatial Predictions.” IEEE Transactions on Geoscience and Remote Sensing 59 (6): 5197–5205.
Sarailidis, Wagener, and Pianosi (2023) Sarailidis, Georgios, Thorsten Wagener, and Francesca Pianosi. 2023. “Integrating scientific knowledge into machine learning using interactive decision trees.” Computers & Geosciences 170: 105248.
Schlather et al. (2015) Schlather, Martin, Alexander Malinowski, Peter J. Menck, Marco Oesting, and Kirstin Strokorb. 2015. “Analysis, Simulation and Prediction of Multivariate Random Fields with Package RandomFields.” Journal of Statistical Software 63 (1): 1–25.
Stock and Subramaniam (2022) Stock, Andy, and Ajit Subramaniam. 2022. “Iterative spatial leave-one-out cross-validation and gap-filling based data augmentation for supervised learning applications in marine remote sensing.” GIScience & Remote Sensing 59 (1): 1281–1300.
Usman et al. (2023) Usman, Muhammad, Mahnoor Ejaz, Janet E. Nichol, Muhammad Shahid Farid, Sawaid Abbas, and Muhammad Hassan Khan. 2023. “A Comparison of Machine Learning Models for Mapping Tree Species Using WorldView-2 Imagery in the Agroforestry Landscape of West Africa.” ISPRS International Journal of Geo-Information 12 (4): 142.
Valavi et al. (2019) Valavi, Roozbeh, Jane Elith, José J. Lahoz‐Monfort, and Gurutzeta Guillera‐Arroita. 2019. “BlockCV : An R package for generating spatially or environmentally separated folds for k ‐fold cross‐validation of species distribution models.” Methods in Ecology and Evolution 10 (2): 225–232.
Wadoux et al. (2021) Wadoux, Alexandre M.J.C., Gerard B.M. Heuvelink, Sytze de Bruin, and Dick J. Brus. 2021. “Spatial cross-validation is not the right way to evaluate map accuracy.” Ecological Modelling 457: 109692.
Wang et al. (2012) Wang, Jin Feng, A. Stein, Bin Bo Gao, and Yong Ge. 2012. “A review of spatial sampling.” Spatial Statistics 2 (1): 1–14.
Wang, Khodadadzadeh, and Zurita-Milla (2023) Wang, Yanwen, Mahdi Khodadadzadeh, and Raúl Zurita-Milla. 2023. “Spatial+: A new cross-validation method to evaluate geospatial machine learning models.” International Journal of Applied Earth Observation and Geoinformation 121: 103364.
Wiens et al. (2008) Wiens, Trevor S., Brenda C. Dale, Mark S. Boyce, and G. Peter Kershaw. 2008. “Three way k-fold cross-validation of resource selection functions.” Ecological Modelling 212 (3-4): 244–255.
Wu et al. (2019) Wu, Wei, Qipo Yang, Jiake Lv, Aidi Li, and Hongbin Liu. 2019. “Investigation of Remote Sensing Imageries for Identifying Soil Texture Classes Using Classification Methods.” IEEE Transactions on Geoscience and Remote Sensing 57 (3): 1653–1663.
Zhang et al. (2023) Zhang, Wen, Zhengjiang Liu, Yan Xue, Ruibo Wang, Xuefei Cao, and Jihong Li. 2023. “An Improved Cross-Validated Adversarial Validation Method.” In Knowledge Science, Engineering and Management. KSEM 2023., edited by W. Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, 343–353. Springer, Cham.
Zhao et al. (2017) Zhao, Wei, Ainong Li, Pan Huang, He Juelin, and Ma Xianming. 2017. “Surface soil moisture relationship model construction based on random forest method.” In International Geoscience and Remote Sensing Symposium (IGARSS) 2017, Vol. 2017-July, jul, 2019–2022. IEEE.
Zurita-Milla, Laurent, and van Gijsel (2015) Zurita-Milla, R., V. C.E. Laurent, and J. A.E. van Gijsel. 2015. “Visualizing the ill-posedness of the inversion of a canopy radiative transfer model: A case study for Sentinel-2.” International Journal of Applied Earth Observation and Geoinformation 43: 7–18.