Learning to detect cloud and snow in remote sensing images from noisy labels

Abstract

Detecting clouds and snow in remote sensing images is an essential preprocessing task for remote sensing imagery. Previous works draw inspiration from semantic segmentation models in computer vision, with most research focusing on improving model architectures to enhance detection performance. However, unlike natural images, the complexity of scenes and the diversity of cloud types in remote sensing images result in many inaccurate labels in cloud and snow detection datasets, introducing unnecessary noises into the training and testing processes. By constructing a new dataset and proposing a novel training strategy with the curriculum learning paradigm, we guide the model in reducing overfitting to noisy labels. Additionally, we design a more appropriate model performance evaluation method, that alleviates the performance assessment bias caused by noisy labels. By conducting experiments on models with UNet and Segformer, we have validated the effectiveness of our proposed method. This paper is the first to consider the impact of label noise on the detection of clouds and snow in remote sensing images.

Index Terms— cloud and snow detection, noise label learning, remote sensing image

1 Introduction

Remote sensing is an indispensable tool for earth observation, providing critical data for environmental monitoring, disaster response, and climate research. However, the presence of clouds and snow in satellite imagery often obscures surface features, posing significant challenges to data accuracy and utility. Clouds can mask underlying terrain, while snow cover, with its similar spectral properties, can be difficult to distinguish from clouds. Accurate detection and differentiation of these two elements are crucial for the integrity of remote sensing applications, from weather forecasting to resource management. Cloud detection is a crucial preprocessing task that assesses the usability of remote sensing images and evaluates the imaging quality of these remote sensing data [1]. During the cloud detection process, ground snow cover, with its extremely high reflective brightness, is often easily mistaken for clouds, leading to false detections. Consequently, the simultaneous detection of clouds and snow in remote sensing images has been widely studied [2].

To facilitate the effective automated detection of clouds and snow in remote sensing images, previous work has proposed a variety of methods, which can be primarily divided into traditional spectral threshold-based approaches and the more recently rapidly evolving deep learning-based methods [1]. This paper primarily focuses on deep learning-based approaches. Deep learning-based methods for cloud and snow detection predominantly employ semantic segmentation models from the field of computer vision [3] to perform pixel-wise classification on the input remote sensing images. Early studies mostly focused solely on the detection of clouds in remote sensing images and directly utilized existing models such as FCN (Fully Convolutional Networks) [4, 5] or UNet [6]. Subsequently, as highly reflective snow significantly impacts the effectiveness of cloud detection, many studies began to explore the task of simultaneously detecting clouds and snow in remote sensing imagery. Related research has also incorporated prior information such as altitude, latitude, and longitude as inputs to enhance the precision of cloud and snow detection [2]. In a manner akin to the evolution of semantic segmentation tasks, later improvements have predominantly focused on enhancements to model architectures. These advancements include improving feature fusion techniques to increase the precision of detecting cloud and snow edges [7], as well as reducing the need for extensive annotation through the use of self-supervised or weakly-supervised learning methods [8, 9]. While many improved methods have boosted the performance of cloud and snow detection, there has been scant work specifically addressing the unique characteristics and challenges of the cloud and snow detection task. One of the challenges, which has persisted yet remains largely unaddressed, is the significant issue of noisy labels within the cloud and snow detection task. Clouds in remote sensing images exhibit a variety of shapes, and the boundaries of many thin clouds are often difficult to discern. This results in a significant introduction of noisy labels during dataset annotation, particularly in scenes with thin clouds or mixed cloud-snow coverages, which is a problem not present in natural image scenes where the edges of objects are typically very distinct. Such noisy labels not only introduce noise during the training process but also use them directly for testing can lead to unreliable results, especially in scenarios where cloud pixels are difficult to distinguish. Therefore, designing specialized methods to mitigate the problems caused by noisy labels, which are particularly challenging in cloud and snow detection, is critically important.

To address the aforementioned issues, drawing on methods from the study of learning with noisy labels [10], we focus our improvements on the reconstruction of cloud and snow detection datasets and the incorporation of a curriculum learning-based training approach [11] to alleviate the adverse effects brought about by noisy labels. Specifically, we split the existing dataset into clean and noisy subsets based on the difficulty of distinguishing cloud and snow areas in the remote sensing images. Subsequently, during the training process, we start with the clean subset and gradually incorporate samples from the noisy subset into the training set, guiding the model to prioritize learning from the clean samples and reduce the impact of the noisy ones. In addition, we have developed a specialized model performance testing method for scenarios with noisy labels. The specialized design of the training and testing procedures described above enables us to obtain a more stable and better-generalizing model for the task of cloud and snow detection, which is plagued by a large number of noisy labels. We employed two mainstream networks, the CNN-based UNet [12] and the Transformer-based Segformer [13], to validate the effectiveness of our proposed method for learning with noisy labels.

The primary contributions of this paper include: (1) Introducing the label noise issue to the task of cloud and snow detection in remote sensing images, a problem that has been widely present but not studied within this domain. (2) Constructing a new dataset and evaluation method for cloud and snow detection with noisy labels, taking into account different cloud types and remote sensing scenarios. (3)Proposing a cloud and snow detection method with the curriculum learning paradigm, tailored to the characteristics of this task.

Refer to caption — Fig. 1: Our proposed curriculum learning-based learning paradigm for cloud and snow detection from noisy labels. The red boxes represent the position of the noisy labels.

2 METHODOLOGY

2.1 Overview

This section will analyze the issue of noisy labels in cloud and snow detection and propose a method to mitigate the negative impacts of noisy labels. As illustrated in Fig. 1, we divide the dataset into a clean set and a noisy set based on the difficulty of annotation. During the curriculum learning-based [11] model training process, noisy samples are gradually introduced to alleviate the overfitting of noisy samples. The following sections will provide a detailed introduction to the construction of the dataset, improvements to the evaluation method, and the training process.

2.2 Dataset and Performance Evaluation with Noisy Labels

Based on our comprehensive study of publicly available cloud and snow detection datasets, we found that labels with noise are related to factors such as the type of clouds, the scenes in the remote sensing images, and whether there is a mixture of clouds and snow. Therefore, based on the aforementioned characteristics, we divided the original dataset into two parts: a clean set and a noisy set. The former primarily comprises samples with clear cloud boundaries, a homogeneous background, and a greater separation between clouds and snow, which typically have more accurate labels. The latter, on the other hand, mostly includes samples with fuzzy boundaries such as thin clouds, and a relatively complex background with more bright targets, or complicated situations with mixed clouds and snow. Specifically, to ensure that our samples encompass a wide range of scenes and cloud types, we have selected approximately 2434 remote sensing images from various satellite sources. These include high-resolution satellites (GF-1), environmental monitoring satellites (HJ-2A/B), and commercial remote sensing satellites (SV2-01/02). Table 1 provides detailed information about the dataset.

Subset Name	Clean Set		Noisy Set
Subset Name	train & valid	test	train & valid	test
# Images	1185	450	599	200
Data Source	Gaofen-1, SV2-01/02, HJ-2A/B
Evaluation Metrics	mIoU and OA		error% $=\frac{\#images_{error}}{\#images_{total}}$

Table 1: Description of the dataset setting for cloud and snow detection with noisy label.

Since we have divided the samples into two subsets, specifically for the noisy set, the labels contain a considerable amount of noise, meaning the labels themselves are not accurate. Therefore, it is challenging to accurately evaluate the model’s performance through specific numerical indicators during the testing phase. Furthermore, in practical applications, minor improvements in metrics are often insignificant. What is more critical is to avoid extensive false positives and false negatives on the scale of the entire image. In light of the issue with noisy labels, we have rethought the way we evaluate model performance. Specifically, as shown in Table 1, we conduct different tests on the clean set and the noisy set. For the clean set, since the labels are relatively more accurate, we can directly utilize widely-used metrics such as Overall Accuracy (OA) and mean Intersection over Union (mIoU) [1] as the primary evaluation criteria. As for the noisy set, we manually tallied the number of samples in the test results that exhibited extensive omissions and misclassifications. This includes the following three situations: (1) Misidentifying large areas of clouds as snow. (2) Overlooking large areas that should have been detected. (3) Wrongly detecting large areas where there shouldn’t be any detection. By calculating the ratio of the number of samples that failed detection to the total number of samples, we can somewhat evaluate the model’s generalization performance in complex scenarios.

2.3 Cloud-Snow Detection with Curriculum Learning

Based on the dataset we constructed, we have incorporated the curriculum learning paradigm into the training process of our cloud and snow detection model. To be specific, given the training pairs from the two datasets: $\mathcal{D}_{clean}$ , $\mathcal{D}_{noisy}$ and the trainable parameters of the segmentation model $\Phi$ . As shown in Fig. 1, at the beginning of the training phase, we use only the data from the clean set $\mathcal{D}_{clean}$ to train the model by:

\Phi_{epoch=0}^{m}:={\rm TrainStage1}(\forall\{x_{i},y_{i}\}\in\mathcal{D}_{clean})

(1)

which allows it to learn effectively under the supervision of accurate labels. As the training progresses, with each epoch, we introduce data with a proportion of $\frac{(epoch-m)}{(n-m)}*len(\mathcal{D}_{noisy})$ from the noisy set into the training set and continue the training process by:

\Phi_{epoch=m}^{n}:={\rm TrainStage2}(\forall\{x_{i},y_{i}\}\in\mathcal{D}_{clean}\cup\mathcal{D}_{noisy}^{sub})

(2)

where $m$ and $n$ represent the epoch numbers at which the integration of the noisy dataset begins and ends, respectively. This gradual infusion of noisy data is designed to help the model learn to cope with and generalize from the imperfections in the data. Finally, we train the model on the entire dataset, which includes both clean and noisy data by:

\Phi_{epoch=n}^{N}:={\rm TrainStage3}(\forall\{x_{i},y_{i}\}\in\mathcal{D}_{clean}\cup\mathcal{D}_{noisy})

(3)

This three stages training approach allows the model to gradually adapt to noise while ensuring that it maintains its generalization performance and benefits from the increased sample volume.

3 Experiment

3.1 Implementation Details

3.1.1 Model Architecture

We employed two distinct models to ascertain the general applicability of our method. For the UNet model [12], we utilized a ResNet-18 architecture as the encoder and connected features from its four stages to the decoder. The upsampling in the decoder is accomplished through a combination of bilinear interpolation and transposed convolution. Regarding the Segformer model [13], we employed a simplified version with a 2-layer decoder of 256 dimensions. Since the output dimensions of Segformer are a quarter of the input dimensions in both width and height, we performed a 4x upsampling on the output features using bilinear interpolation to ensure that the input and output dimensions are consistent.

3.1.2 Training Details

We used the Adam optimizer for both models and applied a step decay learning rate strategy. Considering the characteristics of CNN and Transformer models, we set the initial learning rates to 0.001 for the CNN-based UNet and 0.00006 for the Transformer-based Segformer. The learning rate is reduced by a factor of 10 every 10 epochs. For the loss function, we employed the standard cross-entropy loss and trained the models for a total of 150 epochs. Due to the limited number of samples in the dataset, we combined the training and validation sets and selected the model with the best mIoU on the training set for testing.

3.2 Results

Table 2 shows the results on the clean and noisy test set with our proposed curriculum learning-based training method compared to the original approach. We adhere to the evaluation approach described earlier in Section 2.2. Specifically, due to the substantial presence of noise in the labels of the noisy set, we directly computed the percentage ratio of the number of large-area detection errors to the total number of samples, instead of calculating specific numerical segmentation performance metrics.

Method	Clean Set				Noisy Set
Method	$OA_{c}\uparrow$	$OA_{s}\uparrow$	$IoU_{c}\uparrow$	$IoU_{s}\uparrow$	error % $\downarrow$
UNet [12]	88.89	81.47	83.33	72.49	35
Segformer [13]	89.43	81.44	79.63	70.48	25.5
UNet (Ours)	92.56	85.86	87.50	78.77	23.5
Segformer (Ours)	89.44	78.81	84.84	73.94	18.5

Table 2: The comparison results between the original method and our proposed noise label learning method on two different model architectures.

The comparative results indicate that after incorporating our proposed curriculum learning-based method, there are improvements in the metrics on both the clean and noisy test sets. It is important to note that a small improvement in numerical metrics on the clean test set does not necessarily translate to better generalization performance on the noisy test set. This implies that in practical applications, it is narrow-minded to solely pursue numerical improvements. Instead, it is crucial to ensure that the model can generalize well across different scenarios to minimize the occurrence of large-area detection errors. For instance, in the case of the Segformer, although its performance metrics on the clean set might be relatively inferior compared with UNet, it exhibits a stronger generalization capability in complex scenarios and across various cloud types on the noisy set, indicating a higher tolerance to noise. Moreover, integrating our proposed training strategy can further enhance the model’s robustness to noisy labels.

4 Conclusion

This paper is the first to highlight the problem of noisy labels for clouds and snow detection in remote sensing images. Starting from the curriculum learning paradigm, we constructed a dataset that separates clean data from noisy data. By progressively integrating noisy samples during the training process, we mitigated the issue of overfitting to noisy labels. In addition, we established a new model performance evaluation method suited for scenarios with noisy labels, aligning it more closely with the needs of practical applications. Experiments conducted on both a CNN-based UNet model and a Transformer-based Segformer model confirmed the effectiveness of our approach. This work lays the foundation for future research on learning with noisy labels in cloud and snow detection tasks.

References

[1] Zili Liu, Jiajun Yang, Wenjing Wang, and Zhenwei SHI, “Cloud detection methods for remote sensing images: a survey,” Chinese Space Science and Technology, vol. 43, no. 1, pp. 1, 2023.
[2] Xi Wu, Zhenwei Shi, and Zhengxia Zou, “A geographic information-driven method and a new large scale dataset for remote sensing cloud/snow detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 174, pp. 87–104, 2021.
[3] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos, “Image segmentation using deep learning: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 7, pp. 3523–3542, 2021.
[4] Xi Wu and Zhenwei Shi, “Utilizing multilevel features for cloud detection on satellite imagery,” Remote Sensing, vol. 10, no. 11, pp. 1853, 2018.
[5] Alistair Francis, Panagiotis Sidiropoulos, and Jan-Peter Muller, “Cloudfcn: Accurate and robust cloud detection for satellite imagery with deep learning,” Remote Sensing, vol. 11, no. 19, pp. 2312, 2019.
[6] Marc Wieland, Yu Li, and Sandro Martinis, “Multi-sensor cloud and cloud shadow segmentation with a convolutional neural network,” Remote Sensing of Environment, vol. 230, pp. 111203, 2019.
[7] Jingyu Yang, Jianhua Guo, Huanjing Yue, Zhiheng Liu, Haofeng Hu, and Kun Li, “Cdnet: Cnn-based cloud detection for remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 8, pp. 6195–6211, 2019.
[8] Zhengxia Zou, Wenyuan Li, Tianyang Shi, Zhenwei Shi, and Jieping Ye, “Generative adversarial training for weakly supervised cloud matting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 201–210.
[9] Yansheng Li, Wei Chen, Yongjun Zhang, Chao Tao, Rui Xiao, and Yihua Tan, “Accurate cloud detection in high-resolution remote sensing imagery by weakly supervised deep learning,” Remote Sensing of Environment, vol. 250, pp. 112045, 2020.
[10] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[11] Xin Wang, Yudong Chen, and Wenwu Zhu, “A survey on curriculum learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4555–4576, 2021.
[12] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[13] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 12077–12090, 2021.