GloSoFarID: Global multispectral dataset for Solar Farm IDentification in satellite imagery

Zhiyuan Yang and Ryan Rad Khoury College of Computer Science, Northeastern University
Vancouver, Canada
{yang.zhiyu, r.rad}@northeastern.edu

Abstract

Solar Photovoltaic (PV) technology is increasingly recognized as a pivotal solution in the global pursuit of clean and renewable energy. This technology addresses the urgent need for sustainable energy alternatives by converting solar power into electricity without greenhouse gas emissions. It not only curtails global carbon emissions but also reduces reliance on finite, non-renewable energy sources. In this context, monitoring solar panel farms becomes essential for understanding and facilitating the worldwide shift toward clean energy. This study contributes to this effort by developing the first comprehensive global dataset of multispectral satellite imagery of solar panel farms. This dataset is intended to form the basis for training robust machine learning models, which can accurately map and analyze the expansion and distribution of solar panel farms globally. The insights gained from this endeavor will be instrumental in guiding informed decision-making for a sustainable energy future. https://github.com/yzyly1992/GloSoFarID

Index Terms:

photovoltaic, solar panel farm, multispectral images, dataset, sustainability

I Introduction

Solar Photovoltaic (PV) technology has emerged as a key solution for meeting the global demand for clean and renewable energy. By harnessing solar power for electricity generation, solar PV plays a significant role in mitigating climate change and reducing pollution. Its ability to replace traditional power generation methods reduces global carbon emissions and lessens dependence on finite resources, paving the way for a sustainable future [1].

A critical aspect of the global transition to clean energy is the precise monitoring of solar panel farms, which are integral to renewable energy infrastructure worldwide. Understanding the growth and distribution of these farms is essential for assessing progress towards a sustainable energy landscape [2]. Therefore, monitoring these farms is a pivotal step in comprehensively understanding the global adoption of clean energy resources.

In line with this goal, our project aims to make a significant contribution by creating a comprehensive dataset of multispectral satellite images of solar panel farms. This dataset will be a foundational resource for developing effective machine learning models for monitoring solar panel farms globally. The focus on using these models highlights the importance of accurate and advanced monitoring techniques in promoting a sustainable and cleaner energy future worldwide. This dataset is a pioneering effort, representing the first global compilation of solar panel farm data, and includes comprehensive multispectral satellite information. Our dataset enhances existing resources in three ways:

•

It employs mid-resolution data and includes rural areas globally, addressing previous limitations in geographic coverage.
•

It provides multispectral satellite images with $13$ multispectral bands, offering more comprehensive information that comes in an easy-to-use format with a complementary how-to-use guide and script.
•

The dataset is updated with recent data from $2021$ to $2023$ , doubling the size of the existing data sources and ensuring current and relevant insights into the global solar energy landscape.

Refer to caption — Figure 1: Examples from our GloSoFarID dataset. Each row represents one sample with all $13$ bands along with the ground truth mask where the gray color indicates the solar farm area. See TABLE II for more details.

II Dataset Construction

To construct this dataset, we devised a methodical three-step process that incorporates modern techniques. As illustrated in Fig. 2, our approach begins with the creation of a robust initial training set, proceeds with the training of state-of-the-art (SOTA) models, and concludes by processing the combined outputs of the top-performing models to generate a novel dataset.

II-A Gather Initial Training Dataset

Our first step involved aggregating data from diverse sources to assemble an initial dataset. This dataset comprises three primary segments: Historical Global Ground Truth from $2017$ - $2018$ [3], Manually Annotated US Ground Truth from $2022$ - $2023$ by our team using Google Earth Engine (GEE), and Sentinel-2 Multispectral Satellite Imagery spanning $2017$ - $2023$ [4]. The integration of these datasets yielded a comprehensive initial training set of $10,230$ multispectral images, each with a resolution of $256\times 256$ pixels and encompassing $13$ spectral bands. This rich and varied dataset lays the groundwork for our subsequent analytical endeavors.

II-B Train SOTA Models

With the initial training set in place, we engaged several SOTA models, renowned for their performance, for training. Utilizing a batch size of $16$ over $50$ epochs, our models demonstrated exceptional accuracy in identifying solar panel farms from multispectral images. The top-performing model achieved an outstanding $96.47\%$ Intersection over Union (IoU) and a $98.2\%$ F-score, demonstrating its high accuracy in precisely segmenting solar panel areas within the imagery.

II-C Generate New Dataset

Following the identification of the top three models based on predictive accuracy, we employed them to predict and label data for the global dataset covering $2021$ - $2023$ . The combination (logical OR) of predictions from these models resulted in a unified dataset. To ensure the dataset’s quality and reliability, we implemented a rigorous quality control process. This included applying stringent filtering techniques to remove artifacts and isolated clusters and systematically eliminating any erroneous or untrustworthy data points. The resulting dataset, refined through these procedures and careful manual inspection, represents the latest and most accurate collection of data. Our comprehensive methodology ensures the delivery of a dataset distinguished by its precision and reliability, suitable for a broad spectrum of applications and analyses.

III Dataset Overview

This section presents a detailed analysis of the proposed dataset, including comprehensive statistical information. It aims to give users a clear insight into the dataset’s characteristics and strengths. Additionally, our Github repository contains an easy-to-use guide for working with this dataset regardless of the machine learning framework.

III-A Size, Region, and Time Span

The proposed dataset is comprised of a total of $13,703$ data samples, spanning various global regions and capturing information from the years $2021$ to $2023$ . Fig. 3 displays the location and density of all data points on a global map. Furthermore, Table I provides a detailed view of the dataset’s composition and distribution

TABLE I: Dataset Composition

Year & Region	Positive Sample	Negative Sample	Sum
2023 US^∗	706	969	1675
2022 US^∗	723	968	1691
2023 Global	1774	1689	3463
2022 Global	1803	1655	3458
2021 Global	1787	1629	3416
Total	6793	6910	13703
^∗Samples from the US region have been manually annotated.

III-B Resolution and Dimension

This dataset maintains a spatial resolution of $10$ meters per pixel, consistent with the resolution of Sentinel-2 satellite imagery. Each data sample within the dataset is standardized to dimensions of $256\times 256$ pixels, and encompasses a rich spectral profile with $13$ bands [4]. Table II summarizes the spectral information and Fig. 1 presents two samples visualizing the $13$ bands along with the ground truth mask.

TABLE II: Dataset Bands Information

Name	Wavelength	Description
B1	443.9nm (S2A) / 442.3nm (S2B)	Aerosols
B2	496.6nm (S2A) / 492.1nm (S2B)	Blue
B3	560nm (S2A) / 559nm (S2B)	Green
B4	664.5nm (S2A) / 665nm (S2B)	Red
B5	703.9nm (S2A) / 703.8nm (S2B)	Red Edge 1
B6	740.2nm (S2A) / 739.1nm (S2B)	Red Edge 2
B7	782.5nm (S2A) / 779.7nm (S2B)	Red Edge 3
B8	835.1nm (S2A) / 833nm (S2B)	NIR
B8A	864.8nm (S2A) / 864nm (S2B)	Red Edge 4
B9	945nm (S2A) / 943.2nm (S2B)	Water vapor
B10	1373.5nm (S2A) / 1376.9nm (S2B)	Cirrus
B11	1613.7nm (S2A) / 1610.4nm (S2B)	SWIR 1
B12	2202.4nm (S2A) / 2185.7nm (S2B)	SWIR 2

III-C Compare to Existing Datasets

Numerous studies are dedicated to examining datasets featuring high-resolution images of solar panels. SolarDK, a publicly available benchmark dataset, focuses on the geographical area of Denmark [5]. This dataset encompasses human-annotated information for both classification and segmentation tasks. Furthermore, the UC Berkeley team developed another high-resolution dataset utilizing Google Maps to collect satellite images from various states in the U.S. [6]. Although these openly accessible datasets are valuable resources, they lack diversity in covering non-urban regions and are restricted to specific geographic locations and countries. To address these shortcomings, our dataset overcomes these limitations by employing mid-resolution data and encompassing rural areas across global regions on various continents.

In $2021$ , L. Kruitwagen created a dataset that offers a global inventory of photovoltaic solar energy generating units, presenting a comprehensive overview of solar panel distribution worldwide [3]. This dataset involved an extensive analysis, processing $550$ terabytes of imagery spanning approximately $2$ months in real-time. While the outcome is an expansive and insightful resource, it has certain limitations. It exclusively provides shape files for solar panel areas, lacking crucial input source data. Moreover, it captures information within an outdated timeframe, spanning from $2016$ to $2018$ . In response to these drawbacks, our proposed dataset aims to address these issues comprehensively. It will include multispectral input satellite images, offering a richer and more detailed dataset. Additionally, our dataset will cover an up-to-date timeframe, ranging from $2021$ to $2023$ , ensuring a more current and relevant depiction of the global solar energy landscape.

IV Benchmark Models and Results

Running benchmark segmentation models on our dataset is essential to establish a performance baseline and enable a fair comparative evaluation. This not only enhances our understanding of the dataset but also provides a standardized reference for the broader research community, fostering transparency, consistency, and credibility in future studies. To analyze and benchmark the dataset, we employed two widely recognized semantic segmentation models and one simplified model, namely the Fully Convolutional Network (FCN), U-Net, and Half-UNet depicted in Figure 4.

TABLE III: Benchmark Model Results

Model	Params	IoU (%)	F-score (%)
FCN[7]	1,996,737	71.81	82.87
Half-UNet[8]	224,609	71.48	82.17
U-Net[9]	31,129,377	79.31	87.80

The FCN, introduced by Long et al. [7], is renowned for its ability to perform end-to-end pixel-wise prediction through the extensive utilization of convolutional layers, allowing it to maintain spatial information throughout the network. This makes FCN particularly effective for semantic segmentation tasks.On the other hand, the U-Net architecture, proposed by Ronneberger et al. [9], features a U-shaped structure that incorporates both encoder and decoder components. The encoder captures contextual information by progressively downsampling the input, while the decoder facilitates precise localization by upsampling and merging features from different resolutions. The Half-UNet [8], a simplified model, offers a more lightweight alternative. While retaining the essence of U-Net, Half-UNet reduces computational complexity by utilizing only half of the U-shaped architecture. This streamlining allows for faster processing and lower resource requirements, making it an attractive choice for scenarios with limited computational resources.

For the training of three models, a batch size of $16$ was selected to balance computational efficiency and model convergence, and training was conducted over $100$ epochs. To mitigate the risk of overfitting, we implemented early stopping with a patience value set to $10$ based on validation loss. These hyperparameter choices were made to strike a balance between model complexity and training efficiency. The summary of the performance metrics and model size are presented in TABLE III. Additionally, Fig. 5 illustrates the qualitative results of the three models.

V Conclusion

This study presented a comprehensive global dataset of multispectral satellite images of solar panel farms. Our dataset enhances existing resources in three ways: by incorporating mid-resolution data from rural areas globally, using multispectral satellite images for richer detail, and updating the timeframe of data to $2021$ - $2023$ for a current view of the global solar energy landscape. The evaluation of this dataset using benchmark models has established its robustness and set a baseline for future improvements. It has proven effective in accurately identifying solar panel areas on a global scale. This dataset is a significant asset for the solar energy research community, potentially driving advancements in renewable energy development.

References

Fthenakis [2009] V. Fthenakis, “Sustainability of photovoltaics: The case for thin-film solar cells,” Renewable and Sustainable Energy Reviews, vol. 13, no. 9, pp. 2746–2750, 2009.
Choudhary and Srivastava [2019] P. Choudhary and R. K. Srivastava, “Sustainability perspectives-a review for solar photovoltaic trends and growth opportunities,” Journal of Cleaner Production, vol. 227, pp. 589–612, 2019.
Kruitwagen et al. [2021] L. Kruitwagen, K. Story, J. Friedrich, L. Byers, S. Skillman, and C. Hepburn, “A global inventory of photovoltaic solar energy generating units,” Nature, vol. 598, no. 7882, pp. 604–610, 2021.
Drusch et al. [2012] M. Drusch, U. Del Bello, S. Carlier, O. Colin, V. Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimort et al., “Sentinel-2: Esa’s optical high-resolution mission for gmes operational services,” Remote sensing of Environment, vol. 120, pp. 25–36, 2012.
Khomiakov et al. [2022] M. M. Khomiakov, J. Radzikowski, C. Schmidt, M. Sørensen, M. Andersen, M. Andersen, and J. Frellsen, “Solardk: A high-resolution urban solar panel image classification and localization dataset,” in NeurIPS 2022 Workshop on Tackling Climate Change with Machine Learning, 2022. [Online]. Available: https://www.climatechange.ai/papers/neurips2022/3
Parhar et al. [2021] P. Parhar, R. Sawasaki, A. Todeschini, C. Reed, H. Vahabi, N. Nusaputra, and F. Vergara, “Hyperionsolarnet: Solar panel detection from aerial images,” in NeurIPS 2021 Workshop on Tackling Climate Change with Machine Learning, 2021. [Online]. Available: https://www.climatechange.ai/papers/neurips2021/41
Long et al. [2015] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
Lu et al. [2022] H. Lu, Y. She, J. Tie, and S. Xu, “Half-unet: A simplified u-net architecture for medical image segmentation,” Frontiers in Neuroinformatics, vol. 16, p. 911679, 2022.
Ronneberger et al. [2015] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.