This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

Zihao Song1, Huaxi Chen1, Donghui Quan1, Di Li2, Yinghui Zheng2, Shulei Ni1, Yunchuan Chen1, Yun Zheng1
1Zhejiang Lab, Hangzhou, Zhejiang 311121, China
2National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100101, People’s Republic of China
E-mail: [email protected]: [email protected]
(Accepted XXX. Received YYY; in original form ZZZ)
Abstract

Identifying neutral hydrogen (Hi) galaxies from observational data is a significant challenge in Hi galaxy surveys. With the advancement of observational technology, especially with the advent of large-scale telescope projects such as FAST and SKA, the significant increase in data volume presents new challenges for the efficiency and accuracy of data processing.To address this challenge, in this study, we present a machine learning-based method for extracting Hi sources from the three-dimensional (3D) spectral data obtained from the Commensal Radio Astronomy FAST Survey (CRAFTS). We have carefully assembled a specialized dataset, HISF, rich in Hi sources, specifically designed to enhance the detection process. Our model, Unet-LK, utilizes the advanced 3D-Unet segmentation architecture and employs an elongated convolution kernel to effectively capture the intricate structures of Hi sources. This strategy ensures a reliable identification and segmentation of Hi sources, achieving notable performance metrics with a recall rate of 91.6% and an accuracy of 95.7%. These results substantiate the robustness of our dataset and the effectiveness of our proposed network architecture in the precise identification of Hi sources. Our code and dataset is publicly available at https://github.com/fishszh/HISF.

keywords:
methods: data analysis – techniques: image processing – methods: observational
pubyear: 2015pagerange: Automated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning MethodAutomated Identification and Segmentation of Hi Sources in CRAFTS Using Deep Learning Method

1 Introduction

Neutral hydrogen (Hi) is a crucial constituent of the interstellar medium. Via 21 cm emission line of Hi  researchers can study the evolution of galaxies and the distribution of matter in the universe (Cheng et al., 2020). Hi emission lines provide vital information on the density and velocity structure of neutral gas within galaxies (Springob et al., 2005). Consequently, over the past few decades, numerous Hi surveys have been conducted to detect Hi in the local universe. Key surveys include Hi Parkes All Sky Survey (HIPASS) (Barnes et al., 2001), which identified over 5,000 galaxies across approximately 30,000 deg2, and ALFALFA (Giovanelli et al., 2005), which covered an area of approximately 7,000 deg2, cataloging 31,502 galaxies. The FAST All Sky Hi Survey (FASHI) (Zhang et al., 2024) is a comprehensive endeavor to map the entire celestial sphere observable by the Five-hundred-meter Aperture Spherical radio Telescope (FAST), with a focus on the declination span from -14 to +66. The initial release of its data has successfully cataloged an impressive 41,741 extra-galactic Hi sources. In parallel, the Commensal Radio Astronomy FAST Survey (CRAFTS) (Li et al., 2018) leverages the same observational parameters as FAST but with an extended purview that encompasses a variety of radio astronomy targets. This includes not only Hi sources but also pulsars and fast radio bursts (FRBs), showcasing the versatility and depth of FAST’s observational capabilities.

With the progression of observational technologies and equipment upgrades, a substantial volume of high-quality astronomical observation data has been generated through various sky survey. However, the processing of these vast datasets imposes stringent requirements on both efficiency and accuracy, which conventional methodologies struggle to fulfill.

In response to this challenge, scientists have embarked on exploring the integration of machine learning into the data processing of astronomical observations(Baron, 2019). A variety of machine learning-driven methods have been employed in diverse applications within astronomy, such as detecting tidal features (Desmons et al., 2023), light curve classification (Cui et al., 2023; Tey et al., 2023), source detection (Liang et al., 2023), spectrum classification (Tan et al., 2022), Radio Frequency Interference (RFI) mitigation (Akeret et al., 2017; Sun et al., 2022; Xiao et al., 2022) and so on.

In this study, we utilized CRAFTS observational data to systematically organize and construct a dedicated Hi sources dataset, HISF. To achieve high-precision identification and segmentation of Hi sources, we deployed a 3D Unet deep learning model, referred to as Unet-LK, featuring an elongated convolution kernel, which is capable of effectively extracting and segmenting Hi sources from complex spectral data cubes. The primary objective of this work is to enhance the accuracy and efficiency of Hi source detection by utilizing deep learning technology. This endeavor aims to validate the potent application potential of deep learning in astronomical data processing. Furthermore, it provides new insights and approaches for future astronomical observations and data analysis.

The paper is structured as follows: Section 2 introduces the related Hi survey and Hi source finding works. Section 3 describes the HISF dataset selection and preparation. Section 4 details our model pipeline and experiment results. Our summary is outlined in Section 5.

Refer to caption

Figure 1: Data processing pipeline for Hi source identification. (a) The CRAFTS 3D spectral cube data in our possession is meticulously refined from the raw data. It involves critical processing steps like RFI flagging and Doppler correction, ensuring the accuracy and reliability of our observations. However, the details of these processing steps are beyond the scope of this paper. (b) Upon obtaining the 3D spectral cubes, our methodology commences with expert source identification, followed by manual annotation facilitated by 3D Slicer. Utilizing the ensuing labeled dataset, we then proceed with our model training for Hi source recognition. The "Fore-ground" subplot illustrating the distribution of Hi source signals within the cube, including an inset that magnifies the details of these signals. The "mask" subplot delineates the regions of the annotated Hi sources.

2 Related Work

In previous Hi surveys, for instance HIPASS, ALFALFA and FASHI, research teams conventionally developed their own automated algorithms or employed software like SoFiA111https://github.com/SoFiA-Admin/SoFiA-2 to identify Hi sources. These detections were subjected to further manual analysis and verification, culminating in the release of comprehensive Hi source catalogs. These catalogs encompassed essential attributes, including spatial coordinate ranges, frequency ranges, red-shifts, and signal-to-noise ratios (SNR), among other key parameters. Based on this spatial and frequency domain data, researchers were generally able to conduct effective analyses of the characteristics of Hi emission lines.

In the realm of Hi source detection, extensive research efforts have been conducted. In SKA Science Data Challenge 2, several teams have devised a range of methods to identify Hi sources within a simulated dataset (Hartley et al., 2023). These approaches not only encompass conventional methodologies like SoFiA, but also integrate machine learning techniques, such as 3D Unet for segmentation, CNN for classification, and object detection algorithms like YOLO for Hi source characterization. Liang et al. (2023) tentatively employed the Mask R-CNN model and PointRend approach to identify Hi signals, revealing encouraging outcomes when applied to a simulated 2D dataset. Those exploratory work subtly hints at the potential for these advanced deep learning frameworks to make a valuable contribution to the refinement and streamlining of Hi source detection.

Furthermore, 3D segmentation algorithms have proven to be highly applicable in the medical imaging domain, particularly with CT and MRI scans. The field of medical imaging has seen extensive application and development of these algorithms, leading to the establishment of several State-of-the-Art (SOTA) models. Notable examples include the Unet (Çiçek et al., 2016), nn-Unet (Isensee et al., 2020), FracNet (Jin et al., 2020), UXNet (Lee et al., 2022) and Swin-UNTR (Hatamizadeh et al., 2022), which have set benchmarks in medical image segmentation. The success of these models in delineating complex anatomical structures in 3D medical images offers valuable insights for the identification of Hi sources, suggesting that the principles and techniques refined in the medical context could be adapted to enhance Hi source detection methodologies.

3 Data

Previous studies have commonly employed either conventional algorithms coupled with manual identification, or were grounded on simulated data alone, without being validated against authentic observational datasets. Furthermore, a dearth of openly accessible annotated observational datasets has hindered advancements. To address this deficiency, we utilize the observational data from CRAFTS to systematically compile a novel Hi source dataset, HISF, featuring accompanying masks. This innovative effort is intended to provide a much-needed benchmark for evaluating and enhancing Hi source identification techniques within an empirical context.

The construction of Hi source spectral data cubes from CRAFTS raw data follows a meticulously designed pipeline depicted in Figure 1, including a series of critical steps such as RFI flagging, ripple removal, baseline removal, and other essential processing measures. Despite the inherent challenges in RFI eradication, only the most evident instances have been addressed, leaving residual RFI in the cubes. Consequently, the generated spectral data cubes still contain a substantial amount of RFI, which poses a significant challenge for identifying Hi sources.

To confirm the Hi sources within the CRAFTS spectral data cube, we integrate expert verification and cross-validation methods utilizing other Hi surveys and observations in different wavebands. Currently, we have annotated data for two sky regions, as depicted in Figure 2.


Refer to caption

Figure 2: The region depicted between the red lines reveals the overall sky coverage of CRAFTS. Notably, the regions designated as R1 and R2 are the distinct areas where we have performed data annotation.

In Region R1, we analyzed 646 3D spectral data cubes and confirmed 2050 Hi sources through expert verification and ALFALFA cross-validation. Among these, 1749 Hi sources correspond to those detected by ALFALFA. Nonetheless, they still contained unprocessed RFI signals. Due to differences in frequency coverage and sensitivity between ALFALFA and CRAFTS, we also referred to the FASHI Hi source catalog.

For Region R2, based on our experience accumulated during the data processing, we meticulously eliminated additional RFI signals that are commonly known to originate from electronic devices, civil aviation, and navigation satellite communications during the data processing phase. This meticulous cleaning process has resulted in a comparatively cleaner cubes. In the process of identifying Hi sources, we initially engaged in a manual identification of potential signals. These signals were then scrutinized through a voting system involving five experts, with a consensus of at least three affirmative votes confirming their reliability. By cross-referencing this expert validation with counterparts from other wavebands, we successfully cataloged 469 Hi sources.

Following the approximate coordinate (frequency, R.A., Dec.) of Hi sources from the identification process, we utilized 3D Slicer (Fedorov et al., 2012) to visualize Hi signals on three orthogonal planes, see Figure 3, and we annotate the Hi source on RA-Frequency plane, while checking on the other two planes. Figure 1 (b) displays an illustrative example of a 3D spectral cube, with the "Fore-ground" subplot showcasing the Hi source signals, including an inset that magnifies the details of these signals. The "Mask" subplot provides a clear visualization of the regions corresponding to the annotated Hi sources. The signal boundaries were not strictly defined, focusing on regions with distinct signal characteristics and a minor inclusion of non-Hi signal areas was tolerated. As depicted in the three orthogonal planes of Figure 3, red contours are presented to accentuate the details of the annotated mask.


Refer to caption

Figure 3: 3D Slicer is a free, open source software package for visualization and image analysis. This is an example for visualizing 3D spectral cubes and annotating Hi sources. The Segment Editor panel is utilized for manually creating and refining (paint, draw, …) segmentations from the orthogonal planes of the 3D spectral cube. Additionally, the top right panel allows for the examination of 3D segmentations.

It is noteworthy to mention that, due to the difference in frequency coverage between ALFALFA and CRAFTS, despite our meticulous manual verification, the dataset might still contain a small number of instance where Hi sources will be either falsely identified or inadvertently omitted.

Table 1: HISF dataset overview. Region R1 contains 646 spectral cubes and 2050 Hi sources, with a spatial resolution of 0.0167 degree/pixel and a frequency resolution of 7.6 kHz/pixel. Each cube spans 3930 to 3932 pixels in the frequency direction, encompasses 23 pixels in the R.A. direction, and ranges from 231 to 261 pixel in the DEC. direction. Region R2 includes 157 spectral cubes and 469 Hi sources, maintaining the same spatial and frequency resolutions. Each cube in region R2 extends from 3275 to 4325 pixels in the frequency direction, varies from 158 to 181 pixels in the R.A. direction, and spans from 191 to 248 pixel in the DEC. direction.

Region No. Cube No. Source Shape (pixel)
Train Valid Test
R1 540 36 70 2050 (3930-3932,23,231-261)
R2 100 15 42 469 (3275-4325,158-181,191-248)

We conducted a basic statistical analysis on the source size of all Hi sources, see Figure 4. From the figure, it is evident that the Hi sources occupy a significantly greater number of pixels along the frequency axis than in the spatial dimensions, thus exhibiting an elongated shape.


Refer to caption

Figure 4: The distribution of Hi source extents across R.A., DEC. and frequency axes, measured in pixel units, with a spatial resolution of 0.0167 degrees/pixel and a frequency resolution of 7.6 kHz/pixel. The Hi sources exhibit a pronounced elongation, with their frequency pixel span markedly exceeding the spatial dimensions.

The HISF dataset comprises 3D spectral cubes from Regions R1 and R2. For Region R1, the dataset is partitioned into 540 cubes for training, 36 for validation, and 70 for testing. Region R2 contributes 100 cubes to the training set, 15 to the validation set, and 42 to the test set, see Table 1.

In accordance with the data release policy of FAST, the data from Region R1 is anticipated to be made publicly available on HIverse222https://hiverse.alkaidos.cn/ in the near future, while the data from Region R2 will be released at a later date. HISF dataset is formatted as pairs consisting of 3D spectral cubes and their corresponding labels.

4 Model and Experiments

In the CRAFTS spectral data cube, identifying Hi sources is regarded as a target detection problem within deep learning. Given Unet’s outstanding performance in 3D image segmentation and object detection tasks(Çiçek et al., 2016; Jin et al., 2020; Isensee et al., 2020), we employ a 3D-Unet architecture as the fundamental framework for this task, as its precise segmentation facilitates subsequent Hi emission line analysis. As illustrated in Figure 5, our model pipeline consists of three stages: (a) pre-processing, (b) model training, (c) post-processing.


Refer to caption

Figure 5: The model pipeline for identifying Hi sources, including data pre-processing, model training, and post-processing to refine the results. Two strategies, rebin and crop, can be applied either individually or in combination, as shown in Table 2.

4.1 Model pipeline

(a) Pre-processing: Since a whole-volume 3D spectral cube could be too large to fit in a regular GPU memory, we implement two strategies, (1) rebin: apply an average pooling layer with convolution kernel of (6,1,1) and stride of (4,1,1) on the frequency axis to reduce its dimensionality, (2) crop: conduct segmentation in a sliding window manner, adopting a patch size of (1024, 32, 64) with a stride equal to half the patch size. These two strategies permit an increased batch size, thereby enhancing the efficiency of the training process.

The intensity of input cubes were clipped to the window [-15, 35], then normalized to the range [0, 1].

Simultaneously, for data augmentation, we employ random flipping across the R.A., Dec. and frequency axes with a probability 0.5 along each axis. We also add random gaussian noise, with μ=0\mu=0 and σU(2.8,3.8)\sigma\sim U(2.8,3.8), statistically derived from the 3D spectral cube.

To improve the recognition capability of faint Hi signals, we employed the cut mix technique, randomly degrade the intensity of high SNR Hi sources, mimicking weaker Hi signals by extracting them from the labeled mask and adjusting their intensity to 30-80% of the original. This technique is employed to augment our dataset with a variety of signal strengths.

(b) Model training: Given that the frequency resolution of Hi sources in CRAFTS spectral cube data is significantly higher than the spatial resolution, they span approximately ten pixels in space but encompass hundreds of pixels along the frequency axis, see Figure 4. To address this disparity, we have employed a four-layer 3D Unet network, Figure 6, named Unet-LK, utilizing an elongated convolution kernel of size (7,3,3) along the frequency axis to capture additional contextual information.


Refer to caption

Figure 6: The neural network architecture of UNet-LK is based on a four-layer 3D UNet, featuring an elongated convolution kernel of size (7, 3, 3) along the frequency axis.

Throughout the training process, we utilize the Adam optimizer with an exponentially decaying learning rate from 0.01 to 0.0005, and combine dice loss (LDiceL_{Dice}) and binary cross-entropy loss (LCrossL_{Cross}) functions (Equation 1) to train the network. The model is trained for a total of 600 epochs with a batch size of 2, utilizing an NVIDIA A40 GPU.

Loss=LDice+0.5LCrossLoss=L_{Dice}+0.5L_{Cross} (1)
Dice=2NiNyiy^i+ϵiNyi+iNy^i+ϵDice=\frac{2}{N}\frac{\sum_{i}^{N}y_{i}\cdot\hat{y}_{i}+\epsilon}{\sum_{i}^{N}y_{i}+\sum_{i}^{N}\hat{y}_{i}+\epsilon} (2)
LDice=1DiceL_{Dice}=1-Dice (3)

where N is the number of spectral cube, yiy_{i} is the masked label, and y^i\hat{y}_{i} is the model prediction output.

(c) Post-processing To generate the segmentation outcomes, for areas of overlap, we take the average value, and we binarize the post-processed outcomes using a threshold of 0.5. To efficiently reduce the false positive in our predictions, predictions of small sizes (smaller than 300 voxels) were filtered out.

4.2 Experiments

To enhance the comprehensiveness and depth of our comparative analysis, we also utilized SoFiA on HISF dataset, with an SoFiA setup: detection threshold is 5σ\sigma; smoothing kernels are kernelsXY = 0, 3, 6 and kernelsZ = 0, 3, 7, 15; the minimum number of spatial and spectral pixels is 5 in XY and Z space, while the maximum size is 50 pixels in XY space, but not limited in Z space.

In addition, we have also employed two state-of-the-art (SOTA) frameworks, namely Swin-UNETR and UX-Net, on the 3D medical image segmentation task. Both native and re-bined resolutions were considered for the input volume data. We maintain a balanced ratio of 1:1 for positive and negative samples, ensuring each class is adequately represented during training. Specifically, positive samples are cropped in a manner that ensures they encapsulate at least half the area of the Hi source, whereas negative samples are randomly extracted within the confines of the spectral data cube. This strategy allows the model to learn more effectively from the target areas and enhances its segmentation performance.

Refer to caption

Figure 7: A visualization of the predicted segmentation for one Hi source from test dataset. The top panels display the segmentation mask results comparison of all methods in 3D view. The bottom left panels show the segmentation details on three frequency-position slices. The bottom right panel presents a comparison of the smoothed Hi emission lines. The presence of negative flux in the spectra is primarily attributed to the insufficient baseline removal during the batch processing phase, Figure 8 illustrates three examples that are free from this issue.

Refer to caption

Figure 8: Same as Figure 7, these are three additional segmentation examples where SoFiA failed to detect all three Hi sources, and two SoFiA false positive examples.

4.3 Results

We employ a segmentation algorithm to accomplish the detection task, hence we utilize both Dice (Equation 2) and Intersection over Union (IoU, Equation 4) metrics to evaluate the model’s performance. Given the high threshold configuration within SoFiA, which predominantly identify Hi sources with high SNRs, we have adopted an IoU threshold of greater than or equal to 0.2 as the criterion for successful detection.

IoU=AreaofIntersectionAreaofUnionIoU=\frac{Area\ of\ Intersection}{Area\ of\ Union} (4)

As illustrated in Figure 7 and Table 2, our method successfully attains a recall rate of 91.6% and an impressive accuracy rate of 95.7%, which distinctly surpasses the performance of the commonly employed SoFiA. Notably, our approach demonstrates exceptional capability not only in recognition precision but also achieves an acceptable level of segmentation effectiveness. The dice coefficients for our method reach 78.4% on the training set, 74.3% on the validation set, and 72.6% on the test set.

Relative to the performances achieved by Swin-UNETR and UX-Net, our proposed method displays enhanced results in both recall and precision metrics, possibly due to the fact that the elongated morphology of Hi sources necessitates a larger global receptive field. By addressing this need, our approach seems to be more adept at handling such structural intricacies, leading to improved recognition and segmentation outcomes.These results firmly substantiate the stability and generalization capabilities of our network, exhibiting formidable strengths in both precise source identification and effective data segmentation.

It is worth noting that the preliminary data processing, from raw data to cube data, significantly impacts the detection performance of SoFiA, with RFI being a primary cause of low precision. Since SoFiA can only identify sources with positive values, inadequate baseline removal can greatly affect its recall rate. Nonetheless, deep learning models are equipped to mitigate these issues to a considerable extent.

Table 2: A performance comparison of SoFiA, Unet-LK, Swin-UNETR and UX-Net on test set, with 368 Hi sources. Due to limitations in GPU memory, both crop and rebin techniques were employed during the pre-processing phase for Unet-LK, UX-Net, and Swin-UNETR. For the high threshold configuration within SoFiA, in this study, we adopt IoU (Equation 4) \geq 0.2 as the detection threshold criterion.
Methods Param Segmentation Detection
IoU Dice Recall Precision TP (Number) FP (Number)
SofiA 1.5% 2.9% 64.2% 2.3% 236 10036
UX-Net (rebin) 22.5M 56.0% 71.8% 89.4% 93.5% 329 23
UX-Net (crop) 22.5M 49.7% 66.4% 89.7% 78.9% 330 88
Swin-UNETR (crop) 15.5M 45.6% 62.7% 90.5% 51.2% 333 317
Swin-UNETR (rebin+crop) 15.5M 47.8% 64.7% 85.9% 90.3% 316 34
Unet-LK (rebin) 7.2M 59.1% 74.3% 91.6% 95.7% 337 15

5 Summary

In this work, we propose a novel method for Hi source detection that harnesses the power of 3D-Unet segmentation network to accurately identify and segment Hi sources. Experimental results demonstrate remarkable performance on our test set, achieving high recall (91.6%) and accuracy (95.7%), while maintaining good consistency across different datasets.

Compared to the SoFiA software, our proposed method demonstrates a significant improvement in recognition precision and attains satisfactory segmentation outcomes within the context of HISF dataset. It exhibits superior generalization capabilities, effectively mitigating the impact of RFI and other data processing artifacts to a certain extent. Comparative analysis with state-of-the-art network architectures such as Swin-UNETR and UX-Net indicates that customizing the network architecture in accordance with the specific attributes of the data and target features is indeed a critical factor in optimizing the overall functionality and performance of the model. This not only validates the efficacy of our adopted method but also highlights the profound value of our tailored HISF dataset in enhancing the precision and efficiency of Hi source detection tasks.

Additionally, the meticulously constructed and annotated custom HISF dataset we have developed plays a pivotal role in future identification tasks concerning Hi sources. HISF dataset offers a comprehensive collection of Hi source instances, covering a wide range of observing conditions and signal strengths, with a particular emphasis on cases where Hi sources are difficult to discern amidst complex background noise and low SNR environments. Our meticulous manual annotation process guarantees the authenticity and completeness of each source within the HISF dataset, a critical factor for the training and validation of Hi source identification algorithms.

Despite its promising performance, the proposed method has potential for refinement. Improving the model’s sensitivity to low SNR Hi sources is a notable aspect. Additionally, noise and data variability in Hi datasets might affect generalizability across diverse environments. Future work could thus focus on refining pre-processing techniques to handle these complexities and enhancing network resilience to SNR variations.

Furthermore, given the success with our custom HISF dataset and architecture, future directions include expanding the dataset diversity, developing adaptive learning strategies, and exploring ways to integrate extra contextual information to boost the accuracy of Hi source identification and segmentation.

In conclusion, the promising outcomes of this research have not only made a substantial contribution to the advancement of Hi source detection methodologies, but also revealed an expanded scope of potential applications within the critical task of extracting and analyzing complex sources in the realms of radio astronomy and its associated domains.

Acknowledgements

This work was Supported by National Key R&\&D Program of China No. 2023YFE0110500 and No. 2022YFB4501405, National Natural Science Foundation of China grant No. 12373026 and the Leading Innovation and Entrepreneurship Team of Zhejiang Province of China Grant No. 2023R01008.

Data Availability

The labeled CRAFTS Data used in this paper will be available in the near future, and the data access URLs will be synchronized on HIverse and GitHub https://github.com/fishszh/HISF.

References