Offline data processing in the First JUNO Data Challenge

Tao Lin 11 Weiqing Yin (on the behalf of JUNO Collaboration) 11 2 211 Institute of High Energy Physics, Beijing, China 22 University of Chinese Academy of Sciences, Beijing, China [email protected]

Abstract

The Jiangmen Underground Neutrino Observatory (JUNO) is currently under construction and the installation of detector will be completed by end of 2024. A series of JUNO Data Challenges are proposed to evaluate and validate the complete data processing chain in advance. In this contribution, the offline data processing in the first JUNO Data Challenge (DC-1) is presented. The primary goal of DC-1 is to process one week data using conditions database and multi-threaded reconstruction. The workflow involves the production of simulated data and reconstruction of the data. To achieve the goals, a JUNO-Hackathon has been organized. The software performance is measured and the results are presented.

1 Introduction

The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment with the primary goals of the determining the neutrino mass ordering and precisely measuring oscillation parameters [1, 2, 3]. Currently under construction in Southern China, it comprises a central detector (CD) for neutrino detection, a water pool (WP) and a top tracker (TT) for cosmic ray muon measurement. The innermost of CD is 20 kton liquid scintillator (LS), surrounded by 17,612 20-inch and 25,600 3-inch photomultiplier tubes (PMTs). The WP is equipped with 2,400 20-inch PMTs, served as a veto system for the cosmic ray muons. On the top of WP, the TT is used to measure the muons as well. The installation of all the PMTs and readout electronics will be completed by the end of 2024 and that afterwards, during detector filling, JUNO will start commissioning all the readout channels to test the full DAQ and data processing chains.

Figure 1 shows a schematic view of data processing in JUNO. When data taking starts, the event rate is about 1 kHz. The detector produces thousands channels of waveforms at a sampling rate of 1 GHz. To reduce the huge data volume, an additional system named Online Event Classification (OEC) is applied after trigger to reduce the event size according to the event types. Unlike the trigger system discards events, OEC retains all events. Approximately 60 MB/s of byte-stream RAW data, amounting 2 PB per year, is expected to be produced. The RAW data is transferred from the onsite to the IHEP data center via a dedicated network. The RAW data is preprocessed and converted to the ROOT-based RAW (RTRAW) data, using JUNO Event Data Model [4] and ROOT I/O[5]. Both types of data are replicated to the other data centers through the Distributed Computing Infrastructure (DCI) [6]. To minimize disk volume, the RAW data is archived to a tape library, while the RTRAW data is stored on a disk. After event reconstruction of the RTRAW data, the output is stored in ESD (Event Summary Data) format. The data processing involves several critical components, including data quality monitoring (DQM), keep-up reconstruction (KUP) and physics production (PP).

Refer to caption — Figure 1: Baseline scheme of offline data processing

2 The first JUNO Data Challenge (DC-1)

A series of JUNO Data Challenges (DC) have been proposed to evaluate and validate the complete data processing chain in advance. The JUNO DC serves not only to test the event reconstruction software, but also functions as a system test for the database, the Kafka-based [7] data pipeline, DQM, KUP and PP etc. The estimation of computing and storage capacities are validated through JUNO DC.

The JUNO DC-1 is focus on data processing within the central detector. About one week of inclusive datasets are produced and then reconstructed. Radioactivities, cosmic ray muons and neutrino events are simulated in advance. The rates of radioactivities and muons are set to the rates expected in the real data, while the rate of neutrino events is increased from 60 events per day to 4 Hz. By increasing the event rates of neutrinos, the reconstruction algorithms can be tested with higher statistics. In order to test the conditions database later, seven sets of time offsets are added to the channels during simulation.

As shown in Figure 2, there are two major steps in the workflow. The first step involves using simulation software to generate RTRAW files. The subsequent step involves reconstruction of these RTRAW files in the different systems and producing ESD files. Additionally, the conditions database is used for testing.

The JUNO simulation software [8, 9] incorporates Geant4-based [10, 11, 12] detector simulation and electronics simulation with OEC. As previously mentioned, the existing detector simulation datasets, which serve as inputs of the electronics simulation, are produced prior to the electronics simulation and are mixed according to the event rates. The electronics simulation generates pulses for all detector channels, which are subsequently digitized into waveforms after trigger. Unlike collider experiments, time correlation is curcial, which requires that the events could not be discarded. To reduce the data volume, OEC is employed to classify the event types using fast reconstruction results and to select a storage strategy for each event. Multiple events within a given time window are used for event type classification, a process known as time correlation analysis. As shown in Figure 3, the waveforms are reconstructed to time and charge (t/q) information in the OEC first. This t/q information is then calibrated, and the events are reconstructed with the calibrated t/q information. The OEC determines whether the waveform or t/q information should be stored in the final output file. For instance, waveforms are stored for neutrino events, while only the t/q information is stored for the other types. For the t/q stream, only the uncorrected data is stored into file, permitting offline correction with the conditions database at a later stage.

JUNO Hackathon was organized with the aim of migrating the reconstruction algorithms and conditions database from a serial version to a multi-threaded version. Any issues that encountered during testing have been addressed and resolved. Profiling tools, such as Intel VTune [13], were used to identify bottlenecks. Figure 4 illustrates one of the issues related to low CPU usage, as well as the CPU usage after optimization. A significant issue was the internal use of locks during the event processing, which resulted in multiple threads being blocked when attemping to access the locks.

Dedicated computing resources are employed in JUNO DC-1. Since the data taking has not yet started, the dedicated DQM cluster is used for the testing purposes. This cluster comprises 36 computing nodes with a total of 2304 cores, which are managed by the HTCondor system [14]. A total of 576 job slots are allocated, with each slot equipped with 4 cores and 15 GB memory.

3 Software performance

3.1 RTRAW production

One week of simulated RTRAW data have been generated in DC-1. Each RTRAW file contains about 851 events within 6-second interval. The time interval of each job is determined based on the memory usage and CPU time. One of the challenges encountered in RTRAW production is memory consumption, as the electronics simulation and OEC run together, with multiple events being cached in the memory for event classification. These tasks are executed on the large memory computing nodes. Figure 5 shows the performance of generating RTRAW data. On average, each job consumes 1302 seconds of CPU time and 5.4 GB of memory. The presence of muon shower events will take more memory usage.

3.2 Serial reconstruction

The performance of the serial reconstruction is evaluated as a benchmark. As shown in Figure 6, the average CPU time required to reconstruct 6-second RTRAW data is 5553 s, with an average memory usage of 2.4 GB. The mean reconstruction speed is about 6.53 seconds per event. The reconstruction of a real data within 80-second interval at 1 kHz will takes about 6 days of CPU time. Consequently, it is necessary to develop multi-threaded reconstruction algorithms to reduce the processing time.

3.3 Multi-threaded reconstruction

The performance of the multi-threaded reconstruction is evaluated using 4 CPU cores. As shown in Figure 7, the execution time is reduced to a quarter of that required for serial reconstruction. The second peak in the figure is due to the scheduling of jobs on different nodes with varying CPU types within the computing center. The total memory usage is less than 8 GB, which is lower than the total memory usage of 4 different processes.

Given the variability in processing times for different event types and energies, the output could be delayed if an event is not yet completed. Therefore, the jobs are configured with two output modes. In the “global output” mode, events are cached in memory in the correct time order, and the data is sequentially saved into a file. In contrast, the “output in thread” mode first saves the processed events from different threads into separate files, which are then merged and sorted at the end of the jobs. The figures present the results in these scenarios. The time consumption in “output in thread” mode is less than that in the other mode. Therefore, the “output in thread” mode is chosen for official data production.

4 Conclusions and plans

The JUNO DC-1 is the first time to test the data processing chain, beginning with RTRAW, mimicking the real data processing. Multi-threaded algorithms have been developed and tested. The database has also been used in both the local cluster and DCI. All primary goals have been successfully met, and ongoing checks on the produced data are in progress.

However, certain aspects of DC-1 still require enhancement. For instance, only the reconstruction algorithms for the central detector have been tested. These work will be addressed in the upcoming rounds of the JUNO DC.

Acknowledgements

This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA10010900), National Natural Science Foundation of China (12375195), the Youth Innovation Promotion Association, CAS.

References

[1] An F et al. (JUNO) 2016 J. Phys. G43 030401 (Preprint 1507.05613)
[2] Djurcic Z et al. (JUNO) 2015 (Preprint 1508.07166)
[3] Abusleme A et al. (JUNO) 2022 Prog. Part. Nucl. Phys. 123 103927 (Preprint 2104.02565)
[4] Li T, Xia X, Huang X, Zou J, Li W, lin T, Zhang K and Deng Z 2017 Chin. Phys. C 41 066201 (Preprint 1702.04100)
[5] Brun R and Rademakers F 1997 Nucl. Instrum. Meth. A 389 81–86
[6] Zhang X (JUNO) 2024 EPJ Web Conf. 295 04030
[7] Garg N 2013 Apache Kafka (Packt Publishing) ISBN 1782167935
[8] Lin T et al. 2023 Eur. Phys. J. C 83 382 [Erratum: Eur.Phys.J.C 83, 660 (2023)] (Preprint 2212.10741)
[9] Lin T, Zou J, Li W, Deng Z, Fang X, Cao G, Huang X and You Z (JUNO) 2017 J. Phys. Conf. Ser. 898 042029 (Preprint 1702.05275)
[10] Agostinelli S et al. (GEANT4) 2003 Nucl. Instrum. Meth. A 506 250–303
[11] Allison J et al. 2006 IEEE Trans. Nucl. Sci. 53 270
[12] Allison J et al. 2016 Nucl. Instrum. Meth. A 835 186–225
[13] Intel® VTune™ Profiler https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
[14] HTCondor Team HTCondor https://github.com/htcondor/htcondor