Hierarchical Roofline Analysis:
How to Collect Data using Performance Tools
on Intel CPUs and NVIDIA GPUs
Abstract
This paper surveys a range of methods to collect necessary performance data on Intel CPUs and NVIDIA GPUs for hierarchical Roofline analysis. As of mid-2020, two vendor performance tools, Intel Advisor and NVIDIA Nsight Compute, have integrated Roofline analysis into their supported feature set. This paper fills the gap for when these tools are not available, or when users would like a more customized workflow for certain analysis. Specifically, we will discuss how to use Intel Advisor, RRZE LIKWID, Intel SDE and Intel Amplifier on Intel architectures, and nvprof, Nsight Compute metrics, and Nsight Compute section files on NVIDIA architectures. These tools will be used to collect information for as many memory/cache levels in the memory hierarchy as possible in order to provide insights into application’s data reuse and cache locality characteristics.
Index Terms:
hierarchical Roofline analysis, performance data collection, performance tools, Intel CPUs, NVIDIA GPUsI Introduction
The Roofline performance model [1] offers an insightful and intuitive way to extracte key computational characteristics for applications in high-performance computing (HPC). Its capability to abstract away the complexity of modern memory hierarchies and guide performance analysis and optimization effort has gained its popularity in recent years.
Roofline is a throughput-oriented model centered around the interplay between computational capabilities, memory bandwidth, and data locality. Data locality is the reuse of data once it is being loaded from the main memory, and it is commonly expressed as the arithmetic intensity (AI), ratio between the floating-point operations performed and the data moved (FLOPs/Byte). The performance (GFLOP/s) is bound by the following two terms:
(1) |
Conventionally, the Roofline model is focused on one level of the memory system, but this has been extended to the entire memory hierarchy in recent years, named the hierarchical Roofline model. The hierarchical Roofline helps understand cache reuse and data locality and provides additional insights into the efficiency of the application’s utilization of the memory subsystem. The hierarchical Roofline has been integrated into Intel Advisor [2, 3], and NVIDIA Nsight Compute [4, 5]. Even though they should be the go-to methods for Roofline analysis, we would like to present in this paper a few other tools for the purpose of flexibility and generality.
We will discuss the use of Intel Advisor [2], RRZE LIKWID [6], Intel SDE [7] and Intel VTune [8] on Intel CPUs, and nvprof [9], Nsight Compute metrics, and Nsight Compute section files [5] on NVIDIA GPUs. A mini-application will be used for demonstration and validation purpose, and it is extracted from the Material Science code BerkeleyGW [10] called General Plasmon Pole (GPP) [11]. Architecture-wise, we will focus on the Intel Knights Landing (KNL) CPU and NVIDIA V100 GPU.
To facilitate the Roofline study, a range of other tools have sprung to life as well, for example, the Empirical Roofline Toolkit (ERT) for more accurate machine characterization [12, 13], and [14, 15, 16, 17, 18] for more streamlined data collection methods. Other than tools development, there are many studies on the application of the Roofline model in traditional HPC [19, 17, 18, 20, 21] and Machine Learning [17, 18, 22, 23], and extension and refinement of the model to related topics in HPC, such as instruction Roofline [24], time-based Roofline [23], Roofline scaling trajectory [25], performance portability analysis based on Roofline [13], and power and energy Roofline [26, 27].
II Application and Machine Setup
II-A Mini-Application General Plasmon Pole (GPP)
The GPP mini-application [11] is extracted from the Material Science code BerkeleyGW [10], and it represents the work typically done on a single MPI rank. It is written in C++, and parallelized with OpenMP on the CPU and CUDA on the GPU. The computation involved this mini-app is tensor-contraction like, and several pre-calculated complex double precision arrays are multiplied and summed over certain dimensions and collapsed into a small vector. The problem used in this paper is a medium sized one, and it comprises of 512 electrons and 32768 plane wave basis elements. The pseudo code for this mini-app is as follows.
do band = 1, nbands do igp = 1, ngpown do ig = 1, ncouls do iw = 1, nw load wtilde_array(ig,igp) load aqsntemp(ig,band) load eps(ig,igp) compute wdiff, delw, sch_array, ssx_array reduce on achtemp(iw), asxtemp(iw)
The real code, job scripts and resulted are available at [11].
II-B Machine Setup
This study is conducted on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL).
Cori has three main partitions, Haswell, KNL and GPU, and this study has used its KNL partition [28] and GPU chassis [29]. Each KNL node is a single-socket Intel Xeon Phi Processor 7250 (Knights Landing) processor and has 68 physical cores. There is 96 GB DDR4 memory and 16 GB MCDRAM (or HBM) per node, with the MCDRAM configured in ‘cache’ mode by default. The GPU chassis is deployed primarily for the NERSC Exascale Science Applications Program (NESAP). It has 18 GPU nodes in total, and each node contains two 20-core Intel Xeon Gold 6148 Skylake CPUs, 384 GB DDR4 memory, and 8 NVIDIA V100 Volta GPUs. Each GPU has 80 Streaming Multiprocessors (SMs), 16 GB HBM2 memory, and is connected to others in a ‘hybrid cube-mesh’ topology.
III Methods and Results
III-A Roofline Data Collection on Intel CPUs
Intel Advisor [2] provides the production quality, fully integrated hierarchical Roofline analysis on Intel CPUs, with very little user effort required. Compared to LIKWID [6], it has a higher profiling overhead due to the static instruction analysis and cache simulation. LIKWID [6] is an open-source package developed at the Regional Computing Center Erlangen (RRZE) in Germany. It provides several ‘performance groups’ for easier and more streamlined performance analysis, and in this paper, we have identified a few for the hierarchical Roolfine data collection. LIKWID uses metrics that are based on micro-ops not instructions, and in some cases, it does not distinguish the different vector lengths, such as scalars, AVX-2/AVX-512 instructions, and masked/unmasked vector lanes. This may cause certain inaccuracy and require extra care, however its low overhead has made it a very attractive option for large-scale application analysis. To collect hierarchical Roofline data, another method is to use Intel SDE [7] and VTune [8]. SDE has a very high profiling overhead but it provides the most accurate instruction count and it can produce L1 data movement information as well. On the other hand, VTune can be used to collect DDR/MCDRAM information to complement SDE. In the following few subsections, we will detail the command lines used to collect Roofline data on KNL and the subsequent results.
III-A1 Intel Advisor
Advisor can be invoked as follows for Roofline analysis, and Fig. 1 shows that in GPP, the most significant function takes 2s of ‘Self-Time’ and produces 398 GFLOP/s double-precision performance on 64 OpenMP threads. Advisor naturally provides details on the level of functions and loops, while the methods we will discuss below may require some code instrumentation in order to focus on certain code regions of interest.
module load advisor/2020 advixe-cl --collect=roofline --project-dir=<dir> -- ./gpp 512 2 32768 20 0
III-A2 RRZE LIKWID
LIKWID [6] is an open-source software package and here we use its ‘performance groups’, FLOPS_DP, HBM_CACHE, L2 and DATA (for L1), for hierarchical Roofline data collection. Each of these groups contains a set of raw hardware counters and derived performance metrics, without user having to dive deep into the nitty-gritty micro-architecture specs and hardware counter details. The following command can be used to profile with LIKWID,
module load likwid/4.3.0 groups=(’FLOPS_DP’ ’HBM_CACHE’ ’L2’ ’DATA’) for gs in ${groups[@]} do likwid-perfctr -c 0-271 -g $gs ./gpp 512 2 32768 20 0 done
The raw results for GPP are as follows, and Fig. 2 shows that LIKWID produces a similar Roofline chart as Advisor, with close arithmetic intensity and performance. The DDR-level arithmetic intensity is extremely high in Fig. 2, because the data set (1.5-2 GB) fits well into the HBM cache and there is little memory transaction between DDR and HBM.
Time: 10.2243 secs GFLOPS: 5051.923 MCDRAM Bytes: 742.8158 GB DDR Bytes: 0.8883 GB L2 Bytes: 1387.739 GB L1 Bytes: 6456.799 GB
III-A3 Intel VTune and Intel SDE
This is a methodology developed a few years before the full integration of Roofline into Advisor, and may still present value to users who would like to investigate the underlying details. In this case, the SDE tool can be used for collection of the FLOPs count and L1 data movement, while VTune can be used for uncore data movement collection. The commands and results for GPP analysis in this paper are listed below, and Fig. 3 presents the combined data with a very high consistency with the results in Fig. 2 (albeit the missing L2 data).
# commands for SDE sde64 -knl -d -iform 1 -omix result.sde -global_region -start_ssc_mark 111:repeat -stop_ssc_mark 222:repeat -- ./gpp 512 2 32768 20 0
# results from SDE GFLOPS: 5839.811 L1 Bytes: 3795.623
# commands for VTune module load vtune/2020 vtune -start-paused -r my-vtune.knl -collect memory-access -finalization-mode=none -data-limit=0 -- ./gpp 512 2 32768 20 0 vtune -report hw-events -group-by=package -r my-vtune.knl/ -format csv -csv-delimiter comma > advisor.html
# results from VTune DDR Bytes: 0.735 MCDRAM Bytes: 594.562
III-B Roofline Data Collection on NVIDIA GPUs
On NVIDIA GPUs, an nvprof [9] based methodology was first proposed in [17], then an Nsight Compute [5] metrics based one developed in [22, 30]. These methodologies require a dozen of metrics to be collected for hierarchical Roofline analysis, and could incur significant profiling overhead when the number of kernels in the code is high. With nvprof phasing out in the developer toolchain, Nsight Compute has become the focus of the development of Roofline data collection methodology. A more simplified set of metrics are identified and validated in [30, 31], and it has since been integrated into Nsight Compute 2020 (CUDA 11 release) [4]. The default Roofline feature shipped in Nsight Compute 2020 only includes the HBM level analysis, but it can be extended by using custom section files and/or job scripts such as [30, 31], for hierarchical Roofline analysis.
III-B1 Custom Section Files in Nsight Compute 2020
Nsight Compute uses Google Protocol Buffer messages for the section file, and it allows users to quickly create custom section files for their own tailored analysis. The following is an example in [11] that can be used to collect the hierarchical double precision Roofline data for GPP, and its results are shown in Fig. 4. The 13 FLOPs/Byte arithmetic intensity shows that this kernel has well entered the compute bound region on the HBM level, and particular attention should be paid to the utilization of compute resources such as threads and instructions, rather than the memory system.
module load nsight-compute/2020.1.0 ncu -k NumBandNgpown_kernel -o ncu.prof --section-folder ./ncu-section-files --section SpeedOfLight_HierarchicalDoubleRooflineChart ./gpp 512 2 32768 20 0
Commands/Metrics | |
---|---|
Time | nvprof –print-gpu-summary ./gpp 512 2 32768 20 0 |
FP64 FLOPs | nvprof –metrics flop_count_dp |
FP32 FLOPs | flop_count_sp |
FP16 FLOPs | flop_count_hp |
Tensor Core | tensor_precision_fu_utilization |
L1 Cache | gld_transactions, gst_transactions, atomic_transactions |
local_load_transactions, local_store_transactions | |
shared_load_transactions, shared_store_transactions | |
L2 Cache | l2_read_transactions, l2_write_transactions |
HBM | dram_read_transactions, dram_write_transactions |
III-B2 The nvprof Profiler
Many developers started their GPU optimization with the nvprof profiler and our initial Roofline methodology also starts with the metrics in nvprof. Tab. I lists a set of metrics that can be used for hierarchical Roofline analysis and they are put in three categories, runtime, FLOPs count, and data movement (in bytes) between different memory/cache levels. These metrics are based on CUPTI and can be mapped to the PerfWorks framework in Nsight Compute through [32], with certain validation. The following command has been used for the GPP data collection and the results are in Fig. 5, with a very similar set of arithmetic intensities on L1, L2 and HBM levels, and GFLOP/s performance to those in Fig. 4.
module load cuda/10.2.89 metrics=’fp_count_dp,...’ # see Tab. I nvprof --kernels NumBandNgpown_kernel --metrics $metrics ./gpp 512 2 32768 20 0
III-B3 Metrics in Nsight Compute 2019
As nvprof phases out, we have developed a data collection methodology based on Nsight Compute 2019. These metrics as listed in Tab. II are more detailed than those in nvprof, and they produce comparable results as seen in Fig. 6 and Fig. 4. The commands used to collect Roofline data for GPP are as follows.
module load cuda/10.2.89 metrics=’sm__cycles_elapsed.avg,...’ # see Tab. II nv-nsight-cu-cli -k NumBandNgpown_kernel --metrics $metrics ./gpp 512 2 32768 20 0
Metrics | |
---|---|
Time | sm__cycles_elapsed.avg |
sm__cycles_elapsed.avg.per_second | |
FP64 FLOPs | sm__sass_thread_inst_executed_op_dadd_pred_on.sum |
sm__sass_thread_inst_executed_op_dmul_pred_on.sum | |
sm__sass_thread_inst_executed_op_dfma_pred_on.sum | |
FP32 FLOPs | sm__sass_thread_inst_executed_op_fadd_pred_on.sum |
sm__sass_thread_inst_executed_op_fmul_pred_on.sum | |
sm__sass_thread_inst_executed_op_ffma_pred_on.sum | |
FP16 FLOPs | sm__sass_thread_inst_executed_op_hadd_pred_on.sum |
sm__sass_thread_inst_executed_op_hmul_pred_on.sum | |
sm__sass_thread_inst_executed_op_hfma_pred_on.sum | |
Tensor Core | sm__inst_executed_pipe_tensor.sum |
L1 Cache | l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum |
l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum | |
l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum | |
l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum | |
l1tex__t_set_accesses_pipe_tex_mem_surface_op_atom.sum | |
l1tex__t_set_accesses_pipe_tex_mem_surface_op_red.sum | |
l1tex__t_sectors_pipe_lsu_mem_local_op_ld.sum | |
l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum | |
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum | |
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum | |
L2 Cache | lts__t_sectors_op_read.sum |
lts__t_sectors_op_write.sum | |
lts__t_sectors_op_atom.sum | |
lts__t_sectors_op_red.sum | |
HBM | dram__sectors_read.sum |
dram__sectors_write.sum |
Commands/Metrics | |
---|---|
Time | sm__cycles_elapsed.avg |
sm__cycles_elapsed.avg.per_second | |
FP64 FLOPs | sm__sass_thread_inst_executed_op_dadd_pred_on.sum |
sm__sass_thread_inst_executed_op_dfma_pred_on.sum | |
sm__sass_thread_inst_executed_op_dmul_pred_on.sum | |
FP32 FLOPs | sm__sass_thread_inst_executed_op_fadd_pred_on.sum |
sm__sass_thread_inst_executed_op_ffma_pred_on.sum | |
sm__sass_thread_inst_executed_op_fmul_pred_on.sum | |
FP16 FLOPs | sm__sass_thread_inst_executed_op_hadd_pred_on.sum |
sm__sass_thread_inst_executed_op_hfma_pred_on.sum | |
sm__sass_thread_inst_executed_op_hmul_pred_on.sum | |
Tensor Core | sm__inst_executed_pipe_tensor.sum |
L1 Cache | l1tex__t_bytes.sum |
L2 Cache | lts__t_bytes.sum |
HBM | dram__bytes.sum |
III-B4 Metrics in Nsight Compute 2020
As Nsight Compute evolves over time, we have also developed a more simplified data collection methodology with fewer metrics to collect (please see Tab. III). These metrics are equivalent to the ones used in section files in III-B1, and scripts based on them [30] can be used for easier integration with users’ other job submission workflows, and for more customized Roofline presentation (using Matplotlib). The commands we used to collect Roofline information for GPP in this paper are as follows.
module load nsight-compute/2020.1.0 metrics=’sm__cycles_elapsed.avg,...’ ncu -k NumBandNgpown_kernel --metrics $metrics ./gpp 512 2 32768 20 0
Fig. 7 shows that this methodology produces consistent results as in previous subsections, with very marginal difference on the arithmetic intensity and GFLOP/s throughput.
IV Summary
In this paper, we have presented a range of methods using a variety of performance tools to collect hierarchical data for Roofline analysis. Even though the Roofline model has been integrated into production tools such as Intel Advisor and NVIDIA Nsight Compute, we still expect that this paper fills the gaps for developers who do not have access to those tools, or who would like to investigate the underlying details. It would serve the purpose of flexibility and generality in the Roofline data collection space.
References
- [1] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Commun. ACM, vol. 52, no. 4, 2009.
- [2] Intel Advisor Roofline Analysis. [Online]. Available: https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/survey-trip-counts-flops-and-roofline-analyses/roofline-analysis.html
- [3] T. Koskela, Z. Matveev, C. Yang, A. Adedoyin, R. Belenov, P. Thierry, Z. Zhao, R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams, “A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization,” in International Conference on High Performance Computing. Springer, 2018, pp. 226–245.
- [4] Nsight Compute Roofline Analysis. [Online]. Available: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline
- [5] “NVIDIA Nsight Compute Profiling Tool,” https://docs.nvidia.com/nsight-compute/NsightCompute/index.html.
- [6] T. Röhl, J. Treibig, G. Hager, and G. Wellein, “Overhead analysis of performance counter measurements,” pp. 176–185, 9 2014.
- [7] (2017) Intel Software Development Emulator. [Online]. Available: https://software.intel.com/en-us/articles/intel-software-development-emulator
- [8] (2017) Intel VTune Amplifier. [Online]. Available: https://software.intel.com/en-us/intel-vtune-amplifier-xe
- [9] “NVIDIA Profiler nvprof,,” https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
- [10] “BerkeleyGW,” http://www.berkeleygw.org.
- [11] “Example Scripts for Plotting Roofline,,” https://github.com/cyanguwa/nersc-roofline.
- [12] Empirical Roofline Toolkit. [Online]. Available: https://bitbucket.org/berkeleylab/cs-roofline-toolkit/src/master/
- [13] C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen, B. Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams, “An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability,” in 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2018, pp. 14–23.
- [14] NERSC Roofline Model Documentation. [Online]. Available: https://docs.nersc.gov/development/performance-debugging-tools/roofline/
- [15] C. Yang, B. Friesen, T. Kurth, B. Cook, and S. Williams, “Toward Automated Application Profiling on Cray Systems,” in Cray User Group (CUG), 2018.
- [16] J. R. Madsen, M. G. Awan, H. Brunie, J. Deslippe, R. Gayatri, L. Oliker, Y. Wang, C. Yang, and S. Williams, “Timemory: Modular Performance Analysis for HPC,” in International Conference on High Performance Computing. Springer, 2020, pp. 434–452.
- [17] C. Yang, T. Kurth, and S. Williams, “Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System,” Concurrency and Computation: Practice and Experience. [Online]. Available: https://doi.org/10.1002/cpe.5547
- [18] C. Yang, T. Kurth, and S. Williams, “Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System,” in Cray User Group (CUG) 2019.
- [19] D. Doerfler, J. Deslippe, S. Williams, L. Oliker, B. Cook, T. Kurth, M. Lobet, T. Malas, J.-L. Vay, and H. Vincenti, “Applying the Roofline performance model to the Intel Xeon Phi Knights Landing processor,” International Conference on High Performance Computing, pp. 339–353, 2016.
- [20] M. Del Ben, C. Yang, S. Louie, and J. Deslippe, “Accelerating Large-Scale GW Calculations on Hybrid GPU-CPU Systems,” Bulletin of the American Physical Society, vol. 65, 2020.
- [21] R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, “A Case Study For Performance Portability Using OpenMP 4.5,” in International Workshop on Accelerator Programming Using Directives. Springer, 2018, pp. 75–95.
- [22] Y. Wang, C. Yang, S. Farrel, T. Kurth, and S. Williams, “Hierarchical Roofline Performance Analysis for Deep Learning Applications,” in 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). [Online]. Available: https://arxiv.org/abs/2009.05257
- [23] Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and S. Williams, “Time-Based Roofline for Deep Learning Performance Analysis,” in 2020 IEEE/ACM Deep Learning on Supercomputers Workshop, 2020. [Online]. Available: https://arxiv.org/abs/2009.04598
- [24] N. Ding and S. Williams, “An Instruction Roofline Model for GPUs,” in 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 2019, pp. 7–18.
- [25] K. Z. Ibrahim, S. Williams, and L. Oliker, “Performance analysis of GPU programming models using the roofline scaling trajectories,” in International Symposium on Benchmarking, Measuring and Optimization. Springer, 2019, pp. 3–19.
- [26] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc, “A Roofline Model of Energy,” in 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013, pp. 661–672.
- [27] A. Lopes, F. Pratas, L. Sousa, and A. Ilic, “Exploring GPU Performance, Power And Energy-Efficiency Bounds with Cache-aware Roofline Modeling,” in 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2017, pp. 259–268.
- [28] National Energy Research Scientific Computing Center (NERSC) KNL Partition. [Online]. Available: https://docs.nersc.gov/systems/cori/
- [29] National Energy Research Scientific Computing Center (NERSC) GPU Chassis. [Online]. Available: https://docs-dev.nersc.gov/cgpu/hardware/
- [30] Data Collection Methdology for Roofline Analysis on NVIDIA GPUs. [Online]. Available: https://gitlab.com/NERSC/roofline-on-nvidia-gpus/-/tree/arxiv-paper
- [31] C. Yang, “8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks,” arXiv preprint arXiv:2008.11326, 2020.
- [32] Metrics Comparison between nvprof and Nsight Compute. [Online]. Available: https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison