High-performance Computation of Kubo Formula with Vectorization of Batched Linear Algebra Operation
Abstract
We have proposed a method to accelerate the computation of Kubo formula optimized to vector processors. The key concept is parallel evaluation of multiple integration points, enabled by batched linear algebra operations. Through benchmark comparisons between the vector-based NEC SX-Aurora TSUBASA and the scalar-based Xeon machines in node performance, we verified that the vectorized implementation was speeded up to approximately 2.2 times faster than the baseline. We have also shown that the performance improvement due to padding, indicating that avoiding the memory-bank conflict is critically important in this type of task.
1 Introduction
One of the main challenges in data-driven material development is the small data problem. It is not uncommon for data scientific methodologies to fail to work due to a lack of sufficient data in a target domain. For overcoming this problem, computational approaches based on high-throughput first-principles calculations have attracted much attention and are actively used in data generation [29, 24, 25, 41]. Compared with experimental approaches, the computational approach has advantages in terms of data production, allowing us to construct a large-scale dataset or to run an exhaustive virtual screening.
Currently, almost all high-throughput calculation projects stand on the Kohn–Sham density functional theory (KS-DFT), which is an established and generic method for predicting the electronic structure of materials without assuming empirical parameters. The DFT fundamentally characterizes the ground state of electrons. It enables the computation of static quantities such as electron density, total energy, force fields, and magnetic moments, thus providing useful insights into structural stability, chemical reactions, and various practical applications[31].
External-field responses such as transport properties and optical properties not only give helpful knowledge of materials under a driving forces but also allow direct comparison between calculations and measurements. However, unlike the static quantities, these properties are not included in the standard DFT procedure because they must gather electron excitation, which is out of the range of DFT. Computing response functions in the first-principles level, we have to consider some theoretical extensions of DFT such as time-dependent DFT or additional post-process calculation depending on specific tasks.
Recently, a post-process scheme based on the Kubo formula with an effective tight-binding model has been widely used in the context of high-throughput calculations. The Kubo formula is a general formula for linear response phenomena, which can be evaluated in a single framework regardless of the system and target [30]. Usually, the electronic structure obtained by DFT is often downfolded to a tight-binding (TB) model such as Wannier TB to make its evaluation easy [32]. On the basis of such a scheme, several studies have successfully produced datasets of transport properties, for example, the charge conductivity, the thermoelectric conductivity, and the spin conductivity[38, 44, 39, 43].
Although the Kubo formula is an attractive method for high-throughput calculations due to its versatility and stability, it is computationally expensive, limiting the yield of workflow even if combined with TB models. Roughly estimated, the evaluation of Kubo formula typically costs 10 times as much as that of DFT calculation. Moreover, in contrast to the DFT core program, for which implementations utilizing a general-purpose graphics processing unit (GPGPU) exist, speeding up of Kubo formula remains a major challenge because it involves overcoming multiple computational bottlenecks. Modern high-performance computing (HPC) implementations of the Kubo formula can play a vital role in increasing the yield of data generation.
In this study, we develop an efficient method of evaluating the Kubo formula by optimizing for vector processors. As a practical example, we perform benchmark tests in computation of the anomalous Hall conductivity (AHC) [33] and show the results of performance comparisons between the vectorized method and an existing method of CPU implementation.
2 Numerical method
2.1 Kubo formula
Within the framework of the Kubo formula, a linear response of an observable can be generally expressed as
(1) |
where is the volume, the hatted symbols are operators, and is the response function of an observable under an external field giving a perturbation . is an equilibrium average with respect to the non perturbative Hamiltonian. Considering a single-electron system, this average can be reduced to a Brillouin-zone integral for a periodic system, otherwise a real-space integral .
From the perspective of numerical calculation, Eq.(1) should consist of a three-dimensional integration whose integrand is the trace of a matrix resulting from multiple linear algebra operations. The integrand, depending on a specification, mostly conducts matrix products with either a full-diagonalization or a matrix inverse. Eventually the computational task involves a nested loop for the numerical integration and the matrix oprations, hence the time complexity is approximately where is the number of integration points and is the dimension of the matrix corresponding to the number of single-electron bands.
Here, let us focus on the intrinsic AHC, , in a periodic, single-electron system at zero temperature for demonstration. Eq.(1) can be rewritten as
(2) | ||||
(3) |
where is a charge current operator, is an eigenvalue, is the Fermi energy, and is a unit step function. In this formula, all operators are represented in -dimensional matrices. Note that Eq.(3) is valid for a periodic system, whereas for a system with impurities, we have to compute Green’s functions, i.e., matrix inverse, instead of eigenvalues.
2.2 Baseline implementation
A minimal code evaluating Eqs.(2,3) is summarized as the pseudo code in List1, which consists of a integral of (1) construction of matrices and , (2) full-diagonalization of , (3) conversion of into the eigenbasis, and (4) evaluation of Eq.(3). The integral requires times of the evaluations of (1)-(4), which takes , resulting in the total time complexity . In particular, the most time consuming part is the full-diagonalization (2).
To accelerate the calculation, a parallel programming technique (for instance, MPI parallelization) can be applied to either the integral or the matrix calculations. In high-throughput computation, the former is more effective than the latter. This is because the parallelization of matrix calculations is not effective in many cases where the material is composed of several chemical species per unit cell, in which case the dimension of the matrix is also smaller. Moreover, the integrand is spiky in the momentum space and requires a very fine computational mesh of approximately points, incurring a critical amount of overhead when MPI is used inside of the integral.
2.3 Vectorized implementation
In this section, we propose an alternative, namely vectorized implementation of the Kubo formula, which is optimized for vector processors. The key idea is batch evaluation of multiple integration points, as depicted in Fig.1. Let be the number of batches and be the batch size taken to be the same as the vector length. Here, holds. We first split the -integral loop of the baseline code into two loops: an inner one for the batches of length and an outer one for the sum of them, that is,
(4) |
where . In other words, a part of original loop is pushed into the vectorized integrand . The performance of vectorization becomes apparent when the loop is sufficiently longer than . In our case where is well satisfied, we can expect a sufficient benefit of vectorization.
List2 represents a pseudo code of this implementation. Compared with List1, the procedure is exactly equivalent, but the loop cycles are executed per batch, and each matrix operation is performed in a batch. Here, we prepare a batched version of the matrix product and the diagonalization; the latter is obtained by modifying the EISPACK subroutine [40]. Such a batched linear algebra operation scheme is known as the Batched-BLAS in recent literature [27, 26, 23].
In addition, to avoid degradation due to memory-bank conflict, the matrix size is adjusted by padding. The benefits of the padding will be demonstrated in the next section.
To end this section, we comment on the case of a large-scale system involving hundreds of atoms. In such a case, the dimension of the matrix is sufficiently large, and parallelization of the matrix operations becomes effective. Because the required is in a tradeoff relationship between the unit cell size, we can choose the best strategy of parallelization depending on the problem. As large-scale systems are beyond the scope of this paper, we will not discuss them further.
3 Benchmark results
To examine the performance of the vector implementation, we conduct a demonstration on a vector machine NEC SX-Aurora TSUBASA where the vector length is 256 [36]. For comparison, the baseline implementation is also tested on a CPU machine, Xeon Gold, with 32-process parallelization by flat MPI. We are aware that comparison between different architectures is not straightforward, but here we compare single node performances as a compromise measure. The computing and compiling environments are listed in Table 1.
In this demonstration, we measure the computation times to evaluate AHC of the bcc-Fe and its supercell structures. Wannier TB models are constructed from the DFT calculation using the Quantum Espresso code [28] and the Wannier90 code [37]. We refer to the WANNIER-LINEAR-RESPONSE code [42] as the baseline implementation of Kubo formula and vectorize it by the method introduced in Sec.2.3.
Let us begin with the vector processing rate, a performance indicator defined as the proportion of vectorized processing in total processing. According to the profiler, our vectorized code achieves sufficiently high scores reaching 98-99 % in all demonstrations.
Fig.2 shows the computation times as a function of . Here, we use the [1,1,2]-supercell where as it has 4 atoms with 2 spins and 6 Wannier projections per atom. From this data, we can see that the results on the SX-Aurora are about twice as fast as those of 32-process parallel execution on Xeon Gold, indicating a significant acceleration due to the vectorized implementation. This trend becomes more pronounced as increases. We again emphasize that this comparison is only a single aspect and does not lead general superiority of the vector processor over the CPU. Rather, this is clear evidence of the usefulness of the vectorization technique on this task.
Next, Fig.3 shows the computation times on the case as a function of . Although the performance ratios varied by matrix size, in all cases, the SX-Aurora is more than 1.7 times faster than the Xeon Gold and 2.17 times at maximum. Note that the performance ratio on is lower because it is a special condition for better performance of the Xeon Gold rather than defects of the SX-Aurora. Together with the results in Fig.2, we claim that vectorized implementations have shown significant speedups in a wide range of tasks on small-scale systems, which is a desirable feature in high-throughput computations.
Finally, we examine the effect of the padding on the performance. Fig.4 compares the vectorized implementation with and without the padding, showing that the padding improves the performance by 1.5 times at maximum. The improvement becomes notable where (8 atoms), that is, the memory-bank conflict causes serious performance degradation except in very small systems. Hence, padding is critically important in practical use.
4 Conclusion
In this study, we have proposed a method to accelerate the computation of Kubo formula by using batched matrix operations optimized to vector processors. Our findings have led to the following significant insights:
First, the introduction of vectorization was demonstrated to accelerate the computation of Kubo formula. Through benchmark comparisons between the vector-based NEC SX-Aurora TSUBASA and the scalar-based Xeon machines, we verified that the SX-Aurora has up to approximately 2.2 times faster node performance. This result indicates that our proposed method is effective and leads to increased data production in the context of high-throughput calculations.
The computational performance improvement due to padding was observed. This result underscores the importance of avoiding the memory-bank conflict to improve computational efficiency.
Furthermore, our approach is not limited to Kubo formula; it can be extended to various similar tasks relying on batched linear algebra operations. The vectorization approach could become a powerful alternative to the MPI in such cases. In addition, the same strategy can be applied to GPGPU as long as memory consumption is acceptable. The GPGPU implementation is an interesting future work.
Aknowledgement
Y.Y and T.K. are grateful to Dr. Taizo Shibuya at NEC Corporation for his valuable suggestions.
Contents




Vector machine | Scalar machine | |
---|---|---|
Hardware (1 node) | VE Type-20B 8 cores [35] | Xeon Gold 6362 2 sockets * 16 cores |
Compiler | NEC SDK 4.0.0 [34] | Intel ifort 19.1.2.254 |
MPI | NEC MPI 3.0.0 | Intel MPI ver. 2019 |
Numerical library | NLC 2.3 | Intel MKL |
[sorting=count]
References
- [1] Anubhav Jain et al. “A high-throughput infrastructure for density functional theory calculations” In Computational Materials Science 50, 2011, pp. 2295–2310 DOI: 10.1016/j.commatsci.2011.02.023
- [2] Stefano Curtarolo et al. “AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations” In Computational Materials Science 58, 2012, pp. 227–235 DOI: 10.1016/j.commatsci.2012.02.002
- [3] Stefano Curtarolo et al. “The high-throughput highway to computational materials design” In Nature Materials 12, 2013, pp. 191–201 DOI: 10.1038/nmat3568
- [4] Xiaoyu Yang et al. “MatCloud: A high-throughput computational infrastructure for integrated management of materials simulation, data and resources” In Computational Materials Science 146 Elsevier B.V., 2018, pp. 319–333 DOI: 10.1016/j.commatsci.2018.01.039
- [5] Richard M. Martin “Electronic Structure” Cambridge University Press, 2004 DOI: 10.1017/CBO9780511805769
- [6] Ryogo Kubo “Statistical-Mechanical Theory of Irreversible Processes. I. General Theory and Simple Applications to Magnetic and Conduction Problems” scalling row In Journal of the Physical Society of Japan 12, 1957, pp. 570–586 DOI: 10.1143/JPSJ.12.570
- [7] Nicola Marzari et al. “Maximally localized Wannier functions: Theory and applications” In Reviews of Modern Physics 84 American Physical Society, 2012, pp. 1419–1475 DOI: 10.1103/REVMODPHYS.84.1419/FIGURES/35/MEDIUM
- [8] Akito Sakai et al. “Iron-based binary ferromagnets for transverse thermoelectric conversion” In Nature 581 Nature Research, 2020, pp. 53–57 DOI: 10.1038/s41586-020-2230-z
- [9] Yang Zhang et al. “Different types of spin currents in the comprehensive materials database of nonmagnetic spin Hall effect” scalling row In npj Computational Materials 7, 2021 DOI: 10.1038/s41524-021-00635-0
- [10] Ilias Samathrakis et al. “Enhanced anomalous Nernst effects in ferromagnetic materials driven by Weyl nodes” In Journal of Physics D: Applied Physics 55 IOP Publishing Ltd, 2022 DOI: 10.1088/1361-6463/ac3351
- [11] Jakub Železný et al. “High-throughput study of the anomalous Hall effect” In npj Computational Materials 9, 2023, pp. 151 DOI: 10.1038/s41524-023-01113-5
- [12] Naoto Nagaosa et al. “Anomalous Hall effect” scalling row In Reviews of Modern Physics 82, 2010, pp. 1539–1592 DOI: 10.1103/RevModPhys.82.1539
- [13] B.. Smith et al. “Matrix Eigensystem Routines — EISPACK Guide” Springer Berlin Heidelberg, 1976 DOI: 10.1007/3-540-07546-1
- [14] Jack Dongarra et al. “The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems” In Procedia Computer Science 108 Elsevier B.V., 2017, pp. 495–504 DOI: 10.1016/j.procs.2017.05.138
- [15] Jack Dongarra et al. “Optimized Batched Linear Algebra for Modern Architectures” In Euro-Par 2017: Parallel Processing 10417 Springer International Publishing, 2017, pp. 511–522 DOI: 10.1007/978-3-319-64203-1˙37
- [16] Ahmad Abdelfattah et al. “A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines” In ACM Transactions on Mathematical Software 47 Association for Computing Machinery, 2021 DOI: 10.1145/3431921
- [17] “NEC SX Vector Supercomputer” URL: https://www.nec.com/en/global/solutions/hpc/sx/
- [18] Paolo Giannozzi et al. “QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials” In Journal of Physics: Condensed Matter 21, 2009, pp. 395502 DOI: 10.1088/0953-8984/21/39/395502
- [19] Giovanni Pizzi et al. “Wannier90 as a community code: New features and applications” In Journal of Physics Condensed Matter 32 Institute of Physics Publishing, 2020 DOI: 10.1088/1361-648X/ab51ff
- [20] Jakub Železný “zeleznyj/wannier-linear-response” URL: https://bitbucket.org/zeleznyj/wannier-linear-response/src/master/
- [21] “NEC SX VE Type-20B” URL: https://www.nec.com/en/global/solutions/hpc/sx/Vector-Engine-Types.html#20B
- [22] “NEC SDK” URL: https://www.nec.com/en/global/solutions/hpc/sx/tools.html?
References
- [23] Ahmad Abdelfattah et al. “A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines” In ACM Transactions on Mathematical Software 47 Association for Computing Machinery, 2021 DOI: 10.1145/3431921
- [24] Stefano Curtarolo et al. “AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations” In Computational Materials Science 58, 2012, pp. 227–235 DOI: 10.1016/j.commatsci.2012.02.002
- [25] Stefano Curtarolo et al. “The high-throughput highway to computational materials design” In Nature Materials 12, 2013, pp. 191–201 DOI: 10.1038/nmat3568
- [26] Jack Dongarra et al. “Optimized Batched Linear Algebra for Modern Architectures” In Euro-Par 2017: Parallel Processing 10417 Springer International Publishing, 2017, pp. 511–522 DOI: 10.1007/978-3-319-64203-1˙37
- [27] Jack Dongarra et al. “The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems” In Procedia Computer Science 108 Elsevier B.V., 2017, pp. 495–504 DOI: 10.1016/j.procs.2017.05.138
- [28] Paolo Giannozzi et al. “QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials” In Journal of Physics: Condensed Matter 21, 2009, pp. 395502 DOI: 10.1088/0953-8984/21/39/395502
- [29] Anubhav Jain et al. “A high-throughput infrastructure for density functional theory calculations” In Computational Materials Science 50, 2011, pp. 2295–2310 DOI: 10.1016/j.commatsci.2011.02.023
- [30] Ryogo Kubo “Statistical-Mechanical Theory of Irreversible Processes. I. General Theory and Simple Applications to Magnetic and Conduction Problems” scalling row In Journal of the Physical Society of Japan 12, 1957, pp. 570–586 DOI: 10.1143/JPSJ.12.570
- [31] Richard M. Martin “Electronic Structure” Cambridge University Press, 2004 DOI: 10.1017/CBO9780511805769
- [32] Nicola Marzari et al. “Maximally localized Wannier functions: Theory and applications” In Reviews of Modern Physics 84 American Physical Society, 2012, pp. 1419–1475 DOI: 10.1103/REVMODPHYS.84.1419/FIGURES/35/MEDIUM
- [33] Naoto Nagaosa et al. “Anomalous Hall effect” scalling row In Reviews of Modern Physics 82, 2010, pp. 1539–1592 DOI: 10.1103/RevModPhys.82.1539
- [34] “NEC SDK” URL: https://www.nec.com/en/global/solutions/hpc/sx/tools.html?
- [35] “NEC SX VE Type-20B” URL: https://www.nec.com/en/global/solutions/hpc/sx/Vector-Engine-Types.html#20B
- [36] “NEC SX Vector Supercomputer” URL: https://www.nec.com/en/global/solutions/hpc/sx/
- [37] Giovanni Pizzi et al. “Wannier90 as a community code: New features and applications” In Journal of Physics Condensed Matter 32 Institute of Physics Publishing, 2020 DOI: 10.1088/1361-648X/ab51ff
- [38] Akito Sakai et al. “Iron-based binary ferromagnets for transverse thermoelectric conversion” In Nature 581 Nature Research, 2020, pp. 53–57 DOI: 10.1038/s41586-020-2230-z
- [39] Ilias Samathrakis et al. “Enhanced anomalous Nernst effects in ferromagnetic materials driven by Weyl nodes” In Journal of Physics D: Applied Physics 55 IOP Publishing Ltd, 2022 DOI: 10.1088/1361-6463/ac3351
- [40] B.. Smith et al. “Matrix Eigensystem Routines — EISPACK Guide” Springer Berlin Heidelberg, 1976 DOI: 10.1007/3-540-07546-1
- [41] Xiaoyu Yang et al. “MatCloud: A high-throughput computational infrastructure for integrated management of materials simulation, data and resources” In Computational Materials Science 146 Elsevier B.V., 2018, pp. 319–333 DOI: 10.1016/j.commatsci.2018.01.039
- [42] Jakub Železný “zeleznyj/wannier-linear-response” URL: https://bitbucket.org/zeleznyj/wannier-linear-response/src/master/
- [43] Jakub Železný et al. “High-throughput study of the anomalous Hall effect” In npj Computational Materials 9, 2023, pp. 151 DOI: 10.1038/s41524-023-01113-5
- [44] Yang Zhang et al. “Different types of spin currents in the comprehensive materials database of nonmagnetic spin Hall effect” scalling row In npj Computational Materials 7, 2021 DOI: 10.1038/s41524-021-00635-0