This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-performance Computation of Kubo Formula with Vectorization of Batched Linear Algebra Operation

Yuta Yahagi Secure System Platform Laboratory, NEC Corporation    Toshihiro Kato HPC Department, NEC Corporation
(July 27,2023)
Abstract

We have proposed a method to accelerate the computation of Kubo formula optimized to vector processors. The key concept is parallel evaluation of multiple integration points, enabled by batched linear algebra operations. Through benchmark comparisons between the vector-based NEC SX-Aurora TSUBASA and the scalar-based Xeon machines in node performance, we verified that the vectorized implementation was speeded up to approximately 2.2 times faster than the baseline. We have also shown that the performance improvement due to padding, indicating that avoiding the memory-bank conflict is critically important in this type of task.

1 Introduction

One of the main challenges in data-driven material development is the small data problem. It is not uncommon for data scientific methodologies to fail to work due to a lack of sufficient data in a target domain. For overcoming this problem, computational approaches based on high-throughput first-principles calculations have attracted much attention and are actively used in data generation [29, 24, 25, 41]. Compared with experimental approaches, the computational approach has advantages in terms of data production, allowing us to construct a large-scale dataset or to run an exhaustive virtual screening.

Currently, almost all high-throughput calculation projects stand on the Kohn–Sham density functional theory (KS-DFT), which is an established and generic method for predicting the electronic structure of materials without assuming empirical parameters. The DFT fundamentally characterizes the ground state of electrons. It enables the computation of static quantities such as electron density, total energy, force fields, and magnetic moments, thus providing useful insights into structural stability, chemical reactions, and various practical applications[31].

External-field responses such as transport properties and optical properties not only give helpful knowledge of materials under a driving forces but also allow direct comparison between calculations and measurements. However, unlike the static quantities, these properties are not included in the standard DFT procedure because they must gather electron excitation, which is out of the range of DFT. Computing response functions in the first-principles level, we have to consider some theoretical extensions of DFT such as time-dependent DFT or additional post-process calculation depending on specific tasks.

Recently, a post-process scheme based on the Kubo formula with an effective tight-binding model has been widely used in the context of high-throughput calculations. The Kubo formula is a general formula for linear response phenomena, which can be evaluated in a single framework regardless of the system and target [30]. Usually, the electronic structure obtained by DFT is often downfolded to a tight-binding (TB) model such as Wannier TB to make its evaluation easy [32]. On the basis of such a scheme, several studies have successfully produced datasets of transport properties, for example, the charge conductivity, the thermoelectric conductivity, and the spin conductivity[38, 44, 39, 43].

Although the Kubo formula is an attractive method for high-throughput calculations due to its versatility and stability, it is computationally expensive, limiting the yield of workflow even if combined with TB models. Roughly estimated, the evaluation of Kubo formula typically costs 10 times as much as that of DFT calculation. Moreover, in contrast to the DFT core program, for which implementations utilizing a general-purpose graphics processing unit (GPGPU) exist, speeding up of Kubo formula remains a major challenge because it involves overcoming multiple computational bottlenecks. Modern high-performance computing (HPC) implementations of the Kubo formula can play a vital role in increasing the yield of data generation.

In this study, we develop an efficient method of evaluating the Kubo formula by optimizing for vector processors. As a practical example, we perform benchmark tests in computation of the anomalous Hall conductivity (AHC) [33] and show the results of performance comparisons between the vectorized method and an existing method of CPU implementation.

2 Numerical method

2.1 Kubo formula

Within the framework of the Kubo formula, a linear response of an observable B^\hat{B} can be generally expressed as

χBA(ω)=iV0dteiωt[B^(t),A^(0)]eq,\displaystyle\chi_{BA}(\omega)=\frac{\mathrm{i}}{\hbar V}\int_{0}^{\infty}\mathrm{d}te^{\mathrm{i}\omega t}\braket{[\hat{B}(t),\hat{A}(0)]}_{\mathrm{eq}}, (1)

where VV is the volume, the hatted symbols are operators, and χBA(ω)\chi_{BA}(\omega) is the response function of an observable B^\hat{B} under an external field F(t)=FeiωtF(t)=Fe^{\mathrm{i}\omega t} giving a perturbation A^F(t)-\hat{A}F(t). eq\braket{\cdots}_{\mathrm{eq}} is an equilibrium average with respect to the non perturbative Hamiltonian. Considering a single-electron system, this average can be reduced to a Brillouin-zone integral X^eq(2π)3VB.Z.d𝒌TrX^(𝒌)\braket{\hat{X}}_{\mathrm{eq}}\to(2\pi)^{-3}V\int_{\mathrm{B.Z.}}\mathrm{d}\bm{k}\mathrm{Tr}\hat{X}(\bm{k}) for a periodic system, otherwise a real-space integral X^eqd𝒓TrX^(𝒓)\braket{\hat{X}}_{\mathrm{eq}}\to\int\mathrm{d}\bm{r}\mathrm{Tr}\hat{X}(\bm{r}).

From the perspective of numerical calculation, Eq.(1) should consist of a three-dimensional integration whose integrand is the trace of a matrix resulting from multiple linear algebra operations. The integrand, depending on a specification, mostly conducts matrix products with either a full-diagonalization or a matrix inverse. Eventually the computational task involves a nested loop for the numerical integration and the matrix oprations, hence the time complexity is approximately O(NkNd3)O(N_{k}*N_{d}^{3}) where NkN_{k} is the number of integration points and NdN_{d} is the dimension of the matrix corresponding to the number of single-electron bands.

Here, let us focus on the intrinsic AHC, σyx\sigma_{yx}, in a periodic, single-electron system at zero temperature for demonstration. Eq.(1) can be rewritten as

σyx\displaystyle\sigma_{yx} =𝒌Nkf(𝒌),\displaystyle=\sum_{\bm{k}}^{N_{k}}f(\bm{k}), (2)
f(𝒌)\displaystyle f(\bm{k}) =πnNdmnNdIm[J^y(𝒌)]nm[J^x]mn(𝒌)(Em(𝒌)En(𝒌))2θ(EFEn(𝒌)),\displaystyle=\frac{\hbar}{\pi}\sum_{n}^{N_{d}}\sum_{m\neq n}^{N_{d}}\mathrm{Im}\frac{[\hat{J}_{y}(\bm{k})]_{nm}[\hat{J}_{x}]_{mn}(\bm{k})}{(E_{m}(\bm{k})-E_{n}(\bm{k}))^{2}}\theta(E_{\mathrm{F}}-E_{n}(\bm{k})), (3)

where J^α(𝒌)\hat{J}_{\alpha}(\bm{k}) is a charge current operator, En(𝒌)E_{n}(\bm{k}) is an eigenvalue, EFE_{\mathrm{F}} is the Fermi energy, and θ(x)\theta(x) is a unit step function. In this formula, all operators are represented in NdN_{d}-dimensional matrices. Note that Eq.(3) is valid for a periodic system, whereas for a system with impurities, we have to compute Green’s functions, i.e., matrix inverse, instead of eigenvalues.

2.2 Baseline implementation

A minimal code evaluating Eqs.(2,3) is summarized as the pseudo code in List1, which consists of a 𝒌\bm{k} integral of (1) construction of matrices H^(𝒌)\hat{H}(\bm{k}) and J^α(𝒌)\hat{J}_{\alpha}(\bm{k}), (2) full-diagonalization of H^(𝒌)\hat{H}(\bm{k}), (3) conversion of J^α(𝒌)\hat{J}_{\alpha}(\bm{k}) into the eigenbasis, and (4) evaluation of Eq.(3). The integral requires NkN_{k} times of the evaluations of (1)-(4), which takes O(Nd3)O(N_{d}^{3}), resulting in the total time complexity O(NkNd3)O(N_{k}N_{d}^{3}). In particular, the most time consuming part is the full-diagonalization (2).

To accelerate the calculation, a parallel programming technique (for instance, MPI parallelization) can be applied to either the 𝒌\bm{k} integral or the matrix calculations. In high-throughput computation, the former is more effective than the latter. This is because the parallelization of matrix calculations is not effective in many cases where the material is composed of several chemical species per unit cell, in which case the dimension of the matrix NdN_{d} is also smaller. Moreover, the integrand is spiky in the momentum space and requires a very fine computational mesh of approximately 10610710^{6}-10^{7} points, incurring a critical amount of overhead when MPI is used inside of the integral.

2.3 Vectorized implementation

In this section, we propose an alternative, namely vectorized implementation of the Kubo formula, which is optimized for vector processors. The key idea is batch evaluation of multiple integration points, as depicted in Fig.1. Let NBN_{B} be the number of batches and lBl_{B} be the batch size taken to be the same as the vector length. Here, Nk=NBlBN_{k}=N_{B}*l_{B} holds. We first split the 𝒌\bm{k}-integral loop of the baseline code into two loops: an inner one for the batches of length lBl_{B} and an outer one for the sum of them, that is,

σyx=𝒌Nkf(𝒌)Batchedb=1NBl=1lBf(𝒌blB+l)b=1NBf~(𝒌[(b1)lB:blB]),\displaystyle\sigma_{yx}=\sum_{\bm{k}}^{N_{k}}f(\bm{k})\xrightarrow{\mathrm{Batched}}\sum_{b=1}^{N_{B}}\sum_{l=1}^{l_{B}}f(\bm{k}_{bl_{B}+l})\equiv\sum_{b=1}^{N_{B}}\tilde{f}(\bm{k}_{[(b-1)l_{B}:bl_{B}]}), (4)

where 𝒌[n:m][𝒌n,𝒌n+1,𝒌m]\bm{k}_{[n:m]}\equiv[\bm{k}_{n},\bm{k}_{n+1}\dots,\bm{k}_{m}]. In other words, a part of original loop is pushed into the vectorized integrand f~\tilde{f}. The performance of vectorization becomes apparent when the loop is sufficiently longer than lBl_{B}. In our case where lBNkl_{B}\ll N_{k} is well satisfied, we can expect a sufficient benefit of vectorization.

List2 represents a pseudo code of this implementation. Compared with List1, the procedure is exactly equivalent, but the loop cycles are executed per batch, and each matrix operation is performed in a batch. Here, we prepare a batched version of the matrix product and the diagonalization; the latter is obtained by modifying the EISPACK subroutine [40]. Such a batched linear algebra operation scheme is known as the Batched-BLAS in recent literature [27, 26, 23].

In addition, to avoid degradation due to memory-bank conflict, the matrix size is adjusted by padding. The benefits of the padding will be demonstrated in the next section.

To end this section, we comment on the case of a large-scale system involving hundreds of atoms. In such a case, the dimension of the matrix NdN_{d} is sufficiently large, and parallelization of the matrix operations becomes effective. Because the required NkN_{k} is in a tradeoff relationship between the unit cell size, we can choose the best strategy of parallelization depending on the problem. As large-scale systems are beyond the scope of this paper, we will not discuss them further.

3 Benchmark results

To examine the performance of the vector implementation, we conduct a demonstration on a vector machine NEC SX-Aurora TSUBASA where the vector length is 256 [36]. For comparison, the baseline implementation is also tested on a CPU machine, Xeon Gold, with 32-process parallelization by flat MPI. We are aware that comparison between different architectures is not straightforward, but here we compare single node performances as a compromise measure. The computing and compiling environments are listed in Table 1.

In this demonstration, we measure the computation times to evaluate AHC of the bcc-Fe and its supercell structures. Wannier TB models are constructed from the DFT calculation using the Quantum Espresso code [28] and the Wannier90 code [37]. We refer to the WANNIER-LINEAR-RESPONSE code [42] as the baseline implementation of Kubo formula and vectorize it by the method introduced in Sec.2.3.

Let us begin with the vector processing rate, a performance indicator defined as the proportion of vectorized processing in total processing. According to the profiler, our vectorized code achieves sufficiently high scores reaching 98-99 % in all demonstrations.

Fig.2 shows the computation times as a function of NkN_{k}. Here, we use the [1,1,2]-supercell where Nd=48N_{d}=48 as it has 4 atoms with 2 spins and 6 Wannier projections per atom. From this data, we can see that the results on the SX-Aurora are about twice as fast as those of 32-process parallel execution on Xeon Gold, indicating a significant acceleration due to the vectorized implementation. This trend becomes more pronounced as NkN_{k} increases. We again emphasize that this comparison is only a single aspect and does not lead general superiority of the vector processor over the CPU. Rather, this is clear evidence of the usefulness of the vectorization technique on this task.

Next, Fig.3 shows the computation times on the Nk=1503N_{k}=150^{3} case as a function of NdN_{d}. Although the performance ratios varied by matrix size, in all cases, the SX-Aurora is more than 1.7 times faster than the Xeon Gold and 2.17 times at maximum. Note that the performance ratio on Nd=192N_{d}=192 is lower because it is a special condition for better performance of the Xeon Gold rather than defects of the SX-Aurora. Together with the results in Fig.2, we claim that vectorized implementations have shown significant speedups in a wide range of tasks on small-scale systems, which is a desirable feature in high-throughput computations.

Finally, we examine the effect of the padding on the performance. Fig.4 compares the vectorized implementation with and without the padding, showing that the padding improves the performance by 1.5 times at maximum. The improvement becomes notable where Nd96N_{d}\geq 96 (8 atoms), that is, the memory-bank conflict causes serious performance degradation except in very small systems. Hence, padding is critically important in practical use.

4 Conclusion

In this study, we have proposed a method to accelerate the computation of Kubo formula by using batched matrix operations optimized to vector processors. Our findings have led to the following significant insights:

First, the introduction of vectorization was demonstrated to accelerate the computation of Kubo formula. Through benchmark comparisons between the vector-based NEC SX-Aurora TSUBASA and the scalar-based Xeon machines, we verified that the SX-Aurora has up to approximately 2.2 times faster node performance. This result indicates that our proposed method is effective and leads to increased data production in the context of high-throughput calculations.

The computational performance improvement due to padding was observed. This result underscores the importance of avoiding the memory-bank conflict to improve computational efficiency.

Furthermore, our approach is not limited to Kubo formula; it can be extended to various similar tasks relying on batched linear algebra operations. The vectorization approach could become a powerful alternative to the MPI in such cases. In addition, the same strategy can be applied to GPGPU as long as memory consumption is acceptable. The GPGPU implementation is an interesting future work.

Aknowledgement

Y.Y and T.K. are grateful to Dr. Taizo Shibuya at NEC Corporation for his valuable suggestions.

Contents

Refer to caption
Figure 1: Schematics of (left) the baseline implementation and (right) the vectorized implementation of the Kubo formula. Here, ff is a function of 𝒌\bm{k} and f~\tilde{f} is a function of a list of 𝒌\bm{k}, corresponding to those in Eq.(4), respectively.
Refer to caption
Figure 2: (Color online) Computation times on the Xeon and the SX-Aurora versus the number of 𝒌\bm{k}-points, NkN_{k}. Their performance ratio is defined as (Xeon computing time)/(SX-Aurora computing time). Here, the [1,1,2]-supercell of bcc-Fe is used.
Refer to caption
Figure 3: (Color online) Computation times on the Xeon and the SX-Aurora versus the dimension of matrix, NdN_{d}. Their performance ratio is defined the same as in Fig.2. Here, Nk=1503N_{k}=150^{3} is used.
Refer to caption
Figure 4: (Color online) Computation times on the SX-Aurora with and without the padding versus the dimension of matrix, NdN_{d}. Improvement ratio is defined as (computing time without padding)/(computing time with padding). Here, Nk=1503N_{k}=150^{3} is used.
List 1: Baseline implementation
1! 𝒌\bm{k}-loop (MPI parallel may be applied here)
2for 𝒌\bm{k} in {𝒌1,𝒌2,,𝒌Nk\bm{k}_{1},\bm{k}_{2},\dots,\bm{k}_{N_{k}}}
3 ! Define Hamiltonian and current operators
4 H^,J^α,J^β\hat{H},\hat{J}_{\alpha},\hat{J}_{\beta} = define_matrices(𝒌\bm{k})
5 ! Diagonalization of Hamiltonian
6 ! (𝑬\bm{E} : Eigenvalues, U^\hat{U} : Eigenvectors)
7 𝑬,U^\bm{E},\hat{U} = diagonalize(H^\hat{H})
8 ! Convert J^α\hat{J}_{\alpha} into the eigenbasis as J^=U^J^U^\hat{J}^{\prime}=\hat{U}\hat{J}\hat{U}^{\dagger}
9 J^α\hat{J}^{\prime}_{\alpha} = convert(J^α,U^\hat{J}_{\alpha},\hat{U}); J^β\hat{J}^{\prime}_{\beta} = convert(J^β,U^\hat{J}_{\beta},\hat{U})
10 ! Evaluate AHC by Eq.(3)
11 σαβ\sigma_{\alpha\beta} += evaluate(𝑬,J^α,J^β\bm{E},\hat{J}^{\prime}_{\alpha},\hat{J}^{\prime}_{\beta})
12end for
List 2: Vectorized implementation
1 ! Separate 𝒌\bm{k}-loop by batches of length lBl_{B}
2 for 𝒌[:]\bm{k}[:] in {𝒌[1:lB],𝒌[lB+1:2lB],,𝒌[(NB1)lB+1:Nk]\bm{k}_{[1:l_{B}]},\bm{k}_{[l_{B}+1:2l_{B}]},\dots,\bm{k}_{[(N_{B}-1)l_{B}+1:N_{k}]}}
3 ! Define lists of Hamiltonian and current operators
4 H^[:],J^α[:],J^β[:]\hat{H}[:],\hat{J}_{\alpha}[:],\hat{J}_{\beta}[:] = define_matrices(𝒌[:]\bm{k}[:])
5 ! Batched diagonalization of Hamiltonians
6 𝑬[:],U^[:]\bm{E}[:],\hat{U}[:] = diagonalize(H^[:]\hat{H}[:])
7 ! Convert J^α[:]\hat{J}_{\alpha}[:] into the eigenbasis by batched matrix product
8 J^α[:]\hat{J}^{\prime}_{\alpha}[:] = convert(J^α[:],U^[:]\hat{J}_{\alpha}[:],\hat{U}[:]); J^β[:]\hat{J}^{\prime}_{\beta}[:] = convert(J^β[:],U^[:]\hat{J}_{\beta}[:],\hat{U}[:])
9 ! Evaluate AHC by Eq.(3) and aggregate the batch
10 σαβ\sigma_{\alpha\beta} += evaluate(𝑬[:],J^α[:],J^β[:]\bm{E}[:],\hat{J}^{\prime}_{\alpha}[:],\hat{J}^{\prime}_{\beta}[:])
11 end for
Table 1: Computing and compiling environments
Vector machine Scalar machine
Hardware (1 node) VE Type-20B 8 cores [35] Xeon Gold 6362 2 sockets * 16 cores
Compiler NEC SDK 4.0.0 [34] Intel ifort 19.1.2.254
MPI NEC MPI 3.0.0 Intel MPI ver. 2019
Numerical library NLC 2.3 Intel MKL
\newrefcontext

[sorting=count]

References

  • [1] Anubhav Jain et al. “A high-throughput infrastructure for density functional theory calculations” In Computational Materials Science 50, 2011, pp. 2295–2310 DOI: 10.1016/j.commatsci.2011.02.023
  • [2] Stefano Curtarolo et al. “AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations” In Computational Materials Science 58, 2012, pp. 227–235 DOI: 10.1016/j.commatsci.2012.02.002
  • [3] Stefano Curtarolo et al. “The high-throughput highway to computational materials design” In Nature Materials 12, 2013, pp. 191–201 DOI: 10.1038/nmat3568
  • [4] Xiaoyu Yang et al. “MatCloud: A high-throughput computational infrastructure for integrated management of materials simulation, data and resources” In Computational Materials Science 146 Elsevier B.V., 2018, pp. 319–333 DOI: 10.1016/j.commatsci.2018.01.039
  • [5] Richard M. Martin “Electronic Structure” Cambridge University Press, 2004 DOI: 10.1017/CBO9780511805769
  • [6] Ryogo Kubo “Statistical-Mechanical Theory of Irreversible Processes. I. General Theory and Simple Applications to Magnetic and Conduction Problems” scalling row In Journal of the Physical Society of Japan 12, 1957, pp. 570–586 DOI: 10.1143/JPSJ.12.570
  • [7] Nicola Marzari et al. “Maximally localized Wannier functions: Theory and applications” In Reviews of Modern Physics 84 American Physical Society, 2012, pp. 1419–1475 DOI: 10.1103/REVMODPHYS.84.1419/FIGURES/35/MEDIUM
  • [8] Akito Sakai et al. “Iron-based binary ferromagnets for transverse thermoelectric conversion” In Nature 581 Nature Research, 2020, pp. 53–57 DOI: 10.1038/s41586-020-2230-z
  • [9] Yang Zhang et al. “Different types of spin currents in the comprehensive materials database of nonmagnetic spin Hall effect” scalling row In npj Computational Materials 7, 2021 DOI: 10.1038/s41524-021-00635-0
  • [10] Ilias Samathrakis et al. “Enhanced anomalous Nernst effects in ferromagnetic materials driven by Weyl nodes” In Journal of Physics D: Applied Physics 55 IOP Publishing Ltd, 2022 DOI: 10.1088/1361-6463/ac3351
  • [11] Jakub Železný et al. “High-throughput study of the anomalous Hall effect” In npj Computational Materials 9, 2023, pp. 151 DOI: 10.1038/s41524-023-01113-5
  • [12] Naoto Nagaosa et al. “Anomalous Hall effect” scalling row In Reviews of Modern Physics 82, 2010, pp. 1539–1592 DOI: 10.1103/RevModPhys.82.1539
  • [13] B.. Smith et al. “Matrix Eigensystem Routines — EISPACK Guide” Springer Berlin Heidelberg, 1976 DOI: 10.1007/3-540-07546-1
  • [14] Jack Dongarra et al. “The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems” In Procedia Computer Science 108 Elsevier B.V., 2017, pp. 495–504 DOI: 10.1016/j.procs.2017.05.138
  • [15] Jack Dongarra et al. “Optimized Batched Linear Algebra for Modern Architectures” In Euro-Par 2017: Parallel Processing 10417 Springer International Publishing, 2017, pp. 511–522 DOI: 10.1007/978-3-319-64203-1˙37
  • [16] Ahmad Abdelfattah et al. “A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines” In ACM Transactions on Mathematical Software 47 Association for Computing Machinery, 2021 DOI: 10.1145/3431921
  • [17] “NEC SX Vector Supercomputer” URL: https://www.nec.com/en/global/solutions/hpc/sx/
  • [18] Paolo Giannozzi et al. “QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials” In Journal of Physics: Condensed Matter 21, 2009, pp. 395502 DOI: 10.1088/0953-8984/21/39/395502
  • [19] Giovanni Pizzi et al. “Wannier90 as a community code: New features and applications” In Journal of Physics Condensed Matter 32 Institute of Physics Publishing, 2020 DOI: 10.1088/1361-648X/ab51ff
  • [20] Jakub Železný “zeleznyj/wannier-linear-response” URL: https://bitbucket.org/zeleznyj/wannier-linear-response/src/master/
  • [21] “NEC SX VE Type-20B” URL: https://www.nec.com/en/global/solutions/hpc/sx/Vector-Engine-Types.html#20B
  • [22] “NEC SDK” URL: https://www.nec.com/en/global/solutions/hpc/sx/tools.html?

References

  • [23] Ahmad Abdelfattah et al. “A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines” In ACM Transactions on Mathematical Software 47 Association for Computing Machinery, 2021 DOI: 10.1145/3431921
  • [24] Stefano Curtarolo et al. “AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations” In Computational Materials Science 58, 2012, pp. 227–235 DOI: 10.1016/j.commatsci.2012.02.002
  • [25] Stefano Curtarolo et al. “The high-throughput highway to computational materials design” In Nature Materials 12, 2013, pp. 191–201 DOI: 10.1038/nmat3568
  • [26] Jack Dongarra et al. “Optimized Batched Linear Algebra for Modern Architectures” In Euro-Par 2017: Parallel Processing 10417 Springer International Publishing, 2017, pp. 511–522 DOI: 10.1007/978-3-319-64203-1˙37
  • [27] Jack Dongarra et al. “The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems” In Procedia Computer Science 108 Elsevier B.V., 2017, pp. 495–504 DOI: 10.1016/j.procs.2017.05.138
  • [28] Paolo Giannozzi et al. “QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials” In Journal of Physics: Condensed Matter 21, 2009, pp. 395502 DOI: 10.1088/0953-8984/21/39/395502
  • [29] Anubhav Jain et al. “A high-throughput infrastructure for density functional theory calculations” In Computational Materials Science 50, 2011, pp. 2295–2310 DOI: 10.1016/j.commatsci.2011.02.023
  • [30] Ryogo Kubo “Statistical-Mechanical Theory of Irreversible Processes. I. General Theory and Simple Applications to Magnetic and Conduction Problems” scalling row In Journal of the Physical Society of Japan 12, 1957, pp. 570–586 DOI: 10.1143/JPSJ.12.570
  • [31] Richard M. Martin “Electronic Structure” Cambridge University Press, 2004 DOI: 10.1017/CBO9780511805769
  • [32] Nicola Marzari et al. “Maximally localized Wannier functions: Theory and applications” In Reviews of Modern Physics 84 American Physical Society, 2012, pp. 1419–1475 DOI: 10.1103/REVMODPHYS.84.1419/FIGURES/35/MEDIUM
  • [33] Naoto Nagaosa et al. “Anomalous Hall effect” scalling row In Reviews of Modern Physics 82, 2010, pp. 1539–1592 DOI: 10.1103/RevModPhys.82.1539
  • [34] “NEC SDK” URL: https://www.nec.com/en/global/solutions/hpc/sx/tools.html?
  • [35] “NEC SX VE Type-20B” URL: https://www.nec.com/en/global/solutions/hpc/sx/Vector-Engine-Types.html#20B
  • [36] “NEC SX Vector Supercomputer” URL: https://www.nec.com/en/global/solutions/hpc/sx/
  • [37] Giovanni Pizzi et al. “Wannier90 as a community code: New features and applications” In Journal of Physics Condensed Matter 32 Institute of Physics Publishing, 2020 DOI: 10.1088/1361-648X/ab51ff
  • [38] Akito Sakai et al. “Iron-based binary ferromagnets for transverse thermoelectric conversion” In Nature 581 Nature Research, 2020, pp. 53–57 DOI: 10.1038/s41586-020-2230-z
  • [39] Ilias Samathrakis et al. “Enhanced anomalous Nernst effects in ferromagnetic materials driven by Weyl nodes” In Journal of Physics D: Applied Physics 55 IOP Publishing Ltd, 2022 DOI: 10.1088/1361-6463/ac3351
  • [40] B.. Smith et al. “Matrix Eigensystem Routines — EISPACK Guide” Springer Berlin Heidelberg, 1976 DOI: 10.1007/3-540-07546-1
  • [41] Xiaoyu Yang et al. “MatCloud: A high-throughput computational infrastructure for integrated management of materials simulation, data and resources” In Computational Materials Science 146 Elsevier B.V., 2018, pp. 319–333 DOI: 10.1016/j.commatsci.2018.01.039
  • [42] Jakub Železný “zeleznyj/wannier-linear-response” URL: https://bitbucket.org/zeleznyj/wannier-linear-response/src/master/
  • [43] Jakub Železný et al. “High-throughput study of the anomalous Hall effect” In npj Computational Materials 9, 2023, pp. 151 DOI: 10.1038/s41524-023-01113-5
  • [44] Yang Zhang et al. “Different types of spin currents in the comprehensive materials database of nonmagnetic spin Hall effect” scalling row In npj Computational Materials 7, 2021 DOI: 10.1038/s41524-021-00635-0