This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ANKH: A Generalized 𝒪(N)\mathcal{O}(N) Interpolated Ewald Strategy for Molecular Dynamics Simulations

Igor Chollet LAGA, Université Sorbonne Paris Nord, UMR 7539, Villetaneuse, France and LCT, Sorbonne Université, UMR 7616, Paris, France. [email protected]    Louis Lagardère LCT, Sorbonne Université, UMR 7616, Paris, France. [email protected]    Jean-Philip Piquemal LCT, Sorbonne Université, UMR 7616, Paris, France. [email protected]
Abstract

To evaluate electrostatics interactions, Molecular dynamics (MD) simulations rely on Particle Mesh Ewald (PME), an 𝒪(Nlog(N))\mathcal{O}(Nlog(N)) algorithm that uses Fast Fourier Transforms (FFTs) or, alternatively, on 𝒪(N)\mathcal{O}(N) Fast Multipole Methods (FMM) approaches. However, the FFTs low scalability remains a strong bottleneck for large-scale PME simulations on supercomputers. On the opposite, - FFT-free - FMM techniques are able to deal efficiently with such systems but they fail to reach PME performances for small- to medium-size systems, limiting their real-life applicability. We propose ANKH, a strategy grounded on interpolated Ewald summations and designed to remain efficient/scalable for any size of systems. The method is generalized for distributed point multipoles and so for induced dipoles which makes it suitable for high performance simulations using new generation polarizable force fields towards exascale computing.

1 Introduction

Performing Molecular Dynamics (MD) simulations requires to numerically solve Newton’s equations of motion for a system of interacting particles. Modern simulations rely on Molecular Mechanics to evaluate the different physical interactions acting on an ensemble of particles. These approaches are very diverse and range from classical force fields (FFs) [10, 8, 44] to more evolved polarizable approach embodying many-body effects (PFFs) [36, 29, 24]. In any case, one needs to compute electrostatic interactions that are associated to the Coulomb energy and are an essential contribution to the systems total potential energy. To efficiently compute these quantities, MD softwares mainly exploit grid based methods such as the Particle Mesh Ewald (PME) approach, which is an 𝒪(Nlog(N))\mathcal{O}(Nlog(N)) algorithm that uses Fast Fourier Transforms (FFTs). While PME is extremely efficient for small to medium system sizes, for very large ensembles of particles, the FFTs limited scalability is a major bottleneck for large-scale simulations on supercomputers. Historically, Fast Multipole Methods (FMM) have been considered as good 𝒪(N)\mathcal{O}(N) candidates to overcome such limitations. The FMM strategy is - FFT-free and capable of efficiently dealing with such very large systems. Still, PME performance is yet to be reached for small- to medium-size systems, limiting the real-life applicability of FMMs. In this paper, we introduce ANKH, a new general strategy for the fast evaluation of the electrostatic energy using periodic boundary conditions and a density of charge due to point multipoles:

:=12𝐭2rB3𝐱X𝐲Y𝔇𝐱𝔇𝐲(1|𝐱𝐲+𝐭|),\mathcal{E}:=\frac{1}{2}\sum_{\mathbf{t}\in 2r_{B}\mathbb{Z}^{3}}\sum_{\mathbf{x}\in X}\sum_{\mathbf{y}\in Y}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}\left(\frac{1}{|\mathbf{x}-\mathbf{y}+\mathbf{t}|}\right), (1)

where X,YB3X,Y\subset B\subset\mathbb{R}^{3} are two point clouds (of atoms) that can be equal, 𝔇u:=q𝐮+μ𝐮+Θ𝐮:2\mathfrak{D}_{u}:=q_{\mathbf{u}}+\mu_{\mathbf{u}}\cdot\nabla+\Theta_{\mathbf{u}}:\nabla^{2}, q𝐮q_{\mathbf{u}} is the charge of the atom 𝐮\mathbf{u}, μ𝐮\mu_{\mathbf{u}} its dipole moment, Θ𝐮\Theta_{\mathbf{u}} its quadrupole moment and \nabla, 2\nabla^{2} respectively denote the gradient and Hessian operators, BB is the simulation box of radius rBr_{B}. Each point cloud XX and YY is composed of NN atoms, named particles to fit with the usual notation used in hierarchical methods [20]. Notice that we restricted ourselves to charges, dipoles and quadrupoles for the sake of clarity, but that all the theory presented in this paper can be straightforwardly extended to any multipole order and therefore to induced dipoles, providing solutions for polarizable force fields.

Naive computation of the energy through direct implementation of Eq. (1) faces numerical convergence issues, requiring reformulation. The literature widely exploits Ewald summation techniques [14, 17, 43, 33, 41] to provide numerically convergent expressions to compute Eq. (1). The previously mentioned limitations of 3D-FFT based techniques motivate the development of alternatives [39, 38, 23], also based on Ewald summation, such as fast summation schemes for Eq. (1) directly [31, 27, 25] and applied to molecular dynamics, mainly exploiting Fast Multipole Method [20] (FMM) and cutoff approaches [7]. The latter has many advantages since it provides linear complexity and high scalability, despite the error that may occur when applying cutoff on images of BB. Convergent alternatives [35, 11] also extend FMM for Coulomb potential to periodic case when 𝔇𝐱\mathfrak{D}_{\mathbf{x}} and 𝔇𝐲\mathfrak{D}_{\mathbf{y}} are restricted to charges. Other kernels also benefit from periodic extensions [32].

Our ANKH approach aims at providing a theoretical framework well-suited for linear-complexity and scalable FMM-based energy computations, built on Ewald summation. The methodology presented here introduces various novelties, declined in two variants, both exploiting different ways of solving NN-body problems appearing in Ewald summation. Among them, we introduce:

  • a new interpolated Ewald summation, leading to numerical schemes to handle differential operators of multipolar (polarizable) molecular dynamics, accelerated through two different numerical methods,

  • alternative techniques to account for periodicity,

  • explicit formulae to handle the mutual interactions.

Due to the positioning of our work, this article deals with mathematical and computer science topics in the precise context of molecular dynamics. For the sake of clarity, we provide both the theory and the algorithms that should allow to minimize the reader’s effort to implement our method. However, efficient implementations rely on various optimizations and some details are beyond the scope of this paper. This is why we also provide our code, used to generate the results in the following.

The paper is organized as follows. We first briefly review in Sect. 2 the interpolation-based Fast Multipole Method (referred to as IBFMM in the remainder), then we introduce in Sect. 3 ANKH-FMM which is our new method for the fast Ewald Summation. In Sect. 4 we introduce an entirely new approach named ANKH-FFT that overcomes some limitations of ANKH-FMM regarding modern High Perfoamnce Computing (HPC) architectures (at the cost of a slightly higher complexity than ANKH-FMM), then we present in Sect. 5 numerical results to emphasize on the performance of our method as well as a comparison with a production code for molecular dynamics (namely Tinker-HP [26]).

2 Interpolation-based FMM

This section is dedicated to the presentation of interpolation-based FMM. We first recall in Sect. 2.1 the reasons of the efficiency of the original 3D FMM (for Coulomb potential) as well as the algorithm in the case of quasi-uniform particle distributions. Then, in Sect. 2.2, we shortly describe one of the main kernel-independent FMM, namely the interpolation-based FMM, providing the corresponding operator formulas.

2.1 Generalities of FMMs

Fast Multipole Methods [20, 21, 18] (FMM) is a family of fast divide and conquer strategies to compute NN-body problems in linear or linearithmic complexity. Important efforts have been made in the literature to incorporate 3D FMM (i.e. for the Coulomb potential) in molecular dynamics simulation codes [37, 25, 27]. Such a FMM scheme is based on a separable spherical harmonic expansion of the Coulomb potential 1|𝐱𝐲|\frac{1}{|\mathbf{x}-\mathbf{y}|} writing as:

l=0+4πr<l(2l+1)r>l+1m=llYlm(θ𝐱,ϕ𝐱)Ylm(θ𝐲,ϕ𝐲),\sum_{l=0}^{+\infty}\frac{4\pi r^{l}_{<}}{(2l+1)r^{l+1}_{>}}\sum_{m=-l}^{l}Y^{m}_{l}(\theta_{\mathbf{x}},\phi_{\mathbf{x}})Y^{-m}_{l}(\theta_{\mathbf{y}},\phi_{\mathbf{y}}), (2)

where 𝐱=(r𝐱,θ𝐱,ϕ𝐱)\mathbf{x}=(r_{\mathbf{x}},\theta_{\mathbf{x}},\phi_{\mathbf{x}}), 𝐲=(r𝐲,θ𝐲,ϕ𝐲)\mathbf{y}=(r_{\mathbf{y}},\theta_{\mathbf{y}},\phi_{\mathbf{y}}) in spherical coordinates, r<:=min{r𝐱,r𝐲}r_{<}:=min\{r_{\mathbf{x}},r_{\mathbf{y}}\}, r>:=max{r𝐱,r𝐲}r_{>}:=max\{r_{\mathbf{x}},r_{\mathbf{y}}\} and YlmY^{m}_{l} refers to the spherical harmonic of order (l,m)(l,m). Restricting to subset of particles 𝐱t3,𝐲s3\mathbf{x}\in t\subset\mathbb{R}^{3},\mathbf{y}\in s\subset\mathbb{R}^{3} such that radius(t)+radius(s)distance(t,s)ν\frac{radius(t)+radius(s)}{distance(t,s)}\leq\nu, ν>0\nu>0 sufficiently small, one may approximate Eq. (2) by truncating this series to a finite order PP. This allows to derive a fast summation scheme for any 𝐱X\mathbf{x}\in X, for problems consisting in computing ϕ\phi such that

ϕ(𝐱):=𝐲Yq(𝐲)|𝐱𝐲|\phi(\mathbf{x}):=\sum_{\mathbf{y}\in Y}\frac{q(\mathbf{y})}{|\mathbf{x}-\mathbf{y}|} (3)

where q(𝐲)q(\mathbf{y}) is a charge associated to the particle 𝐲\mathbf{y}. Such a problem in Eq. (3) is referred to as a NN-body problem and such tt and ss are named well-separated target (tt) and source (ss) cells, where a cell denotes in this paper a cubical subset of 3\mathbb{R}^{3}. These cells are obtained in practice through a recursive splitting of the computational box BB into smaller boxes. Assuming that BB is a cube, one may divide it into 88 other cubes and repeat this procedure until the induced cubes (the cells) are sufficiently small. A tree whose nodes are cells and such that the daughters of a given cell are the non-empty cubes coming from its decomposition is named an octree. The root cell is considered to belong at octree level 0 and any non-root cell belongs at level E+1E+1, where EE denotes its mother level. In the case of perfect octrees (i.e. with maximal amount of daughter per non-leaf node, that is 88 daughters), we consider that two cells at the same tree level with strictly positive distance (i.e. non-neighbor cells) are well-separated.

The sum over 𝐲Y|s\mathbf{y}\in Y_{|s} (the restriction of YY to the particles in ss) can be switched, for any 𝐱X|t\mathbf{x}\in X_{|t} to the terms depending only on 𝐲\mathbf{y}, due to the separability in the variables 𝐱\mathbf{x} and 𝐲\mathbf{y} of Eq. (2). Expression in Eq. (2) results [21] in a three-step summation scheme expressing Eq. (3) in the form

kUk[t](𝐱)lCk,l[t,s]𝐲Y|sVl[s](𝐲)q(𝐲)\sum_{k}U_{k}[t]^{*}(\mathbf{x})\sum_{l}C_{k,l}[t,s]\sum_{\mathbf{y}\in Y_{|s}}V_{l}[s](\mathbf{y})q(\mathbf{y}) (4)

or equivalently under matrix form U[t]C[t,s]V[s]q|sU^{*}[t]C[t,s]V[s]q_{|s} where q|sq_{|s} refers to the vector of charges of particles in ss. The result s\mathcal{M}_{s} of the product V[s]q|sV[s]q_{|s} is named the multipole expansion of source cell ss. The matrix V[s]V[s] involved in this product is named Particle-To-Multipole operator (shortly denoted by P2M) since it maps information associated to particles (i.e. here the charges associated to atoms) to the multipole expansion of the cell in which they belong. This multipole expansion is transformed into a local one t\mathcal{L}_{t} by means of product by the Multipole-To-Local operator (shortly denoted by M2L) C[t,s]C[t,s]. Similarly, the product by UU^{*} is referred to as the Local-To-Particle operator (L2P). Nevertheless, the efficiency of the FMM algorithm is also guaranteed by two additional operators based on the idea that multipole expansions of a given non leaf cell can be approximating through combination of its daughter’s multipole expansions by means of a Multipole-To-Multipole operator (M2M) and local expansions in a given non root cell can be approximated from its mother local expansion thanks to a Local-To-Local operator (L2L). Finally, two leaf-cells that are not well-separated still need to interact, but without expansion. This is done by applying the Particle-To-Particle operator (P2P), i.e. nothing else than a direct computation involving only the particles of these two cells. The relation between all these operators according to the tree structure are depicted in Fig. 1.

Refer to caption
Figure 1: 2D schematic representation of FMM operators. A source cell ss at level 2 (orange) interacts with well-separated target cells tt’s at level 2 (purple) through M2L operators (green arrows). At level 3, daughters (red) of ss aggregate their multipole expansions through M2M operators (orange arrows) to form multipole expansion in ss. Each tt generates 4 L2L operators to add influence of ss to its daughters through L2L operators (purple arrows). Red and blue arrows in source and target cells respectively represent the action of the P2M and L2P operators respectively in the corresponding cell. Grey cells at level 3 correspond to the daughter of non well-separated cells from ss at level 2.

In the case of quasi-uniform particle distributions (i.e. with approximately the same density of particles in all the simulation box), octrees can be considered as perfect. In this situation, the FMM algorithm is easy to provide (and usually referred to as the uniform FMM algorithm in the literature), according to the definition of the different FMM operators. This is summarized in Alg. 1, where we used the notation V[s,s]V[s,s^{\prime}] to denote the M2M operator between the source cell ss and one of its daughter ss^{\prime}; U[t,t]U^{*}[t^{\prime},t] to denote the L2L operator between the target cell tt and one of its daughter tt^{\prime}; for any target cell tt, Λ(t)\Lambda(t) the interaction list of tt, that is the set of source cells ss that are well-separated from tt and whose ancestors in the octree are not well-separated from any ancestor of tt; 𝒯\mathscr{T} refers to the octree; (𝒯)\mathcal{L}(\mathscr{T}) refers to the set of leaves of 𝒯\mathscr{T} and K(𝐱,𝐲)=1|𝐱𝐲|K(\mathbf{x},\mathbf{y})=\frac{1}{|\mathbf{x}-\mathbf{y}|}.

Algorithm 1 Uniform FMM algorithm
// Step 1: compute multipole expansions
for each leaf cell s(T)s\in\mathcal{L}(T) do
     // Compute multipole expansion s\mathcal{M}_{s} in ss using P2M
     s=V[s]q|s\mathcal{M}_{s}=V[s]q_{|s}
end for
for each non leaf level ee starting from the deepest one do
     for each cell ss at level ee do
         // Aggregate multipole expansions through M2M
         s=𝟎\mathcal{M}_{s}=\mathbf{0}
         for each daughter ss^{\prime} of ss do
              s=s+V[s,s]s\mathcal{M}_{s}=\mathcal{M}_{s}+V[s,s^{\prime}]\mathcal{M}_{s^{\prime}}
         end for
     end for
end for
// Step 2: Compute local expansions
for each target cell tt do
     // Add far contribution using M2L
     t=𝟎\mathcal{L}_{t}=\mathbf{0}
     for each source cell sΛ(t)s\in\Lambda(t) do
         t=t+C[t,s]s\mathcal{L}_{t}=\mathcal{L}_{t}+C[t,s]\mathcal{M}_{s}
     end for
end for
for each level ee starting from the root do
     // Scatter local expansions using L2L
     for each target cell tt at level ee do
         for each daughter tt^{\prime} of tt do
              t=t+U[t,t]t\mathcal{L}_{t}=\mathcal{L}_{t}+U^{*}[t^{\prime},t]\mathcal{L}_{t}
         end for
     end for
end for
// Step 3: convert local expansions
for each leaf cell t(𝒯)t\in\mathcal{L}(\mathscr{T}) do
     // Apply L2P
     ϕt=U[t]t\phi_{t}=U^{*}[t]\mathcal{L}_{t}
end for
// Step 4: add near contribution
for each leaf cell t(𝒯)t\in\mathcal{L}(\mathscr{T}) do
     // Apply all P2P involving tt
     for each source cell ss adjacent to tt do
         for 𝐱X|t\mathbf{x}\in X_{|t} do
              ϕ(𝐱)=ϕt(𝐱)\phi(\mathbf{x})=\phi_{t}(\mathbf{x})
              for 𝐲Y|s\mathbf{y}\in Y_{|s} do
                  ϕ(𝐱)=ϕ(𝐱)+K(𝐱,𝐲)q(𝐲)\phi(\mathbf{x})=\phi(\mathbf{x})+K(\mathbf{x},\mathbf{y})q(\mathbf{y})
              end for
         end for
     end for
end for

Alg. 1 results in 𝒪(N)\mathcal{O}(N) complexity [21] with prefactor depending on the user required precision. Notations of Alg. 1 directly provide a way of approximating the influence ϕt,s\phi_{t^{\prime},s^{\prime}} of ss^{\prime} a daughter of ss on tt^{\prime} and daughter of tt, provided that tt and ss are well-separated, under matrix form:

ϕt,sU[t]L2PU[t,t]L2LC[t,s]M2LV[s,s]M2MV[s]P2Mq|s.\phi_{t^{\prime},s^{\prime}}\approx\underbrace{U^{*}[t^{\prime}]}_{L2P}\underbrace{U^{*}[t^{\prime},t]}_{L2L}\underbrace{C[t,s]}_{M2L}\underbrace{V[s,s^{\prime}]}_{M2M}\underbrace{V[s^{\prime}]}_{P2M}q_{|s^{\prime}}. (5)

This last can be interpreted as low-rank matrix approximation when seeing Eq. (3) as a matrix-vector product with matrix entries corresponding to evaluation of the Coulomb potential on pairs of target and source cells. Notice that the interaction between equally located particles is not well-defined, and we consider in this article that they are just equal to 0.

In terms of parallel implementation, 3D FMM as shown its ability to scale in distributed memory context. Combined with its linear complexity, this makes this method particularly attractive for MD simulations. However, few limitations still remain: the presentation we made only takes into accounts charges and considered non-periodic case. This last can be solved by means of periodic FMM [11].

2.2 Interpolation-based kernel-independent scheme

Actually, a similar scheme can be derived for any kernel K(𝐱,𝐲)K(\mathbf{x},\mathbf{y}) (not only the Coulomb one) that satisfies particular assumption, namely the asymptotically smooth behavior [9]. There exist different approaches to obtain a fast separable formula in a kernel-independent way [46, 18]. Among them, we are especially interested into the interpolation-based FMM (IBFMM) because of its flexibility, its well documented error bounds as well as the particular form of the approximated expansion of KK it provides. Indeed, denoting by Sk[t]S_{k}[t] (resp. Sl[s]S_{l}[s]) a multivariate Lagrange polynomial associated to the interpolation node 𝐱kt\mathbf{x}_{k}^{t} (resp. 𝐲ls\mathbf{y}_{l}^{s}) in tt (resp. ss), a Lagrange interpolation of KK in t×st\times s writes:

K(𝐱,𝐲)kSk[t](𝐱)lK(𝐱kt,𝐲ls)Sl[s](𝐲)K(\mathbf{x},\mathbf{y})\approx\sum_{k}S_{k}[t](\mathbf{x})\sum_{l}K(\mathbf{x}_{k}^{t},\mathbf{y}_{l}^{s})S_{l}[s](\mathbf{y}) (6)

for any (𝐱,𝐲)t×s(\mathbf{x},\mathbf{y})\in t\times s. Eq. (6) clearly exhibits a separable expansion of KK in the 𝐱\mathbf{x} and 𝐲\mathbf{y} variables. Hence, the restriction of Eq. (3) to particles in t×st\times s (i.e. to X|tX_{|t} and Y|sY_{|s}) and replacing the Coulomb kernel by a generic one KK) can be approximated by:

kSk[t](𝐱)L2P on tlK(𝐱kt,𝐲ls)M2L between t and s𝐲Y|sSl[s](𝐲)q(𝐲)P2M on s\underbrace{\sum_{k}S_{k}[t](\mathbf{x})}_{\text{L2P on }t}\underbrace{\sum_{l}K(\mathbf{x}_{k}^{t},\mathbf{y}_{l}^{s})}_{\text{M2L between $t$ and $s$}}\underbrace{\sum_{\mathbf{y}\in Y_{|s}}S_{l}[s](\mathbf{y})q(\mathbf{y})}_{\text{P2M on $s$}} (7)

by analogy with Eq. (4). This identifies most of the FMM operators. Moreover, interpreting multipole expansions generated this way as charges associated to interpolation nodes (now seen as particles), one can obtain a new NN-body problem with the same kernel KK. Hence, this procedure can be repeated all along the tree structure in order to approximate multipole expansions in the non-leaf cells (which is the analogous of the M2M operator). In the same way, evaluating interpolant at a non-leaf cell on interpolation nodes in its daughters allows to derive a analogous of the L2L operator. To be more precise, we assume that the Lagrange interpolation applied on cell cc uses the interpolation grid Ξc={𝐲0s,,𝐲P1s}\Xi_{c}=\{\mathbf{y}^{s}_{0},...,\mathbf{y}^{s}_{P-1}\} with PP interpolation nodes over cc. In practice, Ξc\Xi_{c} is chosen as product of one-dimensional interpolation nodes (such as Chebyshev nodes [18]). Hence, for any source cell ss, the IBFMM M2M operator between ss and ss^{\prime}, with ss^{\prime} a daughter of ss, is defined as

V[s,s]:=[S0[s](𝐲0s)S0[s](𝐲P1s)SP1[s](𝐲0s)SP1[s](𝐲P1s)].V[s,s^{\prime}]:=\begin{bmatrix}S_{0}[s](\mathbf{y}^{s^{\prime}}_{0})&\ldots&S_{0}[s](\mathbf{y}^{s^{\prime}}_{P-1})\\ \vdots&\ddots&\vdots\\ S_{P-1}[s](\mathbf{y}^{s^{\prime}}_{0})&\ldots&S_{P-1}[s](\mathbf{y}^{s^{\prime}}_{P-1})\end{bmatrix}. (8)

Using Eq. (6), we can identify that the IBFMM L2P operator on leaf cell cc is the adjoint of the IBFMM P2M operator on this cc. Hence, for any non-leaf cell cc, the IBFMM L2L operator between cc and a daughter cc^{\prime} of cc is the adjoint of the IBFMM M2M operator between cc and cc^{\prime}, that is:

U[c]=V[c],U[c,c]=V[c,c].U^{*}[c]=V[c]^{*},\hskip 8.5359ptU^{*}[c^{\prime},c]=V[c,c^{\prime}]^{*}. (9)

As for the classical FMM algorithm, the IBFMM reaches the linear complexity with respect to the number of particles in the system. In the remainder of this article, we consider that KK is translationally and rotationally invariant, which is always verified in our applications. A schematic view of IBFMM is provided in Fig. 2.

Refer to caption
Figure 2: Schematic representation of low-rank approximations obtained through interpolation over 1D quasi-uniform particle distributions. Perfect trees (binary trees here) are represented above (sources) and on the left (targets) of the matrix representation of the far field compression. There are admissible cell pairs (t,s)(t,s) only at levels 2 (big dark grey blocks) and 3 (small dark grey blacks). In each of the corresponding blocks, we drew its low-rank approxiation using red color for source polynomials (i.e. V[s]V[s] matrix), blue color for target polynomials (i.e. U[t]U^{*}[t] matrix and purple for small square matrices C[t,s]C[t,s]). Non-admissible cell pairs are represented using light grey.

3 ANKH-FMM

Numerical issues arise when dealing with Eq. 1 since the series only conditionally converges, possibly preventing practical convergence. The Ewald summation technique [14] transforms Eq. (1) into =real+rec+self\mathcal{E}=\mathcal{E}_{real}+\mathcal{E}_{rec}+\mathcal{E}_{self} where

real:=\displaystyle\mathcal{E}_{real}:= 𝐭𝐱𝐲𝔇𝐱𝔇𝐲(erfc(ξ|𝐱𝐲+𝐭|)|𝐱𝐲+𝐭|),\displaystyle\sum_{\mathbf{t}}\sum_{\mathbf{x}}\sum_{\mathbf{y}}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}\left(\frac{erfc\left(\xi|\mathbf{x}-\mathbf{y}+\mathbf{t}|\right)}{|\mathbf{x}-\mathbf{y}+\mathbf{t}|}\right), (10)
rec:=\displaystyle\mathcal{E}_{rec}:= 12π|B|(2rB)2𝐦𝟎e2(πrBξ)2𝐦𝐦𝐦𝐦S(𝐦)S(𝐦),\displaystyle\frac{1}{2\pi|B|(2r_{B})^{2}}\sum_{\mathbf{m}\neq\mathbf{0}}\frac{e^{-2\left(\frac{\pi r_{B}}{\xi}\right)^{2}\mathbf{m}\cdot\mathbf{m}}}{\mathbf{m}\cdot\mathbf{m}}S(\mathbf{m})S(-\mathbf{m}),
S(𝐦):=\displaystyle S(\mathbf{m}):= 𝐱eiπrB𝐦𝐱(q𝐱+2iπrBμ𝐱𝐦\displaystyle\sum_{\mathbf{x}}e^{i\frac{\pi}{r_{B}}\mathbf{m}\cdot\mathbf{x}}\Big{(}q_{\mathbf{x}}+2i\pi\hskip 1.9919ptr_{B}\hskip 1.42271pt\mu_{\mathbf{x}}\cdot\mathbf{m}
(2πrB))2Θ𝐱:(𝐦𝐦T)),\displaystyle-\big{(}2\pi\hskip 1.9919ptr_{B})\big{)}^{2}\hskip 1.42271pt\Theta_{\mathbf{x}}:\left(\mathbf{m}\mathbf{m}^{T}\right)\Big{)},
self:=\displaystyle\mathcal{E}_{self}:= ξπ𝐱(q𝐱2+2ξ23μ𝐱μ𝐱+8ξ45Θ𝐱:Θ𝐱),\displaystyle-\frac{\xi}{\sqrt{\pi}}\sum_{\mathbf{x}}\Big{(}q_{\mathbf{x}}^{2}+\frac{2\xi^{2}}{3}\mu_{\mathbf{x}}\cdot\mu_{\mathbf{x}}+\frac{8\xi^{4}}{5}\Theta_{\mathbf{x}}:\Theta_{\mathbf{x}}\Big{)},

in which all the series are absolutely convergent, ξ+\xi\in\mathbb{R}^{+*} and erfc(x)=12π0xet2𝑑terfc(x)=1-\frac{2}{\pi}\int_{0}^{x}e^{-t^{2}}dt denotes the complementary error function. The main point is that erfcerfc quickly decays for sufficiently large ξ\xi, implying that only a small amount of images (i.e. of different 𝐭\mathbf{t}’s) in real\mathcal{E}_{real} have to be considered to reach a given accuracy. Although, applying a cutoff with radius δ\delta on the sum 𝐱𝐲\sum_{\mathbf{x}}\sum_{\mathbf{y}} transforms it into 𝐱𝐲(𝐱,δ)\sum_{\mathbf{x}}\sum_{\mathbf{y}\in\mathcal{B}(\mathbf{x},\delta)} accelerating its evaluation, using the notation (𝐱,δ):={𝐳3||𝐳𝐱|δ}\mathcal{B}(\mathbf{x},\delta):=\{\mathbf{z}\in\mathbb{R}^{3}\hskip 0.56917pt|\hskip 0.56917pt|\mathbf{z}-\mathbf{x}|\leq\delta\}. rec\mathcal{E}_{rec} also is a quickly converging sum and only a limited number of 𝐦\mathbf{m}’s have to be considered to reach a given accuracy, provided that ξ\xi is sufficiently large. Usually, ξ\xi is chosen to minimize the evaluation cost according to the possible cutoffs, balancing the computation costs of real\mathcal{E}_{real} and rec\mathcal{E}_{rec}. The Particle Mesh Ewald [14] (PME) algorithm uses Fast Fourier Transforms (FFTs) in the computation of S(𝐦)S(\mathbf{m})’s, further reducing the complexity.

3.1 A new alternative to PME

On one side, the parallelization of PME algorithm in distributed memory can be a practical bottleneck due to the scalability of the FFT [4]. On the other side, there exists fast scalable methods in distributed memory for the evaluation of NN-body problems involving asymptotically smooth kernels, such as (IB)FMM (see Sect. 2.2). We thus propose a new alternative to PME, based on a very simple idea. Indeed, the kernel

H(𝐱,𝐲):=erfc(ξ|𝐱𝐲|)|𝐱𝐲|,H(\mathbf{x},\mathbf{y}):=\frac{erfc\left(\xi|\mathbf{x}-\mathbf{y}|\right)}{|\mathbf{x}-\mathbf{y}|}, (11)

is asymptotically smooth [5], as well as all its derivatives. For sufficiently large pp, thanks to the absolute convergence of the terms in Eq. (10), we have

real\displaystyle\mathcal{E}_{real} 𝐭2rBZp𝐱𝐲𝔇𝐱𝔇𝐲(erfc(ξ|𝐱𝐲+𝐭|)|𝐱𝐲+𝐭|)\displaystyle\approx\sum_{\mathbf{t}\in 2r_{B}Z_{p}}\sum_{\mathbf{x}}\sum_{\mathbf{y}}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}\left(\frac{erfc\left(\xi|\mathbf{x}-\mathbf{y}+\mathbf{t}|\right)}{|\mathbf{x}-\mathbf{y}+\mathbf{t}|}\right) (12)
=𝐭2rBZp𝐱𝐲𝔇𝐱𝔇𝐲H(𝐱,𝐲+𝐭)=:~real,\displaystyle=\sum_{\mathbf{t}\in 2r_{B}Z_{p}}\sum_{\mathbf{x}}\sum_{\mathbf{y}}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}H(\mathbf{x},\mathbf{y}+\mathbf{t})=:\mathcal{\tilde{E}}_{real},

with Zp:={(z0,z1,z2)3||zk|p,k[[0,2]]}Z_{p}:=\big{\{}(z_{0},z_{1},z_{2})\in\mathbb{Z}^{3}\hskip 2.84544pt|\hskip 2.84544pt|z_{k}|\leq p,\hskip 2.84544ptk\in[\![0,2]\!]\big{\}} and pp to be fixed by user.

Our approach is thus very simple: we consider small ξ\xi (typically ξ=0.01\xi=0.01), so that rec0\mathcal{E}_{rec}\approx 0 and can be neglected for our targeted accuracies. In this case, the counterpart relies in the computation of ~real\mathcal{\tilde{E}}_{real} that requires a relatively large amount of periodic images (i.e. a large pp) to reach the numerical convergence. Since self\mathcal{E}_{self} involves 𝒪(N)\mathcal{O}(N) flops to be computed (and is embarrassingly parallel), only this ~real\mathcal{\tilde{E}}_{real} has a corresponding intensive computational cost. Hopefully, Eq. (12) allows the computation of ~real\tilde{\mathcal{E}}_{real} to be handled through fast methods for NN-body problems with kernel 𝔇𝐱𝔇𝐲H(𝐱,𝐲)\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}H(\mathbf{x},\mathbf{y}).

3.2 Reformulation

In this section, we consider that p=0p=0. Extension to other pp’s will be done in Sect. 3.3. Let ~real(t,s):=𝐱t𝐲s𝔇𝐱𝔇𝐲H(𝐱,𝐲)\mathcal{\tilde{E}}_{real}(t,s):=\sum_{\mathbf{x}\in t}\sum_{\mathbf{y}\in s}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}H(\mathbf{x},\mathbf{y}), where t,s(𝒯)t,s\in\mathcal{L}(\mathscr{T}). We can now divide ~real\mathcal{\tilde{E}}_{real} into two parts:

t,s(𝒯)dist(t,s)=0~real(t,s)=:𝒩real+t,s(𝒯)dist(t,s)0~real(t,s)=:real\underbrace{\sum_{\begin{subarray}{c}t,s\in\mathcal{L}(\mathscr{T})\\ dist(t,s)=0\end{subarray}}\mathcal{\tilde{E}}_{real}(t,s)}_{=:\mathcal{N}_{real}}+\underbrace{\sum_{\begin{subarray}{c}t,s\in\mathcal{L}(\mathscr{T})\\ dist(t,s)\neq 0\end{subarray}}\mathcal{\tilde{E}}_{real}(t,s)}_{=:\mathcal{F}_{real}}

where dist(t,s)dist(t,s) denotes the distance between tt and ss. Hence, by increasing the octree depth, one decreases the number of interactions computed inside 𝒩real\mathcal{N}_{real} (near field) while increasing the number of interactions in real\mathcal{F}_{real} (far field).

Each (t,s)(𝒯)(t,s)\in\mathcal{L}(\mathscr{T}) such that dist(t,s)0dist(t,s)\neq 0 actually is a well-separated pair of cells. Hence, by means of multivariate polynomial interpolation as in Eq. (6) on tt and ss, one may accurately approximate ~real(t,s)\mathcal{\tilde{E}}_{real}(t,s) as:

𝐱t𝐲s𝔇𝐱𝔇𝐲(kSk[t](𝐱)lH(𝐱kt,𝐲ls)Sl[s](𝐲))\displaystyle\sum_{\mathbf{x}\in t}\sum_{\mathbf{y}\in s}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}\left(\sum_{k}S_{k}[t](\mathbf{x})\sum_{l}H(\mathbf{x}_{k}^{t},\mathbf{y}_{l}^{s})S_{l}[s](\mathbf{y})\right) (13)
=\displaystyle= k(𝐱t𝔇𝐱Sk[t](𝐱))=:𝒬t(𝐱kt)lH(𝐱kt,𝐲ls)(𝐲s𝔇𝐲Sl[s](𝐲))=:𝒬s(𝐲ls)\displaystyle\sum_{k}\underbrace{\left(\sum_{\mathbf{x}\in t}\mathfrak{D}_{\mathbf{x}}S_{k}[t](\mathbf{x})\right)}_{=:\mathcal{Q}_{t}(\mathbf{x}_{k}^{t})}\sum_{l}H(\mathbf{x}_{k}^{t},\mathbf{y}_{l}^{s})\underbrace{\left(\sum_{\mathbf{y}\in s}\mathfrak{D}_{\mathbf{y}}S_{l}[s](\mathbf{y})\right)}_{=:\mathcal{Q}_{s}(\mathbf{y}_{l}^{s})}
=\displaystyle= k𝒬t(𝐱kt)lH(𝐱kt,𝐲ls)𝒬s(𝐲ls).\displaystyle\sum_{k}\mathcal{Q}_{t}(\mathbf{x}_{k}^{t})\sum_{l}H(\mathbf{x}_{k}^{t},\mathbf{y}_{l}^{s})\mathcal{Q}_{s}(\mathbf{y}_{l}^{s}).

These 𝒬c(𝐮)\mathcal{Q}_{c}(\mathbf{u}), with 𝐮\mathbf{u} any interpolation node in cc, can be interpreted as modified charges associated to this interpolation node, which can itself be seen as a particle. Hence, one can perform an IBFMM in which the multipole/local expansions are computed using the modified charges while the near field (i.e. 𝒩real\mathcal{N}_{real}) is evaluated using the explicit formula for the complete interaction between particles with respect to the kernel HH (i.e. involving charges, dipoles and quadrupoles here). In other words, this method (that we named ANKH) consists in the following steps:

  1. 1.

    Compute the tree decomposition 𝒯\mathscr{T} of the simulation box BB,

  2. 2.

    For each leaf c(𝒯)c\in\mathcal{L}(\mathscr{T}), compute the modified charges in cc (using the same interpolation grid up to a translation to the center of cc),

  3. 3.

    Compute 𝒩real\mathcal{N}_{real} using direct computation,

  4. 4.

    Compute real\mathcal{F}_{real} solving the NN-body problem defined on interpolation nodes and modified charges,

  5. 5.

    Compute self\mathcal{E}_{self} using Eq. (10),

  6. 6.

    Return 𝒩real+real+self\mathcal{N}_{real}+\mathcal{F}_{real}+\mathcal{E}_{self}.

The NN-body problem mentioned in step 4 can be written this way:

t(𝒯),𝐮 interpolationnode on t𝒬t(𝐮)(s(𝒯),𝐯 interpolationnode on sHt,s(𝐮,𝐯)𝒬s(𝐯))N-body problem as in Eq. (3)\sum_{\begin{subarray}{c}t\in\mathcal{L}(\mathscr{T}),\\ \mathbf{u}\text{ interpolation}\\ \text{node on }t\end{subarray}}\mathcal{Q}_{t}(\mathbf{u})\underbrace{\left(\sum_{\begin{subarray}{c}s\in\mathcal{L}(\mathscr{T}),\\ \mathbf{v}\text{ interpolation}\\ \text{node on }s\end{subarray}}H_{t,s}(\mathbf{u},\mathbf{v})\mathcal{Q}_{s}(\mathbf{v})\right)}_{N\text{-body problem as in Eq. \eqref{nbodypb}}} (14)

where Ht,s={H if dist(t,s)>0,0 otherwiseH_{t,s}=\begin{cases}H\text{ if }dist(t,s)>0,\\ 0\text{ otherwise}\end{cases}. According to Eq. (14), the quantity into parenthesise can be efficiently computed in linear time by means of IBFMM applied on kernel HH and in which we simply avoid the P2P operators. We name this approach ANKH-FMM.

In the case of quasi-uniform particle distribution (actually the kind of particle distributions we are targeting), a perfect octree with depth log(N)~{}log(N) generates leaves with 𝒪(1)\mathcal{O}(1) particles. Hence, in this situation, step 3 of the ANKH method costs 𝒪(N)\mathcal{O}(N) flops. For such perfect trees, the tree construction (step 1) also is 𝒪(N)\mathcal{O}(N) and computation of self\mathcal{E}_{self} is, of course, also linear in complexity (step 5). Since it is obvious that step 6 is 𝒪(1)\mathcal{O}(1), the overall complexity is 𝒪(N)\mathcal{O}(N).

Two lasts questions remain: what about the periodicity and the computation of modified charges? The first of these questions is the purpose of Sect. 3.3 and the second one is the purpose of Sect. 3.4.

3.3 Efficient handling of periodicity

Various formulations of the FMM provide efficient ways of dealing with periodicity [35, 11, 32]. We consider a slightly different approach in our method, well-suited for IBFMM. One idea is to perform a classical IBFMM on the computational box as well as on the direct 3313^{3}-1 adjacent images of it. Then, all other distant images are well-separated from the computational box and their interaction can be approximated using only the multipole expansion B\mathcal{M}_{B} of this box. More precisely, denoting C[B,I]C[B,I] the M2L matrix between the computational box BB and its image II, one wants to fastly compute:

IZpp|I|>1C[B,I]B=(IZpp|I|>1C[B,I])B.\sum_{\begin{subarray}{c}I\in Z_{p}\\ p\geq|I|>1\end{subarray}}C[B,I]\mathcal{M}_{B}=\left(\sum_{\begin{subarray}{c}I\in Z_{p}\\ p\geq|I|>1\end{subarray}}C[B,I]\right)\mathcal{M}_{B}. (15)

The remaining task is thus to sum the M2L matrices between images of B and B itself efficiently. This can be done by means of interpolation over equispaced grid [6, 13].

Indeed, such tool allow to express C[B,I]C[B,I] as a product χ𝔽D[B,I]𝔽χ\chi^{*}\mathbb{F}^{*}D[B,I]\mathbb{F}\chi, where χ\chi is a zero-padding matrix that increases the size of the inner matrix by slightly less than 88, 𝔽\mathbb{F} denotes the discrete Fourier matrix (that can be applied through FFT) and D[B,I]D[B,I] is a diagonal matrix that can be computed in linearithmic time with respect to the size of C[B,I]C[B,I].

More precisely, let ΞB:={ΞB(i,j,k):=(rB+2rBi/(L1),rB+2rBj/(L1),rB+2rBk/(L1))|i,j,k[[0,L1]]}\Xi_{B}:=\{\Xi_{B}(i,j,k):=(-r_{B}+2r_{B}i/(L-1),-r_{B}+2r_{B}j/(L-1),-r_{B}+2r_{B}k/(L-1))\hskip 0.28436pt|\hskip 0.28436pti,j,k\in[\![0,L-1]\!]\} be a equispaced grid over BB with LL nodes in each direction (i.e. with L3L^{3} nodes in total) and let ΞI\Xi_{I} be its translation from the origin to the center of the image II of BB. The matrix C[B,I]C[B,I], in this case, has a block-Toeplitz structure [6, 13]. Any such block-Toeplitz matrix can be embedded into a circulant one of size (2L1)3(2L-1)^{3} whose first row is composed of the (2L1)3(2L-1)^{3} different entries of C[B,I]C[B,I]. χ{0,1}(2L1)3×L3\chi\in\{0,1\}^{(2L-1)^{3}\times L^{3}} and D[B,I](2L1)3×(2L1)3D[B,I]\in\mathbb{C}^{(2L-1)^{3}\times(2L-1)^{3}} are then defined as follows:

χq,p:={1 if i0=i1,j0=j1;k0=k1;i1,j1,k1<L0 otherwise,\displaystyle\chi_{q,p}:=\begin{cases}1\text{ if }\begin{subarray}{c}i_{0}=i_{1},j_{0}=j_{1};\\ k_{0}=k_{1};i_{1},j_{1},k_{1}<L\end{subarray}\\ 0\text{ otherwise}\end{cases}, (16)
D[B,I]:=diag(𝔽R[B,I]),\displaystyle D[B,I]:=diag(\mathbb{F}R[B,I]),

where R[B,I]R[B,I] is the first row of the circulant matrix formed by embedding of the Toeplitz matrix C[B,I]C[B,I], i.e.

R[B,I]i(2L1)2+j(2L1)+k:=H(𝐳i,j,k(I),𝟎),\displaystyle R[B,I]_{i(2L-1)^{2}+j(2L-1)+k}:=H\left(\mathbf{z}^{(I)}_{i,j,k},\mathbf{0}\right), (17)
𝐳ijk(I):=ΞB(i~,j~,k~)ΞI(i^,j^,k^),\displaystyle\mathbf{z}^{(I)}_{ijk}:=\Xi_{B}(\tilde{i},\tilde{j},\tilde{k})-\Xi_{I}(\hat{i},\hat{j},\hat{k}),
i~:={0 if i<L2L1i otherwise\displaystyle\tilde{i}:=\begin{cases}0\text{ if }i<L\\ 2L-1-i\text{ otherwise}\end{cases}
i^:={i if i<L0 otherwise\displaystyle\hat{i}:=\begin{cases}i\text{ if }i<L\\ 0\text{ otherwise}\end{cases}

Since 𝔽\mathbb{F} is practically performed through FFT, computation of D[B,I]D[B,I] results in 𝒪(L3log(L))\mathcal{O}(L^{3}log(L)) flops.

We can thus reformulate Eq. (15) as

χ𝔽(IZpp|I|>1D[B,I])=:DB𝔽χB.\chi^{*}\mathbb{F}^{*}\underbrace{\left(\sum_{\begin{subarray}{c}I\in Z_{p}\\ p\geq|I|>1\end{subarray}}D[B,I]\right)}_{=:D_{B}}\mathbb{F}\chi\mathcal{M}_{B}. (18)

As a sum of diagonal matrix, DBD_{B} clearly is diagonal. We can switch the sum in Eq. (18) to the vector R[B,I]R[B,I] thanks to Eq. (16), resulting into

DB=diag(𝔽IZpp|I|>1R[B,I]).D_{B}=diag\left(\mathbb{F}\sum_{\begin{subarray}{c}I\in Z_{p}\\ p\geq|I|>1\end{subarray}}R[B,I]\right). (19)

Since the image II of BB is a translation of BB by 𝐭=2rBI\mathbf{t}=2r_{B}I, we have that ΞI(i^,j^,k^)=ΞB(i^,j^,k^)+𝐭\Xi_{I}(\hat{i},\hat{j},\hat{k})=\Xi_{B}(\hat{i},\hat{j},\hat{k})+\mathbf{t}, so that, thanks to Eq. (17) and the translational invariance of HH,

R[B,I]i(2L1)2+j(2L1)+k\displaystyle R[B,I]_{i(2L-1)^{2}+j(2L-1)+k} =H(𝐳i,j,k(𝟎)+𝐭,𝟎)\displaystyle=H\left(\mathbf{z}^{(\mathbf{0})}_{i,j,k}+\mathbf{t},\mathbf{0}\right) (20)
=H(𝐳i,j,k(𝟎),𝐭).\displaystyle=H\left(\mathbf{z}^{(\mathbf{0})}_{i,j,k},-\mathbf{t}\right).

Notice that I-I also lies into the set of images of BB so that one can eliminate the minus sign in Eq. (20). Also, {𝐳i,j,k(𝟎)|i,j,k[[0,(2L1)]]}\{\mathbf{z}^{(\mathbf{0})}_{i,j,k}\hskip 0.28436pt|\hskip 0.28436pti,j,k\in[\![0,(2L-1)]\!]\} can be interpreted as a particle distribution over BB and its direct neighbors. Moreover, applying the sum over II of Eq. (18) to Eq. (20) gives

IZpp|I|>1H(𝐳i,j,k(𝟎),𝐭)\sum_{\begin{subarray}{c}I\in Z_{p}\\ p\geq|I|>1\end{subarray}}H\left(\mathbf{z}^{(\mathbf{0})}_{i,j,k},\mathbf{t}\right) (21)

which can be interpreted as a NN-body problem with all charges set to 11 over two particle distributions: the set PtargetP_{target} of (2L1)3(2L-1)^{3} different 𝐳ijk(𝟎)\mathbf{z}^{(\mathbf{0})}_{ijk}’s and the set PsourceP_{source} of 𝐭\mathbf{t}’s. For large pp, using a hierarchical method for such problem, one may compute this quantities very efficiently since the first of these particle distribution is already well-separated from each 𝐭\mathbf{t}. In addition, both HH, PtargetP_{target} and PsourceP_{source} are invariant under the action of the rotation group that preserves the cube. This can be exploited to speed-up the computation [13]. See Fig. 3 for a schematic representation.

Refer to caption
Figure 3: 1D schematic representation of the handling of far periodic images through hierarchical methods. Octree 𝒯\mathscr{T} in the simulation box BB is represented in grey. Its two direct (non well-separated) neighbor images (in red) directly interact with 𝒯\mathscr{T}. Red and blue nodes in these trees respectively correspond to non well-separated and well-separated nodes of the images regarding nodes of 𝒯\mathscr{T} at the same tree level. On the root of 𝒯\mathscr{T} (i.e. on BB), interpolation nodes over a equispaced grid of size 5 are represented (in black). PtargetP_{target} is drawn in green. Blue nodes at the same tree level than BB are elements of PsourceP_{source}. Upper nodes can be interpreted as aggregation of some of them during a hierarchical procedure to solve the NN-body problem of Eq. (21).

A critical point here (that actually motivates this entire subsection) is that DBD_{B} does not depend on the particle distribution nor its modified charges, but only relies on the size of the simulation box BB. Hence, the computation of DBD_{B} can be done only once inside a precomputation step, not at each IBFMM call. This has a particular interest since in daylife molecular dynamics applications, such IBFMM calls on the same BB would be numerous, but needing only the single precomputation of DBD_{B} with our method.

In terms of complexity, at runtime, D[B,I]D[B,I] is already computed, so the cost of handling the periodicity this way is the one of two FFTs on a small grid of size (2L1)3(2L-1)^{3}. Since LL is a (small) constant, these FFTs cost 𝒪(L3log(L))=𝒪(1)\mathcal{O}(L^{3}log(L))=\mathcal{O}(1) flops.

3.4 Numerical differentiation

The idea of switching derivatives to the interpolation polynomials has already found applications in the literature [30]. In our case, things are slightly more difficult because we consider differential operators instead of simple derivatives. To do so, we consider a Lagrange interpolation over product of 1D Chebyshev grids in each leaf cell as well as all along the tree structure with a fixed order LL in each direction. Notice that this interpolation is different from the one of Sect. 3.3, that performs over a equispaced grid and only for the root. However, this does not impact the method at all since the root is only involved in interactions between non-adjacent images of BB (see Fig. 3) and because the definition of the M2M/L2L operators is valid no matter the interpolation grid type is.

The choice of multivariate Lagrange interpolation over products of 1D Chebyshev nodes is motivated by the numerical instability that arises on large interpolation order and at fixed arithmetic precision when dealing with equispaced grids. Moreover, this choice of interpolation is known to be particularly good in the context of multilevel summation schemes.

As presented in Sect. 3.2, in the ANKH approach, the differential operators apply on interpolation polynomials directly. Any Lagrange polynomial over a product grid is the product of the Lagrange polynomials in each one-dimensional grids. Assuming for the sake of clarity (One only has to introduce a scaling function in order to recover the general case [30, 6]) that the one-dimensional grids are defined on [1,1][-1,1], we thus have the following decomposition of such polynomial SS:

SiL2+jL+k(𝐱)=Si(x0)Sj(x1)Sk(x1)S_{iL^{2}+jL+k}(\mathbf{x})=S_{i}(x_{0})S_{j}(x_{1})S_{k}(x_{1}) (22)

where SiL2+jL+kS_{iL^{2}+jL+k} can be either Sk[t]S_{k}[t] or Sl[s]S_{l}[s] following the notations of Eq. (6), SiS_{i}’s are one-dimensional Lagrange polynomials over Chebyshev (i.e. Lagrange-Chebyshev polynomials) nodes and 𝐱=(x0,x1,x2)3\mathbf{x}=(x_{0},x_{1},x_{2})\in\mathbb{R}^{3}. Explicit forms [18] are known for these polynomials SiS_{i}:

Sk(x)=1L+2Lm=1L1Tm(rk)Tm(x)S_{k}(x)=\frac{1}{L}+\frac{2}{L}\sum_{m=1}^{L-1}T_{m}(r_{k})T_{m}(x) (23)

where TmT_{m} is the mthm^{th} Chebyshev polynomial and rk:=cos(2k12L)r_{k}:=cos\left(\frac{2k-1}{2L}\right) is the kthk^{th} Chebyshev node. Tm(x)T_{m}(x) can be computed using the recurrence relation:

Tm+2(x)=2xTm+1(x)Tm(x),\displaystyle T_{m+2}(x)=2xT_{m+1}(x)-T_{m}(x), (24)
T0(x)=1,T1(x)=x.\displaystyle T_{0}(x)=1,T_{1}(x)=x.

Problem is that explicit derivation of Eq. (23) leads to

Sk(x)2L1x2m=1L1mTm(xk)sin(macos(x))S_{k}^{\prime}(x)\frac{2}{L\sqrt{1-x^{2}}}\sum_{m=1}^{L-1}mT_{m}(x_{k})sin(m\hskip 2.84544ptacos(x)) (25)

which is numerically unstable near the bounds of [1,1][-1,1]. This can here be solved by using Chebyshev polynomials of the second kind in the derivation, but the same kind of problem appears when trying to derive SkS_{k} once again (which is needed to take into account quadrupoles). We thus opted for another way of deriving these polynomials: there exists [3] a upper triangular matrix 𝒟\mathcal{D} such that:

𝐓(x)=𝒟𝐓(x),𝐓(x):=[T0(x)TL1(x)].\mathbf{T}^{\prime}(x)=\mathcal{D}\mathbf{T}(x),\hskip 5.69046pt\mathbf{T}(x):=\begin{bmatrix}T_{0}(x)\\ \vdots\\ T_{L-1}(x)\end{bmatrix}. (26)

This can trivially be generalized to any derivation order, giving:

𝐓(n)(x)=𝒟n𝐓(x).\mathbf{T}^{(n)}(x)=\mathcal{D}^{n}\mathbf{T}(x). (27)

This last form does not suffer from numerical instability near the bounds of [1,1][-1,1]. To exploit it in ANKH, we start by reformulating the set of Lagrange-Chebyshev polynomials of Eq. (23) in matrix form

𝐒(x)=𝐓(x),\displaystyle\mathbf{S}(x)=\mathbb{H}\mathbf{T}(x), (28)
:=1L[12T1(r0)2TL1(r0)12T1(rL1)2TL1(rL1)],\displaystyle\mathbb{H}:=\frac{1}{L}\begin{bmatrix}1&2T_{1}(r_{0})&\ldots&2T_{L-1}(r_{0})\\ \vdots&\vdots&\ddots&\vdots\\ 1&2T_{1}(r_{L-1})&\ldots&2T_{L-1}(r_{L-1})\\ \end{bmatrix},

where \mathbb{H} does not depend on xx and be be precomputed only once. We can now combine Eq. (27) and Eq. (28) in order to obtain our general formula:

𝐒(n)(x)=𝒟n𝐓(x).\mathbf{S}^{(n)}(x)=\mathbb{H}\mathcal{D}^{n}\mathbf{T}(x). (29)

To extend this technique to the three-dimensional case, a direct application of Eq. (22) provides expressions of the multivariate polynomials.

We summarize using CC++-like pseudocode in Alg. 2 the steps needed to compute the modified charges for multipoles up to quadrupoles (trivial generalizations can be extrapolated from this pseudo-code) using numerical differentiation through derivation of Chebyshev polynomials. Notice that the different 𝐓\mathbf{T}’s computed in Alg. 2 could be stacked in practice inside a matrix with 99 columns before applying the product by \mathbb{H}, which allows to benefit from matrix-matrix products instead of matrix-vector ones (i.e. BLAS3 instead of BLAS2), that should better perform.

Algorithm 2 Compute 𝒬s\mathcal{Q}_{s} in leaf cell ss
// Initialize modified charges with zero
𝒬s=0;\mathcal{Q}_{s}=0;
// Loop over particles in ss (order does not matter)
for 𝐲Y|s\mathbf{y}\in Y_{|s} do
     // Notation: 𝐲=(y0,y1,y2)\mathbf{y}=(y_{0},y_{1},y_{2})
     // Allocate arrays storing Chebyshev polynomials
     S0(0),S0(1),S0(2)=new [L]S_{0}^{(0)},S_{0}^{(1)},S_{0}^{(2)}=\text{new }\mathbb{R}[L];
     S1(0),S1(1),S1(2)=new [L]S_{1}^{(0)},S_{1}^{(1)},S_{1}^{(2)}=\text{new }\mathbb{R}[L];
     S2(0),S2(1),S2(2)=new [L]S_{2}^{(0)},S_{2}^{(1)},S_{2}^{(2)}=\text{new }\mathbb{R}[L];
     𝐓=new [L];\mathbf{T}\hskip 47.51604pt=\text{new }\mathbb{R}[L];
     // Decompose the computation per direction
     for k[[0,2]]k\in[\![0,2]\!] do\triangleright 𝒪(L2)\mathcal{O}(L^{2}) flops
         // Initialize the first two polynomials, see Eq. (24)
         𝐓[0]=1\mathbf{T}[0]=1;
         𝐓[1]=yk\mathbf{T}[1]=y_{k};
         // Compute recurrence following Eq. (24)
         for int i=2i=2; i<Li<L; i++i++ do \triangleright 𝒪(L)\mathcal{O}(L) flops
              T[i]=2ykTk[i1]Tk[i2]T[i]=2y_{k}T_{k}[i-1]-T_{k}[i-2];\triangleright 𝒪(1)\mathcal{O}(1) flops
         end for
         // Apply differentiation matrix following Eq. (26)
         for int j=0j=0; j<3j<3; j++j++ do \triangleright 𝒪(L2)\mathcal{O}(L^{2}) flops
              𝐓=𝒟𝐓\mathbf{T}\hskip 10.52737pt=\mathcal{D}\cdot\mathbf{T}; \triangleright 𝒪(L2)\mathcal{O}(L^{2})
              Sk(p)=𝐓S_{k}^{(p)}=\mathbb{H}\cdot\mathbf{T}; \triangleright 𝒪(L2)\mathcal{O}(L^{2}) flops
         end for
     end for
     // Add all possible combinations scaled with charge…
     𝒬s=𝒬s+S0(0)S1(0)S2(0)q𝐲;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(0)}_{0}S^{(0)}_{1}S^{(0)}_{2}q_{\mathbf{y}};
     // … dipoles…
     𝒬s=𝒬s+S0(1)S1(0)S2(0)(μ𝐲)0;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(1)}_{0}S^{(0)}_{1}S^{(0)}_{2}\left(\mu_{\mathbf{y}}\right)_{0};
     𝒬s=𝒬s+S0(0)S1(1)S2(0)(μ𝐲)1;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(0)}_{0}S^{(1)}_{1}S^{(0)}_{2}\left(\mu_{\mathbf{y}}\right)_{1};
     𝒬s=𝒬s+S0(0)S1(0)S2(1)(μ𝐲)2;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(0)}_{0}S^{(0)}_{1}S^{(1)}_{2}\left(\mu_{\mathbf{y}}\right)_{2};
     // … and quadrupoles, exploiting symmetry
     𝒬s=𝒬s+S0(2)S1(0)S2(0)(Θ𝐲)0,0;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(2)}_{0}S^{(0)}_{1}S^{(0)}_{2}\left(\Theta_{\mathbf{y}}\right)_{0,0};
     𝒬s=𝒬s+S0(1)S1(1)S2(0)(Θ𝐲)0,1×2;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(1)}_{0}S^{(1)}_{1}S^{(0)}_{2}\left(\Theta_{\mathbf{y}}\right)_{0,1}\times 2;
     𝒬s=𝒬s+S0(1)S1(0)S2(1)(Θ𝐲)0,2×2;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(1)}_{0}S^{(0)}_{1}S^{(1)}_{2}\left(\Theta_{\mathbf{y}}\right)_{0,2}\times 2;
     𝒬s=𝒬s+S0(0)S1(2)S2(0)(Θ𝐲)1,1;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(0)}_{0}S^{(2)}_{1}S^{(0)}_{2}\left(\Theta_{\mathbf{y}}\right)_{1,1};
     𝒬s=𝒬s+S0(0)S1(1)S2(1)(Θ𝐲)1,2×2;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(0)}_{0}S^{(1)}_{1}S^{(1)}_{2}\left(\Theta_{\mathbf{y}}\right)_{1,2}\times 2;
     𝒬s=𝒬s+S0(0)S1(0)S2(2)(Θ𝐲)2,2;\mathcal{Q}_{s}=\mathcal{Q}_{s}+S^{(0)}_{0}S^{(0)}_{1}S^{(2)}_{2}\left(\Theta_{\mathbf{y}}\right)_{2,2};
end for

Regarding the total complexity, the computation of QQ inside a leaf cell ss has a cost of 𝒪(L3Ns)\mathcal{O}\left(L^{3}N_{s}\right), where NsN_{s} denotes the number of particles in ss. Since LL is a constant, one may drop it from the big 𝒪\mathcal{O} and the total complexity becomes:

s(𝒯)𝒪(Ns)=𝒪(s(𝒯)Ns)=𝒪(N).\sum_{s\in\mathcal{L}(\mathscr{T})}\mathcal{O}\left(N_{s}\right)=\mathcal{O}\left(\sum_{s\in\mathcal{L}(\mathscr{T})}N_{s}\right)=\mathcal{O}(N). (30)

3.5 Overall complexity and advantages

At the end of the day, ANKH-FMM defines a linear method since the overall complexity, thanks to all previous sections, can be counted this way:

𝒪(N)step 1+𝒪(N)step 2+𝒪(N)step 3+𝒪(N)step 4+𝒪(N)step 5+𝒪(1)step 6\underbrace{\mathcal{O}(N)}_{\text{step }1}+\underbrace{\mathcal{O}(N)}_{\text{step }2}+\underbrace{\mathcal{O}(N)}_{\text{step }3}+\underbrace{\mathcal{O}(N)}_{\text{step }4}+\underbrace{\mathcal{O}(N)}_{\text{step }5}+\underbrace{\mathcal{O}(1)}_{\text{step }6} (31)

where the steps are those presented in Sect. 3.2. This has to be compared with the linearithmic complexity of PME. Hence, our method is asymptotically less costly than PME. However, the FMM (so as IBFMM) is known to suffer from an important prefactor, while the FFT has a (very) small one and may almost reach CPU peak performance. This means that the theoretical complexities are not sufficient to compare the two approaches and numerical tests (that will be provided in the following) have to verify the efficiency of ANKH.

However, as we mentioned in Sect. 3.1, there is another strong advantage of using ANKH-FMM: its distributed parallel potential. Indeed, computation of modified charges, as local operations, are easy to parallelize in distributed memory context. Moreover, IBFMM has shown impressive parallel performance in this context and general parallel strategies aiming at reaching exascale are already developed for FMMs. Parallel scaling of FMMs can be put in comparison with ones of FFTs on which PME relies. In addition, efficient implementation of IBFMM on GPU [42] were proposed in the literature for perfect trees (i.e. exactly our application case). Hence, the ANKH strategy appears as a serious theoretical candidate for exascale hybrid computations.

We may already mention that GPU implementations of IBFMM are technical. This is why we wanted to devise a simple and portable scheme to run on GPU while keeping the parallel structure of the FMM. This is the purpose of Sect. 4.

3.6 Mutual interactions

In the particular context of energy computation, the global algorithmic discussed in Sect. 3.5 still is suboptimal. Indeed, coming back to Eq. (5), we locally want to compute q|tTϕt,sq_{|t^{\prime}}^{T}\phi_{t^{\prime},s^{\prime}} instead of ϕt\phi_{t^{\prime}} only (that would rather be suited for forces computations). Hence, we would have

q|tTϕt,s\displaystyle q_{|t^{\prime}}^{T}\phi_{t^{\prime},s^{\prime}} (q|tTU[t])U[t,t]C[t,s]V[s,s](V[s]q|s)=s\displaystyle\approx\left(q_{|t^{\prime}}^{T}U^{*}[t^{\prime}]\right)U^{*}[t^{\prime},t]C[t,s]V[s,s^{\prime}]\underbrace{\left(V[s^{\prime}]q_{|s^{\prime}}\right)}_{=\mathcal{M}_{s^{\prime}}} (32)
=(q|tTV[t])V[t,t]C[t,s]V[s,s]s\displaystyle=\left(q_{|t^{\prime}}^{T}V[t^{\prime}]^{*}\right)V[t,t^{\prime}]^{*}C[t,s]V[s,s^{\prime}]\mathcal{M}_{s^{\prime}}
=(V[t]q|t)TV[t,t]C[t,s]V[s,s]s\displaystyle=\left(V[t^{\prime}]q_{|t^{\prime}}\right)^{T}V[t,t^{\prime}]^{*}C[t,s]V[s,s^{\prime}]\mathcal{M}_{s}
=tTV[t,t]C[t,s]V[s,s]s\displaystyle=\mathcal{M}_{t^{\prime}}^{T}V[t^{\prime},t]^{*}C[t,s]V[s,s^{\prime}]\mathcal{M}_{s^{\prime}}
=(tTV[t,t]T)=(V[t,t]t)T=tTC[t,s](V[s,s]s)=s\displaystyle=\underbrace{\left(\mathcal{M}_{t^{\prime}}^{T}V[t,t^{\prime}]^{T}\right)}_{=\left(V[t,t^{\prime}]\mathcal{M}_{t}\right)^{T}=\mathcal{M}_{t}^{T}}C[t,s]\underbrace{\left(V[s,s^{\prime}]\mathcal{M}_{s^{\prime}}\right)}_{=\mathcal{M}_{s}}

since U[t]=V[t]=V[t]TU^{*}[t^{\prime}]=V[t^{\prime}]^{*}=V[t^{\prime}]^{T} because IBFMM L2P and P2M operators are adjoint matrices with real entries (so as IBFMM L2L and M2M operators). This provides a new formula for real\mathcal{F}_{real}:

real\displaystyle\mathcal{F}_{real} target cell ttTsΛ(s)C[t,s]s\displaystyle\approx\sum_{\text{target cell }t}\mathcal{M}_{t}^{T}\sum_{s\in\Lambda(s)}C[t,s]\mathcal{M}_{s} (33)
=target cell tsΛ(s)(tTC[t,s]s).\displaystyle=\sum_{\text{target cell }t}\sum_{s\in\Lambda(s)}\left(\mathcal{M}_{t}^{T}C[t,s]\mathcal{M}_{s}\right).

This does not reduce the theoretical complexity but strongly simplifies the method. In addition, many optimizations may be derived from this formula.

First, when performing M2L, we can directly compute the quantity into parenthesis in Eq. (33), i.e. adding the left product by tT\mathcal{M}_{t}^{T}. Second, since if two cells t,sBt,s\subseteq B are such that sΛ(t)s\in\Lambda(t), then tΛ(s)t\in\Lambda(s), meaning that the two interactions will be computed. This can be simplified because C[t,s]=C[s,t]TC[t,s]=C[s,t]^{T}:

tTC[t,s]s+sTC[s,t]t\displaystyle\mathcal{M}_{t}^{T}C[t,s]\mathcal{M}_{s}+\mathcal{M}_{s}^{T}C[s,t]\mathcal{M}_{t} (34)
=\displaystyle= tTC[s,t]Ts+sTC[s,t]t\displaystyle\mathcal{M}_{t}^{T}C[s,t]^{T}\mathcal{M}_{s}+\mathcal{M}_{s}^{T}C[s,t]\mathcal{M}_{t}
=\displaystyle= (sTC[s,t]t)T+sTC[s,t]t\displaystyle\left(\mathcal{M}_{s}^{T}C[s,t]\mathcal{M}_{t}\right)^{T}+\mathcal{M}_{s}^{T}C[s,t]\mathcal{M}_{t}
=\displaystyle= 2sTC[s,t]t.\displaystyle\hskip 2.84544pt2\mathcal{M}_{s}^{T}C[s,t]\mathcal{M}_{t}.

This implies that only half of the M2L interactions need to be computed. In addition, since the M2L matrices are practically approximated by low-rank ones [18] of the form C[t,s]𝐋[t,s]𝐑[t,s]C[t,s]\approx\mathbf{L}[t,s]^{*}\mathbf{R}[t,s] with 𝐋[t,s],𝐑[t,s]k×L3\mathbf{L}[t,s],\mathbf{R}[t,s]\in\mathbb{R}^{k\times L^{3}}, k<<L3k<<L^{3}, we can exploit this factorization to further accelerate the M2L evaluation in Eq. (34). This takes the following form:

sTC[s,t]t\displaystyle\mathcal{M}_{s}^{T}C[s,t]\mathcal{M}_{t} sT𝐋[t,s]𝐑[t,s]t\displaystyle\approx\mathcal{M}_{s}^{T}\mathbf{L}[t,s]^{*}\mathbf{R}[t,s]\mathcal{M}_{t} (35)
=(𝐋[t,s]s)𝒪(kL3)(𝐑[t,s]t)𝒪(kL3).𝒪(kL3)\displaystyle=\underbrace{\underbrace{\left(\mathbf{L}[t,s]\mathcal{M}_{s}\right)}_{\mathcal{O}(kL^{3})}\underbrace{\left(\mathbf{R}[t,s]\mathcal{M}_{t}\right)}_{\mathcal{O}(kL^{3})}.}_{\mathcal{O}(kL^{3})}

Different techniques can be used to obtain such low-rank approximations. We compared two of them: partial Adaptive Cross Approximation [2] (pACA) and Singular Value Decomposition (SVD). pACA benefits from a linear algorithmic complexity while SVD is cubic cost. However, the sizes of the M2L matrices are relatively small and thanks to the use of symmetry [13], it is possible to only compute a few dozens of such approximations per level (and only once during the precomputation step, not at runtime). Hence, because numerical ranks obtained using pACA being higher than ranks obtained through SVD (that are optimal), we measured a slightly higher cost when evaluating the FMM using pACA approximations. For these reasons, we decided to rely on SVDs for performance tests.

Similarly, because our kernels KK are transnationally invariant, we have

𝔇𝐱𝔇𝐲K(𝐱,𝐲)+𝔇𝐲𝔇𝐱K(𝐲,𝐱)=2𝔇𝐱𝔇𝐲K(𝐱,𝐲),\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}K(\mathbf{x},\mathbf{y})+\mathfrak{D}_{\mathbf{y}}\mathfrak{D}_{\mathbf{x}}K(\mathbf{y},\mathbf{x})=2\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}K(\mathbf{x},\mathbf{y}),

meaning that only half of the direct interactions have to be computed.

3.7 Global ANKH-FMM algorithm

For the sake of clarity, we summarize all the ANKH-FMM steps in Alg. 3. More precisely, ANKH-FMM first step consists in computing the modified charges in each leaf cell, once the (perfect) octree space decomposition is known. These modified charges are interpreted as terms of a IBFMM multipole expansion associated to interpolation nodes in these cells. This thus lead to a computation of other multipole expansions in the non-leaf tree levels. Notice that this step seems quite easy to parallelize in a shared-memory context due to the independence of the modified charges of two different leaves. The second ANKH-FMM step corresponds to the M2L evaluation involved in IBFMM, except that in our case mutual interactions give a simpler expression of it (following Sect. 3.6). Since we are focused on energy computation and thanks to the adjoint form of the L2L/M2M and L2P/P2M operators, results of each mutual M2L can be added to a scalar variable. Thus, no L2L/L2P application is needed in practice.

An important point to mention at this presentation stage is that, even if the IBFMM part of ANKH-FMM also involves cells of the direct neighbor images of the simulation box BB, there is no need of computing its multipole expansions. Indeed, there are equal to multipole expansions in the original box up a translation, which only appears in the M2L formula since all other formula are in practice taken respectively to cell centers [12].

Then, the third step of ANKH-FMM is just a direct computation of the near field 𝒩real\mathcal{N}_{real}. To achieve it in practice, we used explicit derivations [40] of the kernel KK. Fourth step corresponds to the handling of periodicity (i.e. here the influence on BB of non-direct neighbor images of BB) using the technique presented in Sect. 3.3. Finally, the two last steps are just direct computation of the self energy in Eq. (10) and simple sum of all the computed quantities.

Algorithm 3 ANKH-FMM
// Step 1: get modified charges, multipole expansions
for each leaf cell s(T)s\in\mathcal{L}(T) do
     Compute 𝒬s\mathcal{Q}_{s} using Alg. 2
     s=𝒬s\mathcal{M}_{s}=\mathcal{Q}_{s}
end for
for each non leaf level ee starting from the deepest one do
     for each cell ss at level ee do
         s=𝟎\mathcal{M}_{s}=\mathbf{0}
         for each daughter ss^{\prime} of ss do
              s=s+V[s,s]s\mathcal{M}_{s}=\mathcal{M}_{s}+V[s,s^{\prime}]\mathcal{M}_{s^{\prime}}
         end for
     end for
end for
// Step 2: get real\mathcal{F}_{real} using mutual M2L
real=𝟎\mathcal{F}_{real}=\mathbf{0}
for each target cell tt do
     for each source cell sΛ(t)s\in\Lambda(t) do
         if interaction of ss and tt has not been yet computed then
              real=real+2tTC[t,s]s\mathcal{F}_{real}=\mathcal{F}_{real}+2\mathcal{M}_{t}^{T}C[t,s]\mathcal{M}_{s}
         end if
     end for
end for
// Step 3: get 𝒩real\mathcal{N}_{real} through mutual P2P
𝒩real=𝟎\mathcal{N}_{real}=\mathbf{0}
for each leaf cell t(𝒯)t\in\mathcal{L}(\mathscr{T}) do
     for each source cell ss adjacent to tt do
         if interaction of ss and tt has not been yet computed then
              for 𝐱X|t\mathbf{x}\in X_{|t} do
                  for 𝐲Y|s\mathbf{y}\in Y_{|s} do
                       𝒩real=𝒩real+2𝔇𝐱𝔇𝐲G(𝐱,𝐲)\mathcal{N}_{real}=\mathcal{N}_{real}+2\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}G(\mathbf{x},\mathbf{y})
                  end for
              end for
         end if
     end for
end for
// Step 4: get periodic influence
Let B\mathcal{M}_{B} be the root multipole expansion
per=BTχ𝔽DB𝔽χB\mathcal{F}_{per}=\mathcal{M}_{B}^{T}\chi^{*}\mathbb{F}^{*}D_{B}\mathbb{F}\chi\mathcal{M}_{B}
// Step 5: compute self energy
self=0\mathcal{E}_{self}=0
for each 𝐱X\mathbf{x}\in X do
     self=self+q𝐱22ξ23μ𝐱μ𝐱+8ξ45Θ𝐱:Θ𝐱\mathcal{E}_{self}=\mathcal{E}_{self}+q_{\mathbf{x}}^{2}\frac{2\xi^{2}}{3}\mu_{\mathbf{x}}\cdot\mu_{\mathbf{x}}+\frac{8\xi^{4}}{5}\Theta_{\mathbf{x}}:\Theta_{\mathbf{x}}
end for
// Step 6: return energy \mathcal{E}
=ξπself+real+𝒩real+per\mathcal{E}=-\frac{\xi}{\sqrt{\pi}}\mathcal{E}_{self}+\mathcal{F}_{real}+\mathcal{N}_{real}+\mathcal{F}_{per}

3.8 A first conclusion

We presented in this section a new method, based on both the Ewald summation and the interpolation-based FMM, aiming at efficiently treating NN-body problems arising in the energy computation context in molecular dynamics and in a way that should efficiently perform in parallel (mainly distributed memory) context. Hence, since FMMs benefit from highly scalable strategies [45, 28, 22], this new ANKH-FMM approach may pave the way for a linear complexity and scalable family of methods suited for molecular large scale dynamics computations. We proved, under molecular dynamics hypotheses, that our method is actually 𝒪(N)\mathcal{O}(N), even when exploiting periodicity with large amount of images, thanks to a new way of taking into account the influence of the far images. We introduced the mutual interactions in the context of M2L evaluations for this application case and we presented a numerical differentiation through polynomial interpolation in order to deal with differential operators arising when considering dipoles and quadrupoles (our methodology can be extended to any multipole order with low effort). Finally, we provided explicit algorithm detailing our method.

Among the direct perspective, the ideas behind ANKH-FMM can be used to compute forces, even if some optimizations have to be changed to this purpose (especially in the mutual M2L interactions). The parallel implementation of ANKH-FMM as well as real tests on supercomputers also counts among the important and relatively short-term perspectives.

4 ANKH-FFT

Despite the sequential and distributed efficiency of the interpolation-based FMM, limitations arise in the context of GPU computing. Indeed, if the near-field part is well suited to GPU, the M2L evaluation suffers from data loading in shared memory that mitigates the performance on this type of architecture. However, the perfectness of the 2d2^{d}-trees in molecular dynamics allows to derive efficient schemes, as those proposed in the literature [42]. In this section, we introduce a new method, also based on interpolation, and designed to bypass the limitations arising in GPU context.

Starting from the interpolation of the particle multipoles at leaves on equispaced grids (instead of Chebyshev ones), one may construct a single level FMM as the evaluation of the matrix concatenating all the M2L matrices between non-adjacent or equal cells. This matrix can be proved to be a block-Toeplitz matrix (we will focus on the theory in a forthcoming mathematical article). Such a matrix can be embedded in a larger circulant matrix. Because any circulant matrix can be explicitly diagonalized in the Fourier domain, and because the product by a discrete Fourier basis can be efficiently processed through FFTs, one may compute this diagonalization in a linearithmic time. Doing so, we obtain a fast linearithmic scheme to deal with the far-field of N-body problems on quasi-uniform particle distributions (this assumption allowing to obtain perfect trees). Of course, the near-field is treated as in the FMM, i.e. by performing local direct computation between non-separated cells.

However, in our case, since we want to perform interpolation on Chebyshev polynomials in order to mitigate the precision loss of the multipole operator evaluation without much numerical stability issue, we start by performing such a Chebyshev interpolation at leaves, directly followed by a re-interpolation of the data now defined at Chebyshev nodes to a local equispaced grid, using the same grid (up to translation) for every leaf (since they all have the same radius).

Once in the Fourier domain, the energy computation problem is nothing more than a scalar product between two vectors: the diagonal of the Fourier diagonalization of the circulant matrix and the modified charges vector Fourier transform squared modulus after zero-padding (to have the same size than the circulant matrix). Since the circulant matrix has 𝒪(N)\mathcal{O}(N) rows and columns (𝒪(N)\mathcal{O}(N) leaves with 𝒪(1)\mathcal{O}(1) interpolation nodes in each), this scalar product costs 𝒪(N)\mathcal{O}(N) floating point operations (flops) while the FFT costs 𝒪(NlogN)\mathcal{O}(NlogN) flops. Since the interpolation phase is analogue to the application of all P2M operators in ANKH-FMM, this step can also be performed in 𝒪(N)\mathcal{O}(N) flops. Hence, we end up with a linearithmic method. This new method is named ANKH-FFT.

ANKH-FFT is theoretically more costly than ANKH-FMM, but ANKH-FFT benefits from highly optimized FFT implementations, also available on GPU, and the interpolation phase is a highly parallel one. Hence, ANKH-FFT seems to be a better candidate than ANKH-FMM in a GPU context. We should still compare the performance of these two methods, which is done in the following. These performance are comparable, and ANKH-FFT behaves as a linear method in the tested range (i.e. from 1000 to 100 Million atoms). This leads us to the conclusion about the use of this new approach: ANKH-FFT appears as an entirely new and efficient approach to deal with local NN-body problems with an algorithmic structure well-suited for GPU programming. This should thus be an excellent alternative to local interpolation-based FMM used in one core of the distributed memory framework of the parallel interpolation-based FMM.

We may here insist on the fact that ANKH-FFT requires a strong mathematical background to be entirely presented, hence we postpone this presentation to a forthcoming mathematical paper, including proofs. ANKH-FFT is based on a matrix factorization presented in Sect. 4.1. We then show how this factorization leads to a diagonalization in Sect. 4.2 and we present the fast periodic condition handling in Sect. 4.3. The important role of mutual interactions in ANKH-FFT is presented in Sect. 4.4. Finally, we summarize the main algorithmic in Sect. 4.5, followed in Sect. 4.6 by a complexity analysis. The numerical comparison between ANKH-FFT and ANKH-FMM is provided in Sect. 4.7.

4.1 Matrix factorization

We start by considering all the leaf cells (𝒯)\mathcal{L}(\mathscr{T}) of 𝒯\mathscr{T} (that are all at the same tree level since 𝒯\mathscr{T} is perfect). Let us consider the extended interaction list Λ+\Lambda^{+} of any leaf cell tt defined as the set of leaf cells with a strictly positive distance with tt:

Λ+(t):={s(𝒯)|dist(t,s)>0}.\Lambda^{+}(t):=\{s\in\mathcal{L}(\mathscr{T})\hskip 2.84544pt|\hskip 2.84544ptdist(t,s)>0\}.

We then decompose ~real\tilde{\mathcal{E}}_{real} of Eq. 12 as

~real=\displaystyle\mathcal{\tilde{E}}_{real}= t(𝒯)𝐱t,𝐲ssΛ+(t)𝔇𝐱𝔇𝐲H(𝐱,𝐲)=:\displaystyle\underbrace{\sum_{t\in\mathcal{I}(\mathscr{T})}\sum_{\begin{subarray}{c}\mathbf{x}\in t,\mathbf{y}\in s\\ s\in\Lambda^{+}(t)\end{subarray}}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}H(\mathbf{x},\mathbf{y})}_{=:\mathcal{F}} (36)
+t(𝒯)𝐱t,𝐲ssΛ+(t)𝔇𝐱𝔇𝐲H(𝐱,𝐲)=:𝒩.\displaystyle+\underbrace{\sum_{t\in\mathcal{I}(\mathscr{T})}\sum_{\begin{subarray}{c}\mathbf{x}\in t,\mathbf{y}\in s\\ s\notin\Lambda^{+}(t)\end{subarray}}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}H(\mathbf{x},\mathbf{y})}_{=:\mathcal{N}}.

As in our ANKH-FMM algorithm, we propose to compute 𝒩\mathcal{N} directly and to approximate \mathcal{F} by means of interpolation. Indeed, following Sect. 3.6, we can write

t(𝒯)sΛ+(t)tTC[t,s]s.\mathcal{F}\approx\sum_{t\in\mathcal{I}(\mathscr{T})}\sum_{s\in\Lambda^{+}(t)}\mathcal{M}_{t}^{T}C[t,s]\mathcal{M}_{s}. (37)

Now, let 𝐄c\mathbf{E}_{c}, for any c(𝒯)c\in\mathcal{L}(\mathscr{T}) be a reinterpolation matrix (see the M2M definition in Eq. 8) from the (Chebyshev) interpolation nodes of cc into a equispaced grid 𝒥L\mathcal{J}_{L} defined by

𝒥L:={\displaystyle\mathcal{J}_{L}:=\Bigg{\{} rB2(1+E)([111]+2𝐤L1)|\displaystyle r_{B}2^{-(1+E)}\left(\begin{bmatrix}-1\\ -1\\ -1\end{bmatrix}+\frac{2\mathbf{k}}{L-1}\right)\hskip 2.84544pt\Bigg{|}\hskip 2.84544pt (38)
𝐤[[0,L1]]3},\displaystyle\mathbf{k}\in[\![0,L-1]\!]^{3}\Bigg{\}},

i.e. 𝐄c\mathbf{E}_{c} interpolates the modified charges associated to Chebyshev nodes in a cell cc to new modified charges associated to the nodes of a equispaced grid in cc. This interpolation grid 𝒥L\mathcal{J}_{L} has to be centered in each leaf cell center. Said differently, the same equispaced interpolation grid 𝒥L\mathcal{J}_{L} is used in each leaf cell, up to a translation (see Fig. 4).

Refer to caption
Figure 4: Schematic representation of 2D interpolation on tensorized equispaced grids (as in Eq. 38) with L=5L=5 on four different leaf cells (in red, orange, green and grey). Interpolation nodes belonging on a unique cell are represented in black on the top figure, nodes belonging in two cells are represented in purple and the node belonging in the four cells is represented in cyan. In the bottom, we represented the actual grid used in AKNH-FFT, with a slight node translation in order to exhibit duplications: purple nodes are duplicated once and cyan one is duplicated four times. Assuming that we only have these four leaf cells (i.e. a tree depth of only one), this cyan node is indexed (following the multi-indexing of Sect. 4.2) as (0,4,0,0)(0,4,0,0) in the red cell, as (1,0,0,0)(1,0,0,0) in the grey cell, as (0,4,0,4)(0,4,0,4) in the orange cell and as (1,0,1,4)(1,0,1,4) in the green cell.

Each M2L matrix C[t,s]C[t,s] can be written as

C[t,s]𝐄tTG[t,s]𝐄sC[t,s]\approx\mathbf{E}^{T}_{t}G[t,s]\mathbf{E}_{s} (39)

where (G[t,s])k,l:=H(ctr(t)+𝐱~k,ctr(s)+𝐱~l)\left(G[t,s]\right)_{k,l}:=H(ctr(t)+\tilde{\mathbf{x}}_{k},ctr(s)+\tilde{\mathbf{x}}_{l}), ctr(c)ctr(c) denoting the center of the cell cc and 𝐱~k\tilde{\mathbf{x}}_{k} referring to the kthk^{th} elements of 𝒥L\mathcal{J}_{L}. In other terms, Eq. 39 simply switches between Chebyshev nodes in a cell and equispaced nodes.

Remark 1

In practice, the order of the Chebyshev rules are chosen so that the derivation of the associated polynomials results in the targeted error. Nevertheless, after modified charges computation, the equispaced interpolation does not have to consider the same interpolation order. This last order can be chosen without consideration about the derivatives, which practically implies lower orders. Hence, another compression level may be induced by this reinterpolation, provided that the interpolation order on equispaced grids is chosen to be smaller than the one on Chebyshev grids.

Now, we aim at efficiently computing the interactions between a cell tt and all the cells in its extended interaction list. Let us extend the definition of G[t,s]G[t,s] in order to cover the blanks induced by pairs of leaf cells (t,s)(t,s) such that sΛ+(𝒯)s\notin\Lambda^{+}(\mathscr{T}), introducing

G~[t,s]={G[t,s] if sΛ+(t),𝟎 otherwise.\tilde{G}[t,s]=\begin{cases}G[t,s]\textit{ if }s\in\Lambda^{+}(t),\\ \mathbf{0}\textit{ otherwise}.\end{cases} (40)

We thus obtain

\displaystyle\mathcal{F} t(𝒯)s(𝒯)tT𝐄tTG~[t,s]𝐄ss\displaystyle\approx\sum_{t\in\mathcal{L}(\mathscr{T})}\sum_{s\in\mathcal{L}(\mathscr{T})}\mathcal{M}_{t}^{T}\mathbf{E}^{T}_{t}\tilde{G}[t,s]\mathbf{E}_{s}\mathcal{M}_{s} (41)

that corresponds to the product of a matrix 𝔸\mathbb{A} formed by the concatenation of the blocks G~[t,s]\tilde{G}[t,s] with a left and a right vector formed by the concatenation of 𝐄cc\mathbf{E}_{c}\mathcal{M}_{c}’s, c(𝒯)c\in\mathcal{L}(\mathscr{T}). The blocks are ordered in 3D (in both rows and column) following a lexicographic order on their center difference:

c<c[r0<0\displaystyle c<c^{\prime}\Leftrightarrow\Big{[}r_{0}<0 (r0=0r1<0)\displaystyle\hskip 2.84544pt\lor\hskip 2.84544pt\left(r_{0}=0\land r_{1}<0\right)
(r0=0r1=0r2<0)]\displaystyle\hskip 2.84544pt\lor\hskip 2.84544pt\left(r_{0}=0\land r_{1}=0\land r_{2}<0\right)\Big{]}

for any c,c(𝒯)c,c^{\prime}\in\mathcal{L}(\mathscr{T}) and with (r0,r1,r2):=ctr(c)ctr(c)(r_{0},r_{1},r_{2}):=ctr(c)-ctr(c^{\prime}). The result of this concatenation is a vector 𝐯8EL3\mathbf{v}\in\mathbb{R}^{8^{E}L^{3}}, where 8E8^{E} corresponds to the number of leaf cells in a perfect octree with depth EE. Using these notations, we obtain the following approximation:

𝐯T𝔸𝐯.\mathcal{F}\approx\mathbf{v}^{T}\mathbb{A}\mathbf{v}. (42)

We are thus now interested by an efficient computation of the product by 𝔸\mathbb{A}. This is the purpose of Sect. 4.2.

Remark 2

In 𝔸\mathbb{A}, a given target cell tt may interact with various source cells ss that spatially share interpolation nodes. However, each equispaced interpolation grid and each multipole expansion s\mathcal{M}_{s} associated to it has to be considered independently because, depending on tt, s\mathcal{M}_{s} may be masked or used, thanks to possible products by zeros in Eq. 40. This means that we do not see the concatenation of equispaced grids as a global equispaced grid, but rather as a collection of such small grids with nodes duplication on the cell’s edges. This consideration is illustrated on Fig. 4

4.2 Diagonalization

In this section, we provide an explicit diagonalization of an embedding of 𝔸\mathbb{A} in a Fourier basis. Let 𝐤,𝐥\mathbf{k},\mathbf{l} be multi-indices in

([[0,n(E)]]cell on x-axis×[[0,L1]]interpolationnode on x-axis)\displaystyle\left(\underbrace{[\![0,n(E)]\!]}_{\begin{subarray}{c}\text{cell on}\\ \text{ x-axis}\end{subarray}}\times\underbrace{[\![0,L-1]\!]}_{\begin{subarray}{c}\text{interpolation}\\ \text{node on}\\ \text{ x-axis}\end{subarray}}\right) ×([[0,n(E)]]cell on y-axis×[[0,L1]]interpolationnode on y-axis)\displaystyle\times\left(\underbrace{[\![0,n(E)]\!]}_{\begin{subarray}{c}\text{cell on}\\ \text{ y-axis}\end{subarray}}\times\underbrace{[\![0,L-1]\!]}_{\begin{subarray}{c}\text{interpolation}\\ \text{node on}\\ \text{ y-axis}\end{subarray}}\right)
×([[0,n(E)]]cell on z-axis×[[0,L1]]interpolationnode on z-axis),\displaystyle\times\left(\underbrace{[\![0,n(E)]\!]}_{\begin{subarray}{c}\text{cell on}\\ \text{ z-axis}\end{subarray}}\times\underbrace{[\![0,L-1]\!]}_{\begin{subarray}{c}\text{interpolation}\\ \text{node on}\\ \text{ z-axis}\end{subarray}}\right),

where n(E):=2E1n(E):=2^{E}-1 (see Fig. 4 for graphical 2D representation). Let 𝐮=(u~0,u~1,u~2){0,1}6\mathbf{u}=(\tilde{u}_{0},\tilde{u}_{1},\tilde{u}_{2})\in\{0,1\}^{6} be a boolean vector and let 𝐤~=(k~0,k~1k~2),𝐥~=(l~0,l~1,l~2)\tilde{\mathbf{k}}=(\tilde{k}_{0},\tilde{k}_{1}\tilde{k}_{2}),\tilde{\mathbf{l}}=(\tilde{l}_{0},\tilde{l}_{1},\tilde{l}_{2}) be defined as

k~p={kp+up if p is even and kp2E1,kp+up if p is odd and kpL1,kp otherwise.\tilde{k}_{p}=\begin{cases}k_{p}+u_{p}\textit{ if }p\textit{ is even and }k_{p}\neq 2^{E}-1,\\ k_{p}+u_{p}\textit{ if }p\textit{ is odd and }k_{p}\neq L-1,\\ k_{p}\textit{ otherwise}.\end{cases} (43)

In our study case, 𝔸\mathbb{A} verifies a strong property which is given in Thm. 1.

Theorem 1

If HH is a radial kernel, then 𝔸𝐤,𝐥=𝔸𝐤~,𝐥~\mathbb{A}_{\mathbf{k},\mathbf{l}}=\mathbb{A}_{\tilde{\mathbf{k}},\tilde{\mathbf{l}}}, where 𝔸𝐤,𝐥\mathbb{A}_{\mathbf{k},\mathbf{l}} refers to the element of 𝔸\mathbb{A} at row k0(2EL3)2+k12EL3+k2k_{0}(2^{E}L^{3})^{2}+k_{1}2^{E}L^{3}+k_{2} and column l0(2EL3)2+l12EL3+l2l_{0}(2^{E}L^{3})^{2}+l_{1}2^{E}L^{3}+l_{2}.

Actually, Thm. 1 exhibits a Toeplitz structure for 𝔸\mathbb{A} in six dimensions (that are the dimensions of the indexing at the begining of this section). The direct consequence of Thm 1 is summarized in Cor. 1.

Corollary 1

Let χ{0,1}(((2E+11)×(2L1))3)×((2E×L)3)\chi\in\{0,1\}^{\left(\left((2^{E+1}-1)\times(2L-1)\right)^{3}\right)\times\left(\left(2^{E}\times L\right)^{3}\right)} be defined as

χ𝐤,𝐥={1 if 𝐤=𝐥,0 otherwise.\chi_{\mathbf{k},\mathbf{l}}=\begin{cases}1\textit{ if }\mathbf{k}=\mathbf{l},\\ 0\textit{ otherwise}.\end{cases} (44)

for any 𝐤([[0,2E1]]×[[0,L1]])3,𝐥([[0,2E+12]]×[[0,2L2]])3\mathbf{k}\in\left([\![0,2^{E}-1]\!]\times[\![0,L-1]\!]\right)^{3},\mathbf{l}\in\left([\![0,2^{E+1}-2]\!]\times[\![0,2L-2]\!]\right)^{3} and let 𝔽6\mathbb{F}_{6} be the discrete Fourier transform matrix of dimensions (2E,L,2E,L,2E,L)(2^{E},L,2^{E},L,2^{E},L). We have

𝔸=χ𝔽6𝐃(𝔸)𝔽6χ,\mathbb{A}=\chi^{*}\mathbb{F}_{6}^{*}\mathbf{D}(\mathbb{A})\mathbb{F}_{6}\chi, (45)

where 𝐃(𝔸)\mathbf{D}(\mathbb{A}) is a diagonal matrix.

The main idea is that 𝔸\mathbb{A} is a block-Toeplitz matrix according to Thm. 1. Such a matrix can be embedded into a block-circulant matrix one. Any block-circulant matrix can be diagonalized in a Fourier basis (here 𝔽6\mathbb{F}_{6}), giving the factorization of Cor. 1. Here, this block-circulant matrix is never built because the diagonal of 𝐃(𝔸)\mathbf{D}(\mathbb{A}) can be computed through a single application of 𝔽6\mathbb{F}_{6} to a vector of correctly ordered elements of the first row and column of 𝔸\mathbb{A}. Similar tools were used in Sect. 3.3 in a much simpler context.

The expression (45) allows to derive a fast scheme for the product by \mathcal{F} in Eq. (42) since thanks to Thm. 1:

𝐯Tχ𝔽6𝐃(𝔸)𝔽6χ𝐯,\mathcal{F}\approx\mathbf{v}^{T}\chi^{*}\mathbb{F}_{6}^{*}\mathbf{D}(\mathbb{A})\mathbb{F}_{6}\chi\mathbf{v}, (46)

in which products by χ,χ\chi^{*},\chi can be evaluated with linear complexity (because of their sparse structure) and products by 𝔽6,𝔽6\mathbb{F}_{6}^{*},\mathbb{F}_{6} can be processed with linearithmic complexity through FFTs (regarding the size of 𝐯\mathbf{v}).

We refer to Alg. 4 for the explicit computation of the diagonal of 𝐃(𝔸)\mathbf{D}(\mathbb{A}). In this algorithm, we used the notations 𝐈=(I0,,I5)T\mathbf{I}=(I_{0},...,I_{5})^{T}, 𝐉=(J0,,J5)T\mathbf{J}=(J_{0},...,J_{5})^{T}, 𝐑=(R0,,R5)T\mathbf{R}=(R_{0},...,R_{5})^{T}. Mainly, this algorithm generates a vector composed of the (correctly ordered for the circulant embedding) elements of the first row and column of 𝔸\mathbb{A}, followed by the discrete Fourier transform of this vector. This results in the diagonal elements of the diagonalization of 𝔸\mathbb{A} in the Fourier domain, i.e. 𝐃(𝔸)\mathbf{D}(\mathbb{A}).

Algorithm 4 Get ANKH-FFT diagonal matrix
u=(2E1)rB2Eu=\frac{(2^{E}-1)r_{B}}{2^{E}}
v=rB2Ev=\frac{r_{B}}{2^{E}}
Pcell=2E+11P_{cell}=2^{E+1}-1
Pntrp=2L1P_{ntrp}=2L-1
𝐙:=[2u2E12vL10000002u2E12vL10000002u2E12vL1]\mathbf{Z}:=\begin{bmatrix}\frac{2u}{2^{E}-1}&\frac{2v}{L-1}&0&0&0&0\\ 0&0&\frac{2u}{2^{E}-1}&\frac{2v}{L-1}&0&0\\ 0&0&0&0&\frac{2u}{2^{E}-1}&\frac{2v}{L-1}\end{bmatrix}
for 𝐑([[0,Pcell1]]×[[0,Pntrp1]])3\mathbf{R}\in\left([\![0,P_{cell}-1]\!]\times[\![0,P_{ntrp}-1]\!]\right)^{3} do
     for k=0,1,2k=0,1,2 do
         if R2k<2ER_{2k}<2^{E} then
              I2k=R2kI_{2k}=R_{2k} and J2k=0J_{2k}=0
         else
              I2k=0I_{2k}=0 and J2k=PcellR2kJ_{2k}=P_{cell}-R_{2k}
         end if
         if R2k+1<LR_{2k+1}<L then
              I2k+1=R2kI_{2k+1}=R_{2k} and J2k+1=0J_{2k+1}=0
         else
              I2k+1=0I_{2k+1}=0 and J2k+1=PcellR2kJ_{2k+1}=P_{cell}-R_{2k}
         end if
         𝐱=𝐙𝐈\mathbf{x}=\mathbf{Z}\mathbf{I}
         𝐲=𝐙𝐉\mathbf{y}=\mathbf{Z}\mathbf{J}
     end for
     t=0t=0
     n=dn=d
     for k=0,1,2k=0,1,2 do
         t=t+|I2kJ2k|t=t+|I_{2k}-J_{2k}|
         if (I2kJ2k)0(I_{2k}-J_{2k})\neq 0 then
              n=n1n=n-1
         end if
     end for
     i=k=02(R2kPntrp+R2k+1)Pcell2kPntrp2ki=\sum_{k=0}^{2}(R_{2k}P_{ntrp}+R_{2k+1})P_{cell}^{2-k}P_{ntrp}^{2-k}
     if t>nt>n then
         𝐯i=𝐭2rBZpH(𝐱,𝐲+𝐭)\mathbf{v}_{i}=\sum_{\mathbf{t}\in 2r_{B}Z_{p}}H\left(\mathbf{x},\mathbf{y}+\mathbf{t}\right)
     else
         𝐯i=0\mathbf{v}_{i}=0
     end if
end for
diag(𝐃(𝔸))=𝔽6𝐯diag\left(\mathbf{D}(\mathbb{A})\right)=\mathbb{F}_{6}\mathbf{v}

4.3 Efficient handling of the periodicity

When large number of periodic images are considered, the computation of 𝐭2rBZpH(𝐱,𝐲+𝐭)\sum_{\mathbf{t}\in 2r_{B}Z_{p}}H\left(\mathbf{x},\mathbf{y}+\mathbf{t}\right) involved in Alg. 4 becomes too coslty to be used. Fortunately, this step can be accelerated through another level of interpolation. Exploiting translational invariance of HH, one gets:

𝐭2rBZpH(𝐱,𝐲+𝐭)\displaystyle\sum_{\mathbf{t}\in 2r_{B}Z_{p}}H(\mathbf{x},\mathbf{y}+\mathbf{t}) =𝐭2rBZpH(𝐱𝐲=:𝐳,𝐭)\displaystyle=\sum_{\mathbf{t}\in 2r_{B}Z_{p}}H(\underbrace{\mathbf{x}-\mathbf{y}}_{=:\mathbf{z}},\mathbf{t}) (47)

that can be separated into two sums

(𝐭2rBZ2H(𝐳,𝐭))=:𝒩per(𝐳)+(𝐭2rBZp\Z2H(𝐳,𝐭))per(𝐳).\underbrace{\left(\sum_{\mathbf{t}\in 2r_{B}Z_{2}}H(\mathbf{z},\mathbf{t})\right)}_{=:\mathcal{N}_{per}(\mathbf{z})}+\underbrace{\left(\sum_{\mathbf{t}\in 2r_{B}Z_{p}\backslash Z_{2}}H(\mathbf{z},\mathbf{t})\right)}_{\mathcal{F}_{per}(\mathbf{z})}. (48)

Since both 𝐱,𝐲B\mathbf{x},\mathbf{y}\in B lie inside the (centered at zero) simulation box of radius rBr_{B}, 𝐳\mathbf{z} belongs inside a box B~\tilde{B} of radius 2rB2r_{B}. Each 𝐭2rBZp\Z2\mathbf{t}\in 2r_{B}Z_{p}\backslash Z_{2} has distance at least 2rB2r_{B} from B~\tilde{B}. hence, each such 𝐭\mathbf{t} is well-separated from B~\tilde{B}, using a treecode-like criterion [15]. Hence, performing an interpolation over a equispaced grid on B~\tilde{B}, we obtain

per(𝐳)kUk[B~](𝐳)per(𝐳𝐤),\mathcal{F}_{per}(\mathbf{z})\approx\sum_{k}U_{k}[\tilde{B}](\mathbf{z})\mathcal{F}_{per}(\mathbf{z_{k}}), (49)

UkU_{k} being the Lagrange polynomial associated to the kthk^{th} node of the equispaced interpolation grid over B~\tilde{B}.

Combining Eq. (48) and Eq. (49), we get

per(𝐳)\displaystyle\mathcal{F}_{per}(\mathbf{z}) kUk[B~](𝐳)(𝐭2rBZp\Z2H(𝐳k,𝐭))=:hk\displaystyle\approx\sum_{k}U_{k}[\tilde{B}](\mathbf{z})\underbrace{\left(\sum_{\mathbf{t}\in 2r_{B}Z_{p}\backslash Z_{2}}H(\mathbf{z}_{k},\mathbf{t})\right)}_{=:h_{k}} (50)

in which the sum into parentheses corresponds to a NN-body problem on two point clouds: the Caresian grid over B~\tilde{B} (with L3L^{3} nodes) and the set 2rBZp\Z22r_{B}Z_{p}\backslash Z_{2} (with all charges equal to 11). This NN-body problem can be solved efficiently using a treecode or a IBFMM. Then, once all hkh_{k}’s are known, any per(𝐳)\mathcal{F}_{per}(\mathbf{z}) can be retrieve in 𝒪(L3)=𝒪(1)\mathcal{O}(L^{3})=\mathcal{O}(1). In addition, these hkh_{k}’s can be computing once during the precomputation step since they do not depend on the modified charges but only on the kernel HH.

The term 𝒩per(𝐳)\mathcal{N}_{per}(\mathbf{z}) involves only a sum over 53=𝒪(1)5^{3}=\mathcal{O}(1) elements, meaning that using this interpolation, one may deduce the result of Eq. (48) in 𝒪(1)\mathcal{O}(1) flops.

4.4 Mutual interactions

The evaluation of Eq. (46) involves two FFTs if computed naively. Since they are the most consuming part of the evaluation time, we may wonder how to mitigate it. This can be done by exploiting once again, as in Sect. 3.6, the mutual interactions. Our idea is based on the following result:

Corollary 2

If HH has real values, then 𝐃(𝔸)[(2E+11)(2L1)]3×[(2E+11)(2L1)]3\mathbf{D}(\mathbb{A})\in\mathbb{R}^{[(2^{E+1}-1)(2L-1)]^{3}\times[(2^{E+1}-1)(2L-1)]^{3}}.

This can be proved from Thm. 1 noticing that that radial property of HH implies symmetry of 𝔸\mathbb{A}. Hence, the spectrum of 𝔸\mathbb{A} is real.

As a direct consequence, following Eq. (46), we have

𝐯Tχ𝔽6𝐃(𝔸)𝔽6χ𝐯\displaystyle\mathbf{v}^{T}\chi^{*}\mathbb{F}_{6}^{*}\mathbf{D}(\mathbb{A})\mathbb{F}_{6}\chi\mathbf{v} =(𝔽6χ𝐯)𝐃(𝔸)𝔽6χ𝐯=:𝐰,\displaystyle=\left(\mathbb{F}_{6}\chi\mathbf{v}\right)^{*}\mathbf{D}(\mathbb{A})\underbrace{\mathbb{F}_{6}\chi\mathbf{v}}_{=:\mathbf{w}}, (51)

with 𝐰=(w0,,w[(2E+11)(2L1)]31)T\mathbf{w}=(w_{0},...,w_{[(2^{E+1}-1)(2L-1)]^{3}-1})^{T}, and which reformulates as

𝐰𝐃(𝔸)𝐰\displaystyle\mathbf{w}^{*}\mathbf{D}(\mathbb{A})\mathbf{w} =k=0[(2E+11)(2L1)]31wk¯𝐃(𝔸)k,kwk\displaystyle=\sum_{k=0}^{[(2^{E+1}-1)(2L-1)]^{3}-1}\overline{w_{k}}\mathbf{D}(\mathbb{A})_{k,k}w_{k} (52)
=k=0𝐃(𝔸)k,k|wk|2.\displaystyle=\sum_{k=0}\mathbf{D}(\mathbb{A})_{k,k}|w_{k}|^{2}.

Hence, only 𝐰\mathbf{w} has to be computed, meaning that only a single FFT needs to be performed during evaluation of Eq. (46). As another important consequence of Cor. 2, once this FFT has been performed, one can only keeps the real part of the output (the imaginary one being equal to 0). In our implementation, we even rely on real-to-complex FFTs from FFTW3 [19] since the modified charges (as well as the entries of 𝐰\mathbf{w}) are real provided that the charges, dipoles and quadrupoles also are.

4.5 Global algorithmic

It is now possible to provide the overall algorithm for the method derived from our ANKH methodology and exploiting the diagonalization of the interpolated Ewald summation over local equispaced grids. Namely, this corresponds to ANKH-FFT and Alg. 5 summarizes all its steps.

ANKH-FMM and ANKH-FFT shares strong similarities: both use the same computation of modified charges, both compute the near field part (𝒩real\mathcal{N}_{real}) using direct interactions and both compute the the self energy in the same way. However, the computation of the far field part (real\mathcal{F}_{real}) is strongly simplified in ANKH-FFT.

Among the differences, we decided to take into account the influence of the periodicity in the diagonalization process of ANKH-FFT while we opted for an external handling in ANKH-FMM, meaning that the evaluation of ANKH-FFT does not involve any periodic influence (already contained in 𝐃(𝔸)\mathbf{D}(\mathbb{A})). The reason behind this choice is that in ANKH-FFT, the far field computation is done in the Fourier domain directly, in which the particle cannot be interpreted as easily as in the cartesian domain, while in ANKH-FMM, the multilevel structure provide a way of measuring the influence of the far images efficiently on-the-fly. This points out the main difference between the two approaches: ANKH-FFT is not multilevel but is built on a global operation (FFT).

Algorithm 5 ANKH-FFT
// Step 1: get modified cartesian multipole expansions
for each leaf cell s(T)s\in\mathcal{L}(T) do
     Compute 𝒬s\mathcal{Q}_{s} using Alg. 2
     s=𝐄s𝒬s\mathcal{M}_{s}=\mathbf{E}_{s}\mathcal{Q}_{s}
end for
// Step 2: get real\mathcal{F}_{real} using FFT
Let 𝐯\mathbf{v} be defined as in Eq. (46)
𝐰=𝔽6χ𝐯\mathbf{w}=\mathbb{F}_{6}\chi\mathbf{v}
real=𝐰T𝐃(𝔸)𝐰\mathcal{F}_{real}=\mathbf{w}^{T}\mathbf{D}(\mathbb{A})\mathbf{w} using Eq. (52)
// Step 3: get 𝒩real\mathcal{N}_{real} through mutual P2P
𝒩real=𝟎\mathcal{N}_{real}=\mathbf{0}
for each leaf cell t(𝒯)t\in\mathcal{L}(\mathscr{T}) do
     for each source cell ss adjacent to tt do
         if interaction of ss and tt has not been yet computed then
              for 𝐱X|t\mathbf{x}\in X_{|t} do
                  for 𝐲Y|s\mathbf{y}\in Y_{|s} do
                       𝒩real=𝒩real+2𝔇𝐱𝔇𝐲G(𝐱,𝐲)\mathcal{N}_{real}=\mathcal{N}_{real}+2\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}G(\mathbf{x},\mathbf{y})
                  end for
              end for
         end if
     end for
end for
// Step 4: compute self energy
self=0\mathcal{E}_{self}=0
for each 𝐱X\mathbf{x}\in X do
     self=self+q𝐱22ξ23μ𝐱μ𝐱+8ξ45Θ𝐱:Θ𝐱\mathcal{E}_{self}=\mathcal{E}_{self}+q_{\mathbf{x}}^{2}\frac{2\xi^{2}}{3}\mu_{\mathbf{x}}\cdot\mu_{\mathbf{x}}+\frac{8\xi^{4}}{5}\Theta_{\mathbf{x}}:\Theta_{\mathbf{x}}
end for
// Step 5: return energy \mathcal{E}
=ξπself+real+𝒩real\mathcal{E}=-\frac{\xi}{\sqrt{\pi}}\mathcal{E}_{self}+\mathcal{F}_{real}+\mathcal{N}_{real}

4.6 Complexity

There are two complexities to count in ANKH-FFT: the precompuation complexity as well as the evaluation one. The precomputation step involves the solving of NN-body problem of Eq. (50), which may be done in a linear time with respect to the number of far images, i.e. in 𝒪(p3)\mathcal{O}(p^{3}). In practice, p=𝒪(1)p=\mathcal{O}(1) is small. Then, the diagonal of 𝔸\mathbb{A} is computed using Alg. 4. It is easy to see from this algorithm that, thanks to Sect. 4.3, its complexity is linearithmic with respect to the size of 𝐯\mathbf{v}.

When considering quasi-uniform particle distributions, the total number of leaf cells can be considered as 𝒪(N)\mathcal{O}(N). In this case, the size of 𝐯\mathbf{v} is 𝒪(NL3)\mathcal{O}(NL^{3}) with a constant LL corresponding to the number of interpolation node in one direction of the equispaced grid, which is a small constant. Hence, the size of 𝐯\mathbf{v} is 𝒪(N)\mathcal{O}(N) and the FFTs perform in 𝒪(NlogN)\mathcal{O}(NlogN).

The precomputation cost of the modified charges being linear with respect to the number of particles the simualtion box, the overall precomputation complexity is

𝒪(N)+𝒪(N)+𝒪(NlogN)=𝒪(NlogN).\mathcal{O}(N)+\mathcal{O}(N)+\mathcal{O}(NlogN)=\mathcal{O}(NlogN). (53)

Since the product by a diagonal matrix is also 𝒪(N)\mathcal{O}(N), the overall complexity of the evaluation of \mathcal{F} complexity though Eq. (46) is 𝒪(NlogN)\mathcal{O}(NlogN) (because of the FFT).

4.7 Comparison between ANKH-FMM and ANKH-FFT and partial conclusions

This section is dedicated to the comparison between ANKH-FMM and ANKH-FFT. As we already discussed, the theoretical complexities (respectively linear and linearithmic) are not sufficient to conclude in the practical efficiency. One of the main reasons behind that is the arithmetic intensity of the state-of-the-art FFT implementations [19] that should be higher than our IBFMM one. Hence, we compared running times for the evaluation of both methods (i.e. excluding precomputations) on the same particle distributions, using charges only (since the near field part is the same between the two methods) and no periodic condition (since its handling has a very small impact on the timings and that we want to compare the main core of these methods). Timings are provided in Fig. 5. To do these tests, we run our (sequential) codes on a single Intel(R) Xeon(R) Gold 6152 CPU.

Refer to caption
Figure 5: Timing comparison of the two ANKH variants with respect to the number of particles (atoms) in the simulation box.

Performance of the two methods are comparable in the (large) tested range of systems, i.e. from 1000 atoms to 10M atoms: ANKH-FFT performs slightly better on the case with 10000 atoms, but not in a significant way. Since ANKH-FMM follows the theoretical estimates of linear complexity, ANKH-FFT also behaves as a linear method in practice.

We may still insist on the fact that the precomputations involved in ANKH-FFT are more costly than the ones of ANKH-FMM (but in the same order of magnitude than the evaluation step). These precomputation timings are however quite relative to the application: in the MD context, the same simulation box is used for a massive amount of evaluations, hence these precomputations are performed only once for many evaluations, which mitigates their costs.

To summarize, we list the main difference, advantages and drawbacks of the two ANKH approaches:

  • ANKH-FMM is a truly linear method while ANKH-FFT only is a linearithmic one. Both achieve fast computations with comparable running times (on CPU) and can take into account periodic boundary conditions as well as distributed point multipoles in a convergent way.

  • ANKH-FFT is easier to implement than ANKH-FMM, especially in projection for the GPU context while ANKH-FMM, as a IBFMM-based approach, directly benefits from the literature on HPC for this kind of method.

  • Since ANKH-FFT relies mainly on FFTs, it may be easier to provide portable codes for it (most modern architectures provide efficient FFT implementations).

  • The drawback, however, is the pre-computation timings of ANKH-FFT that are more costly than ANKH-FMM ones.

One of the main motivation behind providing this ANKH approach is to design methods able to efficiently scale on modern HPC architectures and (very) large systems of particles. Hence, it is important to realize that state-of-the-art techniques (mainly relying on Local Essential Trees [34, 45, 16, 28, 1]) for parallel computations using FMMs can be used in MD exploiting the ANKH framework. More specifically, Local Essential Trees may also be combined with the ANKH-FFT approach, since local sub-problems can be treated with the latter (only the periodic boundary condition handling has to be slightly adapted). We believe that this last idea, exploiting GPU implementation of ANKH-FFT, can lead to highly scalable MD simulations on modern and hybrid HPC architectures.

5 Numerical results

In this section, we provide numerical experiments aiming at validating our ANKH approach. Follwing Sect. 4.7, since the two ANKH variants (ANKH-FMM and ANKH-FFT) provide almost the same timing results, the numerical timing results in this section are based on ANKH-FFT only. Our code is sequential and still run on a single Intel(R) Xeon(R) Silver 4116 CPU. We used the intel oneAPI Math Kernel Library for BLAS calls and FFTW3 for FFT computations. All the code uses double precision arithmetic. The Ewald Summation parameter ξ\xi (see Eq. 10) is always fixed at 0.010.01.

5.1 Interpolation errors

First, we want to verify that our way of treating the point multipole expansions results in admissible errors. To do so, for each partial derivative of HH (as in Eq. 11) in situations corresponding to well-separated pair of cells (t,s)(t,s), we numerically measured the relative error on the interpolation:

𝐱α𝐲β(𝐱,𝐲):=k(𝐱αSk[t])lH(𝐱k,𝐲l)(𝐲βSl[s]),\partial_{\mathbf{x}}^{\alpha}\partial_{\mathbf{y}}^{\beta}\mathcal{I}(\mathbf{x},\mathbf{y}):=\sum_{k}\left(\partial_{\mathbf{x}}^{\alpha}S_{k}[t]\right)\sum_{l}H(\mathbf{x}_{k},\mathbf{y}_{l})\left(\partial_{\mathbf{y}}^{\beta}S_{l}[s]\right),
max on 100 randomly generated (𝐱,𝐲)t×s{𝐱α𝐲β(H(𝐱,𝐲)(𝐱,𝐲))}max on 100 randomly generated (𝐱,𝐲)t×s{|H(𝐱,𝐲)|},\frac{\displaystyle\mathop{max}_{\begin{subarray}{c}\text{ on }100\text{ randomly generated }\\ (\mathbf{x},\mathbf{y})\in t\times s\end{subarray}}\{\partial_{\mathbf{x}}^{\alpha}\partial_{\mathbf{y}}^{\beta}\left(H(\mathbf{x},\mathbf{y})-\mathcal{I}(\mathbf{x},\mathbf{y})\right)\}}{\displaystyle\mathop{max}_{\begin{subarray}{c}\text{ on }100\text{ randomly generated }\\ (\mathbf{x},\mathbf{y})\in t\times s\end{subarray}}\{|H(\mathbf{x},\mathbf{y})|\}}, (54)

where we considered the maximum error, the same set of 100100 test points on both numerator and denominator. The important point is that these errors are given relatively to the maximum magnitude of the (non-differentiated) kernel HH. This is motivated by the form of our differential operators 𝔇𝐱\mathfrak{D}_{\mathbf{x}}, 𝔇𝐲\mathfrak{D}_{\mathbf{y}} that are linear combinations of partial derivatives, hence influenced by the maximal magnitude of these partial derivative, which is obtained on the kernel evaluation (without derivatives) directly. In addition, because we do not consider scaling by charges, dipoles and quadrupoles moment, and because in our applications the magnitudes of these moments are far smaller than the charges one, the results of Fig. 11 should overestimate the numerical error induced by numerical differentiation on practical cases. Notice that similar results are obtained using more test points (but prohibit the tests for high orders). We tested all possible partial derivatives: since if |α0|=|α1||\alpha_{0}|=|\alpha_{1}| and |β0|=|β1||\beta_{0}|=|\beta_{1}| the errors using 𝐱α0𝐲β0\partial^{\alpha_{0}}_{\mathbf{x}}\partial_{\mathbf{y}}^{\beta_{0}} and 𝐱α1𝐲β1\partial^{\alpha_{1}}_{\mathbf{x}}\partial_{\mathbf{y}}^{\beta_{1}} have the same order, we only report one of them in the figure.

Refer to caption
Figure 6: Relative error (according to Eq. 54) of kernel interpolation and numerical differentiation. q: charges, d: dipoles, T: quadrupoles. Error types i-j corresponds to derivatives used for i and j applied on kernel HH.

According to Fig. 11, the numerical differentiation described in Sect. 3.4 results in errors on derivatives always smaller than the error on the kernel interpolation itself (relatively to this kernel direct evaluation). The numerical error of this interpolation process on the (non-differentiated) kernel follows a 𝒪(νL)\mathcal{O}(\nu^{L}) shape, with 1D interpolation order LL and ν(0,1)\nu\in(0,1)), so do all the tested partial derivatives (with the error defined in Eq. 54).

As a sum of these (1+3+6=101+3+6=10, using quadrupole matrix symmetry) partial derivatives, the numerical differentiation error using operators 𝔇𝐱\mathfrak{D}_{\mathbf{x}} and 𝔇𝐲\mathfrak{D}_{\mathbf{y}} (as in Sect. 1) should also behaves as 𝒪(νL)\mathcal{O}(\nu^{L}). This last point is investigated in Sect. 5.2.

5.2 Overall error in non-periodic configuration

In this section, we measure the relative error rANKHr_{ANKH} of ANKH-FFT results ANKH\mathcal{E}_{ANKH} (relatively to exact solution) on a small water test case without periodic boundary conditions and with respect to 1D (Chebyshev) interpolation order. This relative error is given by

exa\displaystyle\mathcal{E}_{exa} =12𝐱X𝐲Y𝔇𝐱𝔇𝐲(1|𝐱𝐲+𝐭|),\displaystyle=\frac{1}{2}\sum_{\mathbf{x}\in X}\sum_{\mathbf{y}\in Y}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}\left(\frac{1}{|\mathbf{x}-\mathbf{y}+\mathbf{t}|}\right), (55)
rANKH\displaystyle r_{ANKH} =|exaANKH||exa|.\displaystyle=\frac{|\mathcal{E}_{exa}-\mathcal{E}_{ANKH}|}{|\mathcal{E}_{exa}|}.

According to Rem. 1, the 1D interpolation order LuL_{u} on equispaced grids does not have to fit with the Chebyshev interpolation one LcL_{c}, hence we considered different LuL_{u} for the tested LcL_{c} in order to check the compression error that may be induced by choosing Lu<LcL_{u}<L_{c}. Results are provided in Fig. 7.

Refer to caption
Figure 7: Relative error on water box with 648 atoms (without periodic boundary condition) with respect to the 1D Chebyshev interpolation order LcL_{c} and according to various 1D equispaced interpolation order LuL_{u}.

First, we observe that the error geometrically decreases with respect to the 1D Chebyshev interpolation order, provided that the equispaced one equals this order. Hence, the overall ANKH-FFT approach numerically converges in accordance with the theory. According to the geometric convergence, when fixing the equispaced order to 44, the relative error does not decrease under 105\approx 10^{-5}, while it stagnates at 106\approx 10^{-6} and 107\approx 10^{-7} for equispaced orders fixed to 66 and 88 respectively. Moreover, when fixing to a constant the equispaced 1D interpolation order, the convergence is still observed up to the precision allowed by the latter (and then stagnates). This thus validates the trick we presented in Rem. 1: we can further compress the Chebyshev interpolation process using possibly smaller equispaced interpolation orders.

5.3 Overall error in periodic boundary configuration

The tested relative error is the same than in Eq. 55 except that periodic boundary conditions are used to exa\mathcal{E}_{exa}, fitting with the target quantity defined in Eq. 1:

exa\displaystyle\mathcal{E}_{exa} =12𝐭2rB3𝐱X𝐲Y𝔇𝐱𝔇𝐲(1|𝐱𝐲+𝐭|),\displaystyle=\frac{1}{2}\sum_{\mathbf{t}\in 2r_{B}\mathbb{Z}^{3}}\sum_{\mathbf{x}\in X}\sum_{\mathbf{y}\in Y}\mathfrak{D}_{\mathbf{x}}\mathfrak{D}_{\mathbf{y}}\left(\frac{1}{|\mathbf{x}-\mathbf{y}+\mathbf{t}|}\right), (56)
rANKH\displaystyle r_{ANKH} =|exaANKH||exa|.\displaystyle=\frac{|\mathcal{E}_{exa}-\mathcal{E}_{ANKH}|}{|\mathcal{E}_{exa}|}.

The corresponding results are provided in Fig. 8, in which we considered 313131^{3}-1 images of the simulation box, an 1D equispaced interpolation order equal to the Chebyshev one LcL_{c} and interpolation order for the far images handling (see Sect. 4.3) equal to Lc+1L_{c}+1 (which was empirically fixed and may not be optimal). The test case is the same than in Sect. 5.2. This choice of image number is also empirical: we observed in our tests that such parameter allowed to fully converge in most cases.

Refer to caption
Figure 8: Relative error of ANKH-FFT de with respect to the 1D Chebyshev interpolation order.

According to these errors, our ANKH approach converges in the periodic boundary condition configuration, still geometrically (hence as the non-periodic case of Sect. 5.2). The relative error value for Lc=8L_{c}=8 is surprisingly low, but we believe that this is just a particular case on which more digits are luckily correct than expected in this particular test case and using this particular parameter choice. This validates our ANKH approach when using periodic boundary conditions.

Unfortunately, due to the cost of direct computation of exact reference solution using large number of far images, we cannot directly check our code with exact solution when considering important number of atoms. However, to emphasize the conclusion in this section, as discussed (and done) in Sect. 5.4, we can verify that our ANKH approach is able to reach the same accuracy than provably convergent code on larger cases. Notice that we also retrieve the geometrical convergence under periodic boundary conditions when using few number of images (to perform the direct computation) on test cases with 100000\approx 100000 particles (hence, since the conclusions are the same, we do not report them in this paper).

5.4 Performance test

In this section, we compare the performance of both ANKH-FFT and Smooth Particle Mesh Ewald (SPME) using Tinker-HP on various systems. Their size range from a small water box of 216 water molecules (648 atoms) up to the Satellite Tobacco Mosaic Virus in water (1066624 atoms), covering typical scale of systems of interest in biochemistry even if both methods could naturally handle larger cases. In more details these are:

  • water box of 648 atoms, size 18.6433 Å\mathring{A},

  • dhfr protein in water, 23558 atoms, size 62.233 Å\mathring{A},

  • water box of 96000 atoms, size 98.653 Å\mathring{A},

  • water box of 288000 atoms, size 142.173 Å\mathring{A},

  • water box of 864000 atoms, size 205.193 Å\mathring{A},

  • Satellite Tobacco Mosaic Virus in water, 1066624 atoms, 223.03 Å\mathring{A}.

Most of these test cases can be found in the Tinker-HP git repository: https://github.com/TinkerTools/tinker-hp/tree/master/v1.2/example

Let us recall that we compare sequential execution of both applications without including precomputations (see Sect. 4.7). We use Tinker-HP results with default SPME parameters as a reference, which we observed to be of order 10410^{-4} compared to pure Ewald summation. Thus, we tuned the ANKH-FFT parameters to reach similar relative errors. We considered, for these tests, a 1D equispaced interpolation order equal to 22.

Refer to caption
Figure 9: Comparison of application timings (in second) for ANKH-FFT (red) and SPME (purple) with regard to the number of atoms in BB.

On small test cases (between few hundreds and few thousands), ANKH-FFT and SPME show similar timings. Medium test cases appears to be the most favorable to SPME here, where it perform 20%20\% faster than ANKH-FFT (for 96000 atoms). Notice that this number of particle is quite small for hierarchical methods and that performance still are very similar (same magnitude order each time), which allows us to claim that our method almost perform the same way than state-of-the-art approaches on this type of particle distributions. On larger test cases, ANKH-FFT outperforms SPME, providing timings up to 40%40\% smaller than the last.

Hence, ANKH-FTT shows important performance on wide range of particle distribution size and for various types of typical systems arising in biochemistry. This direct code performance comparison however does not allow to conclude on the interest of the optimizations we provided. We thus focus on ANKH-FFT on Sect. 5.5.

5.5 Split timings and details

In this section we provide additional details on the tests of Sect. 5.4. First, we present in Fig. 10 split timings in order to compare the running times of the various evaluation steps. There mainly are three of them:

  • the computation of interpolation polynomials at leaves and transformation to modified charges,

  • the far field computation (using FFT),

  • the near field computation (using direct evaluation).

Refer to caption
Figure 10: Relative timings (with respect to the entire running time of each test case) of the different steps of ANKH-FFT application. Relative error (compared to the results of Tinker-HP) are indicated in black and the number of atom in each corresponding system corresponds to the abscissa.

Clearly, the most time consuming part is dedicated to near field computations. This step, as well as the polynomial handling, has the strong advantage of being a local one. In addition, tree structure localizes the particles in an efficient way, avoiding the computation of neighbor lists. We thus expect these two step to fully benefit from parallelization.

Since the far-field computation time does not exceed 6%\approx 6\% of the overall application time, these results tend to validate our fast handling of far interactions. Notice that this far field still involves most of the interactions (at least for medium and large test cases). However, these relatively small timings are mainly due to the important recompression, using 1D equispaced interpolation order equal to 22. This was sufficient to reach the Tinker-HP SPME implementation precision (see relative errors on Fig. 10). For higher required precisions, the balance between near and far fields is better preserved in overall timings. Nevertheless, this recompression allows to strongly mitigate the hierarchical methods prefactor.

As stated in Sect. 4.6, ANKH-FFT precomputation complexity is linearithmic. These costs are negligible in applications according to Sect. 4.7. We still present in Fig. 11 the precomputation timings of ANKH-FFT on the different test cases of Sect. 5.4.

Refer to caption
Figure 11: Precomputation timings with respect to number of atoms for fixed 10410^{-4} output precision and using 313131^{3}-1 images of the simulation box BB (in red), compared to 𝒪(NlogN)\mathcal{O}(NlogN) curve (in black)

According to these timings, the precomputation step behaves as our theoretical linearithmic estimates. This is a consequence of the interpolation used for the handling of far field images since we also compared to a non-interpolated version that was already not able to run in medium test cases. In terms of magnitude order, ANKH-FFT precomputation timings are up to few hundred seconds on larger test cases. Regarding the entire cost of MD simulation, since these precomputations are done only once, the conclusion proposed in Sect. 4.7 is numerically verified.

6 Conclusion

In this paper, we introduced ANKH, a new polynomial interpolation-based accelerated Ewald Summation for energy computations in molecular dynamics. Our method mainly exploits the (absolutely convergent) real part of the Ewald Summation in order to express the energy computation problem as a generalized NN-body problem on which fast kernel-independent hierarchical method can perform. We then proposed two fast approaches, the first one based on IBFMM (accordingly named ANKH-FMM) will be well-suited for distributed memory parallelization and the second, built on a fast matrix diagonalization using FFTs (named ANKH-FFT) will be a good candidate for GPU computations. ANKH-FMM has a linear complexity while ANKH-FFT has a linearithmic one but practically behaves as a linear complexity approach. We demonstrated the numerical convergence of our interpolation-based method and we compared our implementation to a Smooth Particle Mesh Ewald production code (Tinker-HP), exhibiting comparable performance on small and medium test systems and superior performances on large test cases. A public code is provided at https://github.com/IChollet/ankh-fft whereas a version of it will also be available within a future Tinker-HP code release (https://github.com/TinkerTools/tinker-hp).

We believe that ANKH is an interesting approach for MD on modern HPC architectures, especially because it theoretically solved the scalability issues of Particle-Mesh methods in distributed memory context. However, our current implementation does not use shared or distributed memory parallelism and our intuition still has to be verified in practice. Hence, our priority is now to develop a scalable ANKH-based code (MPI + GPU) and to test it on early exascale architectures.

Among other short-term direct perspectives of the work we presented here, we are considering the (almost) straightforward extension to forces computations. Indeed, the main difference between the algorithm we presented and a general force algorithm belong in the mutual interactions that cannot be exploited the same way. We would also like to look at application of ANKH-FFT to other scientific fields on which NN-body problems on quasi-uniform particle distributions have to be solved.

Considering long term perspectives, the adaptation of the algorithms we presented to non-cubic simulation boxes may also find applications in material science.

Acknowledgements

This work has also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 810367), project EMC2 (JPP).

References

  • [1] Mustafa Abduljabbar, Mohammed Al Farhan, Noha Al-Harthi, Rui Chen, Rio Yokota, Hakan Bagci, and David Keyes. Extreme scale fmm-accelerated boundary integral equation solver for wave scattering. SIAM Journal on Scientific Computing, 41(3):C245–C268, 2019.
  • [2] A. Aimi, L. Desiderio, and G. Di Credico. Partially pivoted aca based acceleration of the energetic bem for time-domain acoustic and elastic waves exterior problems. Computers & Mathematics with Applications, 119:351–370, 2022.
  • [3] Amirhossein Amiraslani, Robert M Corless, and Madhusoodan Gunasingam. Differentiation matrices for univariate polynomials. Numerical Algorithms, 83:1–31, 2019.
  • [4] Alan Ayala, Stanimire Tomov, Miroslav Stoyanov, and Jack Dongarra. Scalability issues in fft computation. In Victor Malyshkin, editor, Parallel Computing Technologies, pages 279–287, Cham, 2021. Springer International Publishing.
  • [5] Siwar Badreddine, Igor Chollet, and Laura Grigori. Factorized structure of the long-range two-electron integrals tensor and its application in quantum chemistry. working paper or preprint, October 2022.
  • [6] Pierre Blanchard. Fast hierarchical algorithms for the low-rank approximation of matrices, with applications to materials physics, geostatistics and data analysis. Theses, Université de Bordeaux, February 2017.
  • [7] John Board, Christopher Humphres, Christophe Lambert, William Rankin, and Abdulnour Toukmaji. Ewald and multipole methods for periodic n-body problems. 03 1997.
  • [8] Bernard R Brooks, Charles L Brooks III, Alexander D Mackerell Jr, Lennart Nilsson, Robert J Petrella, Benoît Roux, Youngdo Won, Georgios Archontis, Christian Bartels, Stefan Boresch, et al. Charmm: the biomolecular simulation program. Journal of computational chemistry, 30(10):1545–1614, 2009.
  • [9] Steffen Börm. Efficient Numerical Methods for Non-local Operators: 2\mathcal{H}^{2}-Matrix Compression, Algorithms and Analysis. 12 2010.
  • [10] David A Case, Thomas E Cheatham III, Tom Darden, Holger Gohlke, Ray Luo, Kenneth M Merz Jr, Alexey Onufriev, Carlos Simmerling, Bing Wang, and Robert J Woods. The amber biomolecular simulation programs. Journal of computational chemistry, 26(16):1668–1688, 2005.
  • [11] Matt Challacombe, Chris White, and Martin Head-Gordon. Periodic boundary conditions and the fast multipole method. The Journal of Chemical Physics, 107, 12 1997.
  • [12] Igor Chollet. Symmetries and Fast Multipole Methods for Oscillatory Kernels. Theses, Sorbonne Université, March 2021.
  • [13] Igor Chollet, Xavier Claeys, Pierre Fortin, and Laura Grigori. A Directional Equispaced interpolation-based Fast Multipole Method for oscillatory kernels. working paper or preprint, February 2022.
  • [14] Tom Darden, Darrin York, and Lee Pedersen. Particle mesh ewald: An n log(n) method for ewald sums in large systems. The Journal of Chemical Physics, 98(12):10089–10092, 1993.
  • [15] Walter Dehnen. A hierarchical (n) force calculation algorithm. Journal of Computational Physics, 179(1):27–42, jun 2002.
  • [16] John Dubinski. A parallel tree code. New Astronomy, 1(2):133–147, oct 1996.
  • [17] Ulrich Essmann, Lalith Perera, Max Berkowitz, Tom Darden, Hsing Lee, and Lee Pedersen. A smooth particle mesh ewald method. J. Chem. Phys., 103:8577, 11 1995.
  • [18] William Fong and Eric Darve. The black-box fast multipole method. Journal of Computational Physics, 228(23):8712–8725, 2009.
  • [19] Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216–231, 2005. Special issue on “Program Generation, Optimization, and Platform Adaptation”.
  • [20] L Greengard and V Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73(2):325–348, 1987.
  • [21] L. Grengard and V. Rokhlin. The rapid evaluation of potential fields in three dimensions. In Christopher Anderson and Claude Greengard, editors, Vortex Methods, pages 121–141, Berlin, Heidelberg, 1988. Springer Berlin Heidelberg.
  • [22] Tsuyoshi Hamada, Tetsu Narumi, Rio Yokota, Kenji Yasuoka, Keigo Nitadori, and Makoto Taiji. 42 tflops hierarchical n-body simulations on gpus with applications in both astrophysics and turbulence. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, New York, NY, USA, 2009. Association for Computing Machinery.
  • [23] David J. Hardy, Zhe Wu, James C. Phillips, John E. Stone, Robert D. Skeel, and Klaus Schulten. Multilevel summation method for electrostatic force evaluation. Journal of Chemical Theory and Computation, 11(2):766–779, 2015. PMID: 25691833.
  • [24] Zhifeng Jing, Chengwen Liu, Sara Y. Cheng, Rui Qi, Brandon D. Walker, Jean-Philip Piquemal, and Pengyu Ren. Polarizable force fields for biomolecular simulations: Recent advances and applications. Annual Review of Biophysics, 48(1):371–394, 2019. PMID: 30916997.
  • [25] Bartosz Kohnke, Carsten Kutzner, and Helmut Grubmüller. A gpu-accelerated fast multipole method for gromacs: Performance and accuracy. Journal of Chemical Theory and Computation, 16(11):6938–6949, 2020. PMID: 33084336.
  • [26] Louis Lagardère, Luc-Henri Jolly, Filippo Lipparini, Félix Aviat, Benjamin Stamm, Zhifeng F Jing, Matthew Harger, Hedieh Torabifard, G Andrés Cisneros, Michael J Schnieders, et al. Tinker-hp: a massively parallel molecular dynamics package for multiscale simulations of large complex systems with advanced point dipole polarizable force fields. Chemical science, 9(4):956–972, 2018.
  • [27] Christophe G. Lambert, Thomas A. Darden, and John A. Board Jr. A multipole-based algorithm for efficient calculation of forces and potentials in macroscopic periodic assemblies of particles. Journal of Computational Physics, 126(2):274–285, 1996.
  • [28] Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh Nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, and George Biros. A massively parallel adaptive fast multipole method on heterogeneous architectures. Commun. ACM, 55(5):101–109, may 2012.
  • [29] Josef Melcr and Jean-Philip Piquemal. Accurate biomolecular simulations account for electronic polarization. Frontiers in Molecular Biosciences, 6, 2019.
  • [30] Matthias Messner. Fast Boundary Element Methods in Acoustics. Theses, Technischen Universität Graz, December 2011.
  • [31] Yousuke Ohno, Rio Yokota, Hiroshi Koyama, Gentaro Morimoto, Aki Hasegawa, Gen Masumoto, Noriaki Okimoto, Yoshinori Hirano, Huda Ibeid, Tetsu Narumi, and Makoto Taiji. Petascale molecular dynamics simulation using the fast multipole method on k computer. Computer Physics Communications, 185(10):2575–2585, 2014.
  • [32] V. Rokhlin and S. Wandzura. The fast multipole method for periodic structures. In Proceedings of IEEE Antennas and Propagation Society International Symposium and URSI National Radio Science Meeting, volume 1, pages 424–426 vol.1, 1994.
  • [33] Celeste Sagui, Lee G Pedersen, and Thomas A Darden. Towards an accurate representation of electrostatics in classical force fields: Efficient implementation of multipolar interactions in biomolecular simulations. The Journal of chemical physics, 120(1):73–87, 2004.
  • [34] John K. Salmon. Parallel hierarchical N-body methods. PhD thesis, 1990.
  • [35] Kevin E. Schmidt and Michael A. Lee. Implementing the fast multipole method in three dimensions. Journal of Statistical Physics, 63:1223–1235, 1991.
  • [36] Yue Shi, Pengyu Ren, Michael Schnieders, and Jean-Philip Piquemal. Polarizable Force Fields for Biomolecular Modeling, chapter 2, pages 51–86. John Wiley & Sons, Ltd, 2015.
  • [37] Jiro Shimada, Hiroki Kaneko, and Toshikazu Takada. Performance of fast multipole methods for calculating electrostatic interactions in biomacromolecular simulations. Journal of Computational Chemistry, 15(1):28–43, 1994.
  • [38] Andrew Simmonett, Bernard Brooks, and Thomas Darden. Efficient and scalable electrostatics via spherical grids and treecode summation, 09 2022.
  • [39] Andrew C. Simmonett and Bernard R. Brooks. A compression strategy for particle mesh ewald theory. The Journal of Chemical Physics, 154(5):054112, 2021.
  • [40] W. Smith and Clrc Daresbury. Point multipoles in the ewald summation (revisited). 2007.
  • [41] Benjamin Stamm, Louis Lagardè re, Étienne Polack, Yvon Maday, and Jean-Philip Piquemal. A coherent derivation of the ewald summation for arbitrary orders of multipoles: The self-terms. The Journal of Chemical Physics, 149(12):124103, sep 2018.
  • [42] Toru Takahashi, Cris Cecka, William Fong, and Eric Darve. Optimizing the multipole-to-local operator in the fast multipole method for graphical processing units. International Journal for Numerical Methods in Engineering, 89(1):105–133, 2012.
  • [43] Abdulnour Toukmaji, Celeste Sagui, John Board, and Tom Darden. Efficient particle-mesh ewald based approach to fixed and induced dipolar interactions. The Journal of chemical physics, 113(24):10913–10927, 2000.
  • [44] David Van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and Herman JC Berendsen. Gromacs: fast, flexible, and free. Journal of computational chemistry, 26(16):1701–1718, 2005.
  • [45] M. S. Warren and J. K. Salmon. Astrophysical n-body simulations using hierarchical tree data structures. In Proceedings of the 1992 ACM/IEEE Conference on Supercomputing, Supercomputing ’92, page 570–576, Washington, DC, USA, 1992. IEEE Computer Society Press.
  • [46] Lexing Ying, George Biros, and Denis Zorin. A kernel-independent adaptive fast multipole algorithm in two and three dimensions. Journal of Computational Physics, 196(2):591–626, 2004.