This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy

Siyu Hu1,2\equalcontrib, Wentao Zhang3\equalcontrib, Qiuchen Sha1,2, Feng Pan3,
Lin-Wang Wang4, Weile Jia1, Guangming Tan1, Tong Zhao1
Corresponding author
Abstract

It is imperative to accelerate the training of neural network force field such as Deep Potential, which usually requires thousands of images based on first-principles calculation and a couple of days to generate an accurate potential energy surface. To this end, we propose a novel optimizer named reorganized layer extended Kalman filtering (RLEKF), an optimized version of global extended Kalman filtering (GEKF) with a strategy of splitting big and gathering small layers to overcome the O(N2)O(N^{2}) computational cost of GEKF. This strategy provides an approximation of the dense weights error covariance matrix with a sparse diagonal block matrix for GEKF. We implement both RLEKF and the baseline Adam in our α\alphaDynamics package and numerical experiments are performed on 13 unbiased datasets. Overall, RLEKF converges faster with slightly better accuracy. For example, a test on a typical system, bulk copper, shows that RLEKF converges faster by both the number of training epochs (×\times11.67) and wall-clock time (×\times1.19). Besides, we theoretically prove that the updates of weights converge and thus are against the gradient exploding problem. Experimental results verify that RLEKF is not sensitive to the initialization of weights. The RLEKF sheds light on other AI-for-science applications where training a large neural network (with tons of thousands parameters) is a bottleneck.

Introduction

Ab initio molecular dynamics (AIMD) has been the method of choice in modeling physical phenomena from a microscopic scale, for example, water (Rahman and Stillinger 1971; Stillinger and Rahman 1974), alloy (Wang et al. 2009), nanotube (Raty, Gygi, and Galli 2005), and even protein (Karplus and Kuriyan 2005). However, the cubic scaling of the first-principles methods (Meier, Laino, and Curioni 2014) has hindered both spatial and temporal scales of AIMD packages within thousands of atoms and picoseconds on modern supercomputers. To overcome the “scaling wall” of AIMD, two types of machine-learned MD (MLMD) methods are adopted. The first one is based on classical ML methods. In 1992, Ercolessi and Adam first introduced ML for describing potentials with an accuracy comparable to that obtained by ab initio methods (Ercolessi and Adams 1992, 1994), and to date, methods like ACE (Drautz 2019; Lysogorskiy et al. 2021), SNAP (Thompson et al. 2015), and GPR (Bartók et al. 2010) are developed and widely used in physical problems such as copper and silicon, tantalum, bulk crystals. The second approach is based on neural network (NN) and was first introduced in 2007 by Belher and Parrinello (Behler and Parrinello 2007). The neural network MD (NNMD) method approximates both atomic energy (EiE_{i}) and force (FiF_{i}) with a local neighboring environment, and the atomic potential energy surface is trained through tons of data generated from first-principles calculations. One current state-of-the-art is the Deep Potential (DP) model, which combines large NN and physical symmetries (translation, rotation, and permutation invariance) for accurately describing the high-dimensional configuration space of interatomic potential. Although the corresponding package DeePMD-kit can reach 10 billion atoms when scaling to the top supercomputers (Jia et al. 2020; Guo et al. 2022) in model inference , training procedure of an individual model can still take from hours to days and is the bottleneck.

The two most commonly used training methods are Adam (Kingma and Ba 2014) and scholastic gradient descent (SGD) (Saad 1998) in NNMD packages due to their integration in NN framework such as TensorFlow and PyTorch. For example, many NNMD packages such as HDNNP (Behler 2014), SIMPLE-NN (Lee et al. 2019) adopt Adam in the training of interatomic potential. Yet these optimizers have a slow convergence rate in searching for the optimal solution on the landscape and can take up to hundreds of epochs in training one NNMD model with thousands of training data. Moreover, SGD suffers from gradient exploding without proper control of the learning rate.

Global extended Kalman filtering (Chui, Chen et al. 2017) (GEKF) is a good choice in both convergence and robustness. For example, RuNNer (Behler 2011) adopts GEKF as an optimizer in its training of a simple three-layer fully connected NN with 1000 parameters or so. As shown in (Singraber et al. 2019), RuNNer can achieve 0.69 meV/atom in Energy RMSE and 35.5 meV/Å\mathring{\text{A}} in Force RMSE for H2{}_{\text{2}}O physical system. However, since the error covariance matrix 𝐏\mathbf{P} is updated globally (Fig. 2), GEKF can be computationally expensive when NNs with tens of thousands of parameters are applied.

Our main contribution is a reorganized layer extended Kalman filtering (RLEKF) method, which approximates the dense weights error covariance matrix of GEKF with a sparse diagonal block matrix to reduce the computational cost. Technically, these layers are reorganized by splitting big and gathering adjacent small layers. To have a fair comparison, both RLEKF and Adam methods are implemented in our NNMD package α\alphaDynamics. Compared to the Adam method, our testing results show that RLEKF can reach the same or higher accuracy for both force (better than Adam in 7 out 13 testing cases) and energy (better than Adam in 11 out of 13 testing cases). For a typical copper system, the time-to-solution of RLEKF can be ×11.67\times 11.67 and ×1.19\times 1.19 faster in terms of the number of epochs and wall-clock time, respectively. Especially, RLEKF can significantly reduce the number of epochs to 2-3 for reaching a reasonable accuracy (1.2×1.2\times the RMSE of the best accuracy possible). We theoretically prove the weights updating convergence and therefore this protects RLEKF from gradient exploding. Our work also sheds light on other NN-based applications where training NNs with a relatively large number of parameters is a bottleneck.

Refer to caption
Figure 1: The overview of DP NN with RLEKF. (a) Structure of DP NN. The two arrows in bold correspond to the embedding net and the fitting net. (b) Macro-structure of training, input data into the network, calculate the gradient of energy with respect to weights through backpropagation and update 𝒘,𝐏,λ\boldsymbol{w},\mathbf{P},\lambda with Kalman filter feed data into the network for energy as an intermediate, obtain force through the energy obtained in based on a formula the arrow directs to, derive the gradient of force with respect to weights through backpropagation and use Kalman filtering to update 𝒘,𝐏,λ\boldsymbol{w},\mathbf{P},\lambda, and then the above progress for the next sample starts. (c) Weights and gradients flow of DP NN. The gather OP is implemented on layers when the total number of accumulated parameters is less than NbN_{b}, whereas the split OP is activated on a layer if the number of its parameters exceeds NbN_{b}. (The gather works on the first three layers in embedding net and the last two layers in fitting net, meanwhile the split works on the first layer in fitting net in the following experiment.) (d) Error covariance matrix and splitting strategy. This picture shows our splitting strategy which is applied to most of the following experiments. Here is an example: with the default block size set as Nb=N_{b}=5120 and 10240 for efficiency, except for the first layer of the fitting net, the number of weights of each layer, less than the default block size, so does not need to be split. The first fitting net layer containing 20000 weights will be split into 2 parts (20000=1×10240+1×976020000=1\times 10240+1\times 9760) if Nb=10240N_{b}=10240. (e) Updating strategy (Alg. 2).

Related Work

NNMD Packages. Fully connected NNs are the most widely used in NNMD packages. For example, HDNNP (Behler and Parrinello 2007; Behler 2017), which is a three-layered fully-connected NN is introduced in RuNNer (Behler 2011). Other packages such as BIM-NN (Yao et al. 2017), Simple-NN (Lee et al. 2019), CabanaMD-NNP (Desai, Reeve, and Belak 2022), SPONGE (Huang et al. 2022), DeepMD-kit (Wang and Weinan 2018) are also implemented via fully-connected NN. Many physical phenomena such as a diagram of water and Cu2{}_{\text{2}}S  (Singraber et al. 2019), chemical molecule (Yao et al. 2017), SiO2{}_{\text{2}} (Lee et al. 2019), organic molecules (C10{}_{\text{10}}H2{}_{\text{2}}, C10{}_{\text{10}}H+3{}_{\text{3}}^{+}(Lysogorskiy et al. 2021) are studied with the packages above.

Recent progress of NNMD packages implemented with graph NN (GNN) is gaining momentum. DimeNet++ (Gasteiger, Groß, and Günnemann 2020; Gasteiger et al. 2020), NequIP (Batzner et al. 2022), has shown great potential in describing organic molecules (QM7 and QM9 dataset) at high accuracy. We remark that NNMD packages also differ in employing physical symmetries for NN inputs (“features”), which are not discussed due to it is out of the scope of this paper.

Training Methods in NNMD Packages. Nearly all NNMD packages mentioned above use Adam and SGD as the training procedure for their ease of use in NN frameworks. One challenge of these methods is their time-consuming model training. For example, it usually takes more than 100 epochs to systematically train an individual DP model in the DeePMD-kit software.

As an alternative, Kalman filtering (Kalman et al. 1960; Welch, Bishop et al. 1995) (KF) aims to estimate states of a linear process theoretically based on a state-measurement model with Gaussian noise by an estimator optimal in the sense of minimizing the mean square error of predicted states with the noisy measurement as input. It is widely used in autonomous, navigation, and interactive computer graphics due to its fast convergence and noise filtering. However, the formulation of KF confines itself to a linear optimization problem. Extended Kalman filtering (Smith, Schmidt, and McGee 1962) (EKF) is introduced to solve nonlinear optimization problems through Taylor expansion, and then the linearized problem is solved by KF. In the implementation, EKF has many variants such as NDEKF (Murtuza and Chorian 1994), ONDEKF, LDEKF, and FDEKF. Note that the performance of EKF and its variants are compared in Ref. (Heimes 1998).

We remark that in training an NNMD model with more than tens of thousands of parameters, such as the DP mentioned in this paper, both Adam and EKF (and its currently known variants) are either not effective or not efficient.

Problem Setup and Formulation

DP. The key steps of the DP training are shown in Fig. 1(a). For each atom ii, the physical symmetries such as translational, rotational, and permutational invariances are integrated into the descriptor 𝒟i\mathcal{D}_{i} through a three-layer fully connected network named embedding net. Then 𝒟i\mathcal{D}_{i} is trained via a three-layered fitting net.

  1. 1.

    Every snapshot of the molecule system consists of each atom’s 3D Cartesian coordinates 𝐫i=(xi,yi,zi)R3,i=1,2,,Na\mathbf{r}_{i}=(x_{i},y_{i},z_{i})\in R^{3},i=1,2,...,N_{a} which is then translated into neighbor list of atom ii, i={𝐫ij3|𝐫j𝐫i<rc}\mathcal{R}_{i}=\{\mathbf{r}_{ij}\in\mathbb{R}^{3}|\mid\mathbf{r}_{j}-\mathbf{r}_{i}\mid<r_{c}\} as the input of DP NN. Then, we gather it into its smooth version ~iNm×4\tilde{\mathcal{R}}_{i}\in\mathbb{R}^{N_{m}\times 4}, (~i)j=s(|𝐫ij|)(1,𝐫ij/|𝐫ij|)4(\tilde{\mathcal{R}}_{i})_{j}=s(|\mathbf{r}_{ij}|)(1,\mathbf{r}_{ij}/|\mathbf{r}_{ij}|)\in\mathbb{R}^{4}, where s(x)=1/xs(x)=1/x when x<rcsx<r_{cs}, s(x)=0s(x)=0 when x>rcx>r_{c}, decaying smoothly between the two thresholds, NmN_{m} is the maximum length of all neighbor lists, jj is the neighbor index of atom ii.

  2. 2.

    Define the embedding net 𝒢iNm×M\mathcal{G}_{i}\in\mathbb{R}^{N_{m}\times M}, 𝒢i=𝒢(s(|𝐫i|))\mathcal{G}_{i}=\mathcal{G}(s(|\mathbf{r}_{i\cdot}|)), where MM is called symmetry order, 𝒢=210\mathcal{G}=\mathcal{E}_{2}\circ\mathcal{E}_{1}\circ\mathcal{E}_{0}, l(X)=X+tanh(XWl+𝟏wl),l{1,2}\mathcal{E}_{l}(X)=X+\tanh\left(XW_{l}+\boldsymbol{1}\otimes w_{l}\right),l\in\{1,2\}, 0(𝐱)=tanh(𝐱W0+𝟏w0)\mathcal{E}_{0}(\mathbf{x})=\tanh\left(\mathbf{x}\otimes W_{0}+\boldsymbol{1}\otimes w_{0}\right), |𝐫i||\mathbf{r}_{i\cdot}| and 𝟏Nm×1,wl1×M,WlNm×M,l{0,1,2}\boldsymbol{1}\in\mathbb{R}^{N_{m}\times 1},w_{l}\in\mathbb{R}^{1\times M},W_{l}\in\mathbb{R}^{N_{m}\times M},l\in\{0,1,2\} and the functions ss and tanh\tanh are element-wise.

  3. 3.

    ~i\tilde{\mathcal{R}}_{i} timing 𝒢i\mathcal{G}_{i} yields the so-called descriptor 𝒟i:=𝒢iT~i~iT𝒢i<M×M<\mathcal{D}_{i}:=\mathcal{G}_{i}^{\textsf{T}}\tilde{\mathcal{R}}_{i}\tilde{\mathcal{R}}_{i}^{\textsf{T}}\mathcal{G}_{i}^{<}\in\mathbb{R}^{M\times M^{<}}, i.e. 𝒢i<\mathcal{G}_{i}^{<} is the several columns of 𝒢i\mathcal{G}_{i} and M<<MM^{<}<M.

  4. 4.

    The fitting net is Ei=(𝒟i)=3210(𝒟i)E_{i}=\mathcal{F}(\mathcal{D}_{i})=\mathcal{F}_{3}\circ\mathcal{F}_{2}\circ\mathcal{F}_{1}\circ\mathcal{F}_{0}(\mathcal{D}_{i}), where 𝒟i\mathcal{D}_{i} is reshaped into a vector of form MM<×1\mathbb{R}^{MM^{<}\times 1}, l(𝐱)=𝐱+tanh(W~l𝐱+w~l),w~ld×1,W~ld×d,l{1,2}\mathcal{F}_{l}(\mathbf{x})=\mathbf{x}+\tanh\left(\tilde{W}_{l}\mathbf{x}+\tilde{w}_{l}\right),\tilde{w}_{l}\in\mathbb{R}^{d\times 1},\tilde{W}_{l}\in\mathbb{R}^{d\times d},l\in\{1,2\}, 3(𝐱)=W~3𝐱+w~3\mathcal{F}_{3}(\mathbf{x})=\tilde{W}_{3}\mathbf{x}+\tilde{w}_{3}, w~3\tilde{w}_{3}\in\mathbb{R}, W~31×d\tilde{W}_{3}\in\mathbb{R}^{1\times d}, 0(𝐱)=tanh(W~0𝐱+w~0)\mathcal{F}_{0}(\mathbf{x})=\tanh\left(\tilde{W}_{0}\mathbf{x}+\tilde{w}_{0}\right) , w~0d×1\tilde{w}_{0}\in\mathbb{R}^{d\times 1}, W~0d×MM<\tilde{W}_{0}\in\mathbb{R}^{d\times MM^{<}}.

  5. 5.

    The output E:=iEiE:=\sum_{i}E_{i}, Fi:=𝐫iEF_{i}:=-\nabla_{\mathbf{r}_{i}}E.

EKF with Memory Factor for NNs. For neural networks, the model of interest

{𝜽t=𝜽t1=𝒘,yt=h(𝜽t,xt)+ηt,ηt𝒩(0,Rt),\begin{cases}\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1}=\boldsymbol{w},\\ y_{t}=h(\boldsymbol{\theta}_{t},x_{t})+\eta_{t},\quad\eta_{t}\sim{\mathcal{N}}(0,R_{t}),\end{cases} (1)

is formulated in stochastic language as an EKF problem targeting on 𝜽^t\boldsymbol{\hat{\theta}}_{t}, where 𝒘\boldsymbol{w} is the vector of all trainable parameters in the network h(,)h(\cdot,\cdot), {(xt,yt)}t\{(x_{t},y_{t})\}_{t\in\mathbb{N}} are pairs of feature and label, {yt}t\{y_{t}\}_{t\in{\mathbb{N}}} can also be seen as measurements of EKF, {ηt}t\{\eta_{t}\}_{t\in\mathbb{N}} are noise terms subject to normal distribution with mean 0 and variances {Rt}t\{R_{t}\}_{t\in\mathbb{N}} correspondingly, and t,𝜽^t|t1:=𝔼[𝜽t|yt1,yk2,,y1]\forall t\in\mathbb{N},\boldsymbol{\hat{\theta}}_{t|t-1}:=\mathbb{E}[\boldsymbol{\theta}_{t}|y_{t-1},y_{k-2},\dots,y_{1}]. With fixed 𝜽t\boldsymbol{\theta}_{t} and bounded xtx_{t}, h(𝜽t,xt)h(\boldsymbol{\theta}_{t},x_{t}) will be approximated well by its linearization at 𝜽^t|t1\boldsymbol{\hat{\theta}}_{t|t-1}, just omitting a term 𝒪((𝜽t𝜽^t|t1)2)\mathcal{O}((\boldsymbol{\theta}_{t}-\boldsymbol{\hat{\theta}}_{t|t-1})^{2}).

yt\displaystyle y_{t} h(𝜽^t|t1,xt)+𝐇tT(𝜽t𝜽^t|t1)+ηt\displaystyle\approx h(\boldsymbol{\hat{\theta}}_{t|t-1},x_{t})+\mathbf{H}_{t}^{\textsf{T}}(\boldsymbol{\theta}_{t}-\boldsymbol{\hat{\theta}}_{t|t-1})+\eta_{t} (2)
𝐇t\displaystyle\mathbf{H}_{t} =h(𝜽,xt)𝜽|𝜽=𝜽^t|t1\displaystyle=\frac{\partial h(\boldsymbol{\theta},x_{t})}{\partial\boldsymbol{\theta}}\bigg{|}_{\boldsymbol{\theta}=\boldsymbol{\hat{\theta}}_{t|t-1}}

If set mt=yth(𝜽^t|t1,xt)+𝐇tT𝜽^t|t1m_{t}=y_{t}-h(\boldsymbol{\hat{\theta}}_{t|t-1},x_{t})+\mathbf{H}_{t}^{\textsf{T}}\boldsymbol{\hat{\theta}}_{t|t-1} and rewrite (2) the following KF problem

𝜽t=𝜽t1=𝒘,\displaystyle\boldsymbol{\theta}_{t}=\boldsymbol{\theta}_{t-1}=\boldsymbol{w}, (3)
mt𝐇tT𝜽t+ηt.\displaystyle m_{t}\approx\mathbf{H}_{t}^{\textsf{T}}\boldsymbol{\theta}_{t}+\eta_{t}. (4)

At the beginning of training, the estimator 𝜽^t|t1\boldsymbol{\hat{\theta}}_{t|t-1} is far away from 𝒘\boldsymbol{w}, so less attention should be paid to those data fed to the network at an earlier stage of training than those at later stage. Through timing a factor αt:=Πi=1tλi1/2\alpha_{t}:=\Pi_{i=1}^{t}\lambda_{i}^{-1/2} and α0:=1\alpha_{0}:=1, where 0<λi10<\lambda_{i}\leq 1 and λi1\lambda_{i}\rightarrow 1, the last problem enjoys the better variant as below

{𝜽t=λt1/2𝜽t1,𝜽1=𝒘m~t=αtmt𝐇tT𝜽t+αtηt=𝐇tT𝜽t+η~t,\begin{cases}\boldsymbol{\theta}_{t}=\lambda_{t}^{-1/2}\boldsymbol{\theta}_{t-1},\ \boldsymbol{\theta}_{1}=\boldsymbol{w}\\ \tilde{m}_{t}=\alpha_{t}m_{t}\approx\mathbf{H}_{t}^{\textsf{T}}\boldsymbol{\theta}_{t}+\alpha_{t}\eta_{t}=\mathbf{H}_{t}^{\textsf{T}}\boldsymbol{\theta}_{t}+\tilde{\eta}_{t},\end{cases} (5)

where λt\lambda_{t} is called memory factor. The greater λt\lambda_{t} is, the more weight, or say attention, is paid to previous data. According to basic KF theory (Haykin and Haykin 2001) , we obtain

𝐚t=λt1𝐇tT𝐏t1𝐇t+αt2Rt,𝐊t=λt1𝐏t1𝐇tT𝐚t1,𝐏t=(𝐈𝐊t𝐇t)λt1𝐏t1,𝜽^t=𝜽^tt1+𝐊tϵ~t,ϵ~t=m~t𝐇tT𝜽^tt1=αt(yth(αt1𝜽^t|t1,xt)).\begin{split}\mathbf{a}_{t}&=\lambda_{t}^{-1}\mathbf{H}^{\textsf{T}}_{t}\mathbf{P}_{t-1}\mathbf{H}_{t}+\alpha_{t}^{2}R_{t},\\ \mathbf{K}_{t}&=\lambda_{t}^{-1}\mathbf{P}_{t-1}\mathbf{H}_{t}^{\textsf{T}}\mathbf{a}_{t}^{-1},\\ \mathbf{P}_{t}&=\left(\mathbf{I}-\mathbf{K}_{t}\mathbf{H}_{t}\right)\lambda_{t}^{-1}\mathbf{P}_{t-1},\\ \hat{\boldsymbol{\theta}}_{t}&={\hat{\boldsymbol{\theta}}}_{t\mid t-1}+\mathbf{K}_{t}{\tilde{\epsilon}}_{t},\\ \tilde{\epsilon}_{t}&=\tilde{m}_{t}-\mathbf{H}_{t}^{\textsf{T}}{\hat{\boldsymbol{\theta}}}_{t\mid t-1}=\alpha_{t}(y_{t}-h(\alpha_{t}^{-1}\boldsymbol{\hat{\theta}}_{t|t-1},x_{t})).\end{split}

Finally, we recover the estimator of ww via that of θ\theta divided by the factor αt\alpha_{t}, define t,𝒘^tt1:=αt1𝜽^tt1,𝒘t:=αt1𝜽^t\forall t\in\mathbb{N},\hat{\boldsymbol{w}}_{t\mid t-1}:=\alpha_{t}^{-1}\hat{\boldsymbol{\theta}}_{t\mid t-1},\boldsymbol{w}_{t}:=\alpha_{t}^{-1}\hat{\boldsymbol{\theta}}_{t}, find 𝒘^tt1=𝒘t1\hat{\boldsymbol{w}}_{t\mid t-1}=\boldsymbol{w}_{t-1}, and then get our weights updating strategy

ϵt=yth(𝒘t1,xt),𝒘t=𝒘t1+𝐊tϵt.\begin{split}\epsilon_{t}&=y_{t}-h(\boldsymbol{w}_{t-1},x_{t}),\\ \boldsymbol{w}_{t}&={\boldsymbol{w}}_{t-1}+\mathbf{K}_{t}{\epsilon}_{t}.\end{split}
Systems Structure Time Step (fs) # Snapshots Energy(trn) Energy(tst) Force(trn) Force(tst)
Cu FCC 1 1646 0.250/0.451 0.327/0.442 40.6/39.7 45.2/44.4
Si DC 2.5-3.5 3000 0.148/0.165 0.186/0.181 22.3/21.9 24.1/23.4
C Graphene 2-3.5 4000 0.0856/0.133 0.267/0.278 24.2/27.6 34.7/35.5
Ag FCC 2.5-3 2015 0.142/0.159 0.243/0.265 11.3/11.0 12.4/12.5
Mg HCP 0.5-2 4000 0.111/0.160 0.169/0.189 13.0/14.8 16.1/17.1
NaCl FCC 2-3.5 3193 0.0403/0.0631 0.0435/0.0514 6.06/5.78 6.69/6.47
Li BCC,FCC,HCP 0.5 2494 0.0482/0.126 0.312/0.332 22.9/14.0 24.8/26.4
Al FCC 2-3.5 4000 0.359/0.534 0.473/1.24 54.1/56.3 58.6/55.8
MgAlCu Alloy 3 2530 0.165/0.227 0.218/0.233 44.1/37.6 45.2/42.6
H2O Liquid 0.5 4000 0.297/0.584 0.545/1.00 60.8/20.6 68.1/87.6
S S8 3-5 7000 0.210/0.401 0.628/0.628 51.5/45.9 60.4/58.5
CuO FCC 3 1000 2.203/2.47 2.04/3.78 424/414 442/438
Cu+C SA 3-3.2 2000 0.458/1.27 1.07/2.52 176/143 190/195
Table 1: Structures, the quantities of snapshots, time steps (frequency of yielding snapshots), and the root mean square errors (RMSE) of the training and the testing of RLEKF (before slashes) and Adam (after slashes) in terms of energy (meV) and forces (meV/Å\mathring{\text{A}}) after 30 epochs for various systems. The RMSEs of the energies are normalized by the number of atoms in the system. For all sub-systems, the former 80% and the latter 20% of the snapshot dataset are used for training and testing, respectively. Better results are in bold. Acronyms: FCC (face-centered cubic), BCC (body-centered cubic), HCP (hexagonal close-packed), DC (diamond cubic), SA (surface adsorption).
Algorithm 1 High-level Structure of Training with RLEKF
0:  {𝒘0},{𝐏0},{λ1}\left\{{\boldsymbol{w}_{0}}\right\},\left\{{\mathbf{P}_{0}}\right\},\left\{{\lambda_{1}}\right\}
1:  for t=0,1,2,,T1t=0,1,2,\dots,T-1 do
2:     E^=hE(𝒘t,xt)\hat{E}=h_{E}(\boldsymbol{w}_{t},x_{t})
3:     𝒘+,𝐏+,λ+=RLEKF(E^,EDFT,𝒘t,𝐏t,λt)\boldsymbol{w}_{+},\mathbf{P}_{+},\lambda_{+}=\text{RLEKF}(\hat{E},E_{DFT},\boldsymbol{w}_{t},\mathbf{P}_{t},\lambda_{t})
4:     F^=hF(𝒘+)\hat{F}=h_{F}(\boldsymbol{w}_{+})
5:     𝒘t+1,𝐏t+1,λt+1=RLEKF(F^,FDFT,𝒘+,𝐏+,λ+)\boldsymbol{w}_{t+1},\!\mathbf{P}_{t+1},\!\lambda_{t+1}\!=\!\text{RLEKF}(\hat{F},F_{DFT},\boldsymbol{w}_{+},\mathbf{P}_{+},\lambda_{+})
6:  end for
6:  {𝒘T},{𝐏T},{λT}\left\{{\boldsymbol{w}_{T}}\right\},\left\{{\mathbf{P}_{T}}\right\},\left\{{\lambda_{T}}\right\}
Algorithm 2 RLEKF(Y^,YDFT,𝒘in,𝐏in,λin)(\hat{Y},Y^{\text{DFT}},\boldsymbol{w}^{in},\mathbf{P}^{in},\lambda^{in})
1:  for i=1,2,,length(YDFT)i=1,2,\dots,length(Y^{\text{DFT}}) do
2:     if Y^iYiDFT\hat{Y}_{i}\geq Y_{i}^{\text{DFT}} then
3:        Y^i\hat{Y}_{i} = -Y^i\hat{Y}_{i}
4:     end if
5:  end for
6:  Y^=iY^ilength(YDFT),errY=i|YiDFTY^i|length(YDFT)\hat{Y}=\frac{\sum_{i}\hat{Y}_{i}}{length(Y^{\text{DFT}})},errY=\frac{\sum_{i}|Y^{\text{DFT}}_{i}-\hat{Y}_{i}|}{length(Y^{\text{DFT}})}
7:  H=wY^w=𝒘inH=\nabla_{w}\hat{Y}\mid_{w=\boldsymbol{w}^{in}}
8:  a=1/(Lλin+HT𝐏inH)a=1/(L\lambda^{in}+H^{\textsf{T}}\mathbf{P}^{in}H)
9:  for l=1,2,,Ll=1,2,\dots,L do
10:     K=split(𝐏in,l)×split(H,l)K=\text{split}(\mathbf{P}^{in},l)\times\text{split}(H,l)
11:     𝐏lout=(1/λin)×(split(𝐏in,l)aKKT)\mathbf{P}^{out}_{l}=(1/\lambda^{in})\times(\text{split}(\mathbf{P}^{in},l)-aKK^{\textsf{T}})
12:     𝒘lout=split(𝒘in,l)+aKerrY\boldsymbol{w}^{out}_{l}=\text{split}(\boldsymbol{w}^{in},l)+aKerrY
13:  end for
14:  λout=λinν+1ν\lambda^{out}=\lambda^{in}\nu+1-\nu
15:  𝒘out=collect(𝒘lout|l=1,,L)\boldsymbol{w}^{out}=\text{collect}(\boldsymbol{w}^{out}_{l}|l=1,\dots,L)
16:  𝐏out=collect(𝐏lout|l=1,,L)\mathbf{P}^{out}=\text{collect}(\mathbf{P}^{out}_{l}|l=1,\dots,L)
16:  𝒘out,𝐏out,λout\boldsymbol{w}^{out},\mathbf{P}^{out},\lambda^{out}
Refer to caption
Figure 2: The two subfigures on the left are Energy (eV) (above) and Force (eV/Å\mathring{\text{A}}) (below) RMSE boxplots of test dataset under different weights splitting strategy. The four subfigures on the right are weights error covariance matrixs corresponding to GEKF, SEKF, LEKF, RLEKF in text order.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: RLEKF performance contrasted with DFT (ground truth) on every snapshot in terms of energy (meV) and force (meV/Å\mathring{\text{A}}) for various systems. That is to say, the closer the dots approach diagonals, the better the performance is. RMSE of Etot,EatomE_{\text{tot}},E_{\text{atom}}, and FF on the test set, the structure of a unit cell, and extensive structure are shown in the corners of each subfigure.

Method

In this section, we introduce RLEKF and its splitting strategy (Fig. 1) and then compare RLEKF with GEKF.

The Workflow of Training with RLEKF. The overview of DP NN with RLEKF is shown in Fig. 1, and the corresponding training procedures are detailed in Alg. 1. Specifically, Alg. 1 and Fig. 1.b show the training of RLEKF for both energy (Alg. 1, line 2 & 3 ) and force (Alg. 1, line 4 & 5 ). The forward prediction(shown in Alg. 1 and Fig. 1.c), backward propagation (shown in Alg. 2 and Fig. 1.c), and updating of weights based on Kalman filtering with forgetting factor are shown in Alg. 2 (line 8 to line 14 and Fig. 1.e). Alg. 1 shows the iterative update of 𝒘t\boldsymbol{w}_{t} with 𝐏t,λt\mathbf{P}_{t},\lambda_{t} (also shown in Fig. 1.b). After the initialization of 𝒘0\boldsymbol{w}_{0}, 𝐏0\mathbf{P}_{0}, and λ1\lambda_{1}, the energy process starts and tries to fit the predicted energy to the energy label of a sample, and then the force process updates weights under the supervision of the samples’ force label. In the force process for each image, we randomly choose xx atoms and concatenate their network-predicted force vectors as an effective predicted force F^\hat{F}, which is then used to update weights. This process (i.e. line 4 & 5 of Alg. 1) is repeated yy times in a loop. Here, we recommend a relatively universal setting x=6,y=4x=6,y=4. When these two kinds of alternative weights update in the direction of minimizing the trace of 𝐏t\mathbf{P}_{t} for this sample, the same process for the next one repeats until time step TT. The algorithm RLEKF in line 3 & 5 of Alg. 1 unfolds below.

RLEKF. Alg. 2 shows the layerwise weights-updating strategy of RLEKF, which is an optimized version of EKF in training the DP model. From line 1 to line 7, we transform any vector Y^\hat{Y} into a number, which enjoys two advantages. The first is it averages all the updating information in YiDFTY^iY^{\text{DFT}}_{i}-\hat{Y}_{i} and transforms a vector into a number which avoids an impregnable problem, solving inversion of overwhelmingly many big matrices introduced later. The second is once Alg. 2 invoked we only need to calculate the gradient for one time. More specifically, from line 1 to line 7, we just want to adjust the gradient of Y^i\hat{Y}_{i} with respect to weights in the direction of ”decreasing the difference between Y^i\hat{Y}_{i} and YiDFTY^{\text{DFT}}_{i}”. In order to keep 𝐏out\mathbf{P}^{out} strictly symmetric if 𝐏in\mathbf{P}^{in} is, the definition of Kalman gain in Alg. 2 is a little different from traditional Kalman filtering. In addition, 𝐚\mathbf{a} is usually a matrix, but in line 8 it degenerates into a number, which heavily reduces the computational complexity, where LL is the number of blocks in 𝐏in\mathbf{P}^{in} and take αt2Rt=L𝐈\alpha_{t}^{2}R_{t}=L\mathbf{I} as a suitable hyperparameter. RLEKF independently updates weights and 𝐏t\mathbf{P}_{t} of different layers by KF theory and its splitting strategy introduced in the next subsection. Finally, all updated parameters are collected for the prediction of the next turn (Fig. 1.e, Alg. 2 from line 8 to line 16, where ν\nu is forgetting rate, a hyperparameter describing the varying rate of λt\lambda_{t} and split is a function for obtaining the ll-th part after splitting).

The Splitting Strategy of RLEKF. According to the splitting strategy of RLEKF, GEKF can be approximated by RLEKF with a reduction in weights error covariance matrix considering the efficiency and stability of a large-scale application. Unlike GEKF, the weights error covariance matrix of whichever time step, up to a scalar factor, 𝐏={P1,,PL}\mathbf{P}=\left\{P_{1},\dots,P_{L}\right\} is a N×NN\times N block diagonal matrix, whose shape is {n1×n1,n2×n2,,nL×nL}\left\{n_{1}\times n_{1},n_{2}\times n_{2},\dots,n_{L}\times n_{L}\right\}, where nin_{i} and N:=iniN:=\sum_{i}n_{i} are the number of weights of the iith block and that of the whole NN respectively. This means some weights error covariances are forced into 0 and 𝐏\mathbf{P} thus is no longer the real weights error covariance matrix at most an approximation, but fortunately there are indeed such good enough approximations that they carry most of the ”big values” and information in these diagonal blocks, on which 𝐏\mathbf{P} concentrates. As shown in Fig. 2, GEKF concerns correlations between all the parameters. However, it is computationally expensive when adapted to large NNs. As a seemingly good choice, SEKF is a load balanced approximation version of GEKF, but the heavy correlations between parameters in a layer are ignored inappropriately. Overcoming the drawback of SEKF, LEKF can fully consider the internal correlation between parameters in the same layer, whereas unbearable computation still confronts us if the layer contains tens of thousands of parameters and decoupling every small neighboring layer deviates from the load balance idea. Here, we induce two heuristic principles for choosing a suitable strategy:

  1. 1.

    Put weights with high correlation (in the same layer) in the same block if possible.

  2. 2.

    The block must not be too large since that burdens computers with much computation (splitting the layers if the number of parameters is larger than the threshold NbN_{b}) or too small since that wastes computational power and loses massive information (gathering the near neighboring small layers). The threshold NbN_{b} aims at splitting the matrix 𝐏\mathbf{P} as evenly as possible for load balance and linear computational complexity.

Therefore, following these principles, we induce the splitting strategy of RLEKF (Fig. 1.d) and reorganize the weights in different layers, named parameter parts, into several more appropriate layers. Trainable parameters could be decomposed into a series of ll the parts of size p1,p2,,plp_{1},p_{2},\dots,p_{l}. Starting from the first part of size p1p_{1}, we execute the commands below repeatedly until z=lz=l:

  1. 1.

    For a given part with psp_{s} weights, we split them into psNb\lfloor\frac{p_{s}}{N_{b}}\rfloor layers of size NbN_{b} and a new part of size pspsNbpsNbp_{s}\leftarrow p_{s}-N_{b}\lfloor\frac{p_{s}}{N_{b}}\rfloor.

  2. 2.

    Then consider subsequent parts with ps,ps+1,,pzp_{s},p_{s+1},\dots,p_{z} weights, gather them together into a layer of size i=szpi\sum_{i=s}^{z}p_{i} and then set sz+1s\leftarrow z+1 , if i=szpiNb\sum_{i=s}^{z}p_{i}\leq N_{b} and (i=sz+1pi>Nb(\sum_{i=s}^{z+1}p_{i}>N_{b} or z=l)z=l).

Empirically, Fig. 2 validates the efficiency of RLEKF, comparably or even more accurate than GEKF.

Theoretical Analysis

In this section, we prove the convergence of weights updating and the stability of the training process (avoid gradient exploding). For simplicity, we analyze GEFK case which has the same asymptotic feature as RLEKF (Witkoskie and Doren 2005).

Theorem 1

In EKF problem (5), setting αt2Rt=L𝐈\alpha_{t}^{2}R_{t}=L\mathbf{I}, assuming components of 𝐇i\mathbf{H}_{i} are independent and subject to identical distribution with mean 0 and variance σ2\sigma^{2}, we have

ϵt𝐊t𝒪(1t)\epsilon_{t}\mathbf{K}_{t}\sim\mathcal{O}(\frac{1}{t})

with the probability arbitrarily close to 1 when tt\to\infty, which means the convergence of weights updating and thus the algorithm avoidance of gradient exploding.

The following is a brief proof of Theorem 1. Based on basic KF theory, we obtain

𝐏t\displaystyle\mathbf{P}_{t} =λt1𝐏t1λt2𝐏t1𝐇tT𝐚t1𝐇t𝐏t1L1,\displaystyle=\lambda_{t}^{-1}\mathbf{P}_{t-1}-\lambda_{t}^{-2}\mathbf{P}_{t-1}\mathbf{H}_{t}^{\textsf{T}}\mathbf{a}_{t}^{-1}\mathbf{H}_{t}\mathbf{P}_{t-1}L^{-1},
𝐚t\displaystyle\mathbf{a}_{t} =λt1𝐇tT𝐏t1𝐇tL1+1=𝔼[ϵt2]αt2,\displaystyle=\lambda_{t}^{-1}\mathbf{H}^{\textsf{T}}_{t}\mathbf{P}_{t-1}\mathbf{H}_{t}L^{-1}+1=\mathbb{E}[\epsilon_{t}^{2}]\alpha_{t}^{2},
λt\displaystyle\lambda_{t} =1(1λ1)νt1,\displaystyle=1-(1-\lambda_{1})\nu^{t-1},
𝐏t\displaystyle\mathbf{P}_{t} =𝔼[𝒘~t𝒘~tT]αt2,\displaystyle=\mathbb{E}[\tilde{{\boldsymbol{w}}}_{t}\tilde{{\boldsymbol{w}}}_{t}^{\textsf{T}}]\alpha_{t}^{2},

and

𝐏t1=λt𝐏t11+𝐇t𝐇tTL1\mathbf{P}_{t}^{-1}=\lambda_{t}\mathbf{P}^{-1}_{t-1}+\mathbf{H}_{t}\mathbf{H}_{t}^{\textsf{T}}L^{-1}

by using Woodbury matrix identity, where 𝒘~t=𝒘𝒘t\tilde{{\boldsymbol{w}}}_{t}=\boldsymbol{w}-\boldsymbol{w}_{t}, weights error covariance matrix 𝔼[𝒘~t𝒘~tT]=(𝐏01+i=1tαi2𝐇i𝐇iTL1)1\mathbb{E}[\tilde{{\boldsymbol{w}}}_{t}\tilde{{\boldsymbol{w}}}_{t}^{\textsf{T}}]=(\mathbf{P}_{0}^{-1}+\sum_{i=1}^{t}\alpha_{i}^{2}\mathbf{H}_{i}\mathbf{H}_{i}^{\textsf{T}}L^{-1})^{-1}. In the following experiments, we assume components of 𝐇i\mathbf{H}_{i} are independent and subject to identical distribution with mean 0 and variance σ2\sigma^{2}. So, the covariance matrix of 𝐇i\mathbf{H}_{i} is σ2𝐈\sigma^{2}\mathbf{I} and 𝐏0=𝐈\mathbf{P}_{0}=\mathbf{I} . Hence

𝔼[𝐏t1]=𝐈+k=1tαk2αt2L1σ2𝐈.\mathbb{E}[\mathbf{P}_{t}^{-1}]=\mathbf{I}+\sum_{k=1}^{t}\alpha_{k}^{2}\alpha_{t}^{-2}L^{-1}\sigma^{2}\mathbf{I}.

Using qq-Pochhammer symbol, we find

limtαt2=Πi=0(1(1λ1)νi)=((1λ1);ν)=α\lim_{t\to\infty}\alpha_{t}^{2}=\Pi_{i=0}^{\infty}(1-(1-\lambda_{1})\nu^{i})=((1-\lambda_{1});\nu)_{\infty}=\alpha

exists. Therefore, S(t):=k=1tαk2αt2S(t):=\sum_{k=1}^{t}\alpha_{k}^{2}\alpha_{t}^{-2} of order 𝒪(t)\mathcal{O}(t). According to the law of large numbers, we get

limt𝐏t1S(t)a.s.σ2L1𝐈,\lim_{t\to\infty}\frac{\mathbf{P}_{t}^{-1}}{S(t)}\xrightarrow{a.s.}\sigma^{2}L^{-1}\mathbf{I},

i.e.

limt𝐏t===a.s.limtL𝐈σ2S(t).\lim_{t\to\infty}\mathbf{P}_{t}\stackrel{{\scriptstyle a.s.}}{{=\joinrel=\joinrel=}}\lim_{t\to\infty}\frac{L\mathbf{I}}{\sigma^{2}S(t)}.

Hence,

𝐊ta.s.𝒪(1t),\mathbf{K}_{t}\stackrel{{\scriptstyle a.s.}}{{\sim}}\mathcal{O}(\frac{1}{t}),

if tt is large enough. Further, ϵt\epsilon_{t} is bounded with a probability

(|ϵt|B)1𝔼[ϵt2]B2=1αt2(λt1𝐇tT𝐏t1𝐇t+1)B2,\mathbb{P}(|\epsilon_{t}|\leq B)\geq 1-\frac{\mathbb{E}[\epsilon_{t}^{2}]}{B^{2}}=1-\frac{\alpha_{t}^{-2}(\lambda_{t}^{-1}\mathbf{H}^{\textsf{T}}_{t}\mathbf{P}_{t-1}\mathbf{H}_{t}+1)}{B^{2}},

arbitrarily close to 1, where Markov’s inequality is used, if initialization 𝒘0\boldsymbol{w}_{0} is close enough to some local minimum 𝒘\boldsymbol{w}^{*} of the landscape for Taylor approximation (1). Therefore,

ϵt𝐊t𝒪(1t)\epsilon_{t}\mathbf{K}_{t}\sim\mathcal{O}(\frac{1}{t})

with the probability arbitrarily close to 1.

Refer to caption
Figure 4: Speed ratio of RLEKF’s reaching an accuracy baseline to Adam’s in terms of Energy and Force according to epoch, iteration(weights updating) and, wall-clock time, where 1000 means Adam can not reach the baseline. The baseline is 1.2×\times the lower RMSE between Adam and RLEKF on the test datasets in Tab. 1.
Refer to caption
Refer to caption
Figure 5: Convergence of RMSE of bulk copper. The boxplots show the relationship between the test RMSE and the training epoch of Adam. For the energy case, each box on xx stands for the statistics (median, the first quaritle, and the third quartile) on RMSE data between epoch (x50)(x-50) and epoch (x+50)(x+50) (replacing 50 with 20 for force). The RMSEs of RLEKF after several epochs are shown in lines. The remaining systems refer to supplementary.
Refer to caption
Figure 6: Accuracy (energy and force RMSE on a test dataset) of Cu models with different skip connection weights initialization 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}). The red star represents the setup adopted in the above experiments, to whose result Fig. 1 refers. The 3 blue dots and the red star show similar accuracy, the 4 green dots show lower precision but are still very reasonable.

Experiments and Results

Bulk systems are challenging AIMD tasks due to their extensiveness (periodic boundary condition) and complexity (many different phases and atomic components). Our experiment is conducted on several representative problematic bulk systems. They are simple metal (Cu, Ag, Mg, Al, Li), alloy (MgAlCu), nonmetal (S, C), semiconductor(Si), simple compound (H2O), electrolyte (NaCl), and some challenging systems (CuO, Cu++C). The effectiveness of RLEKF is shown by a comparison with the SOTA benchmark Adam in terms of the RMSEs of predicted total energy of the whole system EtotE_{\text{tot}}, and forces {(Fx,i,Fy,i,Fz,i)|i=1,,n}\left\{(F_{x,i},F_{y,i},F_{z,i})|i=1,\dots,n\right\}, where x,y,zx,y,z correspond to different directions of Cartesian coordinate system and ii is the index of atom (Tab. 1).

Experiment Setting. Set λ1=0.98\lambda_{1}=0.98, 𝐏0=𝐈\mathbf{P}_{0}=\mathbf{I}, ν=0.9987\nu=0.9987, 𝒘0\boldsymbol{w}_{0} consistent with DeePMD-kit (Wang and Weinan 2018). The network configuration is [1, 25, 25, 25] (embedding net), [400, 50, 50, 50,1] (fitting net).

Data Description. We choose 13 unbiased and representative datasets of the aforementioned systems with certain specific structures (second column of Tab. 1). For each dataset, snapshots are yielded based on solving ab initio molecular trajectories via PWmat (Jia et al. 2013). During this process, to enlarge the sampling span of configuration space, we fast generate a long sequence of the snapshot by small time step (the third column of Tab. 1) and choose one for every fixed number in the temperature 300K.

Main Results. Both RLEKF and Adam yield good results except for CuO and Cu++C systems (Tab. 1). From an accuracy perspective, RLEKF reaches a higher energy accuracy in 12 cases except for Si, little less accurate than Adam. As for force, RLEKF is more accurate than Adam in 7 cases while the reminder achieves a very comparable precision. Furthermore, Fig. 3 shows how far the fitting results of RLEKF deviate from those of DFT (ground truth). From a speed perspective, generally, RLEKF converges to a reasonable RMSE much faster than Adam in most of the 13 cases in terms of energy (Fig. 4). We take bulk copper as an example to demonstrate how to understand information in Fig. 4. The lower Energy and Force RMSE between Adam and RLEKF is 0.327 meV and 44.4 meV/Å\mathring{A}. Therefore, the 1.2×\times energy baseline is 0.327×\times1.2=0.392 meV. It is unreachable for Adam, which is denoted as 1000 in (Fig. 5). The 1.2×\times force baseline is 44.4×\times1.2=52.6meV/Å\mathring{\text{A}}. Adam spends 35 epochs to reach the baseline while RLEKF’s only costs 3 epochs (Fig. 5). Then, the force speed ratio is 35/3=11.67 according to epoch. For each sample, Adam updates weights for 1 time and RLEKF updates for 7 times (1 time in the energy process and 6 times in the force process). The iteration speed ratio is 35/(3×\times7)=1.66. On tesla V100, the time cost of each epoch is 60s (Adam) and 587s (RLEKF). Thus, the wall-clock time speed ratio is 35×\times60/(3×\times587)=1.19.

Robustness Analysis: Distribution of Hyperparameter of Weights Initialization. RLEKF is very stable as an optimizer, which can keep NNs from gradients exploding and consequently endow them with very loose initialization constraints almost up to none as shown in Fig. 6.

Computational Complexity. RLEKF also benefits from computing through reducing the computation compared to GEKF. There are 3 computational intensive parts in Alg. 2, calculating aa (line 8), KK (line 10), and PoutP^{out} (line 11). Due to the even splitting strategy, the order of float operation for each block is 𝒪(Nb2)\mathcal{O}(N_{b}^{2}), and the number of the block is of order 𝒪(N/Nb)\mathcal{O}(N/N_{b}). Hence the total computational complexity of RLEKF is of order 𝒪((N/Nb)Nb2)=𝒪(NNb)\mathcal{O}((N/N_{b})N_{b}^{2})=\mathcal{O}(NN_{b}).

Conclusions

We proposed an optimizer RLEKF on DP NN and tested RLEKF on several typical bulk systems (simple metals, insulators, and semiconductors) of diverse structure (FCC, BCC, HCP). RLEKF defeated SOTA benchmark Adam by 11-1 (7-6) in precision and 12-1 (7-6) in wall-clock time speed in terms of energy (force). Besides, the convergence of weights updating is proved theoretically and RLEKF presents robustness on weights initialization. To sum up, RLEKF is an accurate, efficient, and stable optimizer, which paves another path to training general large NNs.

Acknowledgments

This work is supported by the following funding: National Key Research and Development Program of China (2021YFB0300600), National Science Foundation of China (T2125013, 62032023, 61972377), CAS Project for Young Scientists in Basic Research (YSBR-005) and Network Information Project of Chinese Academy of Sciences (CASWX2021SF-0103), the Key Research Program of the Chinese Academy of Sciences grant No. ZDBS-SSW-WHC002, Soft Science Research Project of Guangdong Province (No. 2017B030301013), and Huawei Technologies Co., Ltd.. We thank Dr. Haibo Li for helpful discussions.

References

  • Bartók et al. (2010) Bartók, A. P.; Payne, M. C.; Kondor, R.; and Csányi, G. 2010. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Physical review letters, 104(13): 136403.
  • Batzner et al. (2022) Batzner, S.; Musaelian, A.; Sun, L.; Geiger, M.; Mailoa, J. P.; Kornbluth, M.; Molinari, N.; Smidt, T. E.; and Kozinsky, B. 2022. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature Communications, 13(1): 2453.
  • Behler (2011) Behler, J. 2011. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. The Journal of Chemical Physics, 134(7): 074106.
  • Behler (2014) Behler, J. 2014. Representing potential energy surfaces by high-dimensional neural network potentials. Journal of Physics: Condensed Matter, 26(18): 183001.
  • Behler (2017) Behler, J. 2017. First principles neural network potentials for reactive simulations of large molecular and condensed systems. Angewandte Chemie International Edition, 56(42): 12828–12840.
  • Behler and Parrinello (2007) Behler, J.; and Parrinello, M. 2007. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett., 98: 146401.
  • Chui, Chen et al. (2017) Chui, C. K.; Chen, G.; et al. 2017. Kalman filtering. Springer.
  • Desai, Reeve, and Belak (2022) Desai, S.; Reeve, S. T.; and Belak, J. F. 2022. Implementing a neural network interatomic model with performance portability for emerging exascale architectures. Computer Physics Communications, 270: 108156.
  • Drautz (2019) Drautz, R. 2019. Atomic cluster expansion for accurate and transferable interatomic potentials. Phys. Rev. B, 99: 014104.
  • Ercolessi and Adams (1992) Ercolessi, F.; and Adams, J. B. 1992. Interatomic potentials from first-principles calculations. MRS Online Proceedings Library (OPL), 291.
  • Ercolessi and Adams (1994) Ercolessi, F.; and Adams, J. B. 1994. Interatomic potentials from first-principles calculations: the force-matching method. EPL (Europhysics Letters), 26(8): 583.
  • Gasteiger et al. (2020) Gasteiger, J.; Giri, S.; Margraf, J. T.; and Günnemann, S. 2020. Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules. In Machine Learning for Molecules Workshop, NeurIPS.
  • Gasteiger, Groß, and Günnemann (2020) Gasteiger, J.; Groß, J.; and Günnemann, S. 2020. Directional Message Passing for Molecular Graphs. In International Conference on Learning Representations (ICLR).
  • Guo et al. (2022) Guo, Z.; Lu, D.; Yan, Y.; Hu, S.; Liu, R.; Tan, G.; Sun, N.; Jiang, W.; Liu, L.; Chen, Y.; Zhang, L.; Chen, M.; Wang, H.; and Jia, W. 2022. Extending the Limit of Molecular Dynamics with Ab Initio Accuracy to 10 Billion Atoms. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’22, 205–218. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392044.
  • Haykin and Haykin (2001) Haykin, S. S.; and Haykin, S. S. 2001. Kalman filtering and neural networks, volume 284. Wiley Online Library.
  • Heimes (1998) Heimes, F. 1998. Extended Kalman filter neural network training: experimental results and algorithm improvements. In SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218), volume 2, 1639–1644 vol.2.
  • Huang et al. (2022) Huang, Y.-P.; Xia, Y.; Yang, L.; Wei, J.; Yang, Y. I.; and Gao, Y. Q. 2022. SPONGE: A GPU-Accelerated Molecular Dynamics Package with Enhanced Sampling and AI-Driven Algorithms. Chinese Journal of Chemistry, 40(1): 160–168.
  • Jia et al. (2013) Jia, W.; Fu, J.; Cao, Z.; Wang, L.; Chi, X.; Gao, W.; and Wang, L.-W. 2013. Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines. Journal of Computational Physics, 251: 102–115.
  • Jia et al. (2020) Jia, W.; Wang, H.; Chen, M.; Lu, D.; Lin, L.; Car, R.; Weinan, E.; and Zhang, L. 2020. Pushing the Limit of Molecular Dynamics with Ab Initio Accuracy to 100 Million Atoms with Machine Learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–14.
  • Kalman et al. (1960) Kalman, R. E.; et al. 1960. Contributions to the theory of optimal control. Bol. soc. mat. mexicana, 5(2): 102–119.
  • Karplus and Kuriyan (2005) Karplus, M.; and Kuriyan, J. 2005. Molecular dynamics and protein function. Proceedings of the National Academy of Sciences, 102(19): 6679–6685.
  • Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lee et al. (2019) Lee, K.; Yoo, D.; Jeong, W.; and Han, S. 2019. SIMPLE-NN: An efficient package for training and executing neural-network interatomic potentials. Computer Physics Communications, 242: 95–103.
  • Lysogorskiy et al. (2021) Lysogorskiy, Y.; Oord, C. v. d.; Bochkarev, A.; Menon, S.; Rinaldi, M.; Hammerschmidt, T.; Mrovec, M.; Thompson, A.; Csányi, G.; Ortner, C.; and Drautz, R. 2021. Performant implementation of the atomic cluster expansion (PACE) and application to copper and silicon. npj Computational Materials, 7(1): 97.
  • Meier, Laino, and Curioni (2014) Meier, K.; Laino, T.; and Curioni, A. 2014. Solid-state electrolytes: revealing the mechanisms of Li-ion conduction in tetragonal and cubic LLZO by first-principles calculations. The Journal of Physical Chemistry C, 118(13): 6668–6679.
  • Murtuza and Chorian (1994) Murtuza, S.; and Chorian, S. 1994. Node decoupled extended Kalman filter based learning algorithm for neural networks. In Proceedings of 1994 9th IEEE International Symposium on Intelligent Control, 364–369.
  • Rahman and Stillinger (1971) Rahman, A.; and Stillinger, F. H. 1971. Molecular dynamics study of liquid water. The Journal of Chemical Physics, 55(7): 3336–3359.
  • Raty, Gygi, and Galli (2005) Raty, J.-Y.; Gygi, F.; and Galli, G. 2005. Growth of carbon nanotubes on metal nanoparticles: a microscopic mechanism from ab initio molecular dynamics simulations. Physical review letters, 95(9): 096103.
  • Saad (1998) Saad, D. 1998. Online algorithms and stochastic approximations. Online Learning, 5(3): 6.
  • Singraber et al. (2019) Singraber, A.; Morawietz, T.; Behler, J.; and Dellago, C. 2019. Parallel multistream training of high-dimensional neural network potentials. Journal of chemical theory and computation, 15(5): 3075–3092.
  • Smith, Schmidt, and McGee (1962) Smith, G. L.; Schmidt, S. F.; and McGee, L. A. 1962. Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle. National Aeronautics and Space Administration.
  • Stillinger and Rahman (1974) Stillinger, F. H.; and Rahman, A. 1974. Improved simulation of liquid water by molecular dynamics. The Journal of Chemical Physics, 60(4): 1545–1557.
  • Thompson et al. (2015) Thompson, A.; Swiler, L.; Trott, C.; Foiles, S.; and Tucker, G. 2015. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. Journal of Computational Physics, 285: 316–330.
  • Wang and Weinan (2018) Wang, L. H. J., H.; Zhang; and Weinan, E. . 2018. DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics. Computer Physics Communications, 228: 178–184.
  • Wang et al. (2009) Wang, S.; Kramer, M.; Xu, M.; Wu, S.; Hao, S.; Sordelet, D.; Ho, K.; and Wang, C. 2009. Experimental and ab initio molecular dynamics simulation studies of liquid Al 60 Cu 40 alloy. Physical Review B, 79(14): 144205.
  • Welch, Bishop et al. (1995) Welch, G.; Bishop, G.; et al. 1995. An introduction to the Kalman filter.
  • Witkoskie and Doren (2005) Witkoskie, J. B.; and Doren, D. J. 2005. Neural Network Models of Potential Energy Surfaces: Prototypical Examples. Journal of Chemical Theory and Computation, 1(1): 14–23. PMID: 26641111.
  • Yao et al. (2017) Yao, K.; Herr, J. E.; Brown, S. N.; and Parkhill, J. 2017. Intrinsic Bond Energies from a Bonds-in-Molecules Neural Network. The Journal of Physical Chemistry Letters, 8(12): 2689–2694. PMID: 28573865.