Using Restricted Boltzmann Machines to Model Molecular Geometries

Peter Nekrasov, Jessica Freeze, Victor Batista
Yale University, Department of Chemistry
[email protected]

Abstract

Precise physical descriptions of molecules can be obtained by solving the Schrodinger equation; however, these calculations are intractable and even approximations can be cumbersome. Force fields, which estimate interatomic potentials based on empirical data, are also time-consuming. This paper proposes a new methodology for modeling a set of physical parameters by taking advantage of the restricted Boltzmann machine’s fast learning capacity and representational power. By training the machine on ab initio data, we can predict new data in the distribution of molecular configurations matching the ab initio distribution. In this paper we introduce a new RBM based on the Tanh activation function, and conduct a comparison of RBMs with different activation functions, including sigmoid, Gaussian, and (Leaky) ReLU. Finally we demonstrate the ability of Gaussian RBMs to model small molecules such as water and ethane.

I. Introduction

Recent innovations in data science have led to a proliferation of machine learning techniques, many of which are used to find patterns and categorize data. Among these models, the restricted Boltzmann machine (RBM) has shown special promise for its ability to quickly learn the probability distribution of a training set and extract its key “features.” The RBM is a versatile tool used in many practical applications ranging from social media services to product recommendations. While RBMs are becoming increasingly popular for conventional problems in data science, they are underutilized within the field of chemistry. Though recently a few sporadic studies have used RBMs to model self-avoiding walks of polymers [23] and perform quantum electronic structure calculations [21], no systematic approach has been taken to develop this software for a multitude of continuous systems.

There are several features which make the use of RBMs conducive to the field of chemistry. For one, the restricted Boltzmann machine falls under a family of energy-based models, which means it associates an energy value to any given state of the machine. Because chemistry calculations constantly utilize energy terms, the RBM can be adapted to complex physical computations. In this sense, a trained RBM can serve as a Hamiltonian for a physical system [21]. Furthermore, the RBM’s strength lies in its simplicity: with a simple structure and sampling algorithm, the RBM is straightforward to use and efficient to implement.

Imagine the following predicament: a set of molecular geometries or quantum states is generated either through experimental data or complex ab initio calculations. While this dataset is thought to be an accurate representation of the overall distribution, the number of samples is insufficient to perform any meaningful analysis on the system. Furthermore, one would like to extend the given representation to a complete ensemble within the phase space. In this case, RBMs can enable us to learn the overall distribution of physical parameters and predict points not included in the the given dataset (Figure 1).

Refer to caption — Figure 1: Given a sparse set of data points (blue), our aim is to train the model to predict the shape of the overall distribution and extract new data points (red) which could have been part of the original distribution.

The advantage of this approach is that it requires no prior information about the system in question. Most approximations of the quantum wave function require an expression of the constituent energies of the system. Likewise, molecular dynamics force fields require an understanding of interatomic potentials such as electrostatic and van der Waals forces. Instead, the RBM simply utilizes the statistical frequency of the training configurations to learn an internal representation of their energies. While quantum calculations are laborious to run, the RBM adopts a simple training and sampling algorithm, providing us with resounding representational power at a low computational cost.

In this paper we describe different types of RBMs and criteria for assessing their performance. We then show how the Gaussian RBM (GRBM) learns diverse sets of training data and reproduces various distributions. Finally, we use GRBMs to represent molecular systems with multiple degrees of freedom to show how they can learn bond and angle energies of small molecules such as H₂O and ethane.

II. Methods

Model overview

The RBM consists of two layers of neurons, a visible layer and a hidden layer (shown in Figure 2). Every visible node $\displaystyle v_{i}$ is connected to every hidden node $\displaystyle h_{j}$ by a set of weights, and each node has its own offset, or bias. The state or value of a given node is dependent on the state of the nodes it is connected to, as well as its bias. For ease of computation, the values of the weights are stored in a weight matrix $\displaystyle W$ , where $\displaystyle W_{ij}$ represents the connection from $\displaystyle v_{i}$ to $\displaystyle h_{j}$ . Meanwhile, the values of the visible and hidden biases are stored in bias vectors $\displaystyle a$ and $\displaystyle b$ , respectively. The RBM is “restricted” in the sense that there are no connections between nodes in the same layer, which simplifies learning [16].

Figure 2: A sample RBM with three visible nodes and eight hidden nodes (3-8-RBM). Every visible node is connected to every hidden node, and each node has its own bias (not shown).

The visible layer serves as an input to the machine, where the number of visible nodes corresponds to the number of variables that make up the data. Values for the hidden nodes are then calculated by multiplying the visible nodes by the weights, adding the biases, and then applying some sort of activation function. This inference step can be viewed as a transformation from the space of observable parameters to the space represented by the hidden nodes. The method used for calculating between layers is formalized by a set of conditional probabilities that appear in the section below. In this paper we refer to a $\displaystyle n$ - $\displaystyle k$ -RBM as an RBM with $\displaystyle n$ visible nodes and $\displaystyle k$ hidden nodes.

A notable feature of the RBM, in contrast with other machine learning models, is that it does not have an “output” in the normal sense. Depending on the situation, the output of an RBM can be the visible layer, hidden filters, or energy. RBMs are trained in an unsupervised fashion (without labelling the data), so the hidden nodes identify their own labels during the course of training [5]. In a famous application of RBMs by Netflix to provide movie recommendations (see the Netflix prize [15]), the RBM was trained on a large set of movie ratings obtained from individual users, where the value of each visible node corresponded to the rating of a given movie. Based on the simple pattern that users who like movies from a certain director or genre are likely to enjoy other movies from that same category, the RBM came to associate hidden nodes with movie genres or directors. In this way, each hidden node elucidates a connection or correlation between visible nodes, which is true for continuous data as well. Note that understanding what each hidden node represents requires additional post hoc analysis.

Energy based models

As an energy based model, the RBM associates a scalar measure of energy to each state of the machine. Usually the energy equation is a combination of each layer times its respective biases and a term which relates the two layers. The goal of training is to minimize the overall energy of the RBM with respect to the training data. Once trained, the RBM associates lower energies with inputs that fall within the training distribution and higher energies with those that fall outside the training distribution. This becomes useful when generating new configurations because the RBM estimates the energy of a proposed configuration based on where it fits with the training distribution.

Figure 3: The sampling algorithm for a 4-6-RBM consists in using conditional probabilities to alternate between layers. This technique, known as Gibbs sampling, is relevant to many aspects of using RBMs, including estimating the gradient and calculating the error. This RBM happens to consist of four visible nodes and six hidden nodes.

Binary RBM

The simplest and most widespread version of the RBM is the binary-binary RBM (BBRBM), which has binary visible and binary hidden units. In the binary-binary RBM, an active unit is represented as 1 while inactive units are represented with 0. For a BBRBM, the energy is given by:

\displaystyle E(v,h)=-a^{T}v-b^{T}h-v^{T}Wh

where $\displaystyle v$ is a vector of visible states, $\displaystyle h$ is the hidden states, $\displaystyle a$ is the visible biases, $\displaystyle b$ is the hidden biases, and $\displaystyle W$ is the weight matrix representing the connections between visible and hidden nodes. The visible states $\displaystyle v$ serve as an input to the RBM whereas $\displaystyle a$ , $\displaystyle b$ , and $\displaystyle W$ are all parameters learned during training.

The joint probability of states $\displaystyle v$ and $\displaystyle h$ is taken from the Boltzmann distribution:

\displaystyle p(v,h)=\frac{1}{Z}e^{-E(v,h)}

where $\displaystyle Z$ is the partition function, defined as:

\displaystyle Z=\sum_{v}\sum_{h}e^{-E(v,h)}

which is a sum of over all possible states of the machine. The partition function $\displaystyle Z$ serves to normalize the joint probability distribution so that probabilities sum to 1. From this, we can deduce the marginal probability of a visible configuration $\displaystyle v$ , given by the sum of probabilities over all possible hidden configurations:

	$\displaystyle\displaystyle p(v)$	$\displaystyle\displaystyle=\sum_{h}p(v,h)$		(1)
	$\displaystyle\displaystyle p(v)$	$\displaystyle\displaystyle=\frac{1}{Z}\sum_{h}e^{-E(v,h)}$		(2)

Because the state of the hidden layer depends on the state of the visible layer, we must use a conditional probability $\displaystyle p(h|v)$ to calculate the hidden states. This conditional probability is derived directly from the energy equation [20], giving us the activation of hidden states:

\displaystyle p(h=1|v)=\sigma(b+W^{T}v)

where $\displaystyle\sigma(x)$ represents the sigmoid activation function. Performing this operation only gives a set of probabilities that each hidden neuron is active; one must then sample a Bernoulli distribution with the given probabilities to reach the actual binary states of the hidden layer. Then, the visible states are computed again from the hidden states using:

\displaystyle p(v=1|h)=\sigma(a+Wh)

which gives the probabilities of the visible neurons adopting a value of one. Figure 3 shows how conditional probabilities are used to sample back and forth between layers.

A major drawback of the binary RBM is that it is only able to represent binary states or bit strings. As most empirical data is real-valued, the Gaussian RBM was developed to model continuous variables.

Gaussian RBM

The Gaussian RBM (GRBM), also called the Gaussian-binary RBM, is an effective tool for modeling real-valued data. While the hidden layer remains binary, the visible layer can now adopt real values, with an additional parameter $\displaystyle\sigma$ representing the standard deviation for each visible node. The equation for the joint energy becomes:

\displaystyle\displaystyle E_{GRBM}(v,h)=\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma^{2}}-b^{T}h-\frac{v^{T}}{\sigma^{2}}Wh

(3)

where $\displaystyle\left\lVert\cdot\right\rVert$ is the Euclidean norm. This equation is similar to the binary energy except that the visible states are divided by $\displaystyle\sigma^{2}$ as a form of normalization, and the first term is replaced by $\displaystyle\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma^{2}}$ which serves as a parabolic containment of the visible states. This means that the overall energy increases the further the visible states $\displaystyle v$ are from the visible biases $\displaystyle a$ . Whereas the states in a binary RBM are bounded by 0 and 1, it is important that the visible states of a Gaussian RBM are restrained by this parabolic term in order to prevent them from trailing off.

Similar to the binary RBM, the conditional probability of the hidden states given the visible states is:

\displaystyle p(h=1|v)=\sigma(b+W^{T}\frac{v}{\sigma^{2}})

whereas the conditional probability of the visible states given the hidden states is:

\displaystyle p(v\ |\ h)=\mathcal{N}(a+Wh,\sigma^{2})

where $\displaystyle\mathcal{N}(\mu,\sigma^{2})$ is a Gaussian function with mean $\displaystyle\mu$ and variance $\displaystyle\sigma^{2}$ . In practice this amounts to adding Gaussian noise after calculating visible states. A detailed derivation of the conditional distributions can be found in [20].

While initially there appears to be nothing inherently Gaussian about this energy equation, the probability starts to take on the form of a Gaussian function when substituted into the Boltzmann distribution:

\displaystyle\displaystyle p(v,h)

\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma^{2}}+b^{T}h+\frac{v^{T}}{\sigma}Wh}

If we ignore the second and third terms of the energy equation, we see that the distribution of visible states follows a Gaussian distribution centered at the visible bias $\displaystyle a$ with variance $\displaystyle\sigma^{2}$ . Moreover, it has previously been shown using these equations that the RBM is equivalent to a mixture of Gaussian (MoG) model, with the locations of each Gaussian represented by the column vectors of the weight matrix (for a detailed derivation see [11]).

The energy and sampling equations are usually simplified by normalizing the data to zero mean and unit variance, and then setting $\displaystyle\sigma=1$ . However, one can choose to calculate the $\displaystyle\sigma$ of the data in advance, or alternatively one can use $\displaystyle\sigma$ as a separate parameter which is optimized during training. While training data need not follow a Gaussian distribution, the GRBM works best when modeling this type of distribution. Fortunately, most distributions found in nature are Gaussian due to the central limit theorem.

Log Likelihood Estimates

The goal of training the RBM is to maximize the likelihood of the data under the model. By tweaking the weights and biases, the RBM can associate higher probabilities with configurations found in the training data. For practical purposes, we work with the logarithm of the likelihood which allows us to write products as the sum of logarithms. Since $\displaystyle\log(x)$ is a monotonically increasing function, maximizing the log-likelihood is the same as maximizing the likelihood. The log-likelihood for a training sample $\displaystyle x$ is given by:

	$\displaystyle\displaystyle\mathcal{L}(x)$	$\displaystyle\displaystyle=\log p(x)$
		$\displaystyle\displaystyle=\log\sum_{h}p(x,h)$
		$\displaystyle\displaystyle=\log\sum_{h}\frac{e^{-E(x,h)}}{Z}$
		$\displaystyle\displaystyle=\log\sum_{h}e^{-E(x,h)}-\log Z$

using (1) and (2). By optimizing the log likelihood of the training data, we maximize the likelihood of the training samples over all possible samples in our visible space.

In the case where there is an entire set of training samples, we take the average log-likelihood $\displaystyle\hat{\ell}$ by computing the expectation of the log-likelihood over all the samples:

\displaystyle\hat{\ell}=\bigg{\langle}\log\sum_{h}e^{-E(x,h)}\bigg{\rangle}-\log Z

Average log-likelihood is a rigorous way of monitoring the training and convergence of an RBM and demonstrating its modeling capacity in statistical terms. The most difficult part about estimating this likelihood is calculating the partition function $\displaystyle Z$ , which is intractable in most cases. In this study we use importance sampling to estimate $\displaystyle Z$ whenever calculating log-likelihood (see [8] for more details).

Training algorithm

The gradient of $\displaystyle\hat{\ell}$ , which is given by the derivative of $\displaystyle\hat{\ell}$ with respect to the model parameters $\displaystyle\theta$ , comes out to be the difference between the data-based distribution and the distribution given by the entire model:

\displaystyle\frac{d\hat{\ell}}{d\theta}\propto\bigg{\langle}\frac{dE(x,h)}{d\theta}\bigg{\rangle}_{x}-\bigg{\langle}\frac{dE(v,h)}{d\theta}\bigg{\rangle}_{v}

where $\displaystyle x$ are explicit training samples and $\displaystyle v$ are samples from the model distribution.

Computing the partial derivatives of the energy function with respect to each parameter provides us with the gradient approximations for the weights and biases:

\displaystyle\frac{\partial\hat{\ell}}{\partial a}\propto\langle{x}-a\rangle_{{x}}-\langle v-a\rangle_{v}

\displaystyle\frac{\partial\hat{\ell}}{\partial b}\propto\langle p(h=1|{x})\rangle_{{x}}-\langle p(h=1|v)\rangle_{v}

\displaystyle\frac{\partial\hat{\ell}}{\partial W}\propto\langle{x}\ p(h=1|{x})^{T}\rangle_{{x}}-\langle v\ p(h=1|{v})^{T}\rangle_{v}

\frac{\partial\hat{\ell}}{\partial\sigma}\propto\bigg{\langle}\frac{\left\lVert x-a\right\rVert-2x^{T}Wp(h=1|x)}{\sigma^{3}}\bigg{\rangle}_{{x}}\\ -\bigg{\langle}\frac{\left\lVert v-a\right\rVert-2v^{T}Wp(h=1|v)}{\sigma^{3}}\bigg{\rangle}_{v}

These gradients are used as the update rules for the RBM. Typically training is done in batches, and the gradients are computed after each batch. The gradients are then added to the existing RBM parameters which gives rise to a new set of weights and biases. While the first term for each gradient can be calculated directly from the training data, the second term is almost always intractable, as it requires independent samples from an unknown model distribution.

Several algorithms are currently available for estimating the second term. The most widely used algorithm is Contrastive Divergence. In Contrastive Divergence, sampling between visible and hidden layers is used to create a Markov chain (as shown in Figure 3). The Markov chain is initialized at a training point, and the conditional probabilities are used to get the visible and hidden states after $\displaystyle k$ iterations, in a process known as Gibbs sampling. The value of these layers after sampling serves as a sufficient estimation of the model’s expectation. Typically $\displaystyle k$ is set to 1, as it provides a good estimation of the gradient and also minimizes computation time.

Persistent Contrastive Divergence (PCD) is another algorithm which has been proposed for estimating the model distribution. Instead of restarting the Markov chain for each data point, one persistent Markov chain is retained in memory and extended after each batch. While its success has mostly been reported in binary RBMs [18], some have had success in using PCD to train Gaussian RBMs [3]. A reported improvement on PCD is the Parallel Tempering algorithm (PT), which uses multiple Markov chains at different temperatures with a certain probability that states from different chains will swap [2]. The reasoning behind PT is that having different temperatures will ensure that the Markov chain is fully exploring high energy states. Though these algorithms have all been successfully implemented in binary RBMs, there is little evidence that they work in Gaussian RBMs. In practice, we found that both PCD and PT led to divergence unless a very small learning rate was used ( $\displaystyle\alpha=0.0001$ ), which led to long training times with marginal improvements in likelihood. Overall, CD-1 was simplest to use and almost never led to divergence. Our results match those in [11] which conclude that CD is the best algorithm currently suited for training GRBMs.

Monte Carlo sampling of an RBM

Because the RBM provides us with a measure of energy, we can sample new configurations from a trained RBM using the Metropolis Monte Carlo algorithm. This is done by initializing a random configuration and evaluating its energy. Then for each iteration, a new sample $\displaystyle x^{\prime}$ is generated by adding a random displacement vector to the previous sample $\displaystyle x$ and evaluating the energy of the new sample. If the energy is lower than the previous, the new sample is accepted as part of the simulation. If not, the acceptance ratio is calculated using the Boltzmann distribution:

\displaystyle\frac{p(x^{\prime})}{p(x)}=\exp{\left(\frac{E(v,h)-E(v^{\prime},h^{\prime})}{k_{B}T}\right)}

Finally we generate a random number between 0 and 1, and if our number falls below this acceptance ratio, we keep the configuration as part of our ensemble. The Boltzmann constant $\displaystyle k_{B}$ and temperature parameter $\displaystyle T$ are usually set to one, though they may be useful in simulating higher temperatures or exploring higher energy states.

The benefit of this Metropolis Monte Carlo method is that the normalization for the RBM need not be known. Whereas the true probability density of the joint states requires knowledge of the partition function $\displaystyle Z$ , which is intractable in most cases, the term cancels when calculating the acceptance ratio at each step in the simulation. The result is a Monte Carlo simulation which follows the probability density given by the model.

Computational complexity

As we can see, the energy calculations for an RBM are much simpler than the energy calculations for a quantum system. A typical Hartree-Fock calculation has computational complexity $\displaystyle\mathcal{O}(n^{4})$ , where $\displaystyle n$ is the total number of basis functions, since the number of two-electron integrals necessary to build the Fock matrix is $\displaystyle n^{4}$ [4]. In practice this often becomes $\displaystyle\mathcal{O}(n^{3})$ as most programs will ignore integrals that are close to zero. Depending on the choice of basis set and the size of the atoms, one can have anywhere from 3 to 100 basis functions for one given atom. Therefore the complexity becomes $\displaystyle\mathcal{O}(b^{3}a^{3})$ where $\displaystyle b$ is the average number of basis functions per atom and $\displaystyle a$ is the number of atoms. Because the complexity scales at least cubically with system size, these calculations become quite expensive as one increases either the atoms or basis functions included.

Meanwhile, RBM energy computations have complexity $\displaystyle\mathcal{O}(NH)$ where $\displaystyle N$ is the number of visible nodes and $\displaystyle H$ is the number of hidden nodes. In the proposed method, only three nodes are needed per additional atom, as compared to the large number of basis functions needed in Hartree-Fock calculations. Thus the complexity is $\displaystyle\mathcal{O}(3aH)$ , expressed in terms of the number of atoms $\displaystyle a$ . If we take the number of hidden nodes to be the same as the number of visible nodes, then the computational complexity grows quadratically with system size, and increasing atom size has no impact on the complexity of calculations. Moreover these calculations are performed using matrix multiplication which is simpler than integration.

Visible layer Hidden layer $\displaystyle\mathbf{p(h|v)}$ $\displaystyle\mathbf{p(v|h)}$ $\displaystyle\mathbf{E(v,h)}$ References Binary Binary $\displaystyle\sigma(b+W^{T}v)$ $\displaystyle\sigma(a+Wh)$ $\displaystyle-a^{T}v-b^{T}h-v^{T}Wh$ [15] [5] [9] Gaussian Binary $\displaystyle\sigma(b+W^{T}\frac{v}{\sigma})$ $\displaystyle\mathcal{N}(a+Wh,\sigma^{2})$ $\displaystyle\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma^{2}}-b^{T}h-\frac{v^{T}}{\sigma}Wh$ [20] [24] [1] Gaussian Gaussian $\displaystyle\mathcal{N}(b+\sigma_{h}W^{T}\frac{v}{\sigma_{v}},\sigma_{h}^{2})$ $\displaystyle\mathcal{N}(a+\sigma_{v}W\frac{h}{\sigma_{h}},\sigma_{v}^{2})$ $\displaystyle\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma_{v}^{2}}+\frac{\left\lVert h-b\right\rVert^{2}}{2\sigma_{h}^{2}}-\frac{v^{T}}{\sigma_{v}}W\frac{h}{\sigma_{h}}$ [7] [13] Gaussian ReLU $\displaystyle max(0,\eta+\mathcal{N}(0,\sigma(\eta))$ $\displaystyle\mathcal{N}(a+Wh,\sigma^{2})$ N/A [12] Gaussian Leaky ReLU $\displaystyle max(\mathcal{N}(c\eta,c),\mathcal{N}(\eta,1))$ , where $\displaystyle c\in(0,1)$ $\displaystyle\mathcal{N}(a+Wh,\sigma^{2})$ $\displaystyle\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma^{2}}-b^{T}h-\frac{v^{T}}{\sigma}Wh$ $\displaystyle+\sum_{h_{j}>0}(\frac{h_{j}^{2}}{2}+\log\sqrt{2\pi})+\sum_{h_{j}\leq 0}(\frac{h_{j}^{2}}{2c}+\log\sqrt{2c\pi})$ [10] Gaussian (Noisy) Tanh $\displaystyle\mathcal{N}(\tanh(\eta),1)$ $\displaystyle\mathcal{N}(a+Wh,\sigma^{2})$ $\displaystyle\frac{\left\lVert v-a\right\rVert^{2}}{2\sigma^{2}}-b^{T}h-\frac{v^{T}}{\sigma}Wh$ $\displaystyle+\sum^{J}_{j=1}(h_{j}\tanh^{-1}(h_{j})+\frac{1}{2}\ln(1-h_{j}^{2}))$ See appendix.

Table 1: Different versions of RBMs found in the literature, where

\displaystyle\eta=b+W^{T}\frac{v}{\sigma}

. This table is intended to represent four main activation functions which are prominent in neural networks: linear (Gaussian), sigmoid (binary), ReLU, and Tanh. In the design of a restricted Boltzmann machine, one typically starts with a function that defines the overall energy

\displaystyle E(v,h)

, from which the conditional probabilities are derived. Advantages of different activation functions

Generating training datasets

To assess the validity of the methodology and accuracy of training over a wide range of distributions, we trained the RBM on a set of toy datasets. In the first toy dataset, random samples were taken from a Laplacian distribution and fitted to the line $\displaystyle y=x$ , with Gaussian noise of unit variance added in both dimensions. The second dataset was generated by sampling from two independent Laplacian distributions and using a mixing matrix to merge the sources. An observable $\displaystyle x$ is generated from independent Laplacian sources $\displaystyle s=(s_{1},s_{2})$ using a mixing matrix $\displaystyle A$ :

\displaystyle x=As

where $\displaystyle A$ is a linear transformation from the source space to the observable space. Given a set of observables, the RBM is capable of estimating the mixing matrix and recovering source data [11]. In the context of this task, independent component analysis is the standard computational method for separating observed data into source components. For this reason, the independent components of the original data were compared with the independent components of the data generated by RBM. Independent component analysis was performed using the FastICA algorithm.

Moving to more complex data, different RBMs were also evaluated on natural image patches taken from [11]. These image patches come from randomly sampling the greyscale images found in the van Hateran natural image database [19]. Each patch consists of 14 x 14 pixels, and for our purposes we used only 10,000 of these patches during training.

Meanwhile, molecular training data for H₂O and ethane was generated by performing molecular Monte Carlo simulations and evaluating the energy of each proposed sample using the restricted Hartree-Fock method. The Hartree-Fock calculations made use of the basis set cc-pVDZ because it had the best agreement with literature values for the bond angle and bond length of H₂O. These calculations were run using the PySCF module for quantum chemistry.

Different RBMs

There are many types of RBMs apart from those already discussed. By altering the equation for the overall energy, a variety of hidden and visible units can be implemented. Table 1 summarizes the energy equations and conditional probabilities for a few different types of RBMs. These RBMs are mostly defined by the activation functions used to calculate the hidden layer, including sigmoid, Gaussian, ReLU.

Noticing the absence of an RBM with Tanh activation function, we derived a new energy function and conditional probabilities for an RBM with Tanh hidden units using the method outlined in [14] (see appendix for derivation). After developing the Tanh RBM, we implemented the full range of RBMs defined in Table 1, which use common activation functions from machine learning. This implementation can be found at https://github.com/peter1255/RBM_chem.

Depending on the correlations found within the data, different activation functions may be better for modeling different data. It has been shown that non-linear hidden units expand the capabilities of the RBM by allowing the model to represent non-linear correlations between visible units [14]. The type of non-linearity chosen may have an effect on the model’s representation of the data, though a more detailed comparison of RBMs is needed.

III. Results and Discussion

Comparison of RBMs

The RBMs found in Table 1 were implemented and trained on both natural image data and H₂O geometries. During 20 epochs of training, the performances of RBMs were compared through thorough calculations of the log-likelihood. These results are shown in Figure 4.

First proposed by [12], the ReLU RBM trains well but lacks an appropriate joint energy equation [17], making it impossible to calculate log-likelihood or generate simulations. Though we tried using the Gaussian joint energy equation (3) in test simulations not shown here, values quickly diverged as large hidden terms outweighed the parabolic containment term. Making a slight modification of ReLU into Leaky ReLU gives a viable energy equation which can be successfully trained and used for both sampling and calculating log-likelihood [10]. While this version of the RBM has not yet reached widespread use, we show in 4 that it performs on a level comparable with the previously established GRBM.

Out of all the RBMs tested, the Tanh RBM achieved the best maximum log-likelihood on both natural image data and H₂O molecular geometries. However, when trained on data sets with few parameters, like H₂O data, the Tanh RBM has trouble maintaining convergence and tends to diverge after reaching its maximum. While the Tanh RBM shows great promise for representing a variety of data, additional research must be done to learn how to prevent divergence in general cases.

The model that achieves second-best performance in terms of log-likelihood is the Gaussian RBM, which is quite close in the case of molecular data (Figure 4). For this reason, we selected this version of the model to test the proposed simulation method on molecular data.

Modeling two-dimensional mixtures

Figure 5 compares toy datasets with the datasets generated by a trained GRBM using the Monte Carlo method outlined above. The left column displays the original datasets, while the middle column displays the data generated by RBM. The two datasets appear quite similar to one another, though some of the pointedness in the original data is lost through the RBM’s Gaussian probability distribution. After performing independent component analysis on both training and simulated data, we can see that the independent components (shown using arrows) from the simulated data (middle column) match those from the original data (left column).

Then, evaluating the energy of each point in the visible space using equation (3), we graphed the energy contour represented by the GRBM (right column). By moving along this energy contour we are able to generate the states represented in the middle column. Because the Monte Carlo method favors lower energy states, most of the samples lie within the middle of the contour. Furthermore, the arrows represented by each hidden node of the RBM match the independent components of the reconstructed data, showing that the Monte Carlo sampling method preserves the original components of the data.

Furthermore, very few hidden nodes were required for reproducing the data. In the case of the linear distribution which has only one independent component, only two hidden nodes were needed. In the case of the cross shaped distribution, only four hidden units were needed. In general, two hidden nodes are needed for each independent component in the data: because the hidden nodes are binary, they can only represent their positive vector span. Therefore an additional hidden node is needed to represent the opposite direction. This also provides a useful heuristic for determining the number of hidden nodes to include in an RBM. In this case, using any more arrows would be redundant, as the arrows begin to overlap.

Modeling molecular geometries

After learning toy distributions, the RBM was trained on molecular geometries of H₂O generated by Hartree-Fock. Since the geometry of H₂O is defined by three parameters (two bonds and one angle), three visible nodes were included in the GRBM. The GRBM was trained on a set of 10000 geometries. Figure 3 displays the overall bond and angle energies evaluated using Hartree-Fock (top row) and GRBM (bottom row).

Not only is the energy contour of the RBM similar to the energy determined by quantum methods, the RBM was able to pick up on subtleties including the rotated parabola which defines the bond and angle energies (first and second columns). Looking at the bond and angle energies given by the RBM, the left arm of the parabola shows the energy associated with atom-atom repulsion, while the right side of the graph shows the energy associated with atom-atom attraction. Both methods match the literature values for the bond length and angle of H₂O.

Similar to the previous figure, energy contours (third and fourth columns) were generating by fixing one parameter to its given minimum and evaluating the energy with respect to the other two parameters. As we can see, the energy contours inferred by GRBM are quite similar to the original contours given by Hartree-Fock. Though the stratification of the contour layers is slightly different (a fundamental limitation of the RBM model structure) the overall shape of the contours is quite similar. The GRBM accurately represents the relationship between different data parameters.

Finally, we use the RBM to model a larger molecule: ethane. Though it only consists of 6 atoms, ethane requires 18 internal coordinates to represent all its various bonds, angles, and dihedrals. In general, a molecule with $\displaystyle n$ atoms requires $\displaystyle 3n-6$ unique coordinates, which can be stored in the form of a z-matrix. Therefore we train an RBM with 18 visible nodes on the internal ethane coordinates, and then sample from it to generate new geometries.

Figure 7 shows the difference between the original dataset (black) and the GRBM generated data (red). The reconstructed data is able to approximate molecular geometries by fitting Gaussian curves to the original data. For simple unimodal data, the GRBM does a good job of reproducing the existing shape of the data. However, for more complex distributions (i.e. the bimodal dihedral angle found in coordinate 15 of 7) the GRBM is unable to provide a fit for this complex pattern and instead regresses to the mean in between the two modes. Nevertheless, the rest of the marginal distributions are well represented, and the GRBM captures the essence of the joint distributions. Through our Monte Carlo sampling algorithm, the GRBM provides an accurate picture of the dynamics of an ethane molecule.

Using the RBM it is easy to evaluate the energies of any configurations at a fast speed. By scaling the energy units of the RBM to proper Hartree units, we could use the RBM to represent bond and angle energies without having to perform any quantum calculations. Moreover, because of the low computational cost in generating states of a water molecule, the technique here could also be extended to model an entire body of liquid water in an aqueous solution.

IV. Conclusion

In this paper, we have demonstrated the usefulness of RBMs in modeling complex molecular systems. Because the model is adaptable to a wide variety of data, the proposed methodology can be used for a variety of problems in chemistry. Furthermore, the relative simplicity and efficiency of the model should make it accessible to a wider scientific audience. The RBM is a useful method for generating a complete molecular ensemble given a sparse set of data. We hope that the RBM will allow for a new cycle of experiment and theory, where samples generated from experiments are treated with RBMs to get more information about ensemble systems.

Note

The software package used for this study along with a tutorial for how to use it to model molecular data is available on Github at https://github.com/peter1255/RBM_chem.

References

[1] Cho, K.H., Raiko, T., Ilin, A. Gaussian-Bernoulli deep Boltzmann machine. Proceeding of the The International Joint Conference on Neural Networks (IJCNN) 1–7 (2013).
[2] Cho, K., Raiko, T., & Ilin, A. Parallel tempering is efficient for learning restricted Boltzmann machines. Proceedings of the International Joint Conference on Neural Networks (IJCNN) 3246-3253 (2010).
[3] Courville, A., Bergstra, J., & Bengio, Y. A spike and slab restricted Boltzmann machine. Proceeding of the Society for Artificial Intelligence and Statistics (2011).
[4] Echenique, P. & Alonso, J. L. A mathematical and computational review of Hartree-Fock SCF methods in quantum chemistry. Molecular Physics 105:3057-3098 (2007).
[5] Hinton, G. E. & Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science 313;5786: 504–507 (2006).
[6] Hinton, G.E. Scholarpedia, 2(5):1668 (2007).
[7] Karakida, R., Okada, M., Amari, S. Dynamical analysis of contrastive divergence learning: Restricted Boltzmann machines with Gaussian visible units. Neural Netw. 79:78-87 (2016).
[8] Krause, O., Fischer, A., & Igel, C. Algorithms for estimating the partition function of restricted Boltzmann machines. Artifical Intelligence 278 (2020).
[9] Larochelle, H.; Bengio, Y. Classification using discriminative restricted Boltzmann machines. Proceedings of the 25th international conference on Machine learning - ICML:536 (2008).
[10] Li, C.L., Ravanbakhsh, S., & Poczos, B. Annealing Gaussian into ReLU: a new sampling strategy for leaky-ReLU RBM. arXiv preprint arXiv:1611.03879 (2016).
[11] Melchior, J., Wang, N., Wiskott, L. Gaussian-binary restricted Boltzmann machines for modeling natural image statistics. PLOS ONE 12;3 (2017).
[12] Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In ICML (2010).
[13] Ogawa, S., & Mori, H., A Gaussian-Gaussian-restricted-Boltzmann-machine-based deep neural network technique for photovoltaic system generation forecasting, IFAC-PapersOnLine, 52;4:87-92 (2019).
[14] Ravanbakhsh, S. et al. Stochastic neural networks with monotonic activation functions. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (2016).
[15] Salakhutdinov, R., Mnih, A., & Hinton, G.E. Restricted Boltzmann machines for collaborative filtering. Proceedings of the 24th International Conference on Machine Learning (2007).
[16] Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory. Parallel Distributed Processing, 1:194-281 (1986).
[17] Su, Q. et al. Unsupervised Learning with Truncated Gaussian Graphical Models. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017).
[18] Tieleman, T. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML 25:1064–1071 (2008).
[19] van Hateren, J. H., & van der Schaaf, A. Independent Component Filters of Natural Images Compared with Simple Cells in Primary Visual Cortex. In Proceedings of Biological Sciences 359-366 (1998).
[20] Wang, N., Melchior, J., & Wiskott, L. Gaussian-binary Restricted Boltzmann Machines on Modeling Natural Image Statistics. CoRR. (2014).
[21] Xia, R. & Kais, S. Quantum machine learning for electronic structure calculations. Nature Communications 9:4195 (2018).
[22] Yang, E., Ravikumar, P., Allen, G.I., and Liu, Z. Graphical models via generalized linear models. In NIPS (2012).
[23] Yu, W. et al. Generating the conformational properties of a polymer by the restricted Boltzmann machine. J. Chem. Phys. 151, 031101 (2019).
[24] Ji Zhang, Hongjun Wang, Jielei Chu, Shudong Huang, Tianrui Li, Qigang Zhao, Improved Gaussian–Bernoulli restricted Boltzmann machine for learning discriminative representations. Knowledge-Based Systems 185 (2019).

V. Appendix

Derivation of Tanh energy

First we define the activation function $\displaystyle f(\eta_{j})=\tanh(\eta_{j})$ where $\displaystyle\eta_{j}=b_{j}+\sum_{i=1}^{I}W_{ij}v_{j}$ assuming data is normalized to unit variance. The inverse function is $\displaystyle f^{-1}(h_{j})=\tanh^{-1}(h_{j})$ . The corresponding antiderivatives are:

	$\displaystyle\displaystyle F(\eta_{j})$	$\displaystyle\displaystyle=\int f(\eta_{j})d\eta$
		$\displaystyle\displaystyle=\ln{\cosh{\eta_{j}}}+C$
	$\displaystyle\displaystyle F^{*}(h_{j})$	$\displaystyle\displaystyle=\int f^{-1}(h_{j})dh$
		$\displaystyle\displaystyle=h_{j}\tanh^{-1}(h_{j})+\frac{1}{2}\ln(1-h_{j}^{2})+C$

Meanwhile, the activation for the visible units is linear so that $\displaystyle\overline{f}(\nu_{i})=\nu_{i}$ where $\displaystyle\nu_{i}=a_{i}+\sum^{J}_{j=1}W_{ij}h_{j}$ . Following the same step we get that $\displaystyle\overline{F}(\nu_{i})=\frac{1}{2}\nu_{i}^{2}$ and $\displaystyle\overline{F}^{*}(v_{i})=\frac{1}{2}v_{i}^{2}$ .

From Ravanbakhsh et al. [14], the conditional distribution is defined as:

\displaystyle p(h_{j}|\eta_{j})=\exp(h_{j}\eta_{j}-F(\eta_{j})-F^{*}(h_{j})+g(h_{j}))

Replacing the values of $\displaystyle F(\eta_{j})$ and $\displaystyle F^{*}(h_{j})$ we get the conditional probability:

	$\displaystyle\displaystyle p(h_{j}\|\eta_{j})$	$\displaystyle\displaystyle=\exp(h_{j}\eta_{j}-\ln{\cosh{\eta_{j}}}$
	$\displaystyle\displaystyle-$	$\displaystyle\displaystyle h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})+g(h_{j}))$
		$\displaystyle\displaystyle=\exp(\tanh(\eta_{j})\eta_{j}-\ln{\cosh{\eta_{j}}}$
	$\displaystyle\displaystyle-$	$\displaystyle\displaystyle\tanh(\eta_{j})\tanh^{-1}(\tanh(\eta_{j})+g(h_{j}))-\frac{1}{2}\ln(1-h_{j}^{2})+g(h_{j}))$
		$\displaystyle\displaystyle=\exp(\tanh(\eta_{j})\eta_{j}-\ln{\cosh{\eta_{j}}}$
	$\displaystyle\displaystyle-$	$\displaystyle\displaystyle\tanh(\eta_{j})\tanh^{-1}(\tanh(\eta_{j}))-\frac{1}{2}\ln(1-h_{j}^{2})+g(h_{j}))$
		$\displaystyle\displaystyle=\exp(-\ln{\cosh{\eta_{j}}}-\frac{1}{2}\ln(1-h_{j}^{2})+g(h_{j}))$
		$\displaystyle\displaystyle=\frac{1}{\cosh{\eta_{j}}}\frac{1}{\sqrt{1-\tanh^{2}{\eta_{j}}}}\exp(g(h_{j}))$
		$\displaystyle\displaystyle=\frac{1}{\cosh{\eta_{j}}\operatorname{sech}{\eta_{j}}}\exp(g(h_{j}))$
		$\displaystyle\displaystyle=\exp(g(h_{j}))$

For simplicity we define $\displaystyle g(h_{j})=-\frac{1}{2}h_{j}^{2}-\log\sqrt{2\pi}$ so that the conditional probability distribution becomes a Gaussian with mean $\displaystyle h$ and unit variance, meaning that Gaussian noise is added to the hidden units after applying the Tanh activation. Similarly we get that $\displaystyle p(v|\nu)=\mathcal{N}(\nu,1)$ , as expected for the Gaussian visible layer.

The joint energy equation was generalized in Yang et al.[22] through the following:

\displaystyle E(v,h)=-v^{T}Wh+\overline{F}^{*}(v)+F^{*}(h)

Replacing $\displaystyle\overline{F}^{*}$ and $\displaystyle F^{*}$ we get the energy for a joint configuration of the tanh RBM:

\displaystyle E(v,h)=-v^{T}Wh+\frac{1}{2}||v||^{2}+\sum^{J}_{j=1}(h_{j}\tanh^{-1}(h_{j})+\frac{1}{2}\ln(1-h_{j}^{2}))

From the joint energy we can derive the marginal probability of a given visible configuration $\displaystyle v$ :

	$\displaystyle\displaystyle p(v)$	$\displaystyle\displaystyle=\frac{1}{Z}\int_{h}e^{-E(v,h)}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}\int_{h}e^{v^{T}Wh-\frac{1}{2}\|\|v\|\|^{2}-\sum^{J}_{j=1}(h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2}))}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\int_{h}e^{v^{T}Wh-\sum^{J}_{j=1}(h_{j}\tanh^{-1}(h_{j})+\frac{1}{2}\ln(1-h_{j}^{2}))}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\int_{h}e^{\sum^{J}_{j=1}\eta_{j}h_{j}-h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\int_{h}\prod^{J}_{j}e^{\eta_{j}h_{j}-h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}\int_{h_{j}}e^{\eta_{j}h_{j}-h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})}dh_{j}$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}e^{\eta_{j}\tanh(\eta_{j})-\tanh(\eta_{j})\tanh^{-1}(\tanh(\eta_{j}))-\frac{1}{2}\ln(1-\tanh^{2}(\eta_{j}))}$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}e^{-\frac{1}{2}\ln(1-\tanh^{2}(\eta_{j}))}$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}e^{-\frac{1}{2}\ln(1-\tanh^{2}(b_{j}+\sum_{i=1}^{I}W_{ij}v_{j}))}$

where Z is the partition function.

Since the marginal energy is given by

\displaystyle E(v)=-\log p(v)=F(v)-\log(Z)

where $\displaystyle F(v)$ is the free energy of the given configuration, we have the following expression for the free energy

\displaystyle F(v)=\frac{||v||^{2}}{2}+\sum^{J}_{j}\frac{1}{2}\log(1-\tanh^{2}(b_{j}+\sum_{i=1}^{I}W_{ij}v_{j}))

	$\displaystyle\displaystyle p(v)$	$\displaystyle\displaystyle=\frac{1}{Z}\int_{h}e^{-E(v,h)}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}\int_{h}e^{v^{T}Wh-\frac{1}{2}\|\|v\|\|^{2}-\sum^{J}_{j=1}(h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2}))}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\int_{h}e^{v^{T}Wh-\sum^{J}_{j=1}(h_{j}\tanh^{-1}(h_{j})+\frac{1}{2}\ln(1-h_{j}^{2}))}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\int_{h}e^{\sum^{J}_{j=1}\eta_{j}h_{j}-h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\int_{h}\prod^{J}_{j}e^{\eta_{j}h_{j}-h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})}dh$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}\int_{h_{j}}e^{\eta_{j}h_{j}-h_{j}\tanh^{-1}(h_{j})-\frac{1}{2}\ln(1-h_{j}^{2})}dh_{j}$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}e^{\eta_{j}\tanh(\eta_{j})-\tanh(\eta_{j})\tanh^{-1}(\tanh(\eta_{j}))-\frac{1}{2}\ln(1-\tanh^{2}(\eta_{j}))}$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}e^{-\frac{1}{2}\ln(1-\tanh^{2}(\eta_{j}))}$
		$\displaystyle\displaystyle=\frac{1}{Z}e^{-\frac{1}{2}\|\|v\|\|^{2}}\prod^{J}_{j}e^{-\frac{1}{2}\ln(1-\tanh^{2}(b_{j}+\sum_{i=1}^{I}W_{ij}v_{j}))}$