Dynamic Molecular Graph-based Implementation for Biophysical Properties Prediction

Carter Knutson
Pacific Northwest National Laboratory
Washington, WA 99354
[email protected]
\AndGihan Panapitiya
Pacific Northwest National Laboratory
Washington, WA 99354
[email protected]
\AndRohith Varikoti
Pacific Northwest National Laboratory
Washington, WA 99354
[email protected]
\ANDNeeraj Kumar
Pacific Northwest National Laboratory
Washington, WA 99354
[email protected]

Abstract

Neural Networks (GNNs) have revolutionized the molecular discovery to understand patterns and identify unknown features that can aid in predicting biophysical properties and protein-ligand interactions. However, current models typically rely on 2-dimensional molecular representations as input, and while utilization of 2\3-dimensional structural data has gained deserved traction in recent years as many of these models are still limited to static graph representations. We propose a novel approach based on the transformer model utilizing GNNs for characterizing dynamic features of protein-ligand interactions. Our message passing transformer pre-trains on a set of molecular dynamic data based off of physics-based simulations to learn coordinate construction and make binding probability and affinity predictions as a downstream task. Through extensive testing we compare our results with the existing models, our MDA-PLI model was able to outperform the molecular interaction prediction models with an RMSE of 1.2958. The geometric encodings enabled by our transformer architecture and the addition of time series data add a new dimensionality to this form of research.

1 Introduction

Machine learning (ML) has greatly advanced protein structure prediction and molecular design processes, however, functional and reliable molecular representation remains a fundamental problem. In this context, molecular structural constraint is a major contributing factor to this challenge. Understanding protein-ligand interactions (PLIs) is vital for drug candidate design and discovery. Many machine learning models rely on 2-dimensional sequence information such as simplified molecular-input line-entry system strings (SMILES) or protein sequences, or 3-dimensional information haphazardly extrapolated from the fore mentioned 2D strings (Torng & Altman, 2019; Öztürk et al., 2018; Ragoza et al., 2017). Reliance on 2D data creates an environment in which designed or discovered molecules are syntactically aligned with their predecessors, but completely lack semantic validity. 3D data extrapolated from a purely 2D representation is greatly lacking in spatial accuracy, a core component to characterize PLIs. 3D data is expensive and time consuming to obtain through traditional or computational methods. Therefore it is critical to utilize available data to the fullest extent.

Graph Neural Networks (GNNs) introduced graph data capabilities for the purpose of deep learning (Veličković et al., 2017; Gao et al., 2021; Kipf & Welling, 2016). Graph based molecular design and discovery models have gained significant traction in recent years. These representations are a natural choice for molecular representation in machine learning as atoms easily translate to nodes and edges and have shown improvement when compared to earlier methods and data formats (Duvenaud et al., 2015; Shlomi et al., 2020). Top performing PLI focused GNNs, including those that manage 3D molecular data are still limited to translational invariant, and static models (Lim et al., 2019; Knutson et al., 2022; Gonczarek et al., 2016). This greatly hinders molecular graph representation and learning by depriving the model of important physical\geometric properties.

Geometric and dynamic functionalities are a fairly new and natural development of the GNN. There are multiple ways to represent the geometric data associated with each atom’s 3D coordinate set. In this context, Ashby & Bilbrey (2021) created a dynamic model in which edge prediction is coupled with the Euclidean distance between atoms. Atoms represented by their Cartesian coordinates enable Xu et al. (2022)’s model to predict molecular conformations. More recently, message passing transformer models have gained popularity for the inclusion of geometric data within molecular learning models Liu et al. (2021b); Schütt et al. (2017); Vignac et al. (2020); Brandstetter et al. (2021).

In this work, we propose a novel approach to PLI exploration by utilizing message passing transformers to learn on a limited set of dynamic time series PLI data derived from MD simulations so called the MD-Assisted-PLI(MDA-PLI). This model is unique in its ability in order to make biophysical property predictions on general static data downstream after pre-training on a limited MD dataset. Our approach was inspired by recent pre-train to finetune pipelines (Devlin et al., 2018; Liu et al., 2021a; Wu et al., 2022). We combine the geometric benefits afforded to message passing transformers with the consideration of time-series data to create a truly dynamic molecular representation model.

2 Methods

2.1 Architecture

The ligand and protein structures are represented in terms of two separate graphs, $G_{L}(V_{L},E_{L})$ and $G_{P}(V_{P},E_{P})$ , where $V$ and $E$ correspond to vertices and edges of the graph. Each node is associated with 116 features. Our model is based on the equivariant graph neural network architectures Satorras et al. (2021) and Ganea et al. (2021). The model consists of initial node embedding message updates for ligand and protein graphs within a cross-graph attention mechanism that conducts the coordinate updates as shown in Figure 1.

The node features are updated using the message passing mechanism. During the individual graph encoding phase, the message that is sent from a source node to a target node is constructed as shown in Equation 1. The node features of the target ( $h_{i}^{l}$ ) and source ( $h_{j}^{l}$ ), and the squared relative distance between them ( $||x_{i}-x_{j}||^{2}$ ) are first concatennated and then transformed using a multilayer perceptron $\phi_{e}$ .

m_{i,j}=\phi_{e}(h_{i}^{l},h_{j}^{l},||x_{i}-x_{j}||^{2}),~{}\forall(i,j)\in G_{P},G_{L}

(1)

What is unique in our model is the use of multiple aggregation methods. We use sum, mean and max aggregation to form three messages,

h_{i}^{sum}=\sum_{j\in N(i)}m_{i,j},~{}~{}h_{i}^{mean}=\frac{1}{|N(i)|}\sum_{j\in N(i)}m_{i,j},~{}~{}h_{i}^{max}=max(m_{i,j}).

(2)

Then, we form the final message $M_{i}$ by linearly transforming the concatenation of $h_{i}^{sum},h_{i}^{mean}$ and $h_{i}^{max}$ . That is $h_{i}$ is given by,

M_{i}=\phi_{aggr}(h_{i}^{sum},h_{i}^{mean},h_{i}^{max})

(3)

The ligand and protein messages are used in the graph interaction layer to update the respective node embeddings. For each structure, we update the coordinates according to Equation 4 after transforming $m_{i,j}$ using the multilayer perceptron $\phi_{x}$ .

x_{i}^{l+1}=x_{i}^{l}+\sum_{j\in N(i)}||x_{i}-x_{j}||^{2}\phi_{x}(m_{i,j})

(4)

The remaining steps of the mathematical formulation of our architecture are given in the Appendix.

Refer to caption — Figure 1: Illustration of the pre-training phase and general downstream logic. Molecular dynamic snapshots are converted into time series graphs. The protein and ligand graph messages of each complex frame are fed into a GNN layer with gated cross graph attention that completes the pre-training stage. Binding affinity prediction is made downstream based on the same logic flow.

2.2 Data

Pre-training is conducted on graphs created from the MD simulation frames of 33 unique but diverse carefully selected protein-ligand complexes from the RCSB Burley et al. (2020). Simulations were run for 50ns with snapshots captured every 10ps resulting in roughly 5,000 frames per target. Targets were divided into a standard 80-10-10 training, validation, and test set split at random. Some samples were unable to be run or processed in their entirety, please see the Table2 in the Appendix for a more explicit breakdown of samples included in each pre-training set. Downstream fine-tuning and affinity prediction tasks were performed on 7,695 experimental samples pulled from PDBBind2016 in the same division splits as the pre-training sets. For computational efficiency protein graphs are cropped only to incorporate the pocket in which the ligand is bound through a k-hop proximity method. Pre-training and downstream run on the same hyperparemeters for 30 epochs with early stopping at a patience of 25.

3 Results and Discussion

As a first step, we evaluate our model on the widely used PDBBind2016 dataset, specially prepared for our model to assess the general performance. Distribution of the affinity values in the train and test sets is shown in Figure 3(a), and the predicted versus actual affinity values in Figure 3(b) in the Appendix. The model generally finds it challenging to make predictions for structures associated with low affinity values as indicated in the residual values shown in Figure 2(a) also located in the Appendix.

In Table 1, we present our results in comparison to the experiments on similar datasets reported by Wu et al. (2022) and Feng et al. (2021). The top portion shows the results from two sequence based models: DeepDTA by Öztürk et al. (2018) and DeepAffinity by Karimi et al. (2019). Structure based models are listed afterwards with our results in the final row. As shown in Table 1, we are able to achieve top-of-the-line results when compared to other structure based models.

Model	RMSE	Pearson	Spearman
DeepDTA(Öztürk et al., 2018)	1.565	0.573	0.574
DeepAffinity(Karimi et al., 2019)	1.893	0.415	0.436
3DCNN(Townshend et al., 2020)	1.429 $\pm$ 0.042	0.541 $\pm$ 0.029	0.532 $\pm$ 0.033
IEConv(Hermosilla et al., 2020)	1.554 $\pm$ 0.016	0.414 $\pm$ 0.053	0.428 $\pm$ 0.032
ProtMD(Wu et al., 2022) LP¹¹footnotemark: 1	1.413 $\pm$ 0.032	0.572 $\pm$ 0.047	0.569 $\pm$ 0.051
ProtMD(Wu et al., 2022) FT²²footnotemark: 2	1.367 $\pm$ 0.014	0.601 $\pm$ 0.036	0.587 $\pm$ 0.042
GCAT(Feng et al., 2021)	1.382	0.585	0.592
Pafnucy(Stepniewska-Dziubinska et al., 2018)	1.489	0.539	0.537
MDA-PLI without pre-train (Current Work)	1.3462	0.5955	0.5444
MDA-PLI with pre-train (Current Work)	1.2958	0.6181	0.5698

Table 1: PDBBind Comparison Results of RMSE, Pearson correlation, and Spearman correlation

Multiple observations can be made from the presented results. DeepDTA and DeepAffinity exhibit the worst performance as the only sequence based models present in the table. Performance increase produced by structure base models is clear. While many of the reported structure based models have additional years of technological advancement afforded to them, Stepniewska-Dziubinska et al. (2018) et al. is a prime comparison in the argument for the inclusion of 3D structural data.

Many of the models reported are still limited to static protein-ligand representations, even when based on 3D crystal structure. In this context, ProtMD by Wu et al. (2022) and GCAT by Feng et al. (2021) can be noted as the top performing publications that inspired our current work. Like ProtMD we also utilize MD data in a pre-training stage before the downstream affinity prediction. Both previous models preform predictions through some form of geometrically informed GNN, in the case of GCAT with an attention mechanism. We also implement cross-graph attention consideration to bolster our affinity predictions and lay the foundation to further investigate the mechanisms behind protein-ligand binding affinity.

In Figure 2(b), we show how the Mean Absolute Error (MAE) for different affinity regions vary. The blue colored numbers in 2(b) are the number of training data points for each affinity region. We can see that the lack of training data points is a major factor that affects the prediction accuracy. However, pretraining has been beneficial to improve the predictions for affinity regions for which we have relatively a small number of training data points. The model trained from scratch performs slightly better than the the pretrained model in the high affinity region.

4 Conclusion

In this work, we devised a model that used MD data to pre-train a model based on the Equivariant Graph Neural Networks in order to improve the structure-property relationship of the protein-ligand interactions. We employ multiple aggregation along with cross graph attention to achieve high affinity prediction accuracies compared to previous models. In particular, our pretraining has been instrumental in improving the predictions for affinity regions with low amounts of training data.

5 Acknowledgements

This work was supported by the Laboratory Directed Research Funding (LDRD), Mathematics of Artificial Reasoning for Science (MARS) Initiative, at the Pacific Northwest National Laboratory (PNNL). PNNL is a multiprogram national laboratory operated by Battelle for the DOE under Contract DEAC05-76RLO 1830. This research used computational resources provided by Research Computing at the Pacific Northwest National Laboratory.

References

Ashby & Bilbrey (2021) Michael Hunter Ashby and Jenna A Bilbrey. Geometric learning of the conformational dynamics of molecules using dynamic graph neural networks. arXiv preprint arXiv:2106.13277, 2021.
Brandstetter et al. (2021) Johannes Brandstetter, Rob Hesselink, Elise van der Pol, Erik Bekkers, and Max Welling. Geometric and physical quantities improve e (3) equivariant message passing. arXiv preprint arXiv:2110.02905, 2021.
Burley et al. (2020) Stephen K Burley, Charmi Bhikadiya, Chunxiao Bi, Sebastian Bittrich, Li Chen, Gregg V Crichlow, Cole H Christie, Kenneth Dalenberg, Luigi Di Costanzo, Jose M Duarte, Shuchismita Dutta, Zukang Feng, Sai Ganesan, David S Goodsell, Sutapa Ghosh, Rachel Kramer Green, Vladimir Guranović, Dmytro Guzenko, Brian P Hudson, Catherine L Lawson, Yuhe Liang, Robert Lowe, Harry Namkoong, Ezra Peisach, Irina Persikova, Chris Randle, Alexander Rose, Yana Rose, Andrej Sali, Joan Segura, Monica Sekharan, Chenghua Shao, Yi-Ping Tao, Maria Voigt, John D Westbrook, Jasmine Y Young, Christine Zardecki, and Marina Zhuravleva. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Research, 49(D1):D437–D451, 11 2020. doi: 10.1093/nar/gkaa1038. URL https://doi.org/10.1093/nar/gkaa1038.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems, 28, 2015.
Feng et al. (2021) Xianbing Feng, Jingwei Qu, Tianle Wang, Bei Wang, Xiaoqing Lyu, and Zhi Tang. Attention-enhanced graph cross-convolution for protein-ligand binding affinity prediction. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1299–1302. IEEE, 2021.
Ganea et al. (2021) Octavian-Eugen Ganea, Xinyuan Huang, Charlotte Bunne, Yatao Bian, Regina Barzilay, Tommi Jaakkola, and Andreas Krause. Independent se(3)-equivariant models for end-to-end rigid protein docking. 2021. doi: 10.48550/ARXIV.2111.07786. URL https://arxiv.org/abs/2111.07786.
Gao et al. (2021) Hongyang Gao, Yi Liu, and Shuiwang Ji. Topology-aware graph pooling networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12):4512–4518, 2021.
Gonczarek et al. (2016) Adam Gonczarek, Jakub M Tomczak, Szymon Zaręba, Joanna Kaczmar, Piotr Dąbrowski, and Michał J Walczak. Learning deep architectures for interaction prediction in structure-based virtual screening. arXiv preprint arXiv:1610.07187, 2016.
Hermosilla et al. (2020) Pedro Hermosilla, Marco Schäfer, Matěj Lang, Gloria Fackelmann, Pere Pau Vázquez, Barbora Kozlíková, Michael Krone, Tobias Ritschel, and Timo Ropinski. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. arXiv preprint arXiv:2007.06252, 2020.
Karimi et al. (2019) Mostafa Karimi, Di Wu, Zhangyang Wang, and Yang Shen. Deepaffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics, 35(18):3329–3338, 2019.
Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Knutson et al. (2022) Carter Knutson, Mridula Bontha, Jenna A Bilbrey, and Neeraj Kumar. Decoding the protein–ligand interactions using parallel graph neural networks. Scientific reports, 12(1):1–14, 2022.
Lim et al. (2019) Jaechang Lim, Seongok Ryu, Kyubyong Park, Yo Joong Choe, Jiyeon Ham, and Woo Youn Kim. Predicting drug–target interaction using a novel graph neural network with 3d structure-embedded graph representation. Journal of chemical information and modeling, 59(9):3981–3988, 2019.
Liu et al. (2021a) Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728, 2021a.
Liu et al. (2021b) Yi Liu, Limei Wang, Meng Liu, Yuchao Lin, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. Spherical message passing for 3d molecular graphs. In International Conference on Learning Representations, 2021b.
Öztürk et al. (2018) Hakime Öztürk, Arzucan Özgür, and Elif Ozkirimli. Deepdta: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.
Ragoza et al. (2017) Matthew Ragoza, Joshua Hochuli, Elisa Idrobo, Jocelyn Sunseri, and David Ryan Koes. Protein–ligand scoring with convolutional neural networks. Journal of Chemical Information and Modeling, 57(4):942–957, 2017.
Satorras et al. (2021) Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E(n) equivariant graph neural networks, 2021. URL https://arxiv.org/abs/2102.09844.
Schütt et al. (2017) Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems, 30, 2017.
Shlomi et al. (2020) Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vlimant. Graph neural networks in particle physics. Machine Learning: Science and Technology, 2(2):021001, 2020.
Stepniewska-Dziubinska et al. (2018) Marta M Stepniewska-Dziubinska, Piotr Zielenkiewicz, and Pawel Siedlecki. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics, 34(21):3666–3674, 2018.
Torng & Altman (2019) Wen Torng and Russ B Altman. Graph convolutional neural networks for predicting drug-target interactions. Journal of chemical information and modeling, 59(10):4131–4149, 2019.
Townshend et al. (2020) Raphael JL Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen Jing, Brandon Anderson, Stephan Eismann, et al. Atom3d: Tasks on molecules in three dimensions. arXiv preprint arXiv:2012.04035, 2020.
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks, 2017. URL https://arxiv.org/abs/1710.10903.
Vignac et al. (2020) Clement Vignac, Andreas Loukas, and Pascal Frossard. Building powerful and equivariant graph neural networks with structural message-passing. Advances in Neural Information Processing Systems, 33:14143–14155, 2020.
Wu et al. (2022) Fang Wu, Qiang Zhang, Dragomir Radev, Yuyang Wang, Xurui Jin, Yinghui Jiang, Stan Z Li, and Zhangming Niu. Pre-training protein models with molecular dynamics simulations for drug binding. 2022.
Xu et al. (2022) Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.

Appendix A Appendix

A.1 Mathematics

We employ cross graph attention to account for the interaction between the neighboring atoms of the ligand and the protein. Only the nearest inter-graph neighbors such that the distance between them is less than $th_{dist}$ is considered to find attention coefficient. The attention mechanism we use here is what is introduced by Veličković et al. (2017). The cross-graph message between nodes $i$ in the ligand and nodes $j$ in the protein (or vice versa) is given by,

\mu_{i,j}=a_{i,j}\textbf{W}h_{j}^{enc},~{}~{}||x^{\prime}_{i}-x^{\prime}_{j}||<th_{dist}

(5)

The attention coefficients $a_{i,j}$ are defined as,

a_{i,j}=\frac{exp(LeakyRelU(\textbf{a}[\textbf{W}h_{i},\textbf{W}h_{j}]))}{\sum_{k\in N(i)}exp(LeakyRelU(\textbf{a}[\textbf{W}h_{i},\textbf{W}h_{k}]))},

(6)

where a and W are weight matrices. The cross-graph message are aggregated by adding according to,

\mu_{i}=\sum_{j\in G_{P}}\mu_{i,j}~{}~{}\forall i\in G_{L}~{}~{}and~{}~{}\mu_{i}=\sum_{j\in G_{L}}\mu_{i,j}~{}~{}\forall i\in G_{P}.

(7)

The final node feature update can now be carried out using linear transformation function $\Phi^{n}$ according to Equation 8.

h_{i}^{l+1}=h_{i}^{enc}+\Phi^{n}(h_{i}^{enc},M_{i},\mu_{i}).

(8)

During pre-training, the model learns to predict the coordinates of the next time step, based on those of the current time step. Mean squared loss is used as the loss function. The finetuning task is to predict protein-ligand binding affinity. We sum the final node embeddings to arrive at a representation for the whole structure. These representations for the ligand and the protein are concatenated and transformed to the dimensions of the finetuning target using a multi-layer perceptron.

A.2 Data

An explicit breakdown of pre-training datasets and sample counts.

Set	Targets	Total Samples
Train	27	116,116
Test	3	15,000
Validation	3	15,000

Table 2: Pre-training data.