Semi-Supervised GCN for learning Molecular Structure-Activity Relationships

Alessio Ragno ¹, Dylan Savoia¹, Roberto Capobianco^1,2

¹Dept. of Computer, Control and Management
Engineering “Antonio Ruberti”, Sapienza University of Rome
²Sony AI
Correspondence to [email protected]

Abstract

Since the introduction of artificial intelligence in medicinal chemistry, the necessity has emerged to analyse how molecular property variation is modulated by either single atoms or chemical groups. In this paper, we propose to train graph-to-graph neural network using semi-supervised learning for attributing structure-property relationships. As initial case studies we apply the method to solubility and molecular acidity while checking its consistency in comparison with known experimental chemical data. As final goal, our approach could represent a valuable tool to deal with problems such as activity cliffs, lead optimization and de-novo drug design.

1 Introduction

Drug discovery is a process that consists in searching, developing and optimizing molecules in order to disclose new lead compounds for future new drugs as medicaments for specific diseases.

One big issue in drug discovery consists in experimentally evaluating the molecules, a part of the drug development flow that is time-consuming and expensive.

In order to compensate this problem, medicinal chemists generally use computational approaches to derive models for forecasting molecules’ bio-activity and selecting the ones associated to higher predicted values, for example, quantitative structure-activity relationships (QSAR) consist in the application of machine learning for predicting the molecules’ activity from their structure.

In this paper we propose a method to study these relationships calculating atom attributions through their graph representation. It is, in fact, very important for the chemist to individuate the most relevant substructures in order to exploit them for the development of new molecules. Once a QSAR model is built, we propose to train a graph-to-graph network (GGN) to attribute the molecules’ atoms for their properties.

2 Related Work

While, with classic machine learning, molecules are represented using fingerprints and descriptors, after the introduction of graph convolutional networks (GCN) (Kipf and Welling, 2016) it became possible to represent their structure using graphs. This allows us to learn over molecules taking advantage of both chemical and structural information. When studying molecules’ properties, it is not only interesting to be able to predict them using different estimators, but, it is important to understand how they are modulated by the different functional groups.

In order to study these structure-activity relationships, researches have been exploiting explainable AI (XAI) methods for interpreting the GCN predictions. This allows to visualize the atoms that are considered “mostly responsible” for modulating the activity. Some approaches exploit integrated gradients (IG) (Sundararajan et al., 2017) to generate input-attributions on the atoms (Jiménez-Luna et al., 2021). Although XAI methods do a good job in attributing specific atoms, it is difficult for them to refer to functional groups. Moreover, many approaches (Ying et al., 2019; Luo et al., 2020; Schlichtkrull et al., 2021) involve the perturbation of graphs by means of atom removal, masking the graphs or edge cutting, but they do not take into account the fact that, in chemistry, removing an atom or a bond from a molecule might generate an impossible structure to realize. Other techniques (Huang et al., 2020) focus, instead, on the perturbation of the node features, but again, this could cause the creation of non-sense values.

3 Proposed Approach

In this paper, we propose to train a GGN model, using a semi-supervised approach (Kipf and Welling, 2016), which is able to give scores to atoms and functional groups of a molecule, depending on their influence on a QSAR model’s prediction. To train the model, we need to set some ground-truth output values for the atoms: we modify the molecules by adding or substituting atoms and we evaluate the activity of the generated molecules, finally, we compare them with the original one. We then train the GGN model on the modified molecules, predicting the score of the nodes corresponding only to the added atoms. In this way, the resulting model is able to give scores to every atom of the molecule, depending on its influence on the property.

With this idea, we build a dataset by generating new molecules via atom addition and set the ground-truth scores: given a molecule $m$ , with predicted activity $a$ , and a second molecule $m^{\prime}$ , with predicted activity $a^{\prime}$ , derived from $m$ through an addition of a group $g$ , the ground-truth scores for the atoms of $g$ are defined as $attr(g_{i})=\frac{a^{\prime}-a}{|g|}$ . We train the GGN to approximate $attr(g_{i})$ , with $g_{i}\in g$ .

For instance, given a QSAR model and molecule $m$ with a predicted activity $a=1.30$ (benzene in Figure 1) we can add a group $g$ of 3 atoms (carboxyl group in Figure 1) generating a second molecule $m^{\prime}$ (benzoic acid in Figure 1). Assuming the evaluation using the QSAR model of $m^{\prime}$ is $a^{\prime}=1.00$ , we know that the addition led an activity decrement of $1.00-1.30=-0.30$ , therefore we can assume that the contribution of the added atoms would be $\frac{-0.30}{3}=-0.10$ .

Refer to caption — Figure 1: Ground-truth score generation example with benzene and benzoic acid.

In order to make modifications to the molecules, we need a set of functional groups to add. Ertl (2017) proposes an algorithm to extract functional groups from a set of molecule and applied it to $\approx~{}$ 483,000 molecules with activity below 10 $\mu$ M on any target retrieved from ChEMBL database (Gaulton et al., 2012). The resulting database consists in a set of 768 groups that are present in at least 10 ChEMBL molecules, provided in pseudo-SMILES notation.

Molecule	IG	DeepLIFT	GGN	logS
hexane	-0.1166	-0.1196	-1.8828	-9.1150
pentane	-0.0932	-0.0956	-1.8627	-7.5483
1-hexanol	-0.0657	-0.0711	-0.5852	-2.8134
1-pentanol	-0.0404	-0.0460	-0.5696	-1.3863
3-pentanol	-0.0096	-0.0163	-0.1369	-0.5447
Correlation with logS	0.96	0.95	0.98	1.0

Table 1: Molecular scores obtained with IG, DeepLIFT and the proposed approach compared with the logS of the molecules

4 Experimental Results

We evaluate our method on two datasets:

•

ESOL Water Solubility Dataset (Delaney, 2004), which contains more than 1k molecules with the associated logarithm of the solubility expressed in molar concentration (logS);
•

A set of 160k molecules retrieved from ChEMBL (Gaulton et al., 2012) with the associated ChemAxon predicted pKa.

We train two GCN-QSAR models on the datasets and use the functional groups database to generate the dataset for the GGN model with the ground-truth attribution scores. Finally for each initial dataset, a GGN model is trained. The following subsections analyse the attributions to the GCN-QSAR predictions obtained with some molecules for both the tasks comparing them with two XAI methods: IG (Sundararajan et al., 2017) and DeepLIFT (Shrikumar et al., 2017). The attributions are analysed through a mask of colors on the atoms on a scale from blue to red, where blue is negative and red is positive. The opacity of the color indicates the intensity of the scores. To produce a quantitative evaluation we sum up all the atom scores for each molecule: we expect molecules with a higher value of the property in exam to have a higher molecular scores. Finally we analyse the correlation between molecular scores and their properties.

4.1 Water Solubility

Water solubility is very important in medicinal chemistry, since most of the drugs must be present in the form of solution at the site of absorption. Moreover, solubility works as a benchmark since we know from chemical theory that the presence of specific atoms or patterns modulates significantly the property.

As first analysis the attributions for 5 different molecules external to the dataset with incremental solubility are compared: hexane, pentane, 1-hexanol, 1-pentanol and 3-pentanol.

Molecule	IG	DeepLIFT	GGN	pKa
pentanoic acid	0.4884	0.9821	1.6650	5.01
4-chloropentanoic acid	0.1337	0.0765	0.5025	4.27
3-chloropentanoic acid	0.1610	0.2180	0.5006	4.24
2-chloropentanoic acid	0.0253	-0.0779	0.4777	3.96
Correlation with pKa	0.98	0.41	0.98	1.0

Table 2: Molecular scores obtained with IG, DeepLIFT and the proposed approach compared with the pKa of the first pKa case study molecules

Chemical theory and experimental values for solubility of the molecules in exam suggest that the hydroxyl (OH^-) group generally increases the solubility if introduced in a molecule and the more the it is at the center of the carbon chain, the more the solubility increases. Hexane and pentane are only composed of a chain of carbons, while 1-hexanol and 1-pentanol are generated adding an hydroxyl group to the formers, respectively, this makes them more soluble than their original molecules and in fact the hydroxyl group receives positive scores from all the methods (Figure 2(a), rows 3 and 4). Using the same reasoning, the 3-pentanol, is more soluble than the previous molecules, as its hydroxyl group is placed at the center of the chain. Also in this case, the attribution methods report coherent values with the chemical theory (Figure 2(a), row 4). The molecular scores obtained with the different methods report values in good agreement with the expectation, in fact the molecular scores have a high positive correlation with the logS (Table 1).

4.2 pKa

The acidity of the molecules is a property which influences the solubility of drugs and their pharmacodynamics and pharmacokinetics (ADMET) properties. The strength of an acid is generally measured through the acid dissociation constant, $K_{a}$ , or its linearized version pKa.

We take into exam five molecules: pentanoic acid, 4-chloropentanoic acid, 3-chloropentanoic acid and 2-chloropentanoic acid. In this case study, the molecules in exam have a negative trend of pKa, due to the position of the chlorine along the molecules: the closer to the carboxyl group, the more pKa decreases (more acid). The carboxyl group in the pentanoic acid is correctly attributed with the lowest score since it is the group that determines the molecule’s acidity (Figure 2(b), row 1). Moreover, our method is the only one which correctly identifies the carboxyl group as a whole. For the other molecules, only our method and DeepLIFT are able to attribute the chlorine as an element which influences negatively on the pKa (Figure 2(b), rows 2, 3 and 4). Although DeepLIFT seems to perform similarly to our approach, correlation between the molecular scores and pKa values show that the our methods’s results are more coherent with the experimental values (Table 2).

5 Conclusions

Neural networks in drug design may help reducing the costs of lead discovery and, with GCNs, it is possible to exploit both chemical and structure information of the molecules thanks to the graph representation. We only performed the experiments using the vanilla GCN layer as the main goal was to provide a general method. Other variants, such as Graph Attention Layers, could be used as well, both for the QSAR model and the GGN.

XAI proved to be a key tool for developing future models in drug design. Although novel techniques are able to individuate the most relevant atoms, it is still difficult for them to focus on functional groups. The proposed approach managed to overcome this issue and we think that it could be extended in order to optimize molecules’ activities/properties by adding or substituting atoms in a rational way.

As a drawback, the proposed method consists in building a GGN network. This approach proved to be more “chemically oriented” but brings the overload of training an additional model.

References

Delaney [2004] John S. Delaney. ESOL: estimating aqueous solubility directly from molecular structure. Journal of Chemical Information and Computer Sciences, 44(3):1000–1005, May 2004. doi: 10.1021/ci034243x. URL https://doi.org/10.1021/ci034243x.
Ertl [2017] Peter Ertl. An algorithm to identify functional groups in organic molecules. Journal of Cheminformatics, 9(1):36, Jun 2017. ISSN 1758-2946. doi: 10.1186/s13321-017-0225-z. URL https://doi.org/10.1186/s13321-017-0225-z.
Gaulton et al. [2012] Anna Gaulton, Louisa J. Bellis, A. Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P. Overington. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(Database issue):D1100–D1107, Jan 2012. ISSN 1362-4962. doi: 10.1093/nar/gkr777. URL https://pubmed.ncbi.nlm.nih.gov/21948594. 21948594[pmid].
Huang et al. [2020] Qiang Huang, Makoto Yamada, Yuan Tian, Dinesh Singh, Dawei Yin, and Yi Chang. Graphlime: Local interpretable model explanations for graph neural networks. CoRR, abs/2001.06216, 2020. URL https://arxiv.org/abs/2001.06216.
Jiménez-Luna et al. [2021] José Jiménez-Luna, Miha Skalic, Nils Weskamp, and Gisbert Schneider. Coloring molecules with explainable artificial intelligence for preclinical relevance assessment. 61(3):1083–1094, February 2021. doi: 10.1021/acs.jcim.0c01344. URL https://doi.org/10.1021/acs.jcim.0c01344.
Kipf and Welling [2016] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.
Luo et al. [2020] Dongsheng Luo, Wei Cheng, Dongkuan Xu, Wenchao Yu, Bo Zong, Haifeng Chen, and Xiang Zhang. Parameterized explainer for graph neural network. CoRR, abs/2011.04573, 2020. URL https://arxiv.org/abs/2011.04573.
Schlichtkrull et al. [2021] Michael Sejr Schlichtkrull, Nicola De Cao, and Ivan Titov. Interpreting graph neural networks for {nlp} with differentiable edge masking. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=WznmQa42ZAx.
Shrikumar et al. [2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. CoRR, abs/1704.02685, 2017. URL http://arxiv.org/abs/1704.02685.
Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks, 2017.
Ying et al. [2019] Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. GNN explainer: A tool for post-hoc explanation of graph neural networks. CoRR, abs/1903.03894, 2019. URL http://arxiv.org/abs/1903.03894.