Semi-Supervised Junction Tree Variational Autoencoder for Molecular Graphs
Abstract
Molecular Representation Learning is essential to solving many drug discovery and computational chemistry problems. It is a challenging problem due to the complex structure of molecules and the vast chemical space. Graph representations of molecules are more expressive than traditional representations, such as molecular fingerprints. Therefore, they can improve the performance of machine learning models. We propose SeMole, a method that augments the Junction Tree Variational Autoencoders, a state-of-the-art generative model for molecular graphs, with semi-supervised learning. SeMole aims to improve the accuracy of molecular property prediction when having limited labeled data by exploiting unlabeled data. We enforce that the model generates molecular graphs conditioned on target properties by incorporating the property into the latent representation. We propose an additional pre-training phase to improve the training process for our semi-supervised generative model. We perform an experimental evaluation on the ZINC dataset using three different molecule properties and demonstrate the benefits of semi-supervision.
Introduction
One of the challenges in the drug development pipeline is to discover small molecules with desired properties. Testing candidate molecules experimentally in the wet lab is time-consuming and expensive. The main challenge for computational methods is the vast chemical space and the challenging nature of navigating through this space where molecule representation learning attracts attention. Molecule representation learning has been utilized by either end-to-end training or pretrained strategies to solve drug discovery problems such as generating molecules (Jin, Barzilay, and Jaakkola 2018; Dai et al. 2018; Kusner, Paige, and Hernández-Lobato 2017), and molecule property prediction (Gilmer et al. 2017; Yang et al. 2019; Hu et al. 2019). Recently, advancements in learning molecule representations have been made to propose novel molecules with targeted desired properties (Gómez-Bombarelli et al. 2018; Kang and Cho 2018; Olivecrona et al. 2017; Popova, Isayev, and Tropsha 2018)
Several molecular representation learning models have been designed by representing molecules as graphs, where nodes are atoms and bonds are edges. Utilizing Graph Neural Networks(GNNs) (Wu et al. 2020) have resulted in promising performances for unconditionally generating molecules (Jin, Barzilay, and Jaakkola 2018; Ma, Chen, and Xiao 2018; You et al. 2018a; Shi et al. 2020; You et al. 2018b). Junction Tree Variational Autoencoders(JTVAE) (Jin, Barzilay, and Jaakkola 2018) is one of a state-of-the-art graph-based methods for generating molecules. This method converts the graph structure to the associated tree structure of molecules by breaking them into components predefined in the vocabulary of chemical bonds. This ensures the model generates molecules by assembling chemically valid blocks that result in generating 100% valid molecules.
Previous generative methods do not generate molecules conditioned on target properties, but optimize the latent space based on a target property. (Jin, Barzilay, and Jaakkola 2018),(Kusner, Paige, and Hernández-Lobato 2017), and (Dai et al. 2018) learn the latent representation in an unsupervised manner, minimizing the reconstruction error, and optimize molecules with respect to the desired property afterward, following the optimization approach proposed by (Gómez-Bombarelli et al. 2018). Despite these methods’ promising performance, the optimization approaches do not allow setting the property to a specific value and depend on the definition of the objective function for the optimization task. Therefore They are also not scalable for generating molecules with multiple desired properties.
Moreover, these supervised training methods assume that there is enough labeled data to train a classifier or regressor. These solutions largely depend on the availability of labeled datasets, which is not the case in many real-world datasets. Recently, researchers have proposed methods for semi-supervised graph classification (Sun et al. 2019; Hao et al. 2020) to solve the data scarcity problem. Therefore semi-supervised learning could be beneficial in enhancing the power of molecule generative models.
Kang et al.(Kang and Cho 2018) proposed a semi-supervised variational autoencoder(SSVAE) that conditionally generates molecules without any post hoc optimization. However, SSVAE represents molecules as SMILES strings (Weininger 1988), which often leads the model into generating invalid molecules. The SMILES representation of molecules is also not sensitive to molecule similarities, making it hard to learn a smooth embedding of molecules (Jin, Barzilay, and Jaakkola 2018). Moreover, semi-supervised learning with generative models (Kingma et al. 2014) is challenging to train end-to-end (Maaløe et al. 2016).
In this paper, we extend a state-of-the-art generative model, JTVAE, for molecular graphs with semi-supervised learning to learn molecular properties directly as part of the latent representations via partial supervision. We propose an additional pre-training phase to improve the training process for the semi-supervised generative model. While JTVAE achieved state-of-the-art performance on generating molecules unconditionally, generating molecules conditioned on target properties with limited labeled data remains largely untapped. In conclusion, this paper makes the following contributions:
-
•
We combine a semi-supervised model with a state-of-the-art molecular graph generative model, JTVAE, for molecule property prediction and generation of valid molecules with desired properties.
-
•
We improve the training process of Semi-supervised Variational Autoencoder by adding a pre-training phase.
-
•
We performed an experimental evaluation on ZINC dataset for three different molecule properties on molecule property prediction and generation with respect to target properties.
Related Work
Semi-supervised Learning for Molecular Graphs
Semi-supervised learning is particularly beneficial for chemistry applications due to the vast unlabeled chemical space and frequently available partially labeled data in the pharmaceutical and precision agriculture industry. ASGN(Hao et al. 2020) addressed the data scarcity problem using a semi-supervised method that predicts the most informative samples obtained by an active learning approach. InfoGraph(Sun et al. 2019) is a semi-supervised graph-level representation learning method that has been evaluated on molecular property prediction. This method employs a student-teacher framework where the teacher model is trained on unlabeled data, and the student model is trained on the labeled data using a supervised objective. InfoGraph maximizes the mutual information between the representations learned by these two models so that the student model learns from the teacher model.
Semi-supervised learning using the generative models optimizes the prediction jointly with a Variational Autoencoder over the input data (Kingma et al. 2014; Maaløe et al. 2016; Siddharth et al. 2017). The latent representation is divided into a structured and unstructured part, where the structured part is enforced to represent labels of data points. Only part of the labels is provided during the training; therefore, the model learns the latent representation in a semi-supervised setting. This approach has achieved state-of-the-art performance on image classification (Kingma et al. 2014; Maaløe et al. 2016) and speech synthesis (Habib et al. 2019) with partially labeled data(Kingma and Welling 2019). (Kang and Cho 2018) proposed a method for conditionally generating molecules inspired by semi-supervised learning with generative models (Kingma et al. 2014).
Methodology
In this section, we first introduce our problem definition. Then we propose our method, SeMole, for semi-supervised Junction Tree Variational Autoencoder. Afterward, we present a modified version by adding a pretraining phase to training, which we call SeMolePretrained.
Problem Definition
Given a set of labeled molecular graphs with corresponding property and a set of unlabeled molecular graphs , our goal is to learn a model that predicts the set of labels for the unlabeled molecular graphs as part of its latent representation. This model should be able to reconstruct the molecular graph from the learned latent representation. Therefore, changing the labels will lead to generating new molecular graphs conditioned on the target labels.
SeMole
Our method extends Junction Tree Variational Autoencoder to a semi-supervised generative model for molecular graph property prediction. We propose a semi-supervised generative model consisting of three latent variables , , and . Figure 1 shows the graphical model. represents the molecular graph’s high-level property, such as partially observed solubility. Following Junction Tree Variational Autoencoder, the molecular graph is decomposed to a junction tree by replacing each cluster with a node, and represents the tree structure and the clusters in the tree. represents these clusters’ connectivity and how they are connected in the original graph. and are unobserved. , and are sampled from a normal distribution.
(1) |
Although molecular graph generation - depends on junction tree , which in turn depends on already, a deliberate choice was made to make both molecular graph and junction tree generation process dependent on . This is justified because some stereoisomers (and constitutional isomers) pairs can have drastically different molecular properties. It is the responsibility of the graph decoder to select the proper connectivity of the clusters that give rise to the isomer with the desired property. For that reason, latent variable y is also provided to the graph decoder.
Objective
Our goal is to approximately maximize the log-likelihood for both the tree structure and the molecular graph by maximizing a variational lower bound (ELBO) for observed and unobserved y. First, for the case where y is observed the objective function is denoted as follows:
(2) |
For the case where the label corresponding to a data point is unobserved, we approximate using a posterior function modelled by neural networks. The detailed derivation could be found in the appendix. The unlabelled dataset objective is as follows:
(3) |
Following (Kingma et al. 2014), it is desirable to add a loss so the distribution can be learnt with the labelled dataset. This yields our final objective function:
(4) |
The reconstruction loss follows the JTVAE(Jin, Barzilay, and Jaakkola 2018) loss function consisting of a topological and label prediction for the tree structure and the prediction of correct subgraphs for the graph structure. However, unlike JTVAE(Jin, Barzilay, and Jaakkola 2018), the graph decoder and the tree decoder generate data conditioned on the target property. The last term is a mean squared error loss that is added for supervised property prediction.
% Labeled | target property | SSVAE | SeMolePretrained | SeMole | SeMoleSupervised |
---|---|---|---|---|---|
5% | MolWt | ||||
LogP | |||||
QED | |||||
10% | MolWt | ||||
LogP | |||||
QED | |||||
20% | MolWt | ||||
LogP | |||||
QED | |||||
50% | MolWt | ||||
LogP | |||||
QED |
Pretraining
Semi-supervised Variatiatioanl Autoencoders (Kingma et al. 2014) are challenging to train end-to-end due to their multiple stochastic latent variables (Maaløe et al. 2016). However, Kingma et al. stacked a pretrained feature extractor to their method, which improved the performance significantly. Here we propose a pretraining version of SeMole called SeMolePretrained by setting the coefficient of the supervised loss, , to zero for the first ten training epochs. Since the supervised loss decreases during the training instead of assigning to a constant number, we gradually increase during the training until it reaches , which is a hyperparamater.
Experiments
We evaluate the effectiveness of our proposed methods for two tasks: molecular property prediction and conditionally generating molecules with desired properties.
We vary the percentage of the labeled data (5%, 10%, 20%, and 50%) in the molecular property prediction experiments to evaluate the semi-supervised component of our method. We save 5% of the training set for validation. The property prediction task is evaluated by Mean Absolute Error (MAE) on the testing dataset, which consists of 10000 molecules.
Considering the prior distribution of target properties in the model, we normalize the properties to have a mean of 0 and a standard deviation of 1. We set the batch size to 16 and the learning rate to 0.001. The dimension of and is set to 56. The tree encoder consists of two GRU networks, and the graph decoder is a Message Passing Neural Network.
We use SSVAE (Kang and Cho 2018) as the baseline which also is a semi-supervised variational autoencoder but representing molecules as SMILES representation instead of molecular graphs. The results are copied from the original paper since we use the same experimental design as our baseline. We further perform an ablation study for three different versions of our proposed model to assess the impact of semi-supervision and pretraining.
We developed a supervised version of SeMole by eliminating the unlabeled data from training to show the impact of semi-supervised learning as a solution to label scarcity. Therefore, SeMoleSupervised only takes the labeled portion of the training data as the input. Our goal is to show the benefits of leveraging unlabeled data compared to supervised training.
We use the trained models on 50% labeled data to generate molecules conditioned on target properties and set the properties to different specific values. We also generate molecules unconditionally. Following our baseline (Kang and Cho 2018), during the generation of molecules, we check the validity of the generated molecules using RDKit package (Landrum 2016), and we discard invalid molecules, or molecules that already exist in the training data or are already generated by the decoders before. We continue this process until we generate 3000 molecules or we reach the limit of 10000 generated molecules. Then we label the generated molecules for the target properties to assess whether their properties are close to the target properties.
Dataset
We use 310000 drug-like molecules sampled from ZINC (Sterling and Irwin 2015). We use the same dataset that was also used to evaluate SSVAE (Kang and Cho 2018). Following the literature, we use three chemical properties that are available using RDKit Package(Landrum 2016), i.e., molecular weight (MolWt), Wildman-Crippen partition coefficient (LogP), and quantitative estimation of drug-likeness (QED) which are generaly used for this task in the literature.
Results
TableLABEL:tab1 demonstrates the MAE of the molecule property prediction task with the varying number of labels in the training dataset. We repeated the experiment three times and reported the average and standard deviation of the MAE. The pretrained version of SeMole outperformed SSVAE in most of the cases. SeMolePretrained achieved better performance on LogP and QED compared to MolWt. The results show the effectiveness of the pretraining since SeMolePretrained outperforms SeMole in all cases except in two experiments where SeMole performs slightly better. SeMolePretrained substantially outperforms SeMoleSupervised showing the benefit of semi-supervision while having limited labeled data. The performance of SeMolePretrained compared to SSVAE shows the benefit of representing molecules as graphs and the pretraining of semi-supervised generative models for molecular property prediction with partial supervision.
We compared the percentage of valid, unique, and novel molecules generated by these methods. The results show that SSVAE and SeMolePretrained demonstrate better performance than SeMoleSupervised and SeMole. The main distinction between SeMolePretrained and SSVAE is that SeMolePretrained generates 100% valid molecules. SeMolePretrained also outperforms SSVAE on the percentage of the generated molecules with properties within the threshold of 5% difference from the target values. The table of generative task results is not included due to the page limit.
Conclusion
We have proposed SeMole, a semi-supervised generative model for molecular graphs. we augmented a state-of-the-art generative model, JTVAE, for molecular graphs with semi-supervised learning. We also added pretraining phase to improve the training of the semi-supervised generative model. We performed experiments on molecular property prediction and conditional generation only using limited labeled data. These experiments were performed on three properties in the ZINC dataset. The SeMolePretrained outperforms SSVAE on most molecule property prediction tasks and generates 100% valid molecules conditioned on target properties. The ablation study showed the effectiveness of semi-supervision over supervised methods when labeled data is limited. We also demonstrated the efficacy of our pretraining phase for training a Semi-supervised VAE.
This paper suggests several directions for future research. SeMolePretrained, like other methods in the literature, predicts only on molecular property and generates molecules conditioned on a single property. However, predicting multiple molecular properties could improve the accuracy of the prediction and also is beneficial to discover molecules with multiple target properties. Moreover, extending the 2D representation of molecules to 3D representation could help learn richer latent representations, which lead to more accurate predictions and improved molecule generation.
References
- Dai et al. (2018) Dai, H.; Tian, Y.; Dai, B.; Skiena, S.; and Song, L. 2018. Syntax-directed variational autoencoder for structured data. arXiv preprint arXiv:1802.08786.
- Gilmer et al. (2017) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 1263–1272. PMLR.
- Gómez-Bombarelli et al. (2018) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2): 268–276.
- Habib et al. (2019) Habib, R.; Mariooryad, S.; Shannon, M.; Battenberg, E.; Skerry-Ryan, R.; Stanton, D.; Kao, D.; and Bagby, T. 2019. Semi-supervised generative modeling for controllable speech synthesis. arXiv preprint arXiv:1910.01709.
- Hao et al. (2020) Hao, Z.; Lu, C.; Huang, Z.; Wang, H.; Hu, Z.; Liu, Q.; Chen, E.; and Lee, C. 2020. ASGN: An active semi-supervised graph neural network for molecular property prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 731–752.
- Hu et al. (2019) Hu, W.; Liu, B.; Gomes, J.; Zitnik, M.; Liang, P.; Pande, V.; and Leskovec, J. 2019. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265.
- Jin, Barzilay, and Jaakkola (2018) Jin, W.; Barzilay, R.; and Jaakkola, T. 2018. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, 2323–2332. PMLR.
- Kang and Cho (2018) Kang, S.; and Cho, K. 2018. Conditional molecular design with deep generative models. Journal of chemical information and modeling, 59(1): 43–52.
- Kingma et al. (2014) Kingma, D. P.; Rezende, D. J.; Mohamed, S.; and Welling, M. 2014. Semi-supervised learning with deep generative models. arXiv preprint arXiv:1406.5298.
- Kingma and Welling (2019) Kingma, D. P.; and Welling, M. 2019. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691.
- Kusner, Paige, and Hernández-Lobato (2017) Kusner, M. J.; Paige, B.; and Hernández-Lobato, J. M. 2017. Grammar variational autoencoder. In International Conference on Machine Learning, 1945–1954. PMLR.
- Landrum (2016) Landrum, G. 2016. RDKit: Open-Source Cheminformatics Software.
- Ma, Chen, and Xiao (2018) Ma, T.; Chen, J.; and Xiao, C. 2018. Constrained generation of semantically valid graphs via regularizing variational autoencoders. arXiv preprint arXiv:1809.02630.
- Maaløe et al. (2016) Maaløe, L.; Sønderby, C. K.; Sønderby, S. K.; and Winther, O. 2016. Auxiliary deep generative models. In International conference on machine learning, 1445–1453. PMLR.
- Olivecrona et al. (2017) Olivecrona, M.; Blaschke, T.; Engkvist, O.; and Chen, H. 2017. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1): 1–14.
- Popova, Isayev, and Tropsha (2018) Popova, M.; Isayev, O.; and Tropsha, A. 2018. Deep reinforcement learning for de novo drug design. Science advances, 4(7): eaap7885.
- Shi et al. (2020) Shi, C.; Xu, M.; Zhu, Z.; Zhang, W.; Zhang, M.; and Tang, J. 2020. Graphaf: a flow-based autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382.
- Siddharth et al. (2017) Siddharth, N.; Paige, B.; Van de Meent, J.-W.; Desmaison, A.; Goodman, N. D.; Kohli, P.; Wood, F.; and Torr, P. H. 2017. Learning disentangled representations with semi-supervised deep generative models. arXiv preprint arXiv:1706.00400.
- Sterling and Irwin (2015) Sterling, T.; and Irwin, J. J. 2015. ZINC 15–ligand discovery for everyone. Journal of chemical information and modeling, 55(11): 2324–2337.
- Sun et al. (2019) Sun, F.-Y.; Hoffmann, J.; Verma, V.; and Tang, J. 2019. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv preprint arXiv:1908.01000.
- Weininger (1988) Weininger, D. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1): 31–36.
- Wu et al. (2020) Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; and Philip, S. Y. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1): 4–24.
- Yang et al. (2019) Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; et al. 2019. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8): 3370–3388.
- You et al. (2018a) You, J.; Liu, B.; Ying, R.; Pande, V.; and Leskovec, J. 2018a. Graph convolutional policy network for goal-directed molecular graph generation. arXiv preprint arXiv:1806.02473.
- You et al. (2018b) You, J.; Ying, R.; Ren, X.; Hamilton, W.; and Leskovec, J. 2018b. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, 5708–5717. PMLR.