This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\SectionNumbersOn\alsoaffiliation

Laboratory of Protein Design and Immunoengineering (LPDI), EPFL, Switzerland \alsoaffiliationNational Centre of Competence in Research (NCCR) Catalysis, EPFL, Switzerland

FSscore: A Personalized Machine Learning-based Synthetic Feasibility Score

Rebecca M. Neeser rebecca.neeser,bruno.correia,[email protected] Laboratory of Artificial Chemical Intelligence (LIAC), EPFL, Switzerland    Bruno Correia Laboratory of Protein Design and Immunoengineering (LPDI), EPFL, Switzerland    Philippe Schwaller Laboratory of Artificial Chemical Intelligence (LIAC), EPFL, Switzerland

S1 Code and data availability

All code used for training, fine-tuning, and scoring the FSscore is available at https://github.com/schwallergroup/fsscore. This repository also includes an application that can be run locally and allows a more intuitive and accessible way to label data, fine-tune, and deploy a model.

S1.1 Application

The application allows a user to perform all steps necessary to get a personalized model as intended. These steps involve: pairing molecules, labeling pairs, fine-tuning the pre-trained model, and ultimately scoring molecules. The application is built using Streamlit 1 but is deployed locally. This means that the speed is limited by the computer on which the user is running the app. All models were trained on one GPU (NVIDIA GeForce RTX 3090) but the application will run also on a computer without GPUs at the cost of a much increased running time. Furthermore, running it locally also requires the user to install the package, which we recommend doing using a conda environment and pip. All the necessary steps and usability are outlined below.
1. Installation
We recommend running the app with the packages stored in a conda environment111Conda installation guide: https://conda.io/projects/conda/en/latest/user-guide/install/index.html. First, clone the GitHub repository from the link above (alternatively download the zip file):

git clone https://github.com/schwallergroup/fsscore.git

Next, change the directory to the repository and install all required packages, like so:

cd fsscore
conda create -n fsscore python=3.10
conda activate FSscore
pip install -r requirements.txt
pip install -e .

2. Start the application
Starting the application only requires one line of code (from the root directory of the repository), which will lead to a window to pop up or return an active link:

streamlit run streamlit_app/run.py

From the landing page, select Label molecules in the upper drop-down menu on the left. This will lead to selecting the next steps. If a dataset with labels is available, go to step 4., otherwise continue with the next step. To directly score molecules and a fine-tuned model is available one can jump to step 6.
3. Pair molecules
In this step, a list of molecules provided as SMILES first are paired, and then ranked based on uncertainty of the pre-trained model. This second step will take some time if not performed on a machine with GPU-support. Select the option on the left and provide the path to the csv file (Figure S1). This file should have smiles as a column header for the SMILES structures to be paired.

Refer to caption
Figure S1: Selection window to either pair molecules of a new dataset or continue with an already paired dataset.

4. Label molecule pairs
If a dataset with paired molecules is available, place it in streamlit_app/data/unlabeled. The dataset should appear in the drop-down menu on the right. Make sure to provide the paired molecules with column headers smiles_i and smiles_j.
The labeling window (see Figure S2) prompts the user to select the molecule that is harder to synthesize, before going to the next pair. Once the user is finished labeling, the button Send labels can be pressed. The labeling process can also be continued later and the labeled data can be downloaded. Note: The more labeled pairs, the better the performance of the fine-tuned model.

Refer to caption
Figure S2: Screenshot of labeling interface. The user is prompted to select the molecule that is harder to synthesize using the buttons below before proceeding to the next pair. Once finished, the user can send the labels with the respective button.

5. Fine-tune
Once data is labeled, one can proceed to fine-tuning. This is by default done using the best pre-trained model from this work. The user can also place their own checkpoint at the same location (folder models). How to pre-train a new model is described on our GitHub repository. Before initializing fine-tuning, one can tune several hyperparameters. If nothing is adjusted, the (recommended) default parameters as used in this work are selected (Figure S3). Fine-tuning in general is very fast, but depending on your system this can take a few minutes. Proceeding to score molecules can be achieved by selecting the respective action on the upper drop-down menu. This process can cause some issues in which case we would recommend returning to the home page before trying again.

Refer to caption
Figure S3: Screenshot of window initializing fine-tuning. The dataset to fine-tune with is selected in the drop-down menu left and the path to the pre-trained model can be changed under point 2. The rulers under point 3 allow the user to tune the hyperparameters.

6. Score molecules
Lastly, one can deploy the previously fine-tuned model and score molecules. For this, the user may place a csv file in streamlit_app/data/scoring, which contains the column header smiles. Then, the user can select the correct dataset in the drop-down menu on the left as well as the desired fine-tuned model (Figure S4).

Refer to caption
Figure S4: Screenshot of settings to score molecules given a dataset and a fine-tuned model.

S2 Machine Learning Methods

The novel Focused Synthesizability score (FSscore) is trained in two stages. First, we train a baseline score assessing synthesizability using a graph representation and suitable message passing scheme improving expressivity over similar frameworks such as the SCScore. Secondly, we introduce our fine-tuning approach using human expert knowledge, allowing us to focus the score towards a specific chemical space of interest.

S2.1 Architecture and representation

Our approach to learning a continuous score to assess the synthesizability is inspired by Choung \latinet al. 2, which framed a similar task as a ranking problem using binary preferences. Specifically, every data point consists of two molecules for each of which we predict a scalar in separate forward passes. The minimization of the binary cross entropy between the true preference and the learned score difference δij:=f^(mi)f^(mj)\delta_{ij}:=\hat{f}(m_{i})-\hat{f}(m_{j}) (scaled using a sigmoid function) constitutes our training objective.

The function f:f:\mathcal{M}\rightarrow\mathbb{R} learns to parametrize molecules as an expressive latent representation given a set of molecules m1,,mnm_{1},...,m_{n}\in\mathcal{M}. We represent molecules as graphs G=(𝒱,)G=~{}(\mathcal{V},\mathcal{E}) with atoms as nodes x1,,xn𝒱x_{1},...,x_{n}\in\mathcal{V} and bonds as edges e1,,eme_{1},...,e_{m}\in\mathcal{E} from which we can compute the line graph L(G)L(G) offline iteratively to the desired depth. The transformation process to the line graph is defined such that the edges {e1,,em}\{e_{1},...,e_{m}\} of graph GG are the nodes of the line graph and these nodes are connected if the corresponding edges of GG share a node. The graph neural network (GNN) embedding the molecular graph consists of the graph attention network (GATv2) 3 operating on the graph GG and the Line Evolution (LineEvo) 4 layer operating on the line graph L(G)L(G) as message passing schemes. Both GATv2 and LineEvo use an attention mechanism to update the node representations {h1,,hn}\{h_{1},...,h_{n}\} as follows:

e(hi,hj)=aTσ(Wihi||Wjhj)e(h_{i},h_{j})=a^{T}\sigma(W_{i}h_{i}||W_{j}h_{j}) (S1)

where a2da\in\mathbb{R}^{2d^{\prime}} and Wd×dW\in\mathbb{R}^{d^{\prime}\times d} are learned, σ\sigma denotes an activation function and |||| denotes concatenation. The GNN operates on a hidden size of 128 and the input dimensions depend on the initial featurization of the edges and nodes (see Tab. S1. GATv2 layers use 8 heads and LeakyReLU as activation function after updating the hidden feature vector while LineEvo layers use ELU. In GATv2, these local attention scores eije_{ij} are then averaged across all neighbors 𝒩\mathcal{N} (see Eq. S2) to obtain a normalized attention coefficient αij\alpha_{ij} to compute the updated node representation hih_{i}^{\prime} as a weighted average (see Eq. S3):

αij=softmaxj(e(hi,hj))=exp(e(hi,hj))j𝒩iexp(e(hi,hj))\alpha_{ij}=\text{softmax}_{j}(e(h_{i},h_{j}))=\frac{\text{exp}(e(h_{i},h_{j}))}{\sum_{j\in\mathcal{N}_{i}}\text{exp}(e(h_{i},h_{j}))} (S2)
hi=σ(j𝒩iαijWhj)h_{i}^{\prime}=\sigma(\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}Wh_{j}) (S3)

with PReLU 5 as nonlinearity σ\sigma. In LineEvo layers, e(hi,hj)e(h_{i},h_{j}) is simply transformed using ELU to obtain the new node representation hih_{i}^{\prime}.

These transformation layers are stacked so that two GATv2 (G) layers are followed by one LineEvo (L) layer (GGLGGL). Each of these layers is followed by a readout function, which consists of a global max pooling layer and global weighted-add pooling as suggested by Ren \latinet al. 4 obtaining a molecular representation through concatenation of the two readouts. The intermediate molecular representations of all layers are summed, and the final score si=f^(mi)s_{i}=\hat{f}(m_{i}) is inferred with a multilayer perceptron (MLP) (3 hidden layers of size 256, ReLU as activation function). The model described above was compared to five other implementations: GATv2 layers only (GGG) and four fingerprint implementations, namely Morgan (boolean), Morgan counts, Morgan chiral (boolean), and Morgan chiral counts, all with radius 4 and 2048 bits. 6 The fingerprints are embedded by one linear layer (hidden size of 256) with ReLU as activation function followed by the aforementioned MLP. More detailed information can be found in Section S2.3.

S2.2 Pairing molecules for fine-tuning

To focus our scoring model on a desired chemical space, we apply a fine-tuning approach inspired by Choung \latinet al. 2. This approach was optimized to require few data points in order to limit the time required from expert chemists labeling the training examples in case of labels based on human feedback. Datasets to be used for fine-tuning can be of various origins such as a chemical scope suggested by experimentalists or specific chemical spaces encompassing e.g. natural products. Training our model requires pairs of molecules with a binary preference as label. In order to do so, we first cluster a dataset using the k-means algorithm based on the Tanimoto distances of Morgan counts fingerprint. The number of clusters k was determined for every dataset individually where the mean Silhouette coefficient over a range of k was the smallest. 7. The range queried was set from 5 to 29 for datasets larger than 500 pairs and to 3 to 9 for smaller datasets. Secondly, molecules are paired in such a way that they come from different clusters, appear only once in the paired dataset and have opposite labels if available. We also investigated the influence of having overlapping pairs (molecules can appear multiple times) for the SYBA (CP and MC) and the meanComplexity datasets. This approach results in more fine-tuning data but does not satisfy our desire to keep the dataset size small since human labeling could not make use of this advantage. Subsequently, the uncertainty of the prediction of the pre-trained model on those pairs is determined based on the variance of δij\delta_{ij} obtained using the Monte Carlo dropout method 8 with a dropout rate of 0.2 on 100 predictions. The dataset is sorted with descending variance so that the pairs with high uncertainty are labeled first or used first for fine-tuning when selecting a subset. If no label is available the top n pairs (n depends on the dataset size) are submitted to be evaluated by our expert chemist based on their preference with regard to synthesizability.

S2.3 Training details

To center the obtained score around zero a regularization factor of 1e-4 was applied to the predicted score like to obtain the regularization loss reg(f^;λ):=λf^2\mathcal{L}_{reg}(\hat{f};\lambda):=\lambda||\hat{f}||^{2} as suggested by Choung \latinet al. 2. reg\mathcal{L}_{reg} is added to the cross-entropy loss term as described in Section S2.1 to obtain the final loss =CE+reg\mathcal{L}=\mathcal{L}_{CE}+\mathcal{L}_{reg}.

Pre-training: The training set was randomly split in 25 equally sized subsets and the model was trained with each subset for 10 epochs totaling 250 epochs with a batch size of 128. This sequential learning approach was found to work well in increasing speed of training. The model was trained with the Adam optimizer 9 and an initial learning rate of 3e-4.

Fine-tuning: We chose a transfer learning type of fine-tuning where all the weights of the pre-trained model can get adapted. All the hyperparameter except learning rate and batch size are kept the same. In order to avoid the forgetting the previously learned we track the performance metrics on a random subset of 5,000 pairs from the hold-out test set of the pre-training data. Of the four tested initial learning rates (1e-5, 1e-4, 3e-4, 1e-3) we chose 1e-4 for the graph-based version and 3e-4 for the fingerprint-based models. These learning rates are a good trade-off between degradation of the previously learned and learning on the new dataset as can be seen in the learning curves in Appendix S6.1. We observed quick conversion with fewer than 20 epochs and in order to balance the aforementioned trade-off a custom early stopping method was applied. Training is stopped once either the validation loss (or training loss if the full dataset is used for production) has increased for 3 epochs (patience of 3) or the accuracy on the 5k subset of the hold-out test set decreased for 3 epochs with a delta threshold of 0.02. We trained for a maximum of 20 epochs. For evaluating the efficiency, we further varied the number of data points used for training of which we took 5 random pairs for validation. The results in Table S2 show improvement in performance the bigger the dataset size. The batch size was set to 4.

S2.4 Initial featurization of graphs

Table S1: Features for initializing the nodes and edges of the graph. One-hot encoding includes one additional bit for types not found in the choices (incl. in size). These properties were determined using RDKit10
Localization Feature Description Size
atom atom type one-hot encoded atom type 222Considered atom types: [H, B, C, N, O, F, Si, P, S, Cl, Br, I, Se] 14
charge one-hot encoded formal charge [-4,4] 10
implicit hydrogens one-hot encoded number of hydrogens [0,4] 6
degree one-hote encoded degree [0,4] 6
ring information 1 if in ring else 0 1
aromaticity 1 if aromatic else 0 1
hybridization one-hot encoded hybridization state 333Considered hybridization states: [UNSPECIFIED, s, sp, sp2, sp3, sp3d, sp3d2, other] 9
chiral tag one-hot encoded chiral tag [S, R, unassigned] 4
bond type one-hot encoded bond type 444Considered bond types: [single, double, triple, aromatic] 5
bond conjugated 1 if conjugated else 0 1
ring information 1 if in ring else 0 1

S3 Data

S3.1 Pre-training data

To pre-train our model on a large collection of reactions, we combine the USPTO_full 11 patent dataset with a complementary dataset 12. Reaction data implicitly contains information on the synthetic feasibility through the relation of reactant to product, with the product being synthetically more difficult. The USPTO_full dataset was downloaded according to https://github.com/coleygroup/Graph2SMILES/blob/main/scripts/download_raw_data.py 13 using all USPTO_full subsets and "src" as product and "tgt" as reactants. The dataset was split so that every data point is a pair of one reactant and its respective product and the SMILES strings were canonicalized. The datasets do not contain reactions with multiple products. Data points with empty strings, identical reactants and products (isomeric SMILES), either reactant or product containing less than four heavy atoms or containing element types not in {H, B, C, N, O, F, Si, P, S, Cl, Se, Br, I} were removed. The datasets were further deduplicated retaining one instance of replicates. Furthermore, data points were removed ensuring that there are no cycles in the reaction network. This was achieved by removing back edges with the Depth-First Search (DFS) algorithm as implemented by Sun \latinet al. 14. The filtered dataset consisted of 5,340,704 data points. The train test split was performed so that no molecules (note not just reactant:product pairs) are overlapping resulting in a significant data loss to prevent data leakage. The training set consists of 3,349,455 pairs and the hold out test set of 711,550 pairs (17.5%).

To qualitatively evaluate the pre-trained model the performance on the MOSES 15 and COCONUT 16 dataset were assessed. We downloaded the MOSES test set from  Polykovskiy \latinet al. 15 as is totaling 176,074 SMILES. COCONUT was extracted from  Sorokina \latinet al. 16, all SMILES returning None with RDKit 10 were removed and a random subset of 176,000 molecules was selected to match the MOSES fraction.

S3.2 Fine-tuning data

The data for the fine-tuning case studies come from various sources. The specific processing steps for every dataset is outlined below.

Chirality: The obtain a test set for the chirality case study we filtered the deduplicated molecules from the pre-training training set keeping only those with @ or @@ tokens in the SMILES string, which define the chirality of a tetrahedral stereocenter. Of those, we sample 1,000 SMILES randomly and obtain their partner by getting the non-isomeric SMILES using RDKit rdk 10.

SYBA sets: The CP and MC test sets were extracted from Voršilák \latinet al. 17. No further processing was required.

meanComplexity: The meanComplexity dataset was downloaded from Sheridan \latinet al. 18 and cleaned by removing SMILES whose conversion to an RDKit molecule object failed. This processing step removed 44 SMILES yielding a fine-tuning test set of 1,731 molecules.

PROTACs: For the PROTAC case study we extracted all PROTAC SMILES from the PROTAC-DB Weng \latinet al. 19 and deduplicated the dataset keeping the first instance resulting in 3,270 SMILES. After pairing and ranking by confidence as described in Appendix S2.2, 100 pairs were labeled by a medicinal chemist with expertise in PROTACs. In order to compare the scores of the individual components to the full PROTAC as done in Figure LABEL:fig:protac_barplot the individual fragments (the two ligands and the linker) have to be extracted. For this, we extracted the collections containing information on anchor and warhead (the two ligands) from PROTAC-DB and tried to match an anchor and warhead to every PROTAC by matching protein target ID and find the biggest substructure match. The linker was obtained by removing the ligands’ atoms. For the 3270 PROTACs in the database we could find matching ligands for 1920 of them. This approach of extraction is necessary because PROTAC-DB does not cross-reference the PROTACs to the fragments and our extraction cannot guarantee concordance with the reported warhead and anchor in the respective publications but still is a good approximation.

S4 De novo design case study

To assess the applicability of the FSscore to a generative task we took advantage of the RL-framework of REINVENT 20 and used the recently proposed augmented memory 21 optimization protocol. First, an agent was trained with the composite (equal weights) of the docking score to the D2 Dopamine receptor (DRD2, PDB ID: 6CM4) and the molecular weight (MW) as a reward. The docking score was determined using AutoDock Vina 22 on one conformer each. Conformer generation is performed with RDKit 10 and the Universal force field (UFF) 23 for energy minimization for a maximum of 600 iterations. The score xx was scaled to obtain reward xx^{\prime} in a range of [0,1] by applying a sigmoid transformation as follows:

x=11+1010kxa+b2abx^{\prime}=\frac{1}{1+10^{10k\cdot\frac{x-\frac{a+b}{2}}{a-b}}} (S4)

with aa corresponding to a docking score xx of -1, bb -13 and the steepness kk was set to 0.25 resulting in a reverse sigmoid (a small docking score results in a reward xx^{\prime} towards 1 while a high score returns a reward xx^{\prime} close to 0). The score for MW was formulated so that a high reward is returned when having a MW between 0 and 500 Da using a double sigmoid like so:

(S5)
A=10cSExcdivA=10^{c_{SE}\frac{x}{c_{div}}} (S6)
B=10cSExcdiv+10cSEbcdivB=10^{c_{SE}\frac{x}{c_{div}}}+10^{c_{SE}\frac{b}{c_{div}}} (S7)
C=10cSIxcdiv10cSIxcdiv+10cSIacdivC=\frac{10^{c_{SI}\frac{x}{c_{div}}}}{10^{c_{SI}\frac{x}{\\ }{c_{div}}}+10^{c_{SI}\frac{a}{c_{div}}}} (S8)
x=ABCx^{\prime}=\frac{A}{B}-C (S9)

with coefficients cSE=cSI=500c_{SE}=c_{SI}=500 and cdiv=250c_{div}=250, the lower bound a=0a=0 and upper bound b=500b=500. RL was carried out for 150 epochs, a batch size of 64, learning rate of 1e-4, a σ\sigma of 128 and augmented memory as optimization algorithm with 2 augmentation rounds and selective memory purge with a minimum similarity of 0.4 and a bin size of 10 based on the identical Murcko scaffold. For details on augmented memory in REINVENT we refer to Guo and Schwaller 21. For our first approach (FSscore optimization) all valid SMILES (9,246) generated during this optimization process were subsequently used to fine-tune the FSscore as described above. The agent already optimized for docking was next used to optimize for synthetic feasibility either with the fine-tuned FSscore or the SA score as a comparison. Both approaches were trained in identical fashion with the following parameters: 50 epochs, batch size of 128, σ\sigma of 128, a learning rate of 1e-4 and using the augmented memory algorithm with 2 rounds of augmentation. The diversity filter (incl. selective memory purge) was based on identical Murcko scaffolds based on a minimal similarity of 0.4, a minimal score of 0.4 and a bucket size of 25. The FSscore was transformed to range from 0 to 1 with a sigmoid according to Equation S4 with a=13.08a=-13.08, b=10.07b=10.07 and a steepness k=0.25k=0.25. The agent optimized for the SA score was transformed similarly reverting the sigmoid by setting a=10a=10, b=1b=1 and k=0.25k=0.25. Thus, a high FSscore result in a high reward, while a low SA score results in high reward and vice versa. As mentioned above, only molecules with a minimal score of 0.4 were collected and subsequently used for evaluation. The second optimization approach using FSscore and the Chemspace API (FSscore + CS) was carried out similarly but instead of fine-tuning the FSscore once, the model was updated at every step for 50 rounds of RL. For all the valid and unique SMILES of each batch (max 128 molecules) the most similar molecule in the Chemspace database of buyable building blocks and buyable/synthesizable screening compounds (databases: CSSB, CSSS, CSMB, CSMS) were extracted and used as less difficult partner in a pair for the FSscore fine-tuning. Any data point was removed if an exact match was found. all other hyperparameters were chosen as outlined above. This yielded 4,779 SMILES from the FSscore-trained agent, 3,519 for the FSscore+CS-trained agent, and 4,915 SMILES from the SA score-trained agent. The results are compared by the number of reaction steps predicted by AiZynthFinder and the fraction of exact matches found in the Chemspace database. 24, 25

S5 Metrics

The Pearson correlation coefficient (PCC) between xx and yy is computed using the function scipy.stats.pearsonr 26 and is calculated as followed:

PCC=(xmx)(ymy)(xmx)2(ymy)2PCC=\frac{\sum(x-m_{x})(y-m_{y})}{\sqrt{\sum(x-m_{x})^{2}\sum(y-m_{y})^{2}}} (S10)

with mxm_{x} the mean of vector xx and mym_{y} the mean of vector y.

The following metrics for datasets with binary labels (e.g. HS \latinvs. ES) are computed using the sklearn.metrics package 27. Accuracy (Acc), sensitivity (SN) and specificity are calculated based on the relative sizes of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN):

Acc=TP+TNTP+TN+FP+FNAcc=\frac{TP+TN}{TP+TN+FP+FN} (S11)
SN=TPTP+FNSN=\frac{TP}{TP+FN} (S12)
SP=TNTN+FPSP=\frac{TN}{TN+FP} (S13)

The area under the receiver operating characteristic (ROC) curve (AUC) describes the ability to discriminate data points based on binary labels at various thresholds. It is computed by plotting SN against 1SP1-SP at various cut-off values defining the two classes.

S6 Additional results

Refer to caption
(a) MOSES
Refer to caption
(b) COCONUT
Figure S5: Heat maps displaying the correlations (PCC) between all scores obtained on molecules from MOSES and COCONUT.

S6.1 SYBA test sets

Table S2: Performance metrics showcasing the improvement on specific datasets after fine-tuning at different fine-tuning dataset sizes (pftp_{ft} pairs). No FT refers to the performance of the pre-trained model on those datasets. AccptAcc_{pt} and AUCptAUC_{pt} are determined on the pre-training test set and are based on the score difference (as during training) not the score itself. The values for the fine-tuned versions always show metrics on the dataset excluding the training molecules and the full dataset in brackets.
set mode pftp_{ft} AccAcc AUCAUC AccptAcc_{pt} AUCptAUC_{pt}
chiral graph no FT 0.5435 0.5391 0.905 0.971
20 0.5762 (0.575) 0.5972 (0.5947) 0.9044 0.9706
30 0.5918 (0.592) 0.6185 (0.6173) 0.9035 0.97
40 0.6156 (0.6145) 0.6539 (0.6521) 0.9022 0.9692
50 0.6268 (0.6235) 0.6749 (0.6728) 0.8983 0.9671
fp no FT 0.509 0.4989 0.875 0.957
20 0.5163 (0.5165) 0.5082 (0.5102) 0.879 0.959
30 0.5180 (0.5195) 0.5122 (0.5153) 0.8786 0.9583
40 0.5255 (0.5255) 0.5180 (0.5212) 0.8738 0.9544
50 0.5237 (0.5235) 0.5147 (0.5174) 0.8783 0.9582
CP 17 graph no FT 0.6319 0.6446 0.905 0.971
20 0.9275 (0.9273) 0.9804 (0.9802) 0.903 0.9700
30 0.943 (0.9429) 0.9868 (0.9865) 0.9021 0.9695
40 0.9592 (0.9587) 0.9933 (0.9930) 0.9012 0.9689
50 0.9567 (0.9560) 0.9921 (0.9918) 0.9008 0.9687
fp no FT 0.7008 0.7566 0.880 0.959
20 0.8683 (0.8685) 0.9396 (0.9397) 0.8771 0.9568
30 0.8833 (0.8838) 0.9517 (0.9521) 0.8741 0.9543
40 0.9043 (0.9045) 0.9645 (0.9649) 0.8732 0.9536
50 0.9167 (0.9178) 0.9725 (0.9731) 0.8708 0.952
MC 17 graph no FT 0.5375 0.4894 0.905 0.971
20 0.65 (0.7875) 0.645 (0.7744) 0.9034 0.9695
30 0.6 (0.7625) 0.55 (0.8150) 0.9034 0.9696
40 (0.85) (0.906) 0.8998 0.9679
fp no FT 0.5875 0.515 0.880 0.959
20 0.6 (0.7125) 0.5425 (0.7832) 0.8820 0.9596
30 0.7 (0.8875) 0.61 (0.8994) 0.8734 0.9529
40 (0.825) (0.8513) 0.8786 0.9579
PROTAC-DB 19 graph no FT 0.53 0.4341 0.905 0.971
20 0.5063 (0.515) 0.4341 (0.4602) 0.905 0.9715
30 0.5214 (0.555) 0.4587 (0.5474) 0.9023 0.9697
40 0.5083 (0.51) 0.4583 (0.4781) 0.9048 0.9712
50 0.57 (0.6) 0.522 (0.6142) 0.8993 0.968
60 0.55 (0.615) 0.5097 (0.6389) 0.8984 0.9672
70 0.5167 (0.675) 0.4433 (0.6857) 0.8954 0.9659
80 0.55 (0.685) 0.4725 (0.7281) 0.8975 0.9672
generated graph no FT 0.5594 0.5255 0.905 0.971
20 0.5741 (0.5842) 0.5437 (0.5629) 0.8971 0.9675
30 0.5775 (0.6139) 0.5444 (0.6117) 0.8969 0.9673
40 0.5656 (0.6535) 0.5612 (0.6686) 0.8942 0.9659
50 0.5882 (0.604) 0.5692 (0.589) 0.9012 0.9694
60 0.5823 (0.6485) 0.5756 (0.6566) 0.8887 0.9632
70 0.6034 (0.6287) 0.6177 (0.6527) 0.8907 0.9643
80 0.6316 (0.7376) 0.6357 (0.7752) 0.8847 0.9616
Table S3: Performance metrics showcasing the improvement on specific datasets after fine-tuning at different fine-tuning dataset sizes with overlapping pairs (molecules can appear in multiple pairs). No FT refers to the performance of the pre-trained model on those datasets. AccptAcc_{pt} and AUCptAUC_{pt} are determined on the pre-training test set and are based on the score difference (as during training) not the score itself. The values for the fine-tuned versions always show metrics on the dataset excluding the training molecules (with evaluation size nevaln_{eval}) and the full dataset in brackets.
set mode pftp_{ft} nevaln_{eval} AccAcc AUCAUC AccptAcc_{pt} AUCptAUC_{pt}
CP 17 graph no FT 7162 0.6319 0.6446 0.905 0.971
20 7122 0.8869 (0.8869) 0.9543 (0.9544) 0.9037 0.9639
50 7062 0.9598 (0.9598) 0.9933 (0.9933) 0.9008 0.9622
100 6962 0.9779 (0.9779) 0.9978 (0.9978) 0.8984 0.9609
500 6162 0.9982 (0.9982) 1.0 (1.0) 0.8773 0.9469
1000 5163 0.9983 (0.9983) 1.0 (1.0) 0.8662 0.941
2000 3371 0.9990 (0.9990) 1.0 (1.0) 0.8497 0.9318
fp no FT 7162 0.7008 0.7566 0.880 0.959
20 7122 0.8489 (0.8489) 0.9201 (0.9207) 0.8761 0.9471
50 7062 0.8833 (0.8830) 0.9517 (0.9487) 0.8721 0.944
100 6962 0.9102 (0.9102) 0.9679 (0.9689) 0.8688 0.9403
500 6162 0.9789 (0.9789) 0.9970 (0.9974) 0.8564 0.9285
1000 5164 0.9901 (0.9901) 0.9989 (0.9993) 0.8415 0.9168
2000 3372 0.9965 (0.9965) 0.9997 (0.9999) 0.8179 0.8981
MC 17 graph no FT 80 0.5375 0.4894 0.905 0.971
20 59 0.6625 (0.6625) 0.5829 (0.6656) 0.8985 0.9607
50 41 0.775 (0.775) 0.7273 (0.8175) 0.8971 0.9593
100 32 0.8125 (0.8125) 0.8155 (0.8244) 0.8995 0.9609
500 20 0.9 (0.9) 0.6813 (0.9375) 0.8698 0.9414
fp no FT 80 0.5875 0.515 0.880 0.959
20 60 0.6375 (0.6375) 0.4733 (0.6313) 0.8797 0.9502
50 44 0.85 (0.85) 0.7789 (0.9106) 0.8669 0.9415
100 29 0.925 (0.9250) 0.8389 (0.9484) 0.8545 0.9312
500 20 0.95 (0.95) 0.8542 (0.9612) 0.8324 0.9118

S6.1.1 CP test set

Refer to caption
(a) Graph (GGLGGL)
Refer to caption
(b) Morgan counts
Figure S6: ROC curves showcasing the ability to distinguish HS from ES in the CP test set using the graph-based FSscore or the fp-based FSscore. The fine-tuning was done with 50 pairs and these 100 molecules were excluded from the plots.
Refer to caption
(a) Graph (GGLGGL)
Refer to caption
(b) Morgan counts
Figure S7: Summary of Acc, SN and SP for classifying the molecules from the CP test set as either HS or ES. The fine-tuning was done with 50 pairs and these 100 molecules were excluded from the plots.

S6.1.2 MC test set

Refer to caption
(a) Graph (GGLGGL)
Refer to caption
(b) Morgan counts
Figure S8: ROC curves showcasing the ability to distinguish HS from ES in the MC test set using the graph-based FSscore or the fp-based FSscore. The fine-tuning was done with 20 pairs and these 40 molecules were excluded from the plots.
Refer to caption
(a) Graph (GGLGGL)
Refer to caption
(b) Morgan counts
Figure S9: Summary of Acc, SN and SP for classifying the molecules from the MC test set as either HS or ES. The fine-tuning was done with 20 pairs and these 40 molecules were excluded from the plots.

S6.2 meanComplexity dataset

Table S4: Performance metrics showcasing the improvement on the meanComplexity dataset (Sheridan \latinet al. 18) after fine-tuning at different fine-tuning dataset sizes with (unique = yes) or without overlapping pairs (molecules can appear in multiple pairs). No FT refers to the performance of the pre-trained model on those datasets. AccptAcc_{pt} and AUCptAUC_{pt} are determined on the pre-training test set and are based on the score difference (as during training) not the score itself. The values for the fine-tuned versions always show metrics on the dataset excluding the training molecules (with evaluation size nevaln_{eval}) and the full dataset in brackets.
dataset unique mode pftp_{ft} nevaln_{eval} PCC AccptAcc_{pt} AUCptAUC_{pt}
meanComplexity 18 yes graph no FT 1681 0.51 0.905 0.971
20 1641 0.61 0.9034 0.9701
30 1621 0.63 0.9035 0.9697
40 1601 0.66 0.9033 0.9697
50 1581 0.66 0.9028 0.9689
fp no FT 1681 0.49 0.880 0.959
20 1641 0.58 0.8768 0.9569
30 1621 0.68 0.8668 0.9492
40 1601 0.68 0.8483 0.9353
50 1581 0.67 0.8674 0.9502
no graph no FT 1681 0.51 0.905 0.971
20 1649 0.55 0.8995 0.9615
50 1582 0.51 0.9017 0.963
100 1552 0.73 0.8921 0.9564
500 1291 0.84 0.8765 0.9447
fp no FT 1681 0.49 0.880 0.959
20 1648 0.51 0.8809 0.9513
50 1584 0.59 0.8763 0.9475
100 1538 0.7 0.8632 0.9377
500 1292 0.8 0.8412 0.9212
Refer to caption
(a) Pre-trained model
Refer to caption
(b) Fine-tuned graph-based model
Refer to caption
(c) Fine-tuned fp-based model
Figure S10: Heat maps displaying the correlations (PCC) between all scores including the meanComplexity (mC) obtained on molecules from the meanComplexity dataset. The plots based on fine-tuned models excluded the fine-tuning data (50 pairs).
Refer to caption
Figure S11: Correlations (PCC) between the meanComplexity and the FSscore fine-tuned with different dataset sizes and with unique (left) or overlapping (right) pairs for both graph and fingerprint (FP) representations. The dashed line indicates the PCC of the SA score to the meanCompelxity being the best performing score we have tested.

S7 Example structures

S7.1 Hold-out test set

Refer to caption
Figure S12: Structures from the hold-out test set scored by the pre-trained graph (GGLGGL) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility).
Refer to caption
Figure S13: Structures from the hold-out test set scored by the pre-trained fingerprint (Morgan counts) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility).
Refer to caption
Figure S14: Structures from the hold-out test set scored by the pre-trained fingerprint (Morgan boolean) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility).

S7.2 CP and MC test sets

Refer to caption
Figure S15: Structures from the CP test set scored by the pre-trained graph (GGLGGL) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S16: Structures from the CP test set scored by the pre-trained fingerprint (Morgan counts) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S17: Structures from the CP test set scored by the fine-tuned graph (GGLGGL) model using 50 pairs. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S18: Structures from the CP test set scored by the fine-tuned fingerprint (Morgan counts) model using 50 pairs. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S19: Structures from the MC test set scored by the pre-trained graph (GGLGGL) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S20: Structures from the MC test set scored by the pre-trained fingerprint (Morgan counts) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S21: Structures from the MC test set scored by the fine-tuned graph (GGLGGL) model using 30 pairs. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.
Refer to caption
Figure S22: Structures from the MC test set scored by the fine-tuned fingerprint (Morgan counts) model using 30 pairs. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility). The labels below indicate the ground truth.

S7.3 PROTAC-DB

Refer to caption
Figure S23: Structures from PROTAC-DB scored by the pre-trained graph (GGLGGL) model. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility).
Refer to caption
Figure S24: Structures from PROTAC-DB scored by the fine-tuned graph (GGLGGL) model using 50 pairs. Upper row: Low FSscore (low synthetic feasibility). Lower row: High FSscore (high synthetic feasibility).

S7.4 REINVENT case study

Refer to caption
Figure S25: Random Samples from the three agents trained in the REINVENT case study. Upper row: Samples from the first agent optimized for docking to DRD2. Middle row: Samples from agent optimized to the FSscore.Lower row: Samples from agent optimized for SA score.

S8 Learning curves

Refer to caption
Figure S26: Learning curves of the training and validation set for all model variations during pre-training for 250 epochs. The training set was split into 25 subsets and the model was trained on each of them for 10 epochs explaining the oscillating nature of the curves.
Refer to caption
Figure S27: Learning curves of fine-tuning the graph-based (GGLGGL) model with varying learning rate and number of pairs used from the chirality test set. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S28: Learning curves of fine-tuning the fp-based (Morgan chiral count) model with varying learning rate and number of pairs used from the chirality test set. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S29: Learning curves of fine-tuning the graph-based (GGLGGL) model with varying learning rate and number of pairs used from the CP test set. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S30: Learning curves of fine-tuning the fp-based (Morgan count) model with varying learning rate and number of pairs used from the CP test set. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S31: Learning curves of fine-tuning the graph-based (GGLGGL) model with varying learning rate and number of pairs used from the MC test set. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S32: Learning curves of fine-tuning the fp-based (Morgan count) model with varying learning rate and number of pairs used from the MC test set. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S33: Learning curves of fine-tuning the graph-based (GGLGGL) model with varying learning rate and number of pairs used from the PROTAC-DB sample. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.
Refer to caption
Figure S34: Learning curves of fine-tuning the graph-based (GGLGGL) model with varying learning rate and number of pairs used from the REINVENT case study. Upper row: Training metrics. Middle row: Validation metrics. Lower row: Metrics on 5,000 samples from the pre-training test set.

References

  • 1 Streamlit. https://streamlit.io
  • Choung \latinet al. 2023 Choung, O.-H.; Vianello, R.; Segler, M.; Stiefl, N.; Jiménez-Luna, J. Learning chemical intuition from humans in the loop. 2023; https://chemrxiv.org/engage/chemrxiv/article-details/63f89282897b18336f0c5a55
  • Brody \latinet al. 2022 Brody, S.; Alon, U.; Yahav, E. How Attentive are Graph Attention Networks? International Conference on Learning Representations. 2022
  • Ren \latinet al. 2023 Ren, G.-P.; Wu, K.-J.; He, Y. Enhancing Molecular Representations Via Graph Transformation Layers. Journal of Chemical Information and Modeling 2023, Publisher: American Chemical Society
  • He \latinet al. 2015 He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision. 2015; pp 1026–1034
  • Rogers and Hahn 2010 Rogers, D.; Hahn, M. Extended-connectivity fingerprints. Journal of chemical information and modeling 2010, 50, 742–754
  • Rousseeuw 1987 Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 1987, 20, 53–65
  • Gal and Ghahramani 2016 Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning. 2016; pp 1050–1059
  • Kingma and Ba 2014 Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014,
  • 10 RDKit: Open-source cheminformatics. http://www.rdkit.org
  • Lowe 2012 Lowe, D. M. Extraction of chemical structures and reactions from the literature. Ph.D. thesis, University of Cambridge, 2012
  • Jiang \latinet al. 2021 Jiang, S.; Zhang, Z.; Zhao, H.; Li, J.; Yang, Y.; Lu, B.-L.; Xia, N. When SMILES Smiles, Practicality Judgment and Yield Prediction of Chemical Reaction via Deep Chemical Language Processing. IEEE Access 2021, 9, 85071–85083, Conference Name: IEEE Access
  • Tu and Coley 2022 Tu, Z.; Coley, C. W. Permutation Invariant Graph-to-Sequence Model for Template-Free Retrosynthesis and Reaction Prediction. Journal of Chemical Information and Modeling 2022, 62, 3503–3513, Publisher: American Chemical Society
  • Sun \latinet al. 2017 Sun, J.; Ajwani, D.; Nicholson, P. K.; Sala, A.; Parthasarathy, S. Breaking cycles in noisy hierarchies. Proceedings of the 2017 ACM on Web Science Conference. 2017; pp 151–160
  • Polykovskiy \latinet al. 2020 Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; Kadurin, A.; Johansson, S.; Chen, H.; Nikolenko, S.; Aspuru-Guzik, A.; Zhavoronkov, A. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology 2020, 11
  • Sorokina \latinet al. 2021 Sorokina, M.; Merseburger, P.; Rajan, K.; Yirik, M. A.; Steinbeck, C. COCONUT online: Collection of Open Natural Products database. Journal of Cheminformatics 2021, 13, 2
  • Voršilák \latinet al. 2020 Voršilák, M.; Kolář, M.; Čmelo, I.; Svozil, D. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. Journal of Cheminformatics 2020, 12, 35
  • Sheridan \latinet al. 2014 Sheridan, R. P.; Zorn, N.; Sherer, E. C.; Campeau, L.-C.; Chang, C. Z.; Cumming, J.; Maddess, M. L.; Nantermet, P. G.; Sinz, C. J.; O’Shea, P. D. Modeling a Crowdsourced Definition of Molecular Complexity. Journal of Chemical Information and Modeling 2014, 54, 1604–1616, Publisher: American Chemical Society
  • Weng \latinet al. 2022 Weng, G.; Cai, X.; Cao, D.; Du, H.; Shen, C.; Deng, Y.; He, Q.; Yang, B.; Li, D.; Hou, T. PROTAC-DB 2.0: an updated database of PROTACs. Nucleic Acids Research 2022, gkac946
  • Blaschke \latinet al. 2020 Blaschke, T.; Arús-Pous, J.; Chen, H.; Margreitter, C.; Tyrchan, C.; Engkvist, O.; Papadopoulos, K.; Patronov, A. REINVENT 2.0: An AI Tool for De Novo Drug Design. Journal of Chemical Information and Modeling 2020, 60, 5918–5922, Publisher: American Chemical Society
  • Guo and Schwaller 2023 Guo, J.; Schwaller, P. Augmented Memory: Capitalizing on Experience Replay to Accelerate De Novo Molecular Design. 2023; http://arxiv.org/abs/2305.16160, arXiv:2305.16160 [cs, q-bio]
  • Koes \latinet al. 2013 Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons Learned in Empirical Scoring with smina from the CSAR 2011 Benchmarking Exercise. Journal of Chemical Information and Modeling 2013, 53, 1893–1904, Publisher: American Chemical Society
  • Rappe \latinet al. 1992 Rappe, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard, W. A. I.; Skiff, W. M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. Journal of the American Chemical Society 1992, 114, 10024–10035, Publisher: American Chemical Society
  • Genheden \latinet al. 2020 Genheden, S.; Thakkar, A.; Chadimová, V.; Reymond, J.-L.; Engkvist, O.; Bjerrum, E. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. Journal of Cheminformatics 2020, 12, 70
  • 25 Chemspace search. https://chem-space.com/search
  • Virtanen \latinet al. 2020 Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett, M.; Wilson, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.; Jones, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, İ.; Feng, Y.; Moore, E. W.; VanderPlas, J.; Laxalde, D.; Perktold, J.; Cimrman, R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.; Archibald, A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt, P.; SciPy 1.0 Contributors SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 2020, 17, 261–272
  • Buitinck \latinet al. 2013 Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; Layton, R.; VanderPlas, J.; Joly, A.; Holt, B.; Varoquaux, G. API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 2013; pp 108–122