\SectionNumbersOn\alsoaffiliation

Laboratory of Protein Design and Immunoengineering (LPDI), EPFL, Switzerland \alsoaffiliationNational Centre of Competence in Research (NCCR) Catalysis, EPFL, Switzerland

FSscore: A Personalized Machine Learning-based Synthetic Feasibility Score

Rebecca M. Neeser rebecca.neeser,bruno.correia,[email protected] Laboratory of Artificial Chemical Intelligence (LIAC), EPFL, Switzerland Bruno Correia Laboratory of Protein Design and Immunoengineering (LPDI), EPFL, Switzerland Philippe Schwaller Laboratory of Artificial Chemical Intelligence (LIAC), EPFL, Switzerland

S1 Code and data availability

All code used for training, fine-tuning, and scoring the FSscore is available at https://github.com/schwallergroup/fsscore. This repository also includes an application that can be run locally and allows a more intuitive and accessible way to label data, fine-tune, and deploy a model.

S1.1 Application

The application allows a user to perform all steps necessary to get a personalized model as intended. These steps involve: pairing molecules, labeling pairs, fine-tuning the pre-trained model, and ultimately scoring molecules. The application is built using Streamlit ¹ but is deployed locally. This means that the speed is limited by the computer on which the user is running the app. All models were trained on one GPU (NVIDIA GeForce RTX 3090) but the application will run also on a computer without GPUs at the cost of a much increased running time. Furthermore, running it locally also requires the user to install the package, which we recommend doing using a conda environment and pip. All the necessary steps and usability are outlined below.
1. Installation
We recommend running the app with the packages stored in a conda environment¹¹1Conda installation guide: https://conda.io/projects/conda/en/latest/user-guide/install/index.html. First, clone the GitHub repository from the link above (alternatively download the zip file):

⬇

git clone https://github.com/schwallergroup/fsscore.git

Next, change the directory to the repository and install all required packages, like so:

⬇

cd fsscore

conda create -n fsscore python=3.10

conda activate FSscore

pip install -r requirements.txt

pip install -e .

2. Start the application
Starting the application only requires one line of code (from the root directory of the repository), which will lead to a window to pop up or return an active link:

⬇

streamlit run streamlit_app/run.py

From the landing page, select Label molecules in the upper drop-down menu on the left. This will lead to selecting the next steps. If a dataset with labels is available, go to step 4., otherwise continue with the next step. To directly score molecules and a fine-tuned model is available one can jump to step 6.
3. Pair molecules
In this step, a list of molecules provided as SMILES first are paired, and then ranked based on uncertainty of the pre-trained model. This second step will take some time if not performed on a machine with GPU-support. Select the option on the left and provide the path to the csv file (Figure S1). This file should have smiles as a column header for the SMILES structures to be paired.

Refer to caption — Figure S1: Selection window to either pair molecules of a new dataset or continue with an already paired dataset.

4. Label molecule pairs
If a dataset with paired molecules is available, place it in streamlit_app/data/unlabeled. The dataset should appear in the drop-down menu on the right. Make sure to provide the paired molecules with column headers smiles_i and smiles_j.
The labeling window (see Figure S2) prompts the user to select the molecule that is harder to synthesize, before going to the next pair. Once the user is finished labeling, the button Send labels can be pressed. The labeling process can also be continued later and the labeled data can be downloaded. Note: The more labeled pairs, the better the performance of the fine-tuned model.

5. Fine-tune
Once data is labeled, one can proceed to fine-tuning. This is by default done using the best pre-trained model from this work. The user can also place their own checkpoint at the same location (folder models). How to pre-train a new model is described on our GitHub repository. Before initializing fine-tuning, one can tune several hyperparameters. If nothing is adjusted, the (recommended) default parameters as used in this work are selected (Figure S3). Fine-tuning in general is very fast, but depending on your system this can take a few minutes. Proceeding to score molecules can be achieved by selecting the respective action on the upper drop-down menu. This process can cause some issues in which case we would recommend returning to the home page before trying again.

6. Score molecules
Lastly, one can deploy the previously fine-tuned model and score molecules. For this, the user may place a csv file in streamlit_app/data/scoring, which contains the column header smiles. Then, the user can select the correct dataset in the drop-down menu on the left as well as the desired fine-tuned model (Figure S4).

S2 Machine Learning Methods

The novel Focused Synthesizability score (FSscore) is trained in two stages. First, we train a baseline score assessing synthesizability using a graph representation and suitable message passing scheme improving expressivity over similar frameworks such as the SCScore. Secondly, we introduce our fine-tuning approach using human expert knowledge, allowing us to focus the score towards a specific chemical space of interest.

S2.1 Architecture and representation

Our approach to learning a continuous score to assess the synthesizability is inspired by Choung \latinet al. ², which framed a similar task as a ranking problem using binary preferences. Specifically, every data point consists of two molecules for each of which we predict a scalar in separate forward passes. The minimization of the binary cross entropy between the true preference and the learned score difference $\delta_{ij}:=\hat{f}(m_{i})-\hat{f}(m_{j})$ (scaled using a sigmoid function) constitutes our training objective.

The function $f:\mathcal{M}\rightarrow\mathbb{R}$ learns to parametrize molecules as an expressive latent representation given a set of molecules $m_{1},...,m_{n}\in\mathcal{M}$ . We represent molecules as graphs $G=~{}(\mathcal{V},\mathcal{E})$ with atoms as nodes $x_{1},...,x_{n}\in\mathcal{V}$ and bonds as edges $e_{1},...,e_{m}\in\mathcal{E}$ from which we can compute the line graph $L(G)$ offline iteratively to the desired depth. The transformation process to the line graph is defined such that the edges $\{e_{1},...,e_{m}\}$ of graph $G$ are the nodes of the line graph and these nodes are connected if the corresponding edges of $G$ share a node. The graph neural network (GNN) embedding the molecular graph consists of the graph attention network (GATv2) ³ operating on the graph $G$ and the Line Evolution (LineEvo) ⁴ layer operating on the line graph $L(G)$ as message passing schemes. Both GATv2 and LineEvo use an attention mechanism to update the node representations $\{h_{1},...,h_{n}\}$ as follows:

e(h_{i},h_{j})=a^{T}\sigma(W_{i}h_{i}||W_{j}h_{j})

(S1)

where $a\in\mathbb{R}^{2d^{\prime}}$ and $W\in\mathbb{R}^{d^{\prime}\times d}$ are learned, $\sigma$ denotes an activation function and $||$ denotes concatenation. The GNN operates on a hidden size of 128 and the input dimensions depend on the initial featurization of the edges and nodes (see Tab. S1. GATv2 layers use 8 heads and LeakyReLU as activation function after updating the hidden feature vector while LineEvo layers use ELU. In GATv2, these local attention scores $e_{ij}$ are then averaged across all neighbors $\mathcal{N}$ (see Eq. S2) to obtain a normalized attention coefficient $\alpha_{ij}$ to compute the updated node representation $h_{i}^{\prime}$ as a weighted average (see Eq. S3):

\alpha_{ij}=\text{softmax}_{j}(e(h_{i},h_{j}))=\frac{\text{exp}(e(h_{i},h_{j}))}{\sum_{j\in\mathcal{N}_{i}}\text{exp}(e(h_{i},h_{j}))}

(S2)

h_{i}^{\prime}=\sigma(\sum_{j\in\mathcal{N}_{i}}\alpha_{ij}Wh_{j})

(S3)

with PReLU ⁵ as nonlinearity $\sigma$ . In LineEvo layers, $e(h_{i},h_{j})$ is simply transformed using ELU to obtain the new node representation $h_{i}^{\prime}$ .

These transformation layers are stacked so that two GATv2 (G) layers are followed by one LineEvo (L) layer (GGLGGL). Each of these layers is followed by a readout function, which consists of a global max pooling layer and global weighted-add pooling as suggested by Ren \latinet al. ⁴ obtaining a molecular representation through concatenation of the two readouts. The intermediate molecular representations of all layers are summed, and the final score $s_{i}=\hat{f}(m_{i})$ is inferred with a multilayer perceptron (MLP) (3 hidden layers of size 256, ReLU as activation function). The model described above was compared to five other implementations: GATv2 layers only (GGG) and four fingerprint implementations, namely Morgan (boolean), Morgan counts, Morgan chiral (boolean), and Morgan chiral counts, all with radius 4 and 2048 bits. ⁶ The fingerprints are embedded by one linear layer (hidden size of 256) with ReLU as activation function followed by the aforementioned MLP. More detailed information can be found in Section S2.3.

S2.2 Pairing molecules for fine-tuning

To focus our scoring model on a desired chemical space, we apply a fine-tuning approach inspired by Choung \latinet al. ². This approach was optimized to require few data points in order to limit the time required from expert chemists labeling the training examples in case of labels based on human feedback. Datasets to be used for fine-tuning can be of various origins such as a chemical scope suggested by experimentalists or specific chemical spaces encompassing e.g. natural products. Training our model requires pairs of molecules with a binary preference as label. In order to do so, we first cluster a dataset using the k-means algorithm based on the Tanimoto distances of Morgan counts fingerprint. The number of clusters k was determined for every dataset individually where the mean Silhouette coefficient over a range of k was the smallest. ⁷. The range queried was set from 5 to 29 for datasets larger than 500 pairs and to 3 to 9 for smaller datasets. Secondly, molecules are paired in such a way that they come from different clusters, appear only once in the paired dataset and have opposite labels if available. We also investigated the influence of having overlapping pairs (molecules can appear multiple times) for the SYBA (CP and MC) and the meanComplexity datasets. This approach results in more fine-tuning data but does not satisfy our desire to keep the dataset size small since human labeling could not make use of this advantage. Subsequently, the uncertainty of the prediction of the pre-trained model on those pairs is determined based on the variance of $\delta_{ij}$ obtained using the Monte Carlo dropout method ⁸ with a dropout rate of 0.2 on 100 predictions. The dataset is sorted with descending variance so that the pairs with high uncertainty are labeled first or used first for fine-tuning when selecting a subset. If no label is available the top n pairs (n depends on the dataset size) are submitted to be evaluated by our expert chemist based on their preference with regard to synthesizability.

S2.3 Training details

To center the obtained score around zero a regularization factor of 1e-4 was applied to the predicted score like to obtain the regularization loss $\mathcal{L}_{reg}(\hat{f};\lambda):=\lambda||\hat{f}||^{2}$ as suggested by Choung \latinet al. ². $\mathcal{L}_{reg}$ is added to the cross-entropy loss term as described in Section S2.1 to obtain the final loss $\mathcal{L}=\mathcal{L}_{CE}+\mathcal{L}_{reg}$ .

Pre-training: The training set was randomly split in 25 equally sized subsets and the model was trained with each subset for 10 epochs totaling 250 epochs with a batch size of 128. This sequential learning approach was found to work well in increasing speed of training. The model was trained with the Adam optimizer ⁹ and an initial learning rate of 3e-4.

Fine-tuning: We chose a transfer learning type of fine-tuning where all the weights of the pre-trained model can get adapted. All the hyperparameter except learning rate and batch size are kept the same. In order to avoid the forgetting the previously learned we track the performance metrics on a random subset of 5,000 pairs from the hold-out test set of the pre-training data. Of the four tested initial learning rates (1e-5, 1e-4, 3e-4, 1e-3) we chose 1e-4 for the graph-based version and 3e-4 for the fingerprint-based models. These learning rates are a good trade-off between degradation of the previously learned and learning on the new dataset as can be seen in the learning curves in Appendix S6.1. We observed quick conversion with fewer than 20 epochs and in order to balance the aforementioned trade-off a custom early stopping method was applied. Training is stopped once either the validation loss (or training loss if the full dataset is used for production) has increased for 3 epochs (patience of 3) or the accuracy on the 5k subset of the hold-out test set decreased for 3 epochs with a delta threshold of 0.02. We trained for a maximum of 20 epochs. For evaluating the efficiency, we further varied the number of data points used for training of which we took 5 random pairs for validation. The results in Table S2 show improvement in performance the bigger the dataset size. The batch size was set to 4.

S2.4 Initial featurization of graphs

Table S1: Features for initializing the nodes and edges of the graph. One-hot encoding includes one additional bit for types not found in the choices (incl. in size). These properties were determined using RDKit. ¹⁰

Localization	Feature	Description	Size
atom	atom type	one-hot encoded atom type ²²2Considered atom types: [H, B, C, N, O, F, Si, P, S, Cl, Br, I, Se]	14
	charge	one-hot encoded formal charge [-4,4]	10
	implicit hydrogens	one-hot encoded number of hydrogens [0,4]	6
	degree	one-hote encoded degree [0,4]	6
	ring information	1 if in ring else 0	1
	aromaticity	1 if aromatic else 0	1
	hybridization	one-hot encoded hybridization state ³³3Considered hybridization states: [UNSPECIFIED, s, sp, sp2, sp3, sp3d, sp3d2, other]	9
	chiral tag	one-hot encoded chiral tag [S, R, unassigned]	4
	bond type	one-hot encoded bond type ⁴⁴4Considered bond types: [single, double, triple, aromatic]	5
bond	conjugated	1 if conjugated else 0	1
	ring information	1 if in ring else 0	1

S3 Data

S3.1 Pre-training data

To pre-train our model on a large collection of reactions, we combine the USPTO_full ¹¹ patent dataset with a complementary dataset ¹². Reaction data implicitly contains information on the synthetic feasibility through the relation of reactant to product, with the product being synthetically more difficult. The USPTO_full dataset was downloaded according to https://github.com/coleygroup/Graph2SMILES/blob/main/scripts/download_raw_data.py ¹³ using all USPTO_full subsets and "src" as product and "tgt" as reactants. The dataset was split so that every data point is a pair of one reactant and its respective product and the SMILES strings were canonicalized. The datasets do not contain reactions with multiple products. Data points with empty strings, identical reactants and products (isomeric SMILES), either reactant or product containing less than four heavy atoms or containing element types not in {H, B, C, N, O, F, Si, P, S, Cl, Se, Br, I} were removed. The datasets were further deduplicated retaining one instance of replicates. Furthermore, data points were removed ensuring that there are no cycles in the reaction network. This was achieved by removing back edges with the Depth-First Search (DFS) algorithm as implemented by Sun \latinet al. ¹⁴. The filtered dataset consisted of 5,340,704 data points. The train test split was performed so that no molecules (note not just reactant:product pairs) are overlapping resulting in a significant data loss to prevent data leakage. The training set consists of 3,349,455 pairs and the hold out test set of 711,550 pairs (17.5%).

To qualitatively evaluate the pre-trained model the performance on the MOSES ¹⁵ and COCONUT ¹⁶ dataset were assessed. We downloaded the MOSES test set from Polykovskiy \latinet al. ¹⁵ as is totaling 176,074 SMILES. COCONUT was extracted from Sorokina \latinet al. ¹⁶, all SMILES returning None with RDKit ¹⁰ were removed and a random subset of 176,000 molecules was selected to match the MOSES fraction.

S3.2 Fine-tuning data

The data for the fine-tuning case studies come from various sources. The specific processing steps for every dataset is outlined below.

Chirality: The obtain a test set for the chirality case study we filtered the deduplicated molecules from the pre-training training set keeping only those with @ or @@ tokens in the SMILES string, which define the chirality of a tetrahedral stereocenter. Of those, we sample 1,000 SMILES randomly and obtain their partner by getting the non-isomeric SMILES using RDKit rdk ¹⁰.

SYBA sets: The CP and MC test sets were extracted from Voršilák \latinet al. ¹⁷. No further processing was required.

meanComplexity: The meanComplexity dataset was downloaded from Sheridan \latinet al. ¹⁸ and cleaned by removing SMILES whose conversion to an RDKit molecule object failed. This processing step removed 44 SMILES yielding a fine-tuning test set of 1,731 molecules.

PROTACs: For the PROTAC case study we extracted all PROTAC SMILES from the PROTAC-DB Weng \latinet al. ¹⁹ and deduplicated the dataset keeping the first instance resulting in 3,270 SMILES. After pairing and ranking by confidence as described in Appendix S2.2, 100 pairs were labeled by a medicinal chemist with expertise in PROTACs. In order to compare the scores of the individual components to the full PROTAC as done in Figure LABEL:fig:protac_barplot the individual fragments (the two ligands and the linker) have to be extracted. For this, we extracted the collections containing information on anchor and warhead (the two ligands) from PROTAC-DB and tried to match an anchor and warhead to every PROTAC by matching protein target ID and find the biggest substructure match. The linker was obtained by removing the ligands’ atoms. For the 3270 PROTACs in the database we could find matching ligands for 1920 of them. This approach of extraction is necessary because PROTAC-DB does not cross-reference the PROTACs to the fragments and our extraction cannot guarantee concordance with the reported warhead and anchor in the respective publications but still is a good approximation.

S4 De novo design case study

To assess the applicability of the FSscore to a generative task we took advantage of the RL-framework of REINVENT ²⁰ and used the recently proposed augmented memory ²¹ optimization protocol. First, an agent was trained with the composite (equal weights) of the docking score to the D2 Dopamine receptor (DRD2, PDB ID: 6CM4) and the molecular weight (MW) as a reward. The docking score was determined using AutoDock Vina ²² on one conformer each. Conformer generation is performed with RDKit ¹⁰ and the Universal force field (UFF) ²³ for energy minimization for a maximum of 600 iterations. The score $x$ was scaled to obtain reward $x^{\prime}$ in a range of [0,1] by applying a sigmoid transformation as follows:

x^{\prime}=\frac{1}{1+10^{10k\cdot\frac{x-\frac{a+b}{2}}{a-b}}}

(S4)

with $a$ corresponding to a docking score $x$ of -1, $b$ -13 and the steepness $k$ was set to 0.25 resulting in a reverse sigmoid (a small docking score results in a reward $x^{\prime}$ towards 1 while a high score returns a reward $x^{\prime}$ close to 0). The score for MW was formulated so that a high reward is returned when having a MW between 0 and 500 Da using a double sigmoid like so:

(S5)

A=10^{c_{SE}\frac{x}{c_{div}}}

(S6)

B=10^{c_{SE}\frac{x}{c_{div}}}+10^{c_{SE}\frac{b}{c_{div}}}

(S7)

C=\frac{10^{c_{SI}\frac{x}{c_{div}}}}{10^{c_{SI}\frac{x}{\\ }{c_{div}}}+10^{c_{SI}\frac{a}{c_{div}}}}

(S8)

x^{\prime}=\frac{A}{B}-C

(S9)

with coefficients $c_{SE}=c_{SI}=500$ and $c_{div}=250$ , the lower bound $a=0$ and upper bound $b=500$ . RL was carried out for 150 epochs, a batch size of 64, learning rate of 1e-4, a $\sigma$ of 128 and augmented memory as optimization algorithm with 2 augmentation rounds and selective memory purge with a minimum similarity of 0.4 and a bin size of 10 based on the identical Murcko scaffold. For details on augmented memory in REINVENT we refer to Guo and Schwaller ²¹. For our first approach (FSscore optimization) all valid SMILES (9,246) generated during this optimization process were subsequently used to fine-tune the FSscore as described above. The agent already optimized for docking was next used to optimize for synthetic feasibility either with the fine-tuned FSscore or the SA score as a comparison. Both approaches were trained in identical fashion with the following parameters: 50 epochs, batch size of 128, $\sigma$ of 128, a learning rate of 1e-4 and using the augmented memory algorithm with 2 rounds of augmentation. The diversity filter (incl. selective memory purge) was based on identical Murcko scaffolds based on a minimal similarity of 0.4, a minimal score of 0.4 and a bucket size of 25. The FSscore was transformed to range from 0 to 1 with a sigmoid according to Equation S4 with $a=-13.08$ , $b=10.07$ and a steepness $k=0.25$ . The agent optimized for the SA score was transformed similarly reverting the sigmoid by setting $a=10$ , $b=1$ and $k=0.25$ . Thus, a high FSscore result in a high reward, while a low SA score results in high reward and vice versa. As mentioned above, only molecules with a minimal score of 0.4 were collected and subsequently used for evaluation. The second optimization approach using FSscore and the Chemspace API (FSscore + CS) was carried out similarly but instead of fine-tuning the FSscore once, the model was updated at every step for 50 rounds of RL. For all the valid and unique SMILES of each batch (max 128 molecules) the most similar molecule in the Chemspace database of buyable building blocks and buyable/synthesizable screening compounds (databases: CSSB, CSSS, CSMB, CSMS) were extracted and used as less difficult partner in a pair for the FSscore fine-tuning. Any data point was removed if an exact match was found. all other hyperparameters were chosen as outlined above. This yielded 4,779 SMILES from the FSscore-trained agent, 3,519 for the FSscore+CS-trained agent, and 4,915 SMILES from the SA score-trained agent. The results are compared by the number of reaction steps predicted by AiZynthFinder and the fraction of exact matches found in the Chemspace database. ^{24, 25}

S5 Metrics

The Pearson correlation coefficient (PCC) between $x$ and $y$ is computed using the function scipy.stats.pearsonr ²⁶ and is calculated as followed:

PCC=\frac{\sum(x-m_{x})(y-m_{y})}{\sqrt{\sum(x-m_{x})^{2}\sum(y-m_{y})^{2}}}

(S10)

with $m_{x}$ the mean of vector $x$ and $m_{y}$ the mean of vector y.

The following metrics for datasets with binary labels (e.g. HS \latinvs. ES) are computed using the sklearn.metrics package ²⁷. Accuracy (Acc), sensitivity (SN) and specificity are calculated based on the relative sizes of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN):

Acc=\frac{TP+TN}{TP+TN+FP+FN}

(S11)

SN=\frac{TP}{TP+FN}

(S12)

SP=\frac{TN}{TN+FP}

(S13)

The area under the receiver operating characteristic (ROC) curve (AUC) describes the ability to discriminate data points based on binary labels at various thresholds. It is computed by plotting SN against $1-SP$ at various cut-off values defining the two classes.

S6 Additional results

S6.1 SYBA test sets

Table S2: Performance metrics showcasing the improvement on specific datasets after fine-tuning at different fine-tuning dataset sizes (

p_{ft}

pairs). No FT refers to the performance of the pre-trained model on those datasets.

Acc_{pt}

and

AUC_{pt}

are determined on the pre-training test set and are based on the score difference (as during training) not the score itself. The values for the fine-tuned versions always show metrics on the dataset excluding the training molecules and the full dataset in brackets.

set	mode	$p_{ft}$	$Acc$	$AUC$	$Acc_{pt}$	$AUC_{pt}$
chiral	graph	no FT	0.5435	0.5391	0.905	0.971
		20	0.5762 (0.575)	0.5972 (0.5947)	0.9044	0.9706
		30	0.5918 (0.592)	0.6185 (0.6173)	0.9035	0.97
		40	0.6156 (0.6145)	0.6539 (0.6521)	0.9022	0.9692
		50	0.6268 (0.6235)	0.6749 (0.6728)	0.8983	0.9671
	fp	no FT	0.509	0.4989	0.875	0.957
		20	0.5163 (0.5165)	0.5082 (0.5102)	0.879	0.959
		30	0.5180 (0.5195)	0.5122 (0.5153)	0.8786	0.9583
		40	0.5255 (0.5255)	0.5180 (0.5212)	0.8738	0.9544
		50	0.5237 (0.5235)	0.5147 (0.5174)	0.8783	0.9582
CP ¹⁷	graph	no FT	0.6319	0.6446	0.905	0.971
		20	0.9275 (0.9273)	0.9804 (0.9802)	0.903	0.9700
		30	0.943 (0.9429)	0.9868 (0.9865)	0.9021	0.9695
		40	0.9592 (0.9587)	0.9933 (0.9930)	0.9012	0.9689
		50	0.9567 (0.9560)	0.9921 (0.9918)	0.9008	0.9687
	fp	no FT	0.7008	0.7566	0.880	0.959
		20	0.8683 (0.8685)	0.9396 (0.9397)	0.8771	0.9568
		30	0.8833 (0.8838)	0.9517 (0.9521)	0.8741	0.9543
		40	0.9043 (0.9045)	0.9645 (0.9649)	0.8732	0.9536
		50	0.9167 (0.9178)	0.9725 (0.9731)	0.8708	0.952
MC ¹⁷	graph	no FT	0.5375	0.4894	0.905	0.971
		20	0.65 (0.7875)	0.645 (0.7744)	0.9034	0.9695
		30	0.6 (0.7625)	0.55 (0.8150)	0.9034	0.9696
		40	(0.85)	(0.906)	0.8998	0.9679
	fp	no FT	0.5875	0.515	0.880	0.959
		20	0.6 (0.7125)	0.5425 (0.7832)	0.8820	0.9596
		30	0.7 (0.8875)	0.61 (0.8994)	0.8734	0.9529
		40	(0.825)	(0.8513)	0.8786	0.9579
PROTAC-DB ¹⁹	graph	no FT	0.53	0.4341	0.905	0.971
		20	0.5063 (0.515)	0.4341 (0.4602)	0.905	0.9715
		30	0.5214 (0.555)	0.4587 (0.5474)	0.9023	0.9697
		40	0.5083 (0.51)	0.4583 (0.4781)	0.9048	0.9712
		50	0.57 (0.6)	0.522 (0.6142)	0.8993	0.968
		60	0.55 (0.615)	0.5097 (0.6389)	0.8984	0.9672
		70	0.5167 (0.675)	0.4433 (0.6857)	0.8954	0.9659
		80	0.55 (0.685)	0.4725 (0.7281)	0.8975	0.9672
generated	graph	no FT	0.5594	0.5255	0.905	0.971
		20	0.5741 (0.5842)	0.5437 (0.5629)	0.8971	0.9675
		30	0.5775 (0.6139)	0.5444 (0.6117)	0.8969	0.9673
		40	0.5656 (0.6535)	0.5612 (0.6686)	0.8942	0.9659
		50	0.5882 (0.604)	0.5692 (0.589)	0.9012	0.9694
		60	0.5823 (0.6485)	0.5756 (0.6566)	0.8887	0.9632
		70	0.6034 (0.6287)	0.6177 (0.6527)	0.8907	0.9643
		80	0.6316 (0.7376)	0.6357 (0.7752)	0.8847	0.9616

Table S3: Performance metrics showcasing the improvement on specific datasets after fine-tuning at different fine-tuning dataset sizes with overlapping pairs (molecules can appear in multiple pairs). No FT refers to the performance of the pre-trained model on those datasets.

Acc_{pt}

and

AUC_{pt}

n_{eval}

) and the full dataset in brackets.

set	mode	$p_{ft}$	$n_{eval}$	$Acc$	$AUC$	$Acc_{pt}$	$AUC_{pt}$
CP ¹⁷	graph	no FT	7162	0.6319	0.6446	0.905	0.971
		20	7122	0.8869 (0.8869)	0.9543 (0.9544)	0.9037	0.9639
		50	7062	0.9598 (0.9598)	0.9933 (0.9933)	0.9008	0.9622
		100	6962	0.9779 (0.9779)	0.9978 (0.9978)	0.8984	0.9609
		500	6162	0.9982 (0.9982)	1.0 (1.0)	0.8773	0.9469
		1000	5163	0.9983 (0.9983)	1.0 (1.0)	0.8662	0.941
		2000	3371	0.9990 (0.9990)	1.0 (1.0)	0.8497	0.9318
	fp	no FT	7162	0.7008	0.7566	0.880	0.959
		20	7122	0.8489 (0.8489)	0.9201 (0.9207)	0.8761	0.9471
		50	7062	0.8833 (0.8830)	0.9517 (0.9487)	0.8721	0.944
		100	6962	0.9102 (0.9102)	0.9679 (0.9689)	0.8688	0.9403
		500	6162	0.9789 (0.9789)	0.9970 (0.9974)	0.8564	0.9285
		1000	5164	0.9901 (0.9901)	0.9989 (0.9993)	0.8415	0.9168
		2000	3372	0.9965 (0.9965)	0.9997 (0.9999)	0.8179	0.8981
MC ¹⁷	graph	no FT	80	0.5375	0.4894	0.905	0.971
		20	59	0.6625 (0.6625)	0.5829 (0.6656)	0.8985	0.9607
		50	41	0.775 (0.775)	0.7273 (0.8175)	0.8971	0.9593
		100	32	0.8125 (0.8125)	0.8155 (0.8244)	0.8995	0.9609
		500	20	0.9 (0.9)	0.6813 (0.9375)	0.8698	0.9414
	fp	no FT	80	0.5875	0.515	0.880	0.959
		20	60	0.6375 (0.6375)	0.4733 (0.6313)	0.8797	0.9502
		50	44	0.85 (0.85)	0.7789 (0.9106)	0.8669	0.9415
		100	29	0.925 (0.9250)	0.8389 (0.9484)	0.8545	0.9312
		500	20	0.95 (0.95)	0.8542 (0.9612)	0.8324	0.9118

S6.1.1 CP test set

S6.1.2 MC test set

S6.2 meanComplexity dataset

Table S4: Performance metrics showcasing the improvement on the meanComplexity dataset (Sheridan \latinet al. ¹⁸) after fine-tuning at different fine-tuning dataset sizes with (unique = yes) or without overlapping pairs (molecules can appear in multiple pairs). No FT refers to the performance of the pre-trained model on those datasets.

Acc_{pt}

and

AUC_{pt}

n_{eval}

) and the full dataset in brackets.

dataset	unique	mode	$p_{ft}$	$n_{eval}$	PCC	$Acc_{pt}$	$AUC_{pt}$
meanComplexity ¹⁸	yes	graph	no FT	1681	0.51	0.905	0.971
			20	1641	0.61	0.9034	0.9701
			30	1621	0.63	0.9035	0.9697
			40	1601	0.66	0.9033	0.9697
			50	1581	0.66	0.9028	0.9689
		fp	no FT	1681	0.49	0.880	0.959
			20	1641	0.58	0.8768	0.9569
			30	1621	0.68	0.8668	0.9492
			40	1601	0.68	0.8483	0.9353
			50	1581	0.67	0.8674	0.9502
	no	graph	no FT	1681	0.51	0.905	0.971
			20	1649	0.55	0.8995	0.9615
			50	1582	0.51	0.9017	0.963
			100	1552	0.73	0.8921	0.9564
			500	1291	0.84	0.8765	0.9447
		fp	no FT	1681	0.49	0.880	0.959
			20	1648	0.51	0.8809	0.9513
			50	1584	0.59	0.8763	0.9475
			100	1538	0.7	0.8632	0.9377
			500	1292	0.8	0.8412	0.9212

S7 Example structures

S7.1 Hold-out test set

S7.2 CP and MC test sets

S7.3 PROTAC-DB

S7.4 REINVENT case study

S8 Learning curves

References

1 Streamlit. https://streamlit.io
Choung \latinet al. 2023 Choung, O.-H.; Vianello, R.; Segler, M.; Stiefl, N.; Jiménez-Luna, J. Learning chemical intuition from humans in the loop. 2023; https://chemrxiv.org/engage/chemrxiv/article-details/63f89282897b18336f0c5a55
Brody \latinet al. 2022 Brody, S.; Alon, U.; Yahav, E. How Attentive are Graph Attention Networks? International Conference on Learning Representations. 2022
Ren \latinet al. 2023 Ren, G.-P.; Wu, K.-J.; He, Y. Enhancing Molecular Representations Via Graph Transformation Layers. Journal of Chemical Information and Modeling 2023, Publisher: American Chemical Society
He \latinet al. 2015 He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision. 2015; pp 1026–1034
Rogers and Hahn 2010 Rogers, D.; Hahn, M. Extended-connectivity fingerprints. Journal of chemical information and modeling 2010, 50, 742–754
Rousseeuw 1987 Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 1987, 20, 53–65
Gal and Ghahramani 2016 Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning. 2016; pp 1050–1059
Kingma and Ba 2014 Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014,
10 RDKit: Open-source cheminformatics. http://www.rdkit.org
Lowe 2012 Lowe, D. M. Extraction of chemical structures and reactions from the literature. Ph.D. thesis, University of Cambridge, 2012
Jiang \latinet al. 2021 Jiang, S.; Zhang, Z.; Zhao, H.; Li, J.; Yang, Y.; Lu, B.-L.; Xia, N. When SMILES Smiles, Practicality Judgment and Yield Prediction of Chemical Reaction via Deep Chemical Language Processing. IEEE Access 2021, 9, 85071–85083, Conference Name: IEEE Access
Tu and Coley 2022 Tu, Z.; Coley, C. W. Permutation Invariant Graph-to-Sequence Model for Template-Free Retrosynthesis and Reaction Prediction. Journal of Chemical Information and Modeling 2022, 62, 3503–3513, Publisher: American Chemical Society
Sun \latinet al. 2017 Sun, J.; Ajwani, D.; Nicholson, P. K.; Sala, A.; Parthasarathy, S. Breaking cycles in noisy hierarchies. Proceedings of the 2017 ACM on Web Science Conference. 2017; pp 151–160
Polykovskiy \latinet al. 2020 Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatanov, O.; Belyaev, S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; Kadurin, A.; Johansson, S.; Chen, H.; Nikolenko, S.; Aspuru-Guzik, A.; Zhavoronkov, A. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology 2020, 11
Sorokina \latinet al. 2021 Sorokina, M.; Merseburger, P.; Rajan, K.; Yirik, M. A.; Steinbeck, C. COCONUT online: Collection of Open Natural Products database. Journal of Cheminformatics 2021, 13, 2
Voršilák \latinet al. 2020 Voršilák, M.; Kolář, M.; Čmelo, I.; Svozil, D. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. Journal of Cheminformatics 2020, 12, 35
Sheridan \latinet al. 2014 Sheridan, R. P.; Zorn, N.; Sherer, E. C.; Campeau, L.-C.; Chang, C. Z.; Cumming, J.; Maddess, M. L.; Nantermet, P. G.; Sinz, C. J.; O’Shea, P. D. Modeling a Crowdsourced Definition of Molecular Complexity. Journal of Chemical Information and Modeling 2014, 54, 1604–1616, Publisher: American Chemical Society
Weng \latinet al. 2022 Weng, G.; Cai, X.; Cao, D.; Du, H.; Shen, C.; Deng, Y.; He, Q.; Yang, B.; Li, D.; Hou, T. PROTAC-DB 2.0: an updated database of PROTACs. Nucleic Acids Research 2022, gkac946
Blaschke \latinet al. 2020 Blaschke, T.; Arús-Pous, J.; Chen, H.; Margreitter, C.; Tyrchan, C.; Engkvist, O.; Papadopoulos, K.; Patronov, A. REINVENT 2.0: An AI Tool for De Novo Drug Design. Journal of Chemical Information and Modeling 2020, 60, 5918–5922, Publisher: American Chemical Society
Guo and Schwaller 2023 Guo, J.; Schwaller, P. Augmented Memory: Capitalizing on Experience Replay to Accelerate De Novo Molecular Design. 2023; http://arxiv.org/abs/2305.16160, arXiv:2305.16160 [cs, q-bio]
Koes \latinet al. 2013 Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons Learned in Empirical Scoring with smina from the CSAR 2011 Benchmarking Exercise. Journal of Chemical Information and Modeling 2013, 53, 1893–1904, Publisher: American Chemical Society
Rappe \latinet al. 1992 Rappe, A. K.; Casewit, C. J.; Colwell, K. S.; Goddard, W. A. I.; Skiff, W. M. UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations. Journal of the American Chemical Society 1992, 114, 10024–10035, Publisher: American Chemical Society
Genheden \latinet al. 2020 Genheden, S.; Thakkar, A.; Chadimová, V.; Reymond, J.-L.; Engkvist, O.; Bjerrum, E. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. Journal of Cheminformatics 2020, 12, 70
25 Chemspace search. https://chem-space.com/search
Virtanen \latinet al. 2020 Virtanen, P.; Gommers, R.; Oliphant, T. E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; van der Walt, S. J.; Brett, M.; Wilson, J.; Millman, K. J.; Mayorov, N.; Nelson, A. R. J.; Jones, E.; Kern, R.; Larson, E.; Carey, C. J.; Polat, İ.; Feng, Y.; Moore, E. W.; VanderPlas, J.; Laxalde, D.; Perktold, J.; Cimrman, R.; Henriksen, I.; Quintero, E. A.; Harris, C. R.; Archibald, A. M.; Ribeiro, A. H.; Pedregosa, F.; van Mulbregt, P.; SciPy 1.0 Contributors SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 2020, 17, 261–272
Buitinck \latinet al. 2013 Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; Layton, R.; VanderPlas, J.; Joly, A.; Holt, B.; Varoquaux, G. API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 2013; pp 108–122