Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design

Onur Boyar [email protected]
Department of Mechanical Systems Engineering
Nagoya University Hiroyuki Hanada [email protected]
RIKEN Ichiro Takeuchi [email protected]
Department of Mechanical Systems Engineering
Nagoya University
RIKEN

Abstract

The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which combines a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to modify molecules strategically while maintaining similarity to the original input. Our LSBO setting improves the sample-efficiency of our optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our experiments demonstrate that CLaSMO efficiently enhances target properties with minimal substructure modifications, achieving state-of-the-art results with a smaller model and dataset compared to existing methods. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.

1 Introduction

The accelerated discovery of chemical compounds represents a crucial challenge with the potential to revolutionize global health, offering new ways to combat diseases and viruses. The ability to efficiently discover and develop new chemical compounds could lead to groundbreaking treatments and therapies, addressing some of the most pressing health issues of our time. As the importance of this field grows, so too does the research focused on finding effective solutions. Over the past few years, artificial intelligence (AI) has emerged as a powerful tool in this endeavor. The combination of increased computational power and advancements in generative modeling has brought us closer than ever to achieving significant breakthroughs in accelerated discovery.

Generative models offer various approaches to exploring and creating new chemical compounds. A common strategy involves training a generative model on a comprehensive database of chemical compounds. Once trained, the model can generate entirely new compounds (Gómez-Bombarelli et al., 2018, Tripp et al., 2020, Griffiths & Hernández-Lobato, 2020, Boyar & Takeuchi, 2024, Grosnit et al., 2021, Boyar et al., 2024). This approach opens the door to discovering novel molecules that are unlike any currently known, potentially revealing a vast, untapped universe of chemical possibilities. However, while the generation of these novel molecules is exciting, their practical applicability is often constrained. The challenges lie in synthesizing these novel molecules and in the limited understanding of their properties, which makes it difficult for domain experts to assess their viability Lim et al. (2020).

To address these challenges, another strategy for designing new molecules focuses on modifying/editing existing compounds using generative models Bradshaw et al. (2019), Lim et al. (2020), reinforcement learning Gottipati et al. (2020), genetic algorithms Jensen (2019), or from the domain experts themselves by trial and error Bemis & Murcko (1996), Schreiber (2000), Welsch et al. (2010). These approaches are more likely to produce synthesizable molecules because the base molecule is known to exist and, therefore, can be more tractable for real-world applications. However, even with this strategy, a significant obstacle remains: sample efficiency, i.e., how efficiently a method finds a promising molecular modification with a limited number of molecular property evaluations. Evaluating the properties of a chemical compound is a time-consuming and costly process, and many existing methods in the literature require numerous trials, making them less practical for accelerated discovery. Consequently, there is a critical need for a methodology that can generate target chemical compounds in a more sample-efficient manner.

In this study, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO) method, a framework designed to address two critical challenges in chemical compound discovery: real-world applicability and sample efficiency. CLaSMO combines a Conditional Variational Autoencoder (CVAE) Higgins et al. (2016) with Latent Space Bayesian Optimization (LSBO) Gómez-Bombarelli et al. (2018) to strategically modify input molecules and optimize their chemical properties.

Refer to caption — Figure 1: An example of the CLaSMO framework updating a scaffold for QED optimization. CLaSMO identifies optimal regions in the CVAE latent space and selects bonding points on the scaffold. Chemical features from these points are used to guide the generation of substructures, which are then integrated into the scaffold (A, B, C, D) through small, targeted modifications to improve molecular properties.

In drug discovery, a common strategy for generating synthesizable molecules is to work with molecular scaffolds—key substructures that serve as the foundation for chemical modifications and drug design Bemis & Murcko (1996). Building on this approach, our framework integrates small substructures into these scaffolds to improve key molecular properties. For modifications to be both effective and synthesizable, it’s crucial that the generative model understands how new substructures bond with the scaffold and enhance its properties. To achieve this, we condition substructure generation on the scaffold’s chemical features using a CVAE. Our novel data preparation and training strategy enables the CVAE to generate substructures that align with specific atomic environments. We further optimize this process using LSBO, which efficiently explores both the latent space of the CVAE and the scaffold’s chemical features. This allows CLaSMO to effectively select regions for modification and generate substructures that are chemically meaningful. By tracking molecular similarity between the initial scaffold and the optimized molecule, CLaSMO ensures that modifications are realistic and applicable in real-world settings. Ultimately, this approach accelerates the discovery of novel, synthesizable compounds with improved properties, as illustrated in Figure 1.

We evaluate our approach in two scenarios: optimizing the Quantitative Estimate of Drug-likeness (QED) in Section 5.2 and improving docking simulation scores in Section 5.3. QED assesses how likely a compound is to become a viable drug, while docking scores measure how well a molecule binds to a protein target using a simulator Schrödinger (2023). Our experiments demonstrate that CLaSMO efficiently improves both QED and docking scores in a sample-efficient manner, highlighting its effectiveness. Contributions of this study can be listed as follows:

1.

We propose CLaSMO, a pioneering CVAE and LSBO-based molecule modification algorithm for molecular design. We use a novel data preparation strategy that enables CVAE to learn how substructures bond with target molecules, providing tailored generations.
2.

We show that CLaSMO improves target properties with sample-efficiency while keeping the optimized molecules structurally similar to the input scaffolds, increasing the likelihood of identifying synthesizable compounds with desirable properties.
3.

We demonstrate that CLaSMO achieves state-of-the-art results using a significantly smaller model and training dataset than its competitors, highlighting its efficiency.
4.

We open source a web-application, https://clasmo.streamlit.app/, that enables interactive optimization of input molecules via CLaSMO, which allows chemical experts to decide the region to modify in input molecule, enabling Human-in-the-Loop optimization settings.

2 Related Works

Molecular design strategies can broadly be divided into two categories: from-scratch-generation of molecules and modification-based approaches. Both categories have made significant strides in recent years, yet they also face unique challenges, particularly regarding real-world applicability and sample efficiency.

From-scratch-generation approaches focus on creating entirely new molecules by conducting a search to optimize the target property. A seminal work by Gómez-Bombarelli et al. (2018) introduced latent space optimization-based methodology, using a VAE to generate novel compounds by navigating the latent space of molecular representations. LSBO builds on this by efficiently reducing the number of expensive black-box evaluations required for molecular optimization, enabling the discovery of compounds with desirable properties in a continuous latent space. Since then, numerous studies have further refined and expanded the LSBO framework, focusing on method development and practical applications Grosnit et al. (2021), Tripp et al. (2020), Maus et al. (2022), Griffiths & Hernández-Lobato (2020), Boyar & Takeuchi (2024), Boyar et al. (2024). However, like many other generation-from-scratch methods, such methodologies struggle with real-world applicability—i.e., the difficulty of synthesizing the generated molecules in real-world settings Lim et al. (2020). Other generation methods, such as genetic algorithms Jensen (2019), Grammar VAE Kusner et al. (2017), and Junction Tree (JT) VAE Jin et al. (2018), aim to improve the chemical validity of generated structures. Recent advancements like GP-MOLFORMER Ross et al. (2024) utilizes large language model-like approaches for molecular design, but the real-world applicability challenge remains a major limitation across these methodologies.

Modification-based approaches, on the other hand, focus on adding substructures to existing molecules or scaffolds, often leading to more synthesizable and interpretable designs. Methods like Scaffold-GGM Lim et al. (2020) employ a graph generative model to modify molecular scaffolds, thus improving properties with a higher chance of obtaining synthesizable molecules. Weller & Rohs (2024) introduces DrugHIVE, a deep hierarchical variational autoencoder that leverages scaffold modification to generate novel molecular compounds. There are many other scaffold-based optimization methodologies Li et al. (2019), Langevin et al. (2020), not limited to generative modeling Schreiber (2000), Welsch et al. (2010), Miao et al. (2011). Techniques like Bradshaw et al. (2019)’s model ensure chemical validity in molecular modifications, and Gottipati et al. (2020)’s PGFS model uses reinforcement learning to guide additions to the base molecule. These approaches aim to avoid the low real-world viability problem faced by generation-from-scratch methods, as they build upon known molecular scaffolds, however, they lack advanced optimization methodologies that take sample efficiency into account.

CLaSMO combines the strengths of both categories, leveraging LSBO in the latent space of a CVAE for improved sample efficiency while focusing on scaffold-based modifications. Unlike current LSBO-based molecular design methodologies in the literature, CLaSMO does not generate molecules from scratch but instead optimizes molecular properties by adding substructures to existing scaffolds. This approach mitigates the real-world applicability problem, and increases the chance of obtaining molecules that are both effective and practical for real-world synthesis. By incorporating a targeted scaffold-modification strategy, CLaSMO offers a balanced and efficient solution to molecular design.

3 Preliminaries and Problem Setup

In this section, we provide preliminary knowledge on CVAEs, LSBO, and scaffolds. We then discuss the challenges of property optimization and scaffold modifications.

3.1 Conditional Variational Autoencoders (CVAEs)

A VAE Kingma & Welling (2014) consists of an encoder $f_{\phi}^{\text{enc}}:\mathcal{X}\to\mathcal{Z}$ and a decoder $f_{\theta}^{\text{dec}}:\mathcal{Z}\to\mathcal{X}$ , where $\mathcal{X}$ represents the input space and $\mathcal{Z}$ the latent space. CVAEs extend the framework of VAEs by incorporating additional condition vector $\bm{c}$ into the latent variable model, facilitating the controlled generation of new instances. In the CVAE architecture, the encoder $q_{\phi}(\bm{z}|\bm{x},\bm{c})$ maps an input $\bm{x}$ and a condition $\bm{c}$ to a latent representation $\bm{z}$ . Simultaneously, the decoder $p_{\theta}(\bm{x}|\bm{z},\bm{c})$ reconstructs $\bm{x}$ using both $\bm{z}$ and $\bm{c}$ . The training of CVAEs is formulated as the minimization of the conditional variational lower bound:

\mathcal{L}(\theta,\phi;\bm{x},\bm{c})=-\mathbb{E}_{q_{\phi}(\bm{z}|\bm{x},\bm{c})}\left[\log p_{\theta}(\bm{x}|\bm{z},\bm{c})\right]+\mathrm{KL}\left(q_{\phi}(\bm{z}|\bm{x},\bm{c})\parallel p(\bm{z}\right)),

(1)

where $\mathrm{KL}$ denotes the Kullback-Leibler divergence. In this model, the prior distribution $p(\bm{z})$ over the latent variables is typically assumed to be a standard normal distribution, $\mathcal{N}(0,I)$ . This assumption simplifies the learning process by standardizing the latent space, ensuring that the encoder learns a distribution that closely aligns with a prior distribution, thus enhancing the generative capability of the decoder conditioned on specific contexts.

3.2 Latent Space Bayesian Optimization (LSBO)

In BO, we start with numerous unlabeled instances $\{{\bm{x}_{i}}\}_{i\in[\mathcal{U}]}$ and a smaller set of labeled instances ${(\bm{x}_{i},y_{i})}_{i\in[\mathcal{L}]}$ , where an input $\bm{x}_{i}\in\mathcal{X}$ represents a chemical compound, and a label $y_{i}\in\mathcal{Y}\subseteq\mathcal{R}$ indicates its properties such as docking scores. BO seeks to optimize a costly black-box function $f^{\rm BB}:\mathcal{X}\to\mathcal{Y}$ , which corresponds to obtaining the physical properties of chemical compounds through experiments or time-consuming simulations in the context of molecular design problems. The goal is to maximize $f^{\rm BB}$ with minimal evaluations, using typically a Gaussian Process (GP) surrogate, trained using the labeled instances $\mathcal{L}$ , to predict the function over $\mathcal{X}$ . BO uses the surrogate to select an input $\bm{x}$ that may yield values surpassing the current maximum $\max_{i\in\mathcal{L}}y_{i}$ . However, building a GP surrogate in high-dimensional spaces like chemical compounds is challenging. LSBO tackles this by employing a VAE/CVAE trained on the unlabelled instances $\mathcal{U}$ to reduce dimensionality, encoding instance in $\mathcal{X}$ to lower dimensional latent space $\mathcal{Z}$ . This simplifies surrogate modeling and optimization because $\mathcal{Z}$ has lower-dimension than $\mathcal{X}$ . During LSBO iterations, the acquisition function applied to the GP’s predictions selects new points in $\mathcal{Z}$ to evaluate. The chosen latent variable $\bm{z}_{i\textquoteright}$ is decoded into a new input $\bm{x}_{i\textquoteright}=f_{\theta}^{\rm dec}(\bm{z}_{i^{\prime}})$ . This new instance is evaluated by $f^{\rm BB}$ , and the results update $\mathcal{L}$ and refine the GP model. This cycle repeats until optimal results are achieved or resources are exhausted. In contexts like molecular design, LSBO aims to discover chemical compounds with optimal properties by efficiently navigating the reduced latent space.

3.3 Scaffolds

Scaffolds Bemis & Murcko (1996) are the stable core structures within molecules that serve as the framework for chemical modifications in drug design. Scaffolds retain the essential biological activity of the molecule. They play a crucial role in molecular design by providing a foundation for chemical modifications aimed at optimizing properties like QED. Researchers often use scaffolds to systematically explore chemical variations Schreiber (2000), Welsch et al. (2010), which can lead to the discovery of new compounds with improved properties.

In this study, we follow the scaffold extraction method from Bemis & Murcko (1996), where non-essential components like side chains are removed, leaving the core structure. This extracted scaffold serves as a starting point for further modifications, allowing efficient exploration of chemical space. By focusing on scaffolds, molecular design becomes more streamlined, increasing the likelihood of identifying novel, synthesizable compounds with the desired biological activity. Examples of whole molecule and scaffold pairs are provided in Fig. 2.

3.4 Problem Definition

We denote the scaffold, which is the base of the modification, as $\bm{S}$ , and the modified molecule as $\bm{S}^{\prime}$ . Our goal is to efficiently find the modification that maximizes the molecular property P which is evaluated by $f^{\text{BB}}(\bm{S}^{\prime})$ , while keeping the difference between $\bm{S}$ and $\bm{S}^{\prime}$ small. Directly optimizing over all possible $\bm{S}^{\prime}$ to find the best modification that maximizes $f^{\text{BB}}(\bm{S\textquoteright})$ is impractical, as it involves high complexity and costly evaluations of $f^{\text{BB}}$ .

The primary challenge in optimizing molecular scaffolds lies in i) determining the optimal bonding point on the base scaffold $\bm{S}$ , and ii) selecting the appropriate substructures added to the bonding point to ensure meaningful improvements in the desired property $\mathcal{P}$ . A molecular scaffold $\bm{S}$ , composed of several atoms $p_{1},p_{2},\ldots,p_{k}$ , may have atoms with the remaining capacity to form additional chemical bonds. These atoms serve as potential candidates for bonding with newly generated substructures. Therefore, the task involves not only selecting the right substructure but also identifying the most suitable bonding point $p_{i}$ to optimize scaffold properties. This adds complexity, as the need for precise modifications must be balanced with the challenges of high-dimensional search spaces and evaluation costs. Consequently, a more efficient approach is required to explore scaffold modifications effectively while minimizing the number of evaluations.

To address this, the problem can be reframed as an optimization task in a reduced latent space $\mathcal{Z}$ , obtained through a CVAE. In this space, each point $\bm{z}\in\mathcal{Z}$ corresponds to a potential substructure that can be integrated into the scaffold. By encoding the molecular substructures into this lower-dimensional space $\mathcal{Z}\in\mathbb{R}^{d}$ , the search becomes more tractable. The objective is to find the optimal latent representation $\bm{z^{*}}$ conditioned on the optimal bonding point $p_{i}^{*}$ that, together, maximize the desired property $\mathcal{P}$ when the generated substructure $\bm{s^{\prime}}\leftarrow f^{\text{dec}}(\bm{z})$ is added to the scaffold. Let us denote this modification as $\bm{S}^{{}^{\prime}}\leftarrow g(\bm{S}\oplus\bm{s^{\prime}},p_{i})$ , where the function $g()$ adds substructure $\bm{s}$ to the scaffold $\bm{S}$ at $p_{i}$ . The optimization problem is then formulated as:

\bm{z^{*}},p_{i}^{*}=\arg\max_{\bm{z}\in\mathcal{Z},p_{i}\in B(\mathbf{S})}f^{\text{BB}}(\bm{S^{{}^{\prime}}})=\arg\max_{\bm{z}\in\mathcal{Z},p_{i}\in B(\mathbf{S})}f^{\text{BB}}g(\bm{S}\oplus f^{\text{dec}}(\bm{z}),p_{i}),

(2)

where $B(\bm{S})$ is the set of possible bonding points on the scaffold $\bm{S}$ . In Section 4.2, we demonstrate that atomic features at $p_{i}$ are used as condition vectors to generate new substructures, enabling targeted substructure generation for atom $p_{i}$ .

3.4.1 Controlling Molecular Similarity

Current modification-based methods often fail to account for how changes impact molecular similarity between the original scaffold $\bm{S}$ and the updated scaffold $\bm{S^{\prime}}$ , or any structure in general. Adding substructures typically increases molecular weight, which can hinder real-world applicability, especially when exceeding 500 Daltons, as indicated by Lipinski’s Rule of Five Lipinski et al. (2001). Higher molecular weight compounds are more difficult to synthesize, making them less suitable for molecular design. Additionally, it is sometimes necessary to ensure that modifications result in only minor adjustments to avoid drastic changes.

Thus, a key challenge is to guide the optimization process by considering molecular similarity, ensuring that the modified molecules remain structurally close to the original scaffold. Such a framework can increase the likelihood of obtaining synthesizable compounds by limiting divergence from known molecules. In Section 5, we show that our method effectively solves the optimization problem in Eq. (2) while ensuring molecular similarity between $\bm{S}$ and $\bm{S^{\prime}}$ .

4 Proposed Method

Our proposed CLaSMO framework comprises two key components: the CVAE and the LSBO algorithm. However, an essential first step in our approach is the data preparation required to train the CVAE. In this section, we will begin by outlining the data preparation process, followed by an explanation of the CVAE and the CLaSMO methodology.

4.1 Data Preparation

Our proposed method requires a uniquely tailored dataset because no existing dataset in the literature fully meets the specific needs of our approach. To create this dataset, we developed a BRICS (Breaking Retrosynthetically Interesting Chemical Substructures) Degen et al. (2008) based approach. BRICS is an algorithm designed to decompose organic molecules into smaller, synthetically feasible substructures by identifying breaking points within a molecule’s structure based on chemical retrosynthetic rules. These breaking points, also known as division points, are the connections between subgraphs within the molecule that BRICS identifies. The algorithm systematically breaks down a molecule $\bm{M}$ into $k$ substructures, Fig. 3 demonstrates this procedure.

The division points found by BRICS are of particular importance because they serve as both the points where the molecule is split into substructures and the potential bonding sites where these substructures can be reattached. Therefore, they provide us a valuable information about these substructures in terms of what kind of bonds they can form. This dual role makes the division points a crucial piece of information for our generative model. By preserving the properties and features of atoms at these division points, we capture the chemical context around the atom that dictates how substructures can bond with other parts of a molecule. The atomic features we consider are atom type, hybridization, valence, formal charge, degree, and ring membership. Detailed explanations of these features are provided in the Appendix A.1. These features, when used as condition vectors in our CVAE, are essential for guiding the generation of substructures that can successfully bond with the scaffold, as they provide a detailed understanding of the chemical environment at each division point, such as their atom types and if the selected atom has the capacity for forming additional bonds. Therefore, our dataset for CVAE training includes the substructures obtained via the BRICS algorithm and six different atomic features of the atom at their breaking points. The illustration of the BRICS algorithm, atomic feature extraction points, and their integration with CVAE is provided in Fig. 3.

4.2 Substructure CVAE

Our CVAE consists of an encoder, parameterized by $f^{\rm{enc}}_{\phi}$ , and a decoder, parameterized by $f^{\rm{dec}}_{\theta}$ . The encoder takes the substructure $\bm{s}$ and the associated condition vector $\bm{c}$ , and maps this input into a latent space, producing a latent representation $\bm{z}$ . The decoder then takes a point $\bm{z}$ from this latent space, along with the condition vector $\bm{c}$ , and generates a substructure $\bm{s}^{\prime}$ that is conditioned on the given atomic properties. The loss function of the proposed CVAE is defined as:

\mathcal{L}(\bm{\phi},\bm{\theta};\mathbf{s},\bm{c})=\|\bm{s}-\mathbb{E}_{q_{\bm{\phi}}(\bm{z}|\bm{s},\bm{c})}[p_{\bm{\theta}}(\bm{s}|\bm{z},\bm{c})]\|^{2}+\beta\cdot D_{\text{KL}}(q_{\bm{\phi}}(\bm{z}|\bm{s},\bm{c})\|p(\bm{z})),

(3)

where $q_{\bm{\phi}}(\bm{z}|\bm{s},\bm{c})$ is the encoder, $p_{\bm{\theta}}(\bm{s}|\bm{z},\bm{c})$ is the decoder, and $\beta$ balances the reconstruction and KL divergence terms Higgins et al. (2016). The condition vectors allow the CVAE to learn how different substructures interact with specific atomic environments, building a deeper understanding of their bonding behavior. Figure 3B demonstrates the input substructures that CVAE uses in its training, and Fig. 3C visualizes the CVAE architecture along with the condition vectors.

After training, if during the generation phase, we provide the CVAE with a condition vector that specifies the atomic environment of the target scaffold, the decoder, using a latent vector along with the condition vector, generates substructures that are tailored to bond effectively with the scaffold. This conditioning mechanism ensures that the generated substructures are not random but specifically designed to fit the scaffold’s atomic environment, improving the efficiency of scaffold modifications and the likelihood of successful bonding. In the proposed CLaSMO method, we jointly optimize the bonding position $p_{i}$ and the latent vector $\bm{z}$ , conditioned on the atomic properties $\bm{c}$ of the bonding site in order to optimize both the bonding position and the substructure added to the position.

4.2.1 Condition Vector Embeddings

In Section 4.1, we detailed our data preparation process and the extraction of six atomic environmental features, resulting in a 6-dimensional condition vector. As we will explain in Section 5.1, the relatively simple nature of the substructures used in CVAE training allows for a much lower latent dimension than the actual conditional vector dimension of 6. To align with this simplicity and ensure efficient representation, we employed an Autoencoder model to generate $d$ -dimensional embeddings of the atomic features. These embeddings preserve the key characteristics within the atomic environments and learn the relationship among distinct atomic features while reducing dimensionality, which helps lower computational complexity and improves the model’s ability to generalize. After its training, the encoder of the Autoencoder, $f^{\rm enc}_{\text{CondEmd}}$ , is used to provide the final condition vector $\bm{c}$ to be inputted to CVAE by encoding the six-dimensional atomic environmental feature vector. This process is visualized in Fig. 3C.

4.3 CLaSMO

In CLaSMO, we perform a targeted search in the latent space $\mathcal{Z}\in\mathbb{R}^{d}$ of the CVAE, where decodings from each point $\bm{z}$ represent a potential substructure to be added to the scaffold. The challenge is to modify the scaffold $\bm{S}$ by selecting a substructure and integrating it at an appropriate bonding point $p_{i}$ , optimizing the desired molecular property $\mathcal{P}$ . Therefore, the search space $\Omega$ we consider in our optimization tasks is defined as the Cartesian product of the latent space and the set of possible bonding points on the scaffold:

\Omega=\mathbb{R}^{d}\times\{p_{1},p_{2},\ldots,p_{n}\}.

Within this search space, the goal is to modify the scaffold $\bm{S}$ to maximize the BB function $f^{\rm BB}(\bm{S^{\prime}})$ .

However, although our main goal is optimizing the target property, we also aim to keep the similarity between $\bm{S}$ and $\bm{S^{\prime}}$ at a certain level. Therefore, in CLaSMO, we apply a similarity constraint using the Dice Similarity Dice (1945) metric to compare the input scaffold $\bm{S}$ with the modified molecule $\bm{S^{\prime}}$ before evaluating the BB function. In order to measure the similarity between $\bm{S}$ and $\bm{S^{\prime}}$ , we use Morgan Fingerprints (Morgan, 1965, Rogers & Hahn, 2010). The Morgan fingerprint of a molecule is represented as a binary vector, where each bit indicates the presence or absence of a specific substructure within the molecule, making them effective and popular for comparing molecular similarities. Using these, the similarity between $\bm{S}$ and $\bm{S^{\prime}}$ is measured by computing the Dice Similarity of their Morgan fingerprint vectors¹¹1 Bajusz et al. (2015) discusses that the dice similarity is one of the best metrics to assess similarity between molecules using Morgan fingerprints., defined as:

\text{DICE}(\bm{S},\bm{S^{\prime}})=\frac{2|\bm{M_{S}}\cap\bm{M_{S}^{\prime}}|}{|\bm{M_{S}}|+|\bm{M_{S}^{\prime}}|}

where $\bm{M_{S}}$ and $\bm{M_{S}^{\prime}}$ are the Morgan fingerprint vectors of $\bm{S}$ and $\bm{S^{\prime}}$ , respectively. Dice Similarity, ranging from 0 (no overlap) to 1 (identical), helps us track structural changes during optimization.

The final optimization objective, incorporating both the search for optimal substructures and the similarity constraint, is given by:

\bm{z^{*}},p_{i}^{*}=\arg\max_{(\bm{z},p_{i})\in\Omega}f^{\text{BB}}\left(\bm{S^{\prime}}\right)

(4)

\displaystyle\text{subject to:}\quad\text{DICE}(\bm{S},\bm{S^{\prime}})\geq\tau,

where $\bm{S}^{{}^{\prime}}\leftarrow g(\bm{S}\oplus f^{\text{dec}}(\bm{s^{\prime}}|\bm{z}^{*},\bm{c}))$ . By this approach, CLaSMO efficiently navigates the latent space, optimizing molecular properties while allowing modifications only if $\text{DICE}(\bm{S},\bm{S}^{\prime})\geq\tau$ , effectively controlling the degree of divergence from the input scaffold.

For the joint optimization of substructures and bonding positions, we consider a GP model: $(\bm{z},p)\mapsto y^{\prime}_{\Delta}$ , where $\bm{z}$ is the latent vector representing a substructure, $p$ is the bonding position of the modification, and $y^{\prime}_{\Delta}:=y-y^{\prime}$ represents the improvement in the property, with $y$ and $y^{\prime}$ indicating the properties of molecules $S$ and $S^{\prime}$ , respectively. To prepare the training dataset $\mathcal{D}$ for the GP, we conducted a random sampling of substructures from random latent vectors $\bm{z}$ , each paired with randomly selected bonding regions $p$ on the scaffold $\bm{S}$ , and evaluated the $y^{\prime}_{\Delta}$ obtained from these additions (e.g., in the case of QED optimization, QED score differences between $\bm{S}$ and $\bm{S^{\prime}}$ are calculated). These triplets of $(\bm{z},p,y^{\prime}_{\Delta})$ are used in GP training. This setup allows the GP model to learn the relationship between the latent space representations, atoms in the input scaffold, and the resulting property changes, guiding the optimization process toward regions of the latent space that are more likely to yield beneficial modifications to the scaffold.

Among many possible choices, we employ the Upper Confidence Bound (UCB) acquisition function to guide the optimization process. To ensure LSBO samples from regions that meet the similarity constraint, we introduced a penalization mechanism. Specifically, we assign a negative improvement value $y^{\prime}_{\Delta}$ when the condition $\text{DICE}(\bm{S},\bm{S}^{\prime})\geq\tau$ is not satisfied after sampling from $f^{\text{dec}}([\bm{z}^{*},\bm{c}])$ . Additionally, we also apply a penalty when a substructure cannot bond with the target molecule. We outline our approach in Algorithm 1, and provide details about the selection of hyperparameters within our framework in the Appendix A.3.

4.3.1 Kernel Design of CLaSMO

The problem we try to solve requires simultaneous optimization over continuous latent vectors and discrete bonding points. To handle this complexity, we use a GP model with a covariance function $k$ that accommodates the mixed input space of continuous and discrete variables. We define separate kernels for these inputs:

k_{\text{cont}}(\bm{z},\bm{z}^{\prime})=\exp\left(-\frac{1}{2\ell^{2}}\|\bm{z}-\bm{z}^{\prime}\|^{2}\right),\quad k_{\text{cat}}(p_{i},p_{j})=\exp\left(-\frac{\delta_{p_{i},p_{j}}}{\ell}\right),

where $k_{\text{cont}}(\bm{z},\bm{z}^{\prime})$ is an RBF kernel, and $k_{\text{cat}}(p_{i},p_{j})$ measures the similarity between atoms in the molecule that are ready for additional bond via Kronecker delta function, measuring equality of atom positions, and $\ell$ is the lengthscale parameter. The combined kernel used in CLaSMO is then expressed as:

k_{\text{CLaSMO}}=k((\bm{z},p_{i}),(\bm{z}^{\prime},p_{j}))=k_{\text{cont}}(\bm{z},\bm{z}^{\prime})\cdot k_{\text{cat}}(p_{i},p_{j}).

Algorithm 1 CLaSMO

1:Input: GP training data

\mathcal{D}

, Trained CVAE, Trained Autoencoder encoder

f^{\rm enc}_{\text{CondEmd}}

for condition embeddings, Input molecules

\bm{M}

, Optimization budget per molecule

K

, Similarity threshold

\tau

, Penalization terms

\lambda_{1},\lambda_{2}

2:Fit GP using

\mathcal{D}

3:for

\bm{i}=1

\bm{M}

4: Pick molecule

\bm{m}_{i}

, obtain its scaffold

\bm{S}_{i}\leftarrow\text{Scaffold}(\bm{m}_{i})

5: Evaluate target property value

y\leftarrow f^{\rm{BB}}(\bm{m}_{i})

6: for

j=1

K

7: Identify available atoms in the scaffolds for bonding,

p_{j}\in B(\bm{S})

8: Find

\bm{z^{*}},p_{i}^{*}=\arg\max_{(\bm{z}_{i},p_{i})\in\Omega}f^{\rm BB}\left(S^{\prime}\right)

9: Create condition vector

\bm{c^{*}}

for atom

p^{*}_{i}

in molecule

\bm{S}_{i}

10: Obtain condition embeddings

\bm{c}\leftarrow f^{\rm enc}_{\text{CondEmd}}(\bm{c^{*}})

11: Generate substructure

\bm{s^{*}}\leftarrow f^{\text{dec}}([\bm{z}^{*},\bm{c}])

12: Add substructure

\bm{s^{*}}

to molecule

\bm{S}_{i}

at region

p^{*}_{i}

to obtain

\bm{S}^{\prime}_{i}

13: if

\bm{S}_{i}\neq\bm{S}^{\prime}_{i}

then

14: if

\text{DICE}(\bm{S}_{i},\bm{S}^{\prime}_{i})>\tau

then

15: Evaluate new property:

y^{\prime}=f^{\text{BB}}(\bm{S}^{\prime}_{i})

16: Compute improvement

y^{\prime}_{\Delta}=(y-y^{\prime})

17: if

y^{\prime}_{\Delta}>0

then

18: Update

\bm{S}_{i}\leftarrow\bm{S}^{\prime}_{i}

19: end if

20: else

21: Set

y^{\prime}_{\Delta}

to penalization term

\lambda_{1}

22: end if

23: else

24: Set

y^{\prime}_{\Delta}

to penalization term

\lambda_{2}

25: end if

26: Update

\mathcal{D}\leftarrow\mathcal{D}\cup\{[\bm{z}^{*},p^{*}],y^{\prime}_{\Delta}\}

27: Update GP with

\mathcal{D}

28: end for

29:end for

5 Experiments

In this section, we first provide details about the training of our CVAE model. Next, we discuss the results from two different experimental settings, which are QED property optimization and docking score property optimization.

5.1 CVAE Training

For our data preparation, we used the QM9 dataset Ruddigkeit et al. (2012), Ramakrishnan et al. (2014), which contains 130,000 small molecules. QM9 was chosen for its simplicity and suitability for our task, as it consists of smaller molecules compared to other well-known datasets like ZINC Irwin & Shoichet (2005), Irwin et al. (2020). Using our data preparation strategy, we extracted 18,706 unique substructure pairs along with their atomic environment features to train our CVAE. Most of the substructures obtained from QM9 through our method are under 100 Daltons, making them ideal for training a model focused on generating small substructures to modify the input scaffold. Since our goal is to optimize the scaffold while maintaining a high degree of similarity in the modifications, using such a dataset allows CVAE to learn to generate small substructures, which is crucial for achieving our goals. Of the 18,706 instances, 80% were used for model training, with the remaining data allocated for testing and validation. We represented the molecules using SELFIES Krenn et al. (2020), a string-based molecular representation, in the form of one-hot encoding matrices. Our CVAE architecture consists of three fully connected layers in the encoder and three GRU layers in the decoder.

Given the simplicity of the dataset, we determined that a 2-dimensional latent space was sufficient to achieve over 99% reconstruction accuracy on the test set. This lower-dimensional space also enhances the performance of BO, which typically excels in smaller-dimensional spaces. Additionally, to maintain this simplicity, as explained in Section 4.2.1, we generated a 2-dimensional embedding for the condition vectors using an Autoencoder model, which is trained with fully connected layers in both the encoder and decoder, achieved 93% reconstruction accuracy for the condition vectors. Prior to CVAE training, we obtained the condition vector embeddings via the encoder of this VAE model, and used them during CVAE training. As a result, the CVAE was trained with both a 2-dimensional latent space for the substructures and 2-dimensional condition vectors, effectively balancing simplicity and optimization performance.

5.2 QED Optimization

In this section, we present the results of our QED optimization experiments, where we used RDKit Landrum (2010) to calculate QED values. The QED metric, defined between 0 and 1, inherently limits the range of potential improvements, making optimization within this closed range particularly challenging. To further complicate this task, we imposed similarity constraints to ensure the optimized molecules remain close to their original scaffolds, by running the CLaSMO algorithm with $\tau$ values $\tau\in[0,0.25,0.5,0.6]$ . For each threshold, CLaSMO was run for 100 iterations on 100 input scaffolds, where their whole molecules are sampled randomly from the ZINC250K dataset Gómez-Bombarelli et al. (2018). The search domain for each latent dimension in $\mathcal{Z}$ is set to $[-6,6]$ . The GP model is trained using 100 training instances. Such an experimental setting is designed to demonstrate the optimization capabilities of CLaSMO at changing similarity thresholds, within the limited range of QED optimization tasks.

Initially, the average QED score of the input scaffold molecules was 0.5876, with a maximum QED of 0.8902. After optimization, CLaSMO with no similarity constraint ( $\tau=0$ ) achieved the highest QED score of 0.9480 and an average QED of 0.6778, representing a mean improvement rate of 21.43%²²2Mean improvement rates calculated as the average of the relative improvements obtained by CLaSMO for each input scaffolds. and a maximum improvement rate of 81.17%. As the similarity constraint was tightened, the performance remained competitive. With $\tau=0.25$ , the maximum QED score was 0.9464, with an average QED of 0.6715, leading to a mean improvement rate of 19.14%. Further increasing the similarity threshold to $\tau=0.50$ , the maximum QED score was 0.9437, and the average QED was 0.6602, yielding a mean improvement rate of 16.08%. For $\tau=0.60$ , the maximum QED score was 0.9402, and the average QED was 0.6434, resulting in a mean improvement rate of 11.70%. The results clearly demonstrate that CLaSMO consistently enhances the QED values across a range of similarity thresholds, effectively balancing the trade-off between maximizing QED and preserving structural similarity. Notably, even under more stringent similarity constraints, CLaSMO achieves significant improvements over the initial QED values, highlighting its robust optimization capabilities. This consistent performance across different thresholds underscores CLaSMO’s versatility and effectiveness in optimizing QED, making it a powerful tool for scaffold-based molecular optimization tasks. Table 1 summarizes these findings. These results were achieved with only 100 iterations per input scaffold, demonstrating the sample efficiency of our approach. Additionally, Fig. 4 demonstrates two example sets of scaffold optimization results from the CLaSMO experiments, in which it can be observed that our algorithm finds a molecule with a QED score of 0.948 by very limited additions, keeping the similarity between the input and optimized molecule at the high level.

Table 1: Summary of QED optimization results: Max and mean QED values from input scaffolds and CLaSMO experiments with various

\tau

levels, as well as mean and max improvement rates over QED values of input scaffold.

Metric	Input	$\bm{\tau=0}$	$\bm{\tau=0.25}$	$\bm{\tau=0.50}$	$\bm{\tau=0.60}$
Max QED	0.8902	0.9480	0.9464	0.9437	0.9402
Mean QED	0.5876	0.6778	0.6715	0.6602	0.6434
Mean Imp.	-	21.43%	19.14%	16.08%	11.70%
Max Imp.	-	81.17%	114.57%	74.74%	72.41%

5.2.1 Model Benchmark

Table 2 shows the top QED values achieved by various models, highlighting CLaSMO’s performance. Our method reaches a top QED score of 0.948, matching state-of-the-art results from PGFS Gottipati et al. (2020) and GP-MOLFORMER Ross et al. (2024). Among these models, Scaffold-GGM, DrugHIVE, and PGFS are substructure-addition-based methodologies, while the others generate molecules from scratch. A key distinction is that CLaSMO’s CVAE is trained only on substructures, not complete molecules, using a dataset of only 18,706 instances. The maximum QED value of the substructure among these instances is 0.53, with average QED value overall 0.427³³3Distribution of the QED values used in CVAE training is provided in the Appendix A.2.. In contrast, competitor models are trained on significantly larger datasets and full molecules, often encountering molecules with high QED values during training. These results underscore CLaSMO’s efficiency and lower computational cost, demonstrating the advantages of using carefully curated datasets and tailored methodologies for the target task.

Table 2: Comparison of top QED values achieved by various models with their dataset sizes.

Model	Top QED	Dataset Size
JT-VAE Jin et al. (2018)	0.925	250,000
Scaffold-GGM Lim et al. (2020)⁵⁵5Scaffold-GGM results obtained by running their released code for QED task.	0.928	349,809
DrugHIVE Weller & Rohs (2024)	0.940	27 Million
LIMO Eckmann et al. (2022)	0.947	250,000
PGFS Gottipati et al. (2020)	0.948	-
GP-MOLFORMER Ross et al. (2024)	0.948	1.1 Billion
CLaSMO (ours)	0.948	18,706

5.3 Docking Simulation Score Optimization

In this section, we present our docking simulation score optimization results using the KAT1 protein as the target. KAT1 is an ion channel protein, and compounds with favorable docking scores are more likely to bind effectively and influence its function, which is crucial for developing therapeutic agents targeting ion channel-related conditions⁴⁴4Details of the KAT1 protein can be found at https://pdbj.org/mine/summary/6v1x.. By optimizing docking scores, we aim to identify compounds with the potential to modulate ion channel function and develop novel therapeutic strategies. Unlike the relatively simple calculation of QED values, optimizing docking scores for KAT1 requires computationally intensive simulations. For these calculations, we used Schrödinger Software Schrödinger (2023), known for providing reliable and accurate results, with each simulation taking anywhere from a few minutes to several hours. Given the high computational cost of these simulations, CLaSMO’s sample-efficient approach is particularly valuable, enabling effective optimization while minimizing the number of expensive evaluations.

Similar to the QED optimization setting, the search domain for each latent dimension in $\mathcal{Z}$ was set to $[-6,6]$ . To address the high cost of obtaining labeled data for docking scores, we leveraged the same training set used in the QED experiments, treating them as low-fidelity instances for training the GP model. This approach allows the surrogate model to learn the relationships between atoms and latent vectors, even though the labels originate from QED rather than docking scores. By reusing this shared data, we enhance the GP model’s capacity to generalize and effectively guide the optimization process, reducing the dependency on extensive docking score evaluations. Given the significant time required for each docking score calculation, we limited the experiments to 10 compounds from the ZINC250K dataset and ran CLaSMO for 100 iterations per compound using two Dice similarity thresholds, $\tau\in[0.25,0.50]$ . This setup ensures that despite the computational expense, we can still obtain comprehensive and meaningful results for docking score optimization, and observe if CLaSMO is capable of optimizing docking scores under similarity constraints.

We present the distribution of docking scores obtained from both CLaSMO experiments in Fig. 7. The results clearly demonstrate that CLaSMO, at both similarity thresholds, achieves significant improvements in docking scores compared to the initial scaffolds. Figure 7 provides detailed metrics, including the maximum and average improvements across the models, as well as the initial best values. Notably, CLaSMO with a similarity threshold of $\tau=0.25$ achieved improvements of up to 96.3% over the initial scaffold docking score, while CLaSMO with $\tau=0.50$ achieved improvements of up to 75.1%, both indicating strong optimization performance despite the limited number of LSBO iterations, showcasing CLasMO’s sample-efficiency. In Fig. 8, we provide example molecules obtained from our experiment, where we observe that, even at the beginning stages of the optimization where $\text{DICE}(\bm{S},\bm{S^{\prime}})$ still above 0.70, we observe substantial improvements.

5.4 Ablation Study

We conducted an ablation study to evaluate the impact of incorporating atomic environment features as condition vectors in the generative process. In the training of the baseline VAE model, only substructures of the molecules were used, without conditioning on atomic properties. This setup allowed us to isolate the effect of conditioning on scaffold optimization.

We repeated the QED optimization experiments using the VAE model. The QED optimization results, shown in Table 3, demonstrate that incorporating atomic environment features in the CVAE significantly improves the generation of substructures. The conditioning mechanism enhances the model’s ability to generate substructures that bond more effectively with the scaffold, leading to higher success rates in optimizing molecular properties. While the VAE without conditioning still gains from the LSBO framework’s efficient optimization, incorporating atomic properties as conditions significantly enhances the quality of the generated molecules.

Table 3: QED optimization results: Comparison of VAE and CVAE at different

\tau

thresholds, reporting mean and max improvement rates.

Metric	$\bm{\tau=0.25}$	$\bm{\tau=0.50}$
Mean Imp. (VAE)	12.31%	8.41%
Mean Imp. (CVAE)	19.14%	16.08%
Max Imp. (VAE)	68.16%	32.28%
Max Imp. (CVAE)	114.57%	74.74%

5.5 Interactive Optimization

In CLaSMO, the modification region of the input scaffold is typically selected during the automated optimization process. However, the framework also supports an interactive mode, where a chemical expert manually selects the region of the molecule to modify. In this mode, the expert identifies the specific atom or region for modification, rather than relying on CLaSMO’s automated selection. Once the region is chosen, the rest of the process remains unchanged—CLaSMO continues to optimize the molecule by generating substructures using the CVAE and refining them through LSBO to improve target properties.

This interactive approach offers several key advantages. It enables the integration of expert knowledge into the optimization process, allowing CLaSMO to operate in a Human-in-the-loop setting. Experts can leverage their domain-specific insights to target specific regions they find promising, ensuring that modifications are not only data-driven but also aligned with scientific understanding. Meanwhile, CLaSMO maintains its efficient optimization process, using LSBO to enhance molecular properties and preserve sample efficiency. Detailed user guidelines for this open-source web application are provided in the Appendix A.

6 Conclusion

In this paper, we introduced CLaSMO, a novel framework that combines CVAE and LSBO for scaffold-based molecular optimization. Our approach efficiently explores latent spaces to optimize molecular properties, such as QED and docking scores, while maintaining structural similarity with the input scaffold to improve the chances of real-world viability of optimized molecules. By conditioning substructure generation on the atomic environment of the target region in the input molecule, CLaSMO generates chemically meaningful modifications. The experimental results demonstrate that CLaSMO achieves state-of-the-art performance, even with limited training data, a significantly smaller model compared to existing methods, and a limited number of optimization iterations, showcasing its sample-efficiency and lower computational cost. Furthermore, CLaSMO’s ability to control structural divergence through similarity constraints ensures robust performance across different optimization tasks. Although this paper focuses on scaffold-based modifications, CLaSMO is fully compatible with whole molecules, requiring no changes to its methodology. Additionally, we have open-sourced a web application to allow chemical experts to use CLaSMO in a Human-in-the-Loop setting, further extending its practical applicability. Overall, CLaSMO exemplifies the power of combining scaffold-based strategies with LSBO, offering a highly effective tool for targeted drug discovery and broader molecular design challenges.

Broader Impact Statement

The work presented in this paper has the potential to accelerate the discovery of new chemical compounds, which can positively impact various industries, particularly pharmaceuticals and materials science. By improving the efficiency of molecular optimization, CLaSMO could contribute to the development of more effective drugs, especially in regions facing significant public health challenges, such as the need for rapid vaccine development. Moreover, CLaSMO’s focus on real-world applicability increases the chances that the compounds discovered are not just theoretical but can be realistically produced, which is critical for translating scientific innovation into real-world solutions. Moreover, CLaSMO’s ability to work in a Human-in-the-Loop setting enables domain experts to directly contribute to the optimization process, enhancing collaboration between artificial intelligence and human expertise.

References

Bajusz et al. (2015) D. Bajusz, A. Rácz, and K. Héberger. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics, 7:20, 2015. doi: 10.1186/s13321-015-0069-3. URL https://doi.org/10.1186/s13321-015-0069-3.
Bemis & Murcko (1996) G. W. Bemis and M. A. Murcko. The properties of known drugs. 1. molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893, Jul 1996. doi: 10.1021/jm9602928.
Boyar & Takeuchi (2024) Onur Boyar and Ichiro Takeuchi. Latent Space Bayesian Optimization With Latent Data Augmentation for Enhanced Exploration. Neural Computation, 36(11):2446–2478, 10 2024. ISSN 0899-7667. doi: 10.1162/neco_a_01708. URL https://doi.org/10.1162/neco_a_01708.
Boyar et al. (2024) Onur Boyar, Yanheng Gu, Yuji Tanaka, Shunsuke Tonogai, Tomoya Itakura, and Ichiro Takeuchi. Crystal-lsbo: Automated design of de novo crystals with latent space bayesian optimization, 2024. URL https://arxiv.org/abs/2405.17881.
Bradshaw et al. (2019) John Bradshaw, Brooks Paige, Matt J Kusner, Marwin Segler, and José Miguel Hernández-Lobato. A model to search for synthesizable molecules. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
De Vries et al. (2017) Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. Modulating early visual processing by language. Advances in neural information processing systems, 30, 2017.
Degen et al. (2008) Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compiling and using ’drug-like’ chemical fragment spaces. ChemMedChem, 3(10):1503–1507, 2008. doi: https://doi.org/10.1002/cmdc.200800178.
Dice (1945) Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945. ISSN 00129658, 19399170. URL http://www.jstor.org/stable/1932409.
Eckmann et al. (2022) P Eckmann, K Sun, B Zhao, M Feng, MK Gilson, and RLIMO Yu. Limo: Latent inceptionism for targeted molecule generation. arXiv preprint arXiv:2206.09010, 2022.
Gottipati et al. (2020) Sai Krishna Gottipati, Boris Sattarov, Sufeng Niu, Yashaswi Pathak, Haoran Wei, Shengchao Liu, Shengchao Liu, Simon Blackburn, Karam Thomas, Connor Coley, Jian Tang, Sarath Chandar, and Yoshua Bengio. Learning to navigate the synthetically accessible chemical space using reinforcement learning. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 3668–3679. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/gottipati20a.html.
Griffiths & Hernández-Lobato (2020) Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chemical science, 11(2):577–586, 2020.
Grosnit et al. (2021) Antoine Grosnit, Rasul Tutunov, Alexandre Max Maraval, Ryan-Rhys Griffiths, Alexander Imani Cowen-Rivers, Lin Yang, Lin Zhu, Wenlong Lyu, Zhitang Chen, Jun Wang, Jan Peters, and Haitham Bou-Ammar. High-dimensional bayesian optimisation with variational autoencoders and deep metric learning. ArXiv, abs/2106.03609, 2021.
Gómez-Bombarelli et al. (2018) Rafael Gómez-Bombarelli, Jennifer N. Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2):268–276, 2018. doi: 10.1021/acscentsci.7b00572.
Higgins et al. (2016) Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016.
Irwin & Shoichet (2005) John J Irwin and Brian K Shoichet. Zinc- a free database of commercially available compounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 2005.
Irwin et al. (2020) John J Irwin, Khanh G Tang, Jennifer Young, Chinzorig Dandarchuluun, Benjamin R Wong, Munkhzul Khurelbaatar, Yurii S Moroz, John Mayfield, and Roger A Sayle. Zinc20—a free ultralarge-scale chemical database for ligand discovery. Journal of chemical information and modeling, 60(12):6065–6073, 2020.
Jensen (2019) J. H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical Science, 10(12):3567–3572, 2019. doi: 10.1039/C8SC05372C.
Jin et al. (2018) Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning, pp. 2323–2332, 2018.
Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, 2014.
Krenn et al. (2020) Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, oct 2020. doi: 10.1088/2632-2153/aba947.
Kusner et al. (2017) Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato. Grammar variational autoencoder. In International Conference on Machine Learning, pp. 1945–1954, 2017.
Landrum (2010) Greg Landrum. Rdkit. 2010.
Langevin et al. (2020) Maxime Langevin, Hervé Minoux, Maximilien Levesque, and Marc Bianciotto. Scaffold-constrained molecular generation. Journal of Chemical Information and Modeling, 60(12):5637–5646, 2020.
Li et al. (2019) Yibo Li, Jianxing Hu, Yanxing Wang, Jielong Zhou, Liangren Zhang, and Zhenming Liu. Deepscaffold: a comprehensive tool for scaffold-based de novo drug discovery using deep learning. Journal of chemical information and modeling, 60(1):77–91, 2019.
Lim et al. (2020) Jaechang Lim, Sang-Yeon Hwang, Seokhyun Moon, Seungsu Kim, and Woo Youn Kim. Scaffold-based molecular design with a graph generative model. Chem. Sci., 11:1153–1164, 2020. doi: 10.1039/C9SC04503A. URL http://dx.doi.org/10.1039/C9SC04503A.
Lipinski et al. (2001) C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews, 46(1-3):3–26, March 2001. doi: 10.1016/s0169-409x(00)00129-0.
Maus et al. (2022) Natalie Maus, Haydn Jones, Juston Moore, Matt J Kusner, John Bradshaw, and Jacob Gardner. Local latent space bayesian optimization over structured inputs. Advances in Neural Information Processing Systems, 35:34505–34518, 2022.
Miao et al. (2011) Zhenhuan Miao, Julian Levi, and Zhen Cheng. Protein scaffold-based molecular probes for cancer molecular imaging. Amino Acids, 41(5):1037–1047, 2011. doi: 10.1007/s00726-010-0503-9. URL https://doi.org/10.1007/s00726-010-0503-9.
Morgan (1965) H. L. Morgan. The generation of a unique machine description for chemical structures - a technique developed at chemical abstracts service. Journal of Chemical Documentation, 5(2):107–113, 1965. ISSN 0021-9576. doi: 10.1021/c160017a018. URL https://doi.org/10.1021/c160017a018.
Ramakrishnan et al. (2014) Raghunathan Ramakrishnan, Pavlo Dral, Matthias Rupp, and Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1, 08 2014. doi: 10.1038/sdata.2014.22.
Rogers & Hahn (2010) David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, 2010. ISSN 1549-9596. doi: 10.1021/ci100050t. URL https://doi.org/10.1021/ci100050t.
Ross et al. (2024) Jerret Ross, Brian Belgodere, Samuel C Hoffman, Vijil Chenthamarakshan, Youssef Mroueh, and Payel Das. Gp-molformer: A foundation model for molecular generation. arXiv preprint arXiv:2405.04912, 2024.
Ruddigkeit et al. (2012) Lars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of Chemical Information and Modeling, 52(11):2864–2875, 2012. doi: 10.1021/ci300415d. PMID: 23088335.
Schreiber (2000) Stuart L. Schreiber. Target-oriented and diversity-oriented organic synthesis in drug discovery. Science, 287(5460):1964–1969, 2000. doi: 10.1126/science.287.5460.1964. URL https://www.science.org/doi/abs/10.1126/science.287.5460.1964.
Schrödinger (2023) Schrödinger. Schrödinger Suite. Computer Software, 2023. Release 2023-2.
Tripp et al. (2020) Austin Tripp, Erik Daxberger, and José Hernández-Lobato. Sample-efficient optimization in the latent space of deep generative models via weighted retraining. In Advances in Neural Information Processing Systems, 06 2020.
Weininger (1988) David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988. doi: 10.1021/ci00057a005.
Weller & Rohs (2024) Jesse A Weller and Remo Rohs. Structure-based drug design with a deep hierarchical generative model. Journal of Chemical Information and Modeling, 64(16):6450–6463, 2024.
Welsch et al. (2010) Matthew E Welsch, Scott A Snyder, and Brent R Stockwell. Privileged scaffolds for library design and drug discovery. Current Opinion in Chemical Biology, 14(3):347–361, 2010. ISSN 1367-5931. doi: https://doi.org/10.1016/j.cbpa.2010.02.018. Molecular Diversity.
Yan et al. (2020) Chaochao Yan, Sheng Wang, Jinyu Yang, Tingyang Xu, and Junzhou Huang. Re-balancing variational autoencoder loss for molecule sequence generation. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–7, 2020.

Appendix A Appendix

A.1 Definitions of Atomic Features

In Table 4, we provide the definitions of the six atomic features we utilized in our data preparation setting.

Property	Description
Atom Type	Specifies the element (e.g., carbon, oxygen), which determines bonding capabilities and chemical behavior.
Hybridization	Describes the mixing of atomic orbitals, influencing shape, bond angles, and bonding interactions.
Valence	Refers to the number of bonds an atom can form, indicating potential for additional bonding.
Formal Charge	Represents the charge if all bonding electrons are shared equally, crucial for reactivity and bonding sites.
Degree	Denotes the number of directly attached atoms (neighbors), providing insight into the local atomic environment.
Ring Membership	Indicates if the atom is part of a ring structure, impacting substructure rigidity, stability, and bonding behavior.

Table 4: Key atomic properties used in CVAE training in CLaSMO framework.

A.2 QED Value Distribution of Input Substructures

Figure 9 shows the distribution of QED values for the substructures used to train the CVAE model, generated through our data preparation process.

A.3 Model and Hyperparameter Selection

In chemical VAE models, it has been established that setting the KL divergence weight $\beta<1$ can improve generative performance Yan et al. (2020), and our findings are consistent with this. We experimented with a range of $\beta$ values from $1$ to $1^{-7}$ , selecting models based on their reconstruction accuracy on the training set. Interestingly, we found that even at very low $\beta$ values, the CVAE retained its generative capacity. However, as $\beta$ increased, we observed a decline in both reconstruction accuracy and the diversity of generated molecules.

To ensure robust training, we applied early stopping in combination with a learning rate scheduler (using PyTorch’s ReduceLROnPlateau function). Models were evaluated based on their reconstruction performance and generative diversity, leading us to select the model with $\beta=0.000001$ as the optimal candidate. Additionally, our CVAE leverages conditional batch normalization De Vries et al. (2017), which improves the impact of conditioning in the generative process.

For CLaSMO’s penalization terms, we opted for a straightforward approach rather than an exhaustive hyperparameter optimization process. We set $\lambda_{1}=-5$ to penalize cases where the similarity constraint $\text{DICE}(\bm{S},\bm{S^{\prime}})>\tau$ was violated. Similarly, we assigned $\lambda_{2}=-7.5$ for situations where the generated substructure could not be added to the input scaffold. These values were chosen to assign poor scores in cases where the sampled region did not meet the desired criteria, allowing LSBO to learn that the region is suboptimal. This approach helps guide the optimization process away from unproductive regions and toward more promising areas.

Human-in-the-Loop via Web-Application

In Fig. 10, we showcase a sequence of screenshots from our web application, demonstrating the process of molecule optimization. First, the user inputs a SMILES Weininger (1988) string of the chemical compound into the designated text field. Once the input is provided, the application automatically computes the molecule’s QED value and generates a visual representation. Subsequently, the user selects a region of interest by drawing a rectangle around the area they wish to modify. Upon confirming the selection, the CLaSMO optimization process is initiated, targeting improvements in the selected molecular region. Upon completion, the optimized molecule is displayed, and the process can be continued by using the resulting molecule as input for further iterations. By incorporating user input in the region selection, we create a Human-in-the-Loop optimization workflow.