Pre-trained Molecular Language Models with Random Functional Group Masking¹¹1Corresponding to Haoyi Xiong via [email protected]

Tianhao Peng Baidu Research, Baidu Inc., Haidian, 100085, Beijing, China. School of Computer Science and Engineering, Beihang University, Haidian, 100091, Beijing, China. Yuchen Li Baidu Research, Baidu Inc., Haidian, 100085, Beijing, China. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Minhang, 200240, Shanghai, China. Xuhong Li Baidu Research, Baidu Inc., Haidian, 100085, Beijing, China. Jiang Bian Baidu Research, Baidu Inc., Haidian, 100085, Beijing, China. Zeke Xie The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511455, Guangdong, China. Ning Sui Department of Molecular and Structural Biochemistry, North Carolina State University, Raleigh, NC 27695, United States Shahid Mumtaz Department of Computer Science, Nottingham Trent University, Nottingham, UK. Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland. Yanwu Xu School of Future Technology, South China University of Technology, Guangzhou, 510641, Guangdong, China. Linghe Kong Department of Computer Science and Engineering, Shanghai Jiao Tong University, Minhang, 200240, Shanghai, China. Haoyi Xiong Baidu Research, Baidu Inc., Haidian, 100085, Beijing, China.

Abstract

Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose MLM-FG – a SMILES-based Molecular Language Model, which randomly masking SMILES subsequences corresponding to specific molecular Functional Groups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of MLM-FG. Our findings reveal that MLM-FG outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones. Remarkably, MLM-FG also surpasses 3D-graph-based models, which explicitly incorporate molecular structures into their inputs, highlighting its exceptional capacity for representation learning even without explicit 3D structural information. These results indicate that MLM-FG effectively captures the nuanced language of molecules, offering promising implications for computational chemistry and related disciplines.

1 Main

While deep learning has been widely explored in cheminformatics with significant progress [1, 2], her potential is severely limited by the scale of labeled data. To reduce the cost of data annotation while enabling generalizable, transferable, and robust representation learning from unlabeled data, researchers extend the pre-training strategies [3] from images and texts to molecular data [4, 5, 6, 7, 8].

To work with machine learning algorithms, molecules can be represented by a chemical notation language named SMILES [9], which explicitly represents meaningful substructures such as branches and cyclic structures. To enable pre-training, researchers use the variant of Transformers to pre-train on large-scale unlabeled SMILES strings [10, 8, 11]. To pre-train a molecular language model, given the SMILES string of every molecule in the training dataset, existing methods usually adopt an masked autoencoding strategy [8, 10, 11, 12]. It first randomly selects a subsequence of the SMILES string. Then the strategy masks the selected part and trains models to predict the masked part. However, such random masking strategy would ignore the key chemical substructures of molecules, such as rings and functional groups [7, 4]. For instance, consider aspirin, which is denoted by “O=C(C)Oc1ccccc1C(=O)O”. In this molecular structure, critical functional groups such as the carboxylic acid (“-COOH”) and the ester (“-COO-”) are at risk of being overlooked due to random masking. This oversight neglects their pivotal contributions to the molecular activity and properties. To the end, these methods may fail to learn the critical molecular properties, which are primarily relevant to the chemical substructures of a molecule, from SMILES strings [13, 14].

The previous investigation has highlighted the limitations of SMILES in terms of topology awareness, underscoring its inability to explicitly encode structural information of molecules[15]. To address the issue, structure-aware pre-training methods utilizing Graph Neural Networks (GNNs) have marked a significant advancement. These approaches leverage the graph-format representation of molecules (such as the topology of every atom in a molecule onto a 2D space), enriching the learning models with a deeper understanding of molecular structures [7, 6, 4]. More recently, 3D graph-based molecular data representation has been introduced in GNN pre-training, where the structures of massive molecules in a 3D space have been used to boost the performance of pre-trained models by incorporating the 3D structural/topological information [16, 17]. However, such structural/topological information is not also precise. For example, in addition to acquiring the 3D positions of every atom and the angles between every two bonds in a molecule through experiments, some studies directly convert SMILES strings or 2D topology graph of a molecule into a 3D graph using Merck molecular force field (MMFF94) [18] function in RDKit [16]. It is reasonable to doubt the “add-value” of such automatic data format conversion and the precision of converted 3D graphs. Thus, it would be challenging to pre-train a structure-aware molecular model while the precise structural information is not available.

To tackle the aforementioned challenges, we propose a novel molecular representation framework MLM-FG – a SMILES-based Molecular Language Model, which randomly masking SMILES subsequences corresponding to specific molecular Functional Groups to incorporate structure information of atoms during the pre-training phase. Specifically, MLM-FG employs transformer-based models trained on a large corpus of SMILES strings for 100 million molecules. As shown in Figure 1, given the SMILES string for every molecule in the training dataset, MLM-FG first parses the string and identify the subsequences corresponding to functional groups and key clusters of atoms in the molecules. Then MLM-FG randomly masks a certain proportion of subsequences and trains the model to predict the masked part as the pre-training task.

Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of MLM-FG. Our findings reveal that MLM-FG outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones. Remarkably, MLM-FG also surpasses 3D graph-based models, which explicitly incorporate molecular structures into their inputs, highlighting its exceptional capacity for representation learning even without explicit 3D structural information. These results show that pre-trained transformer encoders specialized in molecular SMILES demonstrate robust performance, matching or even exceeding existing supervised or unsupervised language models and GNN benchmarks in accurately forecasting a broad spectrum of molecular properties.

Refer to caption — Figure 1: An illustration of the proposed MLM-FG framework: (1) MLM-FG adopts 12-layer multi-head transformer blocks (in either RoBERTa or MoLFormer architectures) with a hidden state dimension of $D_{h}$ =768 for pre-training and fine-tuning, (2) MLM-FG follows a functional group-aware random masking strategy to pre-train the model on a large corpus of 10 to 100 million SMILES sequences from PubChem, and (3) MLM-FG fine-tunes the pre-trained models to support a wide range of molecular machine learning applications.

2 Results

In this section, we present a series of comprehensive experiments designed to illustrate the efficacy of MLM-FG. These experiments include performance comparisons across various downstream tasks and visual analysis of pre-trained representations. To assess the impact of model architecture and data size, we utilized two transformer-based models for pre-training on a corpus consisting of millions of molecules. These models are based on the MoLFormer [8] and RoBERTa architectures.

Moreover, we conducted a comparative analysis of MLM-FG with models pre-trained using different strategies and methods derived from existing literature. These include models based on both molecular graphs, such as MolCLR [4], GROVER [19], and GEM [16], and SMILES e.g., MoLFormer [8]. Notably, two recent works – GEM [16], incorporating the explicit 3D structures of 20 million molecules in pre-training, and MoLFormer [8], pre-trained using SMILES strings of 1.1 billion molecules, are two strong baselines in the line of research for molecular graph-based and SMILE-based solutions.

2.1 Performance of MLM-FG on Downstream Tasks

Before fine-tuning MLM-FG to downstream tasks, we use 10 million, 20 million, and 100 million unlabelled molecules sampled from PubChem [20], a public access database that contains purchasable drug-like compounds, to pre-train MLM-FG on two transformer-based models. Subsequently, we conduct experiments on multiple molecular benchmarks from the MoleculeNet [21], including seven classification tasks and five regression tasks. Following the previous work [16, 22, 8], we adopt the scaffold split [23] to split the datasets, which splits molecules based on their molecular substructure. By separating structurally distinct molecules into different subsets, scaffold splitting poses a more significant challenge and offers a robust test of model generalizability compared to random splitting methods.

Table 1: Comparison of classification accuracy between fine-tuned MLM-FG and existing pre-trained/self-supervised baselines on multiple classification benchmarks.

	BBBP	BACE	ClinTox	Tox21	SIDER	HIV	MUV
No. molecules	2,039	1,513	1,478	7,831	1,427	41,127	93,087
No. prediction tasks	1	1	2	12	27	1	17
Pre-trained models from existing literature
MolCLR-gin	0.9307	0.7873	0.8005	0.7644	0.5826	0.7768	0.7386
MolCLR-gcn	0.8432	0.7194	0.7997	0.7179	0.5353	0.7616	0.6701
GROVER-base	0.9022	0.7700	0.6847	0.7187	0.5579	0.6950	0.6265
GROVER-large	0.8861	0.7795	0.6082	0.7155	0.5283	0.6956	0.5132
GEM	0.9103	0.8603	0.8506	0.7791	0.6279	0.7500	0.7253
MoLFormer	0.9037	0.8275	0.9451	0.7734	0.5826	0.7630	0.7599
MoLFormer and RoBERTa models without pre-training
MoLFormer (from scratch)	0.8636	0.7728	0.7317	0.7461	0.5667	0.6991	0.6863
RoBERTa (from scratch)	0.8711	0.7445	0.8858	0.7369	0.5285	0.5575	0.6674
RoBERTa models pre-trained by random subsequence masking
RoBERTa (10M, rand. subseq)	0.8572	0.8253	0.9284	0.7533	0.6111	0.7006	0.6234
RoBERTa (20M, rand. subseq)	0.9068	0.8135	0.9011	0.7635	0.5799	0.7477	0.6481
RoBERTa (100M, rand. subseq)	0.9048	0.8248	0.9167	0.7852	0.5860	0.7683	0.6909
MoLFormer and RoBERTa models pre-trained by MLM-FG
MLM-FG (MoLFormer, 10M)	0.8980	0.8044	0.9669	0.7765	0.5811	0.7633	0.6829
MLM-FG (MoLFormer, 20M)	0.8976	0.8088	0.9436	0.7793	0.5992	0.7801	0.7185
MLM-FG (MoLFormer, 100M)	0.9055	0.8040	0.9270	0.7893	0.5786	0.7690	0.6017
MLM-FG (RoBERTa, 10M)	0.8870	0.8265	0.9258	0.7545	0.6054	0.7106	0.6103
MLM-FG (RoBERTa, 20M)	0.9378	0.8458	0.8919	0.7603	0.5908	0.7594	0.6428
MLM-FG (RoBERTa, 100M)	0.9237	0.7981	0.9606	0.7896	0.6042	0.7807	0.7990

2.1.1 Classification Tasks

We choose seven classification tasks from the MoleculeNet benchmark with six baseline models to evaluate and compare the performance of MLM-FG. Based on the experimental results shown in Table 1, we can conclude that MLM-FG, employing either MoLFormer or RoBERTa architectures, surpasses all of the baselines in five (BBBP, ClinTox, Tox21, HIV, and MUV) out of seven benchmarks and comes a close second in the other two (BACE and SIDER). This result demonstrates the superiority of MLM-FG in dealing with the prediction of molecular properties; especially compared within SMILES-based solutions, MLM-FG delivers the highest classification accuracy. GEM outperforms MLM-FG in BACE and SIDER datasets, which could be attributed to its utilization of explicit 3D structural information of molecules.

Table 2: Comparison of regression errors between fine-tuned MLM-FG and existing pre-trained/self-supervised baselines on multiple regression benchmarks.

	RMSE			MAE
	ESOL	FreeSolv	Lipo	QM7	QM8
No. molecules	1,128	642	4,200	6,830	21,786
No. prediction tasks	1	1	1	1	12
Pre-trained models from existing literature
MolCLR-gin	1.4717	2.7116	0.7411	96.5469	0.0205
MolCLR-gcn	1.5074	2.7273	0.9033	93.5973	0.0235
GROVER-base	0.8813	1.7772	0.6664	101.2853	0.0228
GROVER-large	0.8831	2.7143	0.7063	114.3004	0.0234
GEM	0.7614	2.4581	0.6861	65.0067	0.0179
MoLFormer	0.6613	4.4485	0.4457	69.0700	0.0177
MoLFormer and RoBERTa models without pre-training
MoLFormer (from scratch)	0.9721	3.3689	0.9500	68.6214	0.0279
RoBERTa (from scratch)	0.9513	3.6014	0.9910	66.8488	0.0244
RoBERTa models pre-trained by random subsequence masking
RoBERTa (10M, rand. subseq)	0.4909	4.4444	0.4515	68.4687	0.0219
RoBERTa (20M, rand. subseq)	0.4596	3.1672	0.4560	70.1688	0.0207
RoBERTa (100M, rand. subseq)	0.4301	2.4527	0.4430	76.6563	0.0232
MoLFormer and RoBERTa models pre-trained by MLM-FG
MLM-FG (MoLFormer, 10M)	0.3432	5.5461	0.4919	67.7549	0.0221
MLM-FG (MoLFormer, 20M)	0.4407	3.7525	0.4325	66.2175	0.0226
MLM-FG (MoLFormer, 100M)	0.5135	3.2596	0.4272	69.3677	0.0212
MLM-FG (RoBERTa, 10M)	0.5707	1.7430	0.4892	66.9334	0.0212
MLM-FG (RoBERTa, 20M)	0.6668	1.9143	0.6711	64.1665	0.0220
MLM-FG (RoBERTa, 100M)	0.3901	3.0487	0.3984	74.8177	0.0202

2.1.2 Regression Tasks

We choose five classification tasks from the MoleculeNet benchmark with six baseline models to evaluate the performance of MLM-FG. Based on the experimental results shown in Table 2, we can conclude that MLM-FG exceeds the performance of all baselines in four out of five benchmarks (ESOL, FreeSolv, Lipo, and QM7) and attains comparable results on the qm8 dataset. Especially, MLM-FG showcases notable performance gains over the second-best model on ESOL dataset with 41.01% improvement. In particular, the QM8 dataset involve prediction of several quantum-chemical measures, which is considered challenging without 3D information [8]. GEM and MoLFormer lead MLM-FG by a slight margin on the QM8 dataset, possibly because GEM incorporates additional 3D information, while MoLFormer is pre-trained with 11 billion molecules (ten times larger than ours).

2.2 Ablation Study

To comprehensively evaluate the effectiveness of MLM-FG, we conducted ablation studies to dissect the contribution of key treatments, including pre-training strategies, model architectures, and the size of pre-training datasets, to the overall performance.

The comparison between the functional group-based random masking, random subsequence masking, and training from scratch, underscores the effectiveness of the unique masking approach proposed by MLM-FG. Notably, MLM-FG demonstrated a significant performance improvement, for instance, achieving error reductions in the ESOL dataset to 0.3432 using functional group-based masking, compared to 0.4909 with random subsequence masking and 0.9721 when trained from scratch. Moreover, we can observe a clear performance improvement from the vanilla MoLFormer to the MoLFormer models pre-trained by MLM-FG in the most datasets for both classification and regression tasks. This observation confirms the advantage of the proposed functional group-aware random masking strategy, even with less molecules for pre-training. In addition, the comparisons between MoLFormer and RoBERTa models highlight a distinct advantage for the more extensive RoBERTa model in the most cases. For instance, in the regression tasks such as FreeSolv, RoBERTa models pre-trained by MLM-FG posted superior results (e.g., error of 1.7430 for the RoBERTa 10M) when compared to the MLM-FG pre-trained MoLFormer models under similar conditions (error of 5.5461 for MoLFormer 10M).

Expanding the dataset size typically leads to improved performance for MLM-FG, as demonstrated across various benchmarks. For example, RoBERTa models pre-trained with MLM-FG achieve an accuracy of 0.6103 on the MUV dataset when utilizing 10 million molecules for pre-training. Meanwhile, this accuracy increases to 0.6428 with a dataset size of 20 million molecules and further rises to 0.7990 when leveraging the entire 100 million molecules for pre-training. Moreover, training with just 10 million molecules on the ESOL dataset yields an error of 0.3432, which represents a significant improvement over MoLFormer models trained from scratch (0.9721). However, on expanding the training set to 20 million and 100 million molecules, we see a performance dip (0.4407 and 0.5135, respectively), indicating possible overfitting or inefficiencies in handling larger datasets without fine-tuning the methodology accordingly [24]. Thus, while increased dataset sizes generally improve the accuracy of MLM-FG, the specific nature of the data and the pre-training techniques also play a critical role in extracting the maximal benefit from larger datasets.

2.3 Pre-trained Representations Visualization

The pre-training representation visualization results provide comprehensive insights into the learned molecular representations by MLM-FG.

Our first visualization analysis intends to connect the weights of the molecules and the distribution of their learned representations. These representations, extracted from the downstream datasets without fine-tuning, encompass 312,879 unique molecules. As shown in Figure 2, our experiment maps these presentations onto a 2D space using UMAP [25], where each point in the visualization is color-coded based on its corresponding molecular weight (g/mol). It can be seen in Figure 2 that even without task-specific fine-tuning, MLM-FG is capable of distinguishing between light-weighted and heavily-weighted molecules, indicating that the pre-training representation of MLM-FG has successfully captured molecular property information.

Yet another visualization study has been conducted to analyze the 2D graph and 3D geometric information content in the pre-training representation of MLM-FG. This study specifically focused on the comparisons between SMILES-based transformer models, including the standard MoLFormer as well as its variants and RoBERTa pre-trained using MLM-FG. For each model, given a SMILES string as the input, we extracted attention vectors for every atomic token, analyzed attentions across different atomic token pairs, and constructed attention matrices for these atoms. We subsequently compare these attention matrices with the corresponding matrices representing covalent bond connectivity and 3D distances between atom pairs. Figure 3 showcases the adjacency matrix, 3D distance matrix, and attention matrices for the molecule “CP(Br)C=O”. It shows that the attention matrices obtained by MLM-FG is close to the 3D distance matrix of the molecule, especially when comparing to MoLFormer and MoLFormer/RoBERTa’s variants pre-trained by other strategies. Table 3 presents the results for the cosine similarity between the 3D distance matrices and attention matrices, averaged over 50,000 molecules, showcasing the performance of various models. Compared to MoLFormer, MoLFormer trained from scratch, and RoBERTa trained from scratch, MLM-FG with both MoLFormer and RoBERTa architectures pre-trained for 100M steps exhibits superior performance in capturing the relationship between attention matrices and 3D distance matrices.

Table 3: Comparisons of cosine similarity between the 3D distance matrix and attention matrix, averaged over 50,000 molecules: The attention matrices of MoLFormer exhibit higher similarity with the 3D distance matrices than those of MLM-FG (MoLFormer, 100M). This discrepancy may be attributed to the pre-training of MoLFormer with 1 billion SMILES strings, highlighting the potential impact of pre-training data size on the alignment of attention and 3D geometric information. The commendable performance of the RoBERTa architecture (for either training from scratch or MLM-FG’s variant) underscores the distinct benefits associated with larger-scale architectures.

MoLFormer	MoLFormer (from scratch)	RoBERTa (from scratch)	MLM-FG (MoLFormer, 100M)	MLM-FG (RoBERTa, 100M)
0.5039	0.2174	0.8130	0.3306	0.8424

3 Discussion

This work proposes the MLM-FG framework that incorporates functional groups – as prior information on molecular structures – and leverages a functional group-aware random masking strategy to pre-train molecular language models on large-scale SMILES databases, yielding enhanced performance and generalization capability for downstream tasks. Our model has been evaluated across 12 datasets, including 7 classification and 5 regression tasks, outperforming the existing state-of-the-art models in 9 of these datasets. Notably, 5 of these datasets involve predictions on multiple sub-tasks of molecular properties, with up to 27 sub-tasks. We also validate the impact of data scale and model type, pre-training on MoLFormer and RoBERTa models across datasets of 10 million, 20 million, and 100 million molecules, followed by an analysis of downstream task performance. We employe AUC, RMSE, and MAE metrics to ensure a fair comparison among molecule analysis methods. Currently, there is limited research on leveraging Transformer architectures with added external knowledge for molecule data analysis. As a publicly available tool, MLM-FG offers a powerful resource for molecule analysis and further advanced applications.

3.1 Key Findings

Several key findings of this work could be summarized as follows.

•

Compared to 2D/3D molecular graphs, SMILES strings lack of explicit structural information and are with limited topological awareness. In the meanwhile, the proposed MLM-FG framework overcomes these limitations by incorporating functional group-aware random masking during pre-training, which enables implicit learning of structural features and functional group interactions from SMILES data, ultimately leading to more accurate predictions of molecular properties.
•

The functional group-aware random masking strategy proposed by MLM-FG demonstrates a significant performance improvement compared to masking strategies used in MoLFormer, random subsequence masking, and training from scratch. Furthermore, our analysis indicates that the distances between attention vectors of two atomic tokens extracted from MLM-FG closely approximate the actual 3D distances between atoms, providing a more accurate representation of molecular structures.
•

It has been observed that leveraging a larger model like RoBERTa or pre-training with a larger volume of data typically results in enhanced performance in downstream tasks, especially in the experiments for classification tasks. However, it is important to note that in many scenarios, employing larger models with more data may actually hurt the performance [24].

3.2 Implications

The MLM-FG model represents a significant advancement in molecular modeling by capturing essential structural information through functional group-aware masking within SMILES strings. This capability enhances the prediction of molecular properties and may aid in understanding of structure-activity relationships, making it a valuable tool across drug discovery and metabolomics studies.

In drug discovery, MLM-FG can be applied to virtual screening of large compound libraries to identify potential drug candidates, help prioritize compounds that are more likely to interact with specific biological targets, and aid in optimizing design of lead compounds. Additionally, MLM-FG may facilitate drug repurposing by screening existing drugs for new therapeutic targets, broadening the utility of known compounds. In metabolomics studies, MLM-FG may help identify unknown metabolites, providing valuable insights into metabolic processes and potential therapeutic targets.

Overall, MLM-FG emerges as a transformative tool in computational chemistry with broad implications and applications across the life sciences. Integrating MLM-FG into various research workflows can accelerate innovation and yield more efficient and targeted outcomes in their respective fields.

3.3 Limitations

The MLM-FG model, despite its innovations in molecule analysis, confronts several challenges. Firstly, it cannot model very long SMILES sequences. Those exceeding 512 tokens are truncated, potentially leading to loss of information. Secondly, our focus is on molecular modeling, limiting our ability to extend to predicting chemical reactions between molecules and molecular generation. Additionally, the performance of MLM-FG could be further improved through incorporating 3D information of molecules in pre-training, fine-tuning, and testing. Moreover, data re-sampling of pre-training datasets and advanced fine-tuning strategies could enhance MLM-FG in downstream tasks. Our future work would address these issues for potential performance enhancements.

4 Methods

This section provides a comprehensive overview of the design features associated with each component of MLM-FG. We will introduce the model architecture of MLM-FG and the pre-training strategy.

4.1 Model Architectures

In this work, we present MLM-FG, an approach for large-scale pre-training of molecules based on the Transformer blocks, which incorporates multi-layer and multi-head transformer blocks. Specifically, MLM-FG offers the same architectural configuration as the one shared by MoLFormer and RoBERTa, employing a 12-layer transformer and a hidden state dimension of $D_{h}\text{=}768$ . Consider an input SMILES sequence denoted as $\boldsymbol{s}\text{=}(s_{1},s_{2},...,s_{L})$ , where $L$ represents the length of the sequence. MLM-FG first tokenizes the sequence and subsequently feeds them into the transformer. This process enables us to extract token embeddings $\boldsymbol{h}\text{=}(h_{1},h_{2},...,h_{l})\in\mathcal{R}^{L\times D_{h}}$ , where $D_{h}$ represents the dimension of the hidden representations for the tokens. Then the model takes the series of token embeddings as input and transform them into a lower-dimentional vector to output the embedding of the SMILES sequence. The total number of trainable parameters in MLM-FG (MoLFormer) is approximately 48.1M and MLM-FG (RoBERTa) is approximately 93.8M.

4.2 Pre-training Datasets

Following many other pre-training based approaches, ours MLM-FG is structured into two main phases: pre-training and fine-tuning. During the pre-training phase, which is not tailored to any particular task, MLM-FG is trained on a vast corpus of range from 10 million to 100 million SMILES sequences sampled from PubChem [20]. The self-supervised training phase enables MLM-FG to discern sequential distributions and substructures in molecule sequences, thereby gaining a holistic grasp of their structural and functional insights.

4.3 Functional group-aware Pre-training Strategy

Pre-training strategies in molecular representation learning highly correlate with molecule formats. For pre-training with unlabeled data, the prevalent approach involves reconstructing randomly masked tokens in SMILES strings. Given that molecules with similar structures may have vastly different properties, this method might overlook the complex interrelations among molecular features and potentially distort molecular semantics. Our objective is to weave chemical domain knowledge, specifically regarding substructures, into the pre-training process.

Rather than randomly masking subsequences or tokens in SMILES, we mask the cluster of tokens in SMILES that represent these substructures. During the pre-training phase, we start by identifying the substructures that correspond to specific molecular functional groups with RDKit [16]. Then we randomly selecting a subset of these identified substructures, followed by masking the associated tokens within these substructures. Based on the count of these functional groups within a molecule, our masking strategy adjusts as follows:

•

If a molecule does not contain any functional groups, atom masking is employed as the default strategy. This ensures that the model still learns general structural aspects of molecules lacking specific functional groups.
•

For molecules with fewer than 10 functional groups, we mask only one functional group. This approach is designed to preserve the overall structural integrity of the molecule while still introducing the model to the complexity of functional groups.
•

In cases where a molecule contains more than 10 functional groups, we randomly mask 10% of these groups. This strategy introduces a higher level of complexity and variability, challenging the model to better generalize its learning across a more diverse set of molecular substructures.

The model leverages self-supervised learning to predict masked atoms, thereby acquiring structural information about molecules. This methodical selection and masking process is instrumental in guiding the model to understand and predict the underlying structural characteristics of molecules, enhancing its ability to infer molecular properties and functionalities based on structural cues. By integrating domain knowledge about molecular substructures into our pre-training strategy, we enable the model to develop a more nuanced and accurate representation of molecular structures, paving the way for more effective learning and prediction in downstream tasks.

4.4 Setups of MLM-FG in Experiments

MLM-FG approach gives rise to several model variants distinguished primarily by their underlying architecture and the size of the pre-training dataset. This section introduces the key variants leveraged in our experiments, which include MLM-FG (MoLFormer) and MLM-FG (RoBERTa).

•

MLM-FG (MoLFormer): This variant utilizes the MoLFormer architecture, specifically designed to capture complex molecular representations using rotary positional embeddings and an efficient linear attention mechanism. The MoLFormer models were pre-trained using three different dataset sizes: 10 million, 20 million, and 100 million molecules. These variations allow for an understanding of how the scale of the pre-training data impacts the effectiveness of model embeddings on downstream tasks, such as molecular property prediction.
•

MLM-FG (RoBERTa): This variant builds upon the RoBERTa architecture, renowned for its robustness in handling masked language modeling tasks due to its bidirectional encoder representations. Similarly to MoLFormer, RoBERTa models were pre-trained on datasets of 10 million, 20 million, and 100 million molecules. These multiple pre-training data scales enable evaluations of the RoBERTa model’s performance adaptability and efficiency when applied to different molecular prediction applications.

Both variants of MLM-FG, based MoLFormer and RoBERTa transformers, are instrumental in drawing comparative insights between different transformer-based architectures and dataset sizes. The insights gained from these variants help delineate the potential benefits and limitations inherent in each architecture, fostering an advanced understanding of their applicability within molecular informatics.

4.5 Setups of Baseline Methods for Comparisons

The models we are comparing against are based on cutting-edge methodologies derived from contemporary literature. These methods are as follows.

•

MolCLR-gin and MolCLR-gcn: These are 2D molecular graph-based models designed to leverage molecular graphs. They are equipped with distinct features focusing on graph-based learning paradigms and trained on 10 million unlabelled molecules. The total number of trainable parameters in MolCLR-gin is approximately 2.2M and in MolCLR-gcn is approximately 0.8M.
•

GEM: A 3D molecular graph-based model, GEM, is built upon a geometry-based approach. It incorporates innovative strategies based on molecular geometry in a 3D space for pre-training on 20 million unlabelled molecules. The total number of trainable parameters in GEM is approximately 0.107M.
•

GROVER-base and GROVER-large: These methods integrate Message Passing Networks into the Transformer-style architecture. They predict contextual properties based on atomic embeddings, encoding contextual information into node embeddings. The dataset for pre-training includes 10 million molecules. The total number of trainable parameters in GROVER-base is approximately 48M and in GROVER-large is approximately 100M.
•

MoLFormer: It is another model based on SMILES representations. This model employs pre-trained representations to capture molecular information encoded as SMILES strings. The pre-training dataset contains 1.1 billion molecules, and the total number of trainable parameters in MoLFormer is approximately 48.1M.

These methods are included for comparison due to their representation of state-of-the-art molecular modeling techniques, each offering distinct advantages. MolCLR-gin and MolCLR-gcn focus on 2D graph representations, GEM provides a 3D approach, GROVER integrates Message Passing Networks with Transformer architectures for contextual analysis, and MoLFormer utilizes SMILES representations with extensive pre-training. Comparing against these varied advanced models allows a comprehensive evaluation of our proposed models’ effectiveness and improvements in predictive accuracy.

4.6 Hyper-parameters and Training Details

In the MLM-FG, both MoLFormer and RoBERTa comprise 12 layers, each equipped with 12 attention heads. For pre-training, we initialized with a learning rate of 3 $\times 10^{-5}$ , gradually reducing it using a LambdaLR scheduler, and utilized the AdamW optimizer with a batch size of 1,024 across 16 NVIDIA V100 GPUs. We conducted 50 epochs for datasets of 10M and 20M molecules and reduced the epoch count to 20 for the 100M dataset to balance computational demands and training depth. In the fine-tuning phase, we maintained the learning rate at 3 $\times 10^{-5}$ but switched to the FusedLAMB optimizer for better efficiency, with a smaller batch size of 64 to ensure precise model adjustments tailored to specific tasks.

Data Availability

The datasets used for pre-training and fine-tuning are derived from previous studies. These datasets are publicly available via download links as follows.

•

PubChem: https://pubchem.ncbi.nlm.nih.gov/
•

QM8: https://moleculenet.org/datasets-1
•

ESOL: https://moleculenet.org/datasets-1
•

FreeSolv: https://moleculenet.org/datasets-1
•

MUV: https://moleculenet.org/datasets-1
•

BBBP: https://moleculenet.org/datasets-1
•

BACE: https://moleculenet.org/datasets-1
•

ClinTox: https://moleculenet.org/datasets-1
•

Tox21: https://moleculenet.org/datasets-1
•

SIDER: https://moleculenet.org/datasets-1
•

HIV: https://moleculenet.org/datasets-1

Code Availability

We built MLM-FG using Python and PyTorch. The code repository of MLM-FG, readme files and tutorials are all available at https://anonymous.4open.science/r/MLM-FG/README.md. The checkpoints of pre-trained models are available for download at https://drive.google.com/drive/folders/16vOW0rzMJJAC0iNFbzb6E_40yQ3lbLaF.

Author Contributions

All authors have made contributions in this paper. T. Peng designed studies conducted experiments and wrote part of the manuscript. Y. Li, X. Li, J. Bian, Z. Xie, N. Sui, S. Mumtaz, and Y. Xu involved in the discussion and wrote part of the manuscript. L. Kong oversaw the research progress, involved in the discussion and wrote part of the manuscript. H. Xiong oversaw the research progress, designed the study and experiments, involved in the discussion, and wrote the manuscript. T. Peng and H. Xiong made the equal technical contributions to this work. H. Xiong is the senior author and L. Kong shares the co-senior contribution.

Conflict of Interest Statement

The authors declare that they have no competing interests.

References

[1] Bing Huang and O Anatole Von Lilienfeld. Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity. The Journal of Chemical Physics, 145(16), 2016.
[2] Laurianne David, Amol Thakkar, Rocío Mercado, and Ola Engkvist. Molecular representations in ai-driven drug discovery: a review and practical guide. J. Cheminformatics, 12(1):56, 2020.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
[4] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nat. Mach. Intell., 4(3):279–287, 2022.
[5] Jinhua Zhu, Yingce Xia, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. Dual-view molecule pre-training. CoRR, abs/2106.10234, 2021.
[6] Pengyong Li, Jun Wang, Yixuan Qiao, Hao Chen, Yihuan Yu, Xiaojun Yao, Peng Gao, Guotong Xie, and Sen Song. Learn molecular representations from large-scale unlabeled molecules for drug discovery. CoRR, abs/2012.11175, 2020.
[7] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[8] Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi, Youssef Mroueh, and Payel Das. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
[9] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
[10] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
[11] Johan Broberg, Maria Margareta Bånkestad, and Erik Ylipää Hellqvist. Pre-training transformers for molecular property prediction using reaction prediction. In ICML 2022 2nd AI for Science Workshop, 2022.
[12] Ross Irwin, Spyridon Dimitriadis, Jiazhen He, and Esben Jannik Bjerrum. Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1):015022, 2022.
[13] Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. Motif-based graph self-supervised learning for molecular property prediction. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 15870–15882, 2021.
[14] Mengying Sun, Jing Xing, Huijun Wang, Bin Chen, and Jiayu Zhou. Mocl: Contrastive learning on molecular graphs with multi-level domain knowledge. CoRR, abs/2106.04509, 2021.
[15] Shuang Zhang, Rui Fan, Yuti Liu, Shuang Chen, Qiao Liu, and Wanwen Zeng. Applications of transformer-based language models in bioinformatics: a survey. Bioinformatics Advances, 3(1):vbad001, 2023.
[16] Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, Jingbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022.
[17] Kenneth Atz, Francesca Grisoni, and Gisbert Schneider. Geometric deep learning on molecular representations. Nature Machine Intelligence, 3(12):1023–1032, 2021.
[18] Thomas A Halgren. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519, 1996.
[19] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Grover: Self-supervised message passing transformer on large-scale molecular data. arXiv preprint arXiv:2007.02835, 2(3):17, 2020.
[20] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109, 2019.
[21] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
[22] W Hu, B Liu, J Gomes, M Zitnik, P Liang, V Pande, and J Leskovec. Strategies for pre-training graph neural networks. In International Conference on Learning Representations (ICLR), 2020.
[23] Bharath Ramsundar, Peter Eastman, Pat Walters, and Vijay Pande. Deep learning for the life sciences: applying deep learning to genomics, microscopy, drug discovery, and more. ” O’Reilly Media, Inc.”, 2019.
[24] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
[25] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29), 2018.

Pre-trained Molecular Language Models with Random Functional Group Masking111Corresponding to Haoyi Xiong via [email protected]