MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

1^st Muzhen Cai Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected] 2^nd Sendong Zhao Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected] 3^rd Haochun Wang Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected] 4^th Yanrui Du Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected] 5^th Zewen Qiang Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected] 6^th Bing Qin Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected] 7^th Ting Liu Faculty of Computing,
Harbin Institute of Technology
Harbin, China
[email protected]

Abstract

Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.

Index Terms:

multimodality, fusion method, multi-granularity views

I Introduction

Artificial Intelligence (AI) enhances drug property prediction by encoding molecular structures, thus facilitating the rapid identification of unqualified drug candidates from extensive pools [1, 2, 3, 4, 5, 6, 7]. Studies [8, 9, 10] indicate that different molecular representations capture distinct and complementary aspects of molecular information, highlighting the potential benefits of integrating multiple modalities to fully understand and utilize the diverse characteristics of molecules.

Refer to caption — Figure 1: Diagram of multimodality and multi-granularity. It illustrates how both SMILES and molecule graphs can be represented at molecular and atomic levels, providing a comprehensive view of different granularities and modalities.

It is reasonable to assume that different molecular representations of the same molecule possess complementary information because they encode distinct underlying knowledge (Complementary Information Assumption). Therefore, integrating complementary information from different molecular representations becomes a key research direction in molecular encoding. However, most existing studies focus only on combining molecular-level information and fail to incorporate intra-molecular alignment information between different modalities. For example, DMP [8], PanGu [11], GraphMVP [9], and MEMO [10] rely on contrastive learning or self-reconstruction techniques to align different representations. Consequently, much of the complementary information between different representations is not effectively utilized during molecular-level alignment.

To address the above issue, we consider both molecular-level and atomic-level alignments of different molecular representations to fully utilize complementary information (see Figure 1). Specifically, we propose MolFusion, a multi-granularity fusion learning method that 1) integrates different molecular representations (e.g., SMILES and molecule graphs); and 2) exploits both molecular-level and atomic-level alignments. The proposed molecular-level component, MolSim, brings the encoded representations of similar molecules closer together in continuous vector space by enhancing contrastive learning with similarity scores derived from molecular knowledge, replacing traditional binary labels. Our atomic-level component, AtomAlign, enriches atomic-level encoding by using one representation to complete the masked atoms in the other representation of the same molecule. In detail, it innovatively predicts masked information in SMILES by leveraging the encoded representations of masked SMILES and unmasked molecule graphs. By synchronously training these two components, we achieve partial alignment of the encoded representations of the two molecular modalities, thereby significantly preserving complementary information.

To validate the effectiveness of our proposed multi-granularity fusion learning method, we selected 6 classification tasks and 3 regression tasks from MoleculeNet [12] for evaluation. The experimental results demonstrate that our method achieves significant performance improvements compared to other fusion methods. Furthermore, ablation study and visualization confirm the validity of our multi-granularity fusion approach, highlighting its robustness in preserving complementary information between different molecular modalities. Therefore, our proposed fusion learning method surpasses existing fusion techniques and lays a solid foundation for future research in multimodal molecular representation learning. Additionally, our approach is model-agnostic, allowing for the integration of any existing pre-trained powerful models.

In our research, we propose a multimodal and multi-granularity fusion learning. Our contributions are as follows:

•

Multi-granularity views: We introduce an innovative atomic-level component that enriches atomic-level encoding by using one representation to complete the masked atoms in the other representation of the same molecule. Additionally, we propose a molecular-level component that brings the encoded representations of similar molecules closer together in continuous vector space.
•

Mutimodal confusion: Our method effectively leverages complementary information from different molecular modalities. By integrating SMILES and molecular graphs, we harness the unique strengths of each representation to enhance the overall molecular encoding, ensuring robust preservation of complementary information.
•

Effective Strategy: The results show that our integrated model, despite its simplicity, outperforms existing fusion methods, demonstrating the effectiveness of our multimodal fusion strategy in enhancing molecular representation learning.

II Related Work

In this section, we analyze the development and shortcomings of current molecular representation methods, focusing on the perspective of molecular data modality.

II-A Unimodal Molecular Representation Methods

Numerous molecular representation learning methods are based on single molecular representations, such as SMILES [13], molecular graphs [14], and 3D molecular representations [14]. There are relatively comprehensive methods for both molecular-level and atomic-level training for unimodal molecular representation learning. In molecular-level methods, contrastive learning [15, 16, 17] is a common training technique. MolCLR [18], MoCL [19], and GraphCL [20] perform contrastive learning on augmented molecular graphs after data augmentation operations. Knowledge-based BERT [21] generates various SMILES from canonical SMILES for data augmentation. In MolGNet [22], molecular graphs are split into two halves. Two halves from the same molecule are used as positive samples, and halves from different molecules are used as negative samples. KCL [23] enhances molecular graphs using knowledge graphs, adding corresponding attribute nodes to atoms and connecting them with edges. The molecular graphs before and after knowledge enhancement are encoded separately, with the encodings of the same molecule as positive samples and the encodings of different molecules as negative samples.

Beyond contrastive learning, SMILES Transformer [24] uses the autoencoding method to learn molecular representations. Similarly, X-Mol [25] designs Transformer layers, using bidirectional self-attention layers in the first half to learn SMILES representations and unidirectional self-attention layers in the second half to regenerate SMILES from the learned embeddings. Grover [26] and knowledge-based BERT propose a molecular features prediction task, predicting molecular features, such as the presence of specific functional groups, from the embeddings of the entire molecule.

For atomic-level methods, the most classic approach is masked language models [27, 28, 29]. CHEM-BERT [30], for example, masks atoms in SMILES and uses molecular contextual information to predict the types of masked atoms. For molecular graphs, atom features are masked, and other molecular features, such as bond features, are used to predict the types of atoms [31, 22, 32, 33].

Beyond atom masking tasks, Grover proposes using the learned atomic features to predict the contextual properties of atoms. In the context prediction task [31], atom neighborhood encodings are used as node embeddings, and surrounding graph structures are represented as context embeddings. The task is to predict whether the node embeddings and context embeddings come from the same molecule. InfoGraph [34] proposes to predict which atoms belong to which molecules based on representations of all molecules and the atoms they contain.

Notably, Hu et al. [31] suggests strategies for pre-training at the molecular or atomic level are found to offer limited improvements and may even degrade the performance of many downstream tasks. Therefore, most works [35, 26, 22] attempt to retain both molecular-level and atomic-level training strategies in their studies, employing multi-granularity methods to ensure optimal performance.

II-B Multimodal Molecular Representation Methods

Researchers [8, 9, 10] find that different molecular representations can provide complementary information, making the effective utilization of multiple molecular representations a key focus in drug property prediction. For instance, DMP [8] and PanGu [11] utilize molecular linear representations and molecular graph information, GraphMVP [9] leverages molecular graphs and 3D molecular representations, while MEMO [10] processes multiple representations, including SMILES, molecular graphs, 3D representations, and fingerprints. In multimodal methods, contrastive learning and its variants are prevalent. For example, GraphMVP and DMP treat different representations of the same molecule as positive samples and representations from different molecules as negative samples for contrastive learning or its variants. MEMO treats single representations and aggregated representations from the same molecule as positive samples and those from different molecules as negative samples.

Beyond contrastive learning, self-reconstructing is another approach. GraphMVP reconstructs one representation from the other between molecular graphs and 3D molecular representations. Similarly, PanGu uses molecular graphs to reconstruct the linear representation of SELFIES [36, 37], which is similar to SMILES.

Unfortunately, these works lack exploration of multimodal atomic-level approaches and also ignore the use of existing trained, powerful, unimodal models. For instance, DMP is a typical work that integrates SMILES and molecule graphs, using atomic-level masking methods and dual-view molecular-level contrastive learning on two single modalities. However, DMP trains two encoders from scratch, neglecting the use of the current powerful single modal encoders, resulting in the need for a large amount of data: 110M of training data, which is about 4.4 times our training data and consumes significant computational resources.

Therefore, our approach aims to propose a multi-granularity molecular representation fusion learning, utilizing atomic-level alignment to enhance the complementary information between different molecular representations.

III Methodology

In this section, we introduce the molecular-level component MolSim and the atomic-level component AtomAlign to achieve fusion learning of pre-trained encoders for different molecular representations. Molecular-level/atomic-level methods (see Figure 1) represent the methods with molecules/atoms as the minimum processing units, respectively.

III-A Molecular-level Fusion Component: MolSim

Contrastive learning [15] is a typically molecular-level method for multimodal fusion, labeling different representations of the same molecule as 1 and those of different molecules as 0. This approach implies that any two different molecules are entirely dissimilar [38, 39]. However, as shown in Figure 2, although Aspirin and Paracetamol are different molecules, they share similar properties, such as solubility in water and ethanol and analgesic effects. To better capture molecular similarities, we introduce MolSim, which uses continuous similarity measures instead of traditional binary labels. MolSim thus captures subtle molecular relationships more accurately.

For accurate similarity measurement, we utilize Morgan fingerprints [40] to compute the Tanimoto coefficient [41], which captures detailed molecular information, including atom types, bond types, and functional groups. The coefficient effectively reflects structural similarity, providing a reasonable measure of molecular relationships.

In the MolSim component, we replace binary labels with continuous similarity measures by calculating the Tanimoto coefficient between Morgan fingerprints, as shown in Figure 3 (a). This continuous similarity measure allows MolSim to learn more effectively and leverage complementary information.

The similarity matrix $T$ , calculated by using the Tanimoto coefficient between Morgan fingerprints, as mentioned above, serves as the target labels for learning. Given the features from SMILES and molecule graphs, the computed similarity matrix $Q$ is computed as:

Q=\tau^{-1}\times S\times G^{T}

(1)

where $\tau^{-1}$ is the inverse of the temperature parameter $\tau$ , and $S$ and $G$ represent the feature matrices for SMILES and molecule graphs, respectively.

In MolSim, we use the mean squared error to calculate the loss between model predictions and the molecular similarity matrix $T$ :

\mathcal{L}_{\text{MolSim}}=\frac{1}{N\times N}\sum_{i=1}^{N}\sum_{j=1}^{N}(Q_{ij}-T_{ij})^{2}

(2)

where $N$ is the number of samples in the matrix.

III-B Atomic-level Fusion Component: AtomAlign

In addition to learning molecular-level representations, we introduce AtomAlign for atomic-level fusion. Through atomic alignment, the model can align identical information contained in different molecular representations and obtain unique molecular information from each representation. This is consistent with the Complementary Information Assumption. In AtomAlign, we employ atomic-level masking to achieve alignment of atomic features between different molecular representations, as illustrated in Figure 3 (b).

In AtomAlign, we randomly mask atoms in SMILES and encode the masked SMILES. We then subtract the masked SMILES encodings from the molecular graph encodings of the same molecule. The resulting encodings are passed through a linear layer to predict both the types of masked atoms in SMILES and identify which atoms are not masked without predicting their types.

To align atomic features between different molecular representations, we first calculate the difference between the molecular graph encodings $\mathcal{T}$ and the masked SMILES encodings $\mathcal{E}_{\text{mask}}$ . The resulting difference vector $\mathcal{D}$ is given by:

\mathcal{D}=\mathcal{T}-\mathcal{E}_{\text{mask}}

(3)

Our loss function integrates the prediction of masked atoms and the identification of unmasked atoms. The total loss is computed as follows:

\mathcal{L}_{\text{mask}}=\frac{1}{N_{m}}\sum_{i\in\text{masked}}\text{CrossEntropy}(\mathcal{D}_{i},\mathcal{I}_{i}^{\text{mask}})

(4)

\mathcal{L}_{\text{unmask}}=\frac{1}{N_{u}}\sum_{i\in\text{unmasked}}\text{CrossEntropy}(\mathcal{D}_{i},\mathcal{M}_{i}^{\text{unmasked}})

(5)

\mathcal{L}_{\text{AtomAlign}}=\alpha\mathcal{L}_{\text{mask}}+(1-\alpha)\mathcal{L}_{\text{unmask}}

(6)

where $\mathcal{D}_{i}$ represents the difference vector at the $i^{\text{th}}$ position, $\mathcal{I}_{i}^{\text{mask}}$ is the index label of the masked atom at position $i$ in the SMILES representation. $\mathcal{M}$ represents the masking indicator matrix, where $\mathcal{M}_{i}$ is the 0/1 label indicating whether the atom at the $i^{\text{th}}$ position is masked, and $\alpha$ is a weighting parameter that balances the contributions of the masked and unmasked loss components, with $0\leq\alpha\leq 1$ .

These two components are trained synchronously, therefore:

\mathcal{L}_{\text{MolFusion}}=\mathcal{L}_{\text{MolSim}}+\beta\cdot\mathcal{L}_{\text{AtomAlign}}

(7)

where $\beta$ is a hyperparameter.

Our method design offers several advantages: (1) Integrating molecular-level and atomic-level fusion learning effectively leverages the complementary information from different molecular representations. (2) Our method significantly reduces data and computational resource consumption by efficiently using the existing powerful unimodal encoders. (3) Due to the accessibility of SMILES and molecular graph data, even if only SMILES or molecular graph is provided as input in downstream tasks, we can use the RDKit tool [42] to obtain another modal representation of molecules, thereby jointly improving model performance, as shown in Figure 3 (c). (4) Our design is notably lightweight, introducing only a single linear layer in the AtomAlign task, which avoids the inclusion of excessive model parameters and reduces training costs.

IV Experiments

In this section, we introduce experimental design, including the backbones, datasets, and baselines. Besides, we validate the effectiveness of our approach using 6 classification tasks and 3 regression tasks from MoleculeNet [12]. Additionally, we verify the Complementary Information Assumption and demonstrate the rationality of the components through comprehensive experimental analysis.

IV-A Experimental Design

IV-A1 Backbones

SMILES Encoder: We select CHEM-BERT [30] as the SMILES encoder in our study because CHEM-BERT demonstrates robust performance in drug property prediction. Its ability to encode SMILES into continuous embeddings effectively captures unique molecular characteristics.

Molecular Graph Encoder: For molecular graph encoders, we use the Grover-large model [26]. Grover-large is a powerful graph neural network-based encoder that can effectively process and learn representations of molecular graphs.

Table I: Comparison of performance of various fusion methods and aggregation operations on different datasets. The second row represents the number of samples in each dataset. In the “Fusion Method” column, “NO-TRAIN” (corresponding to section IV-A3) indicates no additional training for the two encoders, and “CL” stands for contrastive learning. In the “Aggregation Operation” column, “MG” represents Molecule Graph Encoder-only, “smiles” represents SMILES Encoder-only, and CCO and EWA represent two proposed aggregation operations. For the NO-TRAIN and SMILES Encoder-only condition, the performance reflects that of CHEM-BERT, while for the No-Train and Molecule Graph Encoder-only condition, it reflects that of GROVER-LARGE. In downstream experiments, we freeze the parameters of the two encoders and use a linear probe for prediction. The best performance in each dataset is highlighted in bold.

		SIDER		BBBP		BACE		Clintox		Tox21		Toxcast		ESOL		Lipo		Freesolv
		1427		2050		1514		1484		7831		8597		1128		4201		642
Fusion Method	Aggregation Operation	ROC-AUC												RMSE
	MG	55.64	3.04	55.77	1.19	38.35	2.44	47.13	3.81	52.05	0.34	48.83	0.31	2.071	0.077	1.039	0.018	4.197	0.051
	SMILES	50.21	0.11	60.01	0.35	50.47	9.25	39.47	0.65	56.38	0.06	51.31	0.06	2.256	0.006	1.100	0.008	4.163	0.037
	CCO	54.56	2.15	58.41	0.28	37.59	7.85	45.61	3.83	53.77	0.35	49.25	0.37	2.065	0.140	0.989	0.020	4.203	0.042
No- Train	EWA	55.93	3.21	56.41	0.95	39.49	3.69	49.86	3.79	54.32	0.98	49.12	0.31	2.141	0.119	1.030	0.019	4.153	0.030
	MG	53.54	2.28	54.37	2.00	43.89	1.52	63.33	2.47	50.27	0.13	48.73	0.17	2.194	0.035	1.054	0.007	4.204	0.009
	SMILES	54.23	2.34	55.28	2.38	46.70	7.46	39.71	1.15	51.35	0.06	50.23	0.03	2.195	0.040	1.093	0.011	4.180	0.024
	CCO	53.56	2.20	58.21	2.63	50.77	3.29	46.54	0.37	51.59	0.41	49.47	0.08	2.129	0.081	1.020	0.015	4.223	0.022
CL	EWA	53.93	2.21	55.23	3.25	44.88	5.92	49.15	1.77	50.51	0.05	49.75	0.05	2.054	0.087	1.052	0.002	4.205	0.013
	MG	49.18	0.72	52.89	6.55	57.99	10.47	50.99	4.31	48.50	0.42	50.22	0.24	2.251	0.150	1.113	0.004	4.195	0.041
	SMILES	51.29	1.21	51.38	1.53	40.74	4.70	51.09	3.61	49.31	1.44	49.87	0.41	2.250	0.013	1.113	0.004	4.192	0.033
	CCO	49.40	1.81	48.14	4.53	39.75	1.80	47.79	5.45	48.38	0.60	50.32	0.15	2.245	0.006	1.116	0.003	4.178	0.017
DMP	EWA	49.40	1.74	53.01	6.62	57.79	10.4	52.07	3.35	47.66	0.82	49.83	0.25	2.256	0.019	1.112	0.005	4.207	0.050
	MG	56.09	3.45	57.05	1.06	41.91	4.27	73.72	0.98	61.62	0.06	49.30	0.01	2.138	0.100	1.040	0.024	4.214	0.009
	SMILES	54.98	2.55	56.78	0.11	45.56	12.63	49.26	0.59	55.88	0.02	51.89	0.08	2.066	0.071	1.023	0.006	4.160	0.004
	CCO	55.00	4.01	61.19	0.43	54.11	3.28	72.94	3.57	63.37	0.03	50.23	0.08	1.907	0.064	0.963	0.004	4.074	0.006
Ours	EWA	55.80	3.14	60.63	0.55	53.17	10.58	57.44	0.96	62.22	0.09	50.41	0.01	1.823	0.060	0.979	0.019	4.108	0.003

IV-A2 Datasets

Training Dataset

We choose ZINC [43] as the training dataset, which contains 249,455 compounds. These compounds cover a wide range of chemical spaces, including drug samples, natural products, fragment compounds, prodrug molecules, and various ligands. We randomly split the training set and validation set in a 9:1 ratio. It is worth noting that the pre-training datasets for the selected backbones (CHEM-BERT and Grover-large) include ZINC, ZINC15 [44], and Chembl [45]. This ensures that the training data does not introduce new data outside the backbones, proving that the improvement of the fusion method is due to the complementarity of the two modalities rather than additional training data.

Downstream Task Datasets

We select 6 classification tasks and 3 regression tasks from MoleculeNet. Classification tasks include BBBP, BACE, SIDER, Clintox, Toxcast, and Tox21. Regression tasks include ESOL, Lipo, and Freesolv. For both classification and regression tasks, we apply scaffold splitting [46]. The datasets are divided into training, validation, and test sets with a ratio of 80%, 10%, and 10%, respectively.

IV-A3 Baselines

In this section, we select baselines from the perspective of model fusion methods. We choose three representative fusion methods to compare with our proposed method: 1. Simple Fusion: Aggregation of unimodal encodings without training. 2. Contrastive Learning Fusion: A multimodal molecular-level fusion method. 3. DMP Fusion [8]: A fusion method that combines multimodal molecular-level learning with unimodal atomic-level learning. Since aggregation operations are independent of the encoder training method, they can be combined with contrastive learning fusion, DMP fusion, and our method for comparison.

Simple Fusion

Simple fusion uses aggregation operations directly, without training, to integrate multimodal information. There are two proposed aggregation operations: (a) Element-Wise Addition (EWA) is an aggregation operation that adds each corresponding element of two embeddings. (b) Concatenation Operation (CCO) concatenates two embeddings along the specified dimension to form a single, longer embedding.

Contrastive Learning

Comparative learning [15] is a common method for molecular modal fusion. As a baseline, we use contrastive learning to align different molecular modalities. Modalities of the same molecule are treated as positive samples, while those of different molecules are treated as negative samples.

DMP

DMP [8] is a typical work that integrates SMILES and molecule graphs, using atomic-level masking methods and dual-view molecular-level contrastive learning on two single modalities, respectively. Unfortunately, this approach trains two encoders from scratch, ignoring existing powerful single-modal encoders. As a result, it requires a large dataset of 110M training samples, about 4.4 times the size of our dataset, leading to unnecessary computational waste. Both DMP and our approach have a dual-tower structure. We replace the backbones in DMP with those selected for our method while retaining the DMP training approach.

IV-A4 Metrics

We use ROC-AUC to evaluate classification tasks and RMSE to evaluate regression tasks. In cases where datasets contain multiple tasks, we calculate the average ROC-AUC score. Additionally, we employ an early-stop mechanism based on the validation set loss and record the average performance of 3 times.

IV-B Results

We validate our proposed fusion method on drug property prediction tasks, and the experimental results are shown in Table I. Our fusion method achieves significant performance improvements.

Unlike previous work [8, 9, 10], our method fully utilizes two encoders of different modalities. Previously, in downstream tasks, only a single encoder was used, which is insufficient for utilizing molecular multimodal complementary information: using only one encoder means that there is only one modality to supplement the information of another modality without the reverse information. Therefore, our work investigates the complementary effect of information between two molecular modalities. We evaluate the SMILES encoder only, molecule graph encoder only, EWA, and CCO under the conditions of no-train, contrastive learning, DMP, and our proposed method, as shown in Table I.

The results in Table I show that, among all the compared fusion methods, our approach achieves the most significant improvement by aggregating multimodal representations, as opposed to relying solely on single-modal representation. For example, in the BBBP dataset, compared to the maximum performance of the encoder-only method, our method shows an improvement of 4.64% using aggregation operations, while the no-train condition results in a decrease of 1.60%, contrastive learning achieves an improvement of 2.93%, and DMP shows a minor increase of 0.12%. This clearly demonstrates the effectiveness of our method in leveraging complementary multimodal information. Compared to our method, the effectiveness of other fusion methods decreases to varying degrees. Comparing DMP with our method, we can analyze that multimodal atomic-level methods can better integrate molecular multimodal information than single-mode atomic-level methods.

IV-C Ablation

In order to better understand the roles of the two components in MolFusion, we conduct ablation experiments. We run MolSim and AtomAlign components separately and combine them with atomic-level masked learning model and molecular-level contrastive learning, respectively, to form a multi-granularity method. We select SIDER, BBBP, Tox21, and Toxcast as the ablation experimental datasets and record the maximum experimental result of four aggregation operations as the result of each fusion method. The results demonstrate the effectiveness of two components in our method.

As shown in Figure 5, the experimental results show that our method is optimal among many combinations of molecular-level and atomic-level methods. Neither MolSim nor AtomAlign components alone can effectively enable the model to learn the complementary information from molecular modalities. For instance, in the SIDER dataset, MolSim alone achieves a ROC-AUC of 54.23% and AtomAlign 50.21%, but combined in our method, the ROC-AUC improves to 56.09%. In addition, whether we replace the molecular-level component in our method with contrastive learning or replace the atomic-level component in our method with a single-modal masked learning model, the effect decreases to varying degrees. For example, in the Tox21 dataset, replacing the molecular-level and atomic-level components in our approach with single-modal methods results in a performance drop of 12.09% and 6.98%, respectively. These prove the rationality and effectiveness of the two components in MolFusion, confirming that multi-granularity fusion method specifically designed for multimodal data can more effectively learn complementary information between different modalities.

IV-D Visualization of Fusion Effects

To visually demonstrate the effectiveness of our proposed fusion method in integrating two molecular representations, we use t-SNE for dimensionality reduction visualization. We select 500 molecular vectors from the ZINC dataset and visualize them under three conditions: no-train, contrastive learning, and our method. The results validate the Complementary Information Assumption.

As shown in Figure 4, our method results in partially overlapping vector spaces, consistent with Complementary Information Assumption. This leads to the most significant improvement when fusing representations of the two modalities.

V Conclusion

In conclusion, we propose MolFusion, a novel multimodal and multi-granularity fusion method. This method can better learn complementary information between different modalities through the molecular-level method MolSim and the atomic-level method AtomAlign. We achieve significant performance improvements on multiple classification and regression tasks in MoleculeNet. The effectiveness of our method in integrating different molecular modalities is verified through the ablation experiment, and the rationality of our method is further demonstrated through dimensionality reduction visualization.

VI References

[1] J. Xia, Y. Zhu, Y. Du, and S. Z. Li, “A systematic survey of chemical pre-trained models,” arXiv preprint arXiv:2210.16484, 2022.
[2] E. Gawehn, J. A. Hiss, and G. Schneider, “Deep learning in drug discovery,” Molecular informatics, vol. 35, no. 1, pp. 3–14, 2016.
[3] G. B. Goh, N. O. Hodas, and A. Vishnu, “Deep learning for computational chemistry,” Journal of computational chemistry, vol. 38, no. 16, pp. 1291–1307, 2017.
[4] Z. Zeng, Y. Yao, Z. Liu, and M. Sun, “A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals,” Nature communications, vol. 13, no. 1, p. 862, 2022.
[5] B. Su, D. Du, Z. Yang, Y. Zhou, J. Li, A. Rao, H. Sun, Z. Lu, and J.-R. Wen, “A molecular multimodal foundation model associating molecule graphs with natural language,” arXiv preprint arXiv:2209.05481, 2022.
[6] J. Payne, M. Srouji, D. A. Yap, and V. Kosaraju, “Bert learns (and teaches) chemistry,” arXiv preprint arXiv:2007.16012, 2020.
[7] Z. Zhang, Q. Liu, H. Wang, C. Lu, and C.-K. Lee, “Motif-based graph self-supervised learning for molecular property prediction,” Advances in Neural Information Processing Systems, vol. 34, pp. 15 870–15 882, 2021.
[8] J. Zhu, Y. Xia, T. Qin, W. Zhou, H. Li, and T.-Y. Liu, “Dual-view molecule pre-training,” arXiv preprint arXiv:2106.10234, 2021.
[9] Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang, and J. Tang, “Graphmae: Self-supervised masked graph autoencoders,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 594–604.
[10] Y. Zhu, D. Chen, Y. Du, Y. Wang, Q. Liu, and S. Wu, “Featurizations matter: A multiview contrastive learning approach to molecular pretraining,” in ICML 2022 2nd AI for Science Workshop, 2022.
[11] X. Lin, C. Xu, Z. Xiong, X. Zhang, N. Ni, B. Ni, J. Chang, R. Pan, Z. Wang, F. Yu et al., “Pangu drug model: learn a molecule like a human,” bioRxiv, pp. 2022–03, 2022.
[12] Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande, “Moleculenet: a benchmark for molecular machine learning,” Chemical science, vol. 9, no. 2, pp. 513–530, 2018.
[13] D. Weininger, “Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules,” Journal of chemical information and computer sciences, vol. 28, no. 1, pp. 31–36, 1988.
[14] Z. Li, M. Jiang, S. Wang, and S. Zhang, “Deep learning methods for molecular representation and property prediction,” Drug Discovery Today, vol. 27, no. 12, p. 103373, 2022.
[15] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[16] Y. Fang, Q. Zhang, N. Zhang, Z. Chen, X. Zhuang, X. Shao, X. Fan, and H. Chen, “Knowledge graph-enhanced molecular contrastive learning with functional prompt,” Nature Machine Intelligence, vol. 5, no. 5, pp. 542–553, 2023.
[17] A. Fürst, E. Rumetshofer, J. Lehner, V. T. Tran, F. Tang, H. Ramsauer, D. Kreil, M. Kopp, G. Klambauer, A. Bitto et al., “Cloob: Modern hopfield networks with infoloob outperform clip,” Advances in neural information processing systems, vol. 35, pp. 20 450–20 468, 2022.
[18] Y. Wang, J. Wang, Z. Cao, and A. Farimani, “Molclr: Molecular contrastive learning of representations via graph neural networks. arxiv 2021,” arXiv preprint arXiv:2102.10056.
[19] M. Sun, J. Xing, H. Wang, B. Chen, and J. Zhou, “Mocl: data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph,” in Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, 2021, pp. 3585–3594.
[20] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” Advances in neural information processing systems, vol. 33, pp. 5812–5823, 2020.
[21] Z. Wu, D. Jiang, J. Wang, X. Zhang, H. Du, L. Pan, C.-Y. Hsieh, D. Cao, and T. Hou, “Knowledge-based bert: a method to extract molecular features like computational chemists,” Briefings in Bioinformatics, vol. 23, no. 3, p. bbac131, 2022.
[22] P. Li, J. Wang, Y. Qiao, H. Chen, Y. Yu, X. Yao, P. Gao, G. Xie, and S. Song, “An effective self-supervised framework for learning expressive molecular global representations to drug discovery,” Briefings in Bioinformatics, vol. 22, no. 6, p. bbab109, 2021.
[23] Y. Fang, Q. Zhang, H. Yang, X. Zhuang, S. Deng, W. Zhang, M. Qin, Z. Chen, X. Fan, and H. Chen, “Molecular contrastive learning with chemical element knowledge graph,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 4, 2022, pp. 3968–3976.
[24] S. Wang, Y. Guo, Y. Wang, H. Sun, and J. Huang, “Smiles-bert: large scale unsupervised pre-training for molecular property prediction,” in Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, 2019, pp. 429–436.
[25] D. Xue, H. Zhang, D. Xiao, Y. Gong, G. Chuai, Y. Sun, H. Tian, H. Wu, Y. Li, and Q. Liu, “X-mol: large-scale pre-training for molecular understanding and diverse molecular analysis,” bioRxiv, pp. 2020–12, 2020.
[26] Y. Rong, Y. Bian, T. Xu, W. Xie, Y. Wei, W. Huang, and J. Huang, “Self-supervised graph transformer on large-scale molecular data,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 559–12 571, 2020.
[27] S. Honda, S. Shi, and H. R. Ueda, “Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery,” arXiv preprint arXiv:1911.04738, 2019.
[28] S. Chithrananda, G. Grand, and B. Ramsundar, “Chemberta: large-scale self-supervised pretraining for molecular property prediction,” arXiv preprint arXiv:2010.09885, 2020.
[29] X.-C. Zhang, C.-K. Wu, Z.-J. Yang, Z.-X. Wu, J.-C. Yi, C.-Y. Hsieh, T.-J. Hou, and D.-S. Cao, “Mg-bert: leveraging unsupervised atomic representation learning for molecular property prediction,” Briefings in bioinformatics, vol. 22, no. 6, p. bbab152, 2021.
[30] H. Kim, J. Lee, S. Ahn, and J. R. Lee, “A merged molecular representation learning for molecular properties prediction with a web-based service,” Scientific Reports, vol. 11, no. 1, p. 11028, 2021.
[31] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec, “Strategies for pre-training graph neural networks,” arXiv preprint arXiv:1905.12265, 2019.
[32] H. Li, R. Zhang, Y. Min, D. Ma, D. Zhao, and J. Zeng, “A knowledge-guided pre-training framework for improving molecular representation learning,” Nature Communications, vol. 14, no. 1, p. 7568, 2023.
[33] J. Xia, C. Zhao, B. Hu, Z. Gao, C. Tan, Y. Liu, S. Li, and S. Z. Li, “Mole-bert: Rethinking pre-training graph neural networks for molecules,” in The Eleventh International Conference on Learning Representations, 2022.
[34] F.-Y. Sun, J. Hoffmann, V. Verma, and J. Tang, “Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization,” arXiv preprint arXiv:1908.01000, 2019.
[35] J. Jiang, R. Zhang, Z. Zhao, J. Ma, Y. Liu, Y. Yuan, and B. Niu, “Multigran-smiles: multi-granularity smiles learning for molecular property prediction,” Bioinformatics, vol. 38, no. 19, pp. 4573–4580, 2022.
[36] M. Krenn, F. Häse, A. Nigam, P. Friederich, and A. Aspuru-Guzik, “Self-referencing embedded strings (selfies): A 100% robust molecular string representation,” Machine Learning: Science and Technology, vol. 1, no. 4, p. 045024, 2020.
[37] N. Janakarajan, T. Erdmann, S. Swaminathan, T. Laino, and J. Born, “Language models in molecular discovery,” arXiv preprint arXiv:2309.16235, 2023.
[38] J. Xia, L. Wu, G. Wang, J. Chen, and S. Z. Li, “Progcl: Rethinking hard negative mining in graph contrastive learning,” arXiv preprint arXiv:2110.02027, 2021.
[39] Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” arXiv preprint arXiv:2210.10163, 2022.
[40] D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” Journal of chemical information and modeling, vol. 50, no. 5, pp. 742–754, 2010.
[41] M. Vogt and J. Bajorath, “Modeling tanimoto similarity value distributions and predicting search results,” Molecular Informatics, vol. 36, no. 7, p. 1600131, 2017.
[42] G. Landrum, “Rdkit documentation,” Release, vol. 1, no. 1-79, p. 4, 2013.
[43] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman, “Zinc: a free tool to discover chemistry for biology,” Journal of chemical information and modeling, vol. 52, no. 7, pp. 1757–1768, 2012.
[44] T. Sterling and J. J. Irwin, “Zinc 15–ligand discovery for everyone,” Journal of chemical information and modeling, vol. 55, no. 11, pp. 2324–2337, 2015.
[45] A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani et al., “Chembl: a large-scale bioactivity database for drug discovery,” Nucleic acids research, vol. 40, no. D1, pp. D1100–D1107, 2012.
[46] Y. Hu, D. Stumpfe, and J. Bajorath, “Computational exploration of molecular scaffolds in medicinal chemistry: Miniperspective,” Journal of medicinal chemistry, vol. 59, no. 9, pp. 4062–4076, 2016.