UniMAP: Universal SMILES-Graph Representation Learning

Shikun Feng¹ , Lixin Yang² , Yanwen Huang³ , Yuyan Ni⁴,
Wei-Ying Ma¹, Yanyan Lan¹^‡
¹Institute for AI Industry Research, Tsinghua University
²Renmin University of China
³Department of Pharmaceutical Science, Peking University
⁴Academy of Mathematics and Systems Science, Chinese Academy of Sciences
^‡ Correspondence to [email protected].

Abstract

Molecular representation learning is fundamental for many drug-related applications. Most existing molecular pre-training models are limited in using single molecular modality, either SMILES or graph representation. To effectively leverage both modalities, we argue that it is critical to capture the fine-grained ‘semantics’ between SMILES and graph, because subtle sequence/graph differences may lead to contrary molecular properties. In this paper, we propose a universal SMILE-graph representation learning model, namely UniMAP. Firstly, an embedding layer is employed to obtain the token and node/edge representation in SMILES and graph, respectively. A multi-layer Transformer is then utilized to conduct deep cross-modality fusion. Specially, four kinds of pre-training tasks are designed for UniMAP, including Multi-Level Cross-Modality Masking (CMM), SMILES-Graph Matching (SGM), Fragment-Level Alignment (FLA), and Domain Knowledge Learning (DKL). In this way, both global (i.e. SGM and DKL) and local (i.e. CMM and FLA) alignments are integrated to achieve comprehensive cross-modality fusion. We evaluate UniMAP on various downstream tasks, i.e. molecular property prediction, drug-target affinity prediction, and drug-drug interaction. Experimental results show that UniMAP outperforms current state-of-the-art pre-training methods. We also visualize the learned representations to demonstrate the effectiveness of multi-modality integration.

1 Introduction

Machine learning has been applied to a wide range of tasks in cheminformatics and bioinformatics, including molecular property prediction [1], molecular generation [2], and virtual screening [3]. Due to the insufficiency of labeled data and engineered features, representation learning becomes an increasingly important tool in replacement of conventional methods, such as manually designed fingerprints [4].

Inspired by the success of pre-training and fine-tuning paradigm, pre-trained models on molecular data have demonstrated promising results. According to data formats, these models can be mainly classified into two categories, SMILES based and graph based methods. SMILES based methods, taking the SMILES [5] string as input, have been facilitated by various sequence modeling methods in NLP such as masked language modeling (MLM) [6] and auto-encoder [7]. Despite their ease of use, such methods treat molecules as 1D sequences and may fail to capture the geometric structure of a molecule [8]. While graph based methods, taking the molecular graph as input, mainly leverage graph level self-supervised training, such as atom/edge/subgraph masking [9] and contrastive learning [10, 11]. These methods directly utilize graph to represent the 2D structure, but may be limited by the over-smoothing [12] and over-squashing [13] problems of graph neural networks (GNN).

Given these limitations, a natural idea is to combine SMILES string and molecular graph into one model to enjoy the merits of both approaches. Recently, MOCO [14] designs a two-stream model to fuse SMILES and graph. Specifically, SMILES and graph are treated as two molecular views, in which a Transformer [15] branch and a GNN branch are used to obtain their representations, respectively. Then a consistency loss is conducted on the two molecular-level representations for pre-training.

Refer to caption — Figure 1: Two examples to demonstrate that substitution of a single fragment will lead to opposite properties in HIV dataset from MoleculeNet [1]. Specifically, the green parts in graph and the underlined parts in SMILES indicate the differences between the two molecules. A molecule is labeled as HIV active if it can inhibit HIV replication and HIV inactive otherwise.

We argue that the above global level alignment cannot well model the correspondence between SMILES and graph. Firstly, SMILES, as a language representation, provides detailed descriptions of a molecular graph, including atoms, bonds, rings, branching, stereochemistry, and isotope. As a result, the global level alignment as in MOCO may fail to recognize the subtle difference between molecules, and cause severe prediction error. For example, the substitution of a single fragment will lead to drug deactivation in virtual screening. We give two example pairs in Figure 1 for demonstration. In Pair 1, we replace 4-methylbenzyl (SMILES:c1ccc(C)cc1) with 2-hydroxymethylfuran (SMILES:c1ccc(CO)o1), and find that the modified compound no longer inhibits HIV replication. Similarly in Pair 2, if we replace just a single atom, i.e. thiol (SMILES: S) to chloride (SMILES: Cl), the same result is observed. We have tested previous models on this specific case, including the representative SMILES based method ChemBERTa [6] and graph based method GROVER [8]. The similarity results of both ChemBERTa and GROVER are larger than 0.8 for each pair, while UniMAP obtains much smaller similarity results, i.e. 0.55 for Pair 1 and 0.74 for Pair 2. Since the code of MOCO is not released and we cannot reproduce the results in their paper, we do not report the MOCO results on this case. However, given the similarity results from ChemBERTa and GROVER, a single consistency loss will not introduce much difference.

From the above analysis, it is critical to capture the fine-grained ‘semantics’ between SMILES and graph. To this end, we adopt fragments as the basic unit, and propose a multi-modality molecular pre-training model UniMAP, to obtain the UNIversal sMiles-grAPh representation for molecule. Fragments play a very important role in computational pharmaceutical research [16, 17]. Specifically, molecular fragments are chemically feasible substructures, which to some extent largely determine molecular biological activities and even influence the affinity between a molecule and its target proteins. Therefore, fragments are meaningful substructures to reflect the fine-grained ‘semantics’ between SMILES and graph.

To extract meaningful fragments, we first employ BRICS algorithm [18], to split a molecular graph to different fragments. Then we design a SMILES-graph fragment decomposition algorithm to obtain the corresponding SMILES fragment for each graph fragment. In this way, we could use an embedding layer to obtain both token and fragment representations for a molecule, which act as heterogeneous inputs to facilitate different pre-training tasks. Afterward, a shared Transformer backbone is utilized to conduct deep cross-modality fusion, and output the unified representation for fine-tuning. As for pre-training, besides two molecular-level tasks including SMILES-Graph Matching (SGM) and Domain Knowledge Learning (DKL), we introduce two novel fragment-level tasks to achieve fine-grained cross-modality interactions, i.e. Multi-Level Cross-Modality Masking (CMM) and Fragment-Level Alignment (FLA). Using both fragments and tokens as masking units, CMM forces the model to acquire both single- and cross-modality information, while FLA uses the fragment embeddings from different modalities as the contrastive learning unit to capture fine-grained cross-modality semantics. In this way, UniMAP learns a better molecular representation by leveraging the fine-grained semantic correlation between SMILES and graph.

We pre-train UniMAP on 10 million molecules from Pubchem [19], and then fine-tune the pre-trained model on various downstream tasks. We achieve state-of-the-art results on molecular property prediction tasks from MoleculeNet and outperform existing methods on drug-target affinity prediction and drug-drug interaction tasks. Our further visualization analysis shows typical fragment alignment patterns and clear embedding clusters, demonstrating the ability of UniMAP to capture both fragment-level and molecular-level ‘semantics’.

Our main contributions are three folds:

•

We propose UniMAP, the first single-stream multi-modality molecular representation learning method to leverage both SMILES and graph representation;
•

We design both molecular-level (SGM and DKL) and fragment-level (CMM and FLA) pre-training tasks to achieve deep cross-modality fusion in UniMAP;
•

We conduct experiments on downstream tasks including molecular property prediction, drug-target affinity prediction, and drug-drug interaction prediction, and demonstrate promising results of UniMAP.

2 Method

In this section, we introduce our proposed UniMAP, as shown in Figure 2, including an input embedding layer, a Transformer encoder, and four pre-training tasks.

2.1 Embedding Layer

The embedding layer is designed to map the given SMILES and graph of a molecule to embeddings for further computation. For an input SMILES, we adopt a regex-based tokenizer from DeepChem [20] to parse it into a series of tokens $S=[t_{1},t_{2},...,t_{n}]$ , then we apply an embedding layer to obtain the embeddings of SMILES denoted as $\boldsymbol{s}=[s_{1},s_{2},...,s_{n}]$ , where $s_{i}$ is the embedding of token $i$ with size $D$ . While an input graph is denoted as $G=\{V,E\}$ , where $V$ is the vertex set with $v_{i}\in{V}$ denotes the $i$ -th atom, and $E$ is the edge set with $e_{ij}\in{E}$ denotes an edge between the $i$ -th and the $j$ -th atom. We adopt a GCN [21] to obtain the graph embeddings $\boldsymbol{g}=[g_{1},g_{2},...,g_{m}]$ , where $m=|V|$ is the atom number, and ${g_{i}}$ is the $D$ dimensional vector for the $i$ -th atom.

2.2 Transformer Encoder

We adopt a Transformer based encoder denoted as $\boldsymbol{\theta}$ to handle both SMILES and graph embeddings. The typical Transformer takes a sequence of embeddings as input. We add a learnable position embeddings denoted as $p_{\boldsymbol{s}}$ to embeddings of SMILES $\boldsymbol{s}$ to maintain positional information. As for the molecular graph in which atoms are not organized in sequence, we simply concatenate graph embeddings $\boldsymbol{g}$ after the SMILES tokens embeddings to obtain the input of the Transformer, denoted as $\boldsymbol{z}$ .

\boldsymbol{z}=[\boldsymbol{s}+p_{\boldsymbol{s}},\boldsymbol{g}].

(1)

The Transformer encoder is composed of stack of identical blocks, where each block consists of two parts: a multi-head self-attention (MSA) module and a feed-forward neural network (FFN). Specifically, the self-attention mechanism in MSA allows cross-modality attention and conducts multi-modality interaction, including both inter- and intra-modality fusions.

To this end, the embeddings $\boldsymbol{x}=\boldsymbol{\theta}(\boldsymbol{z})$ is obtained for each molecule, where $\boldsymbol{x}=(x_{1},x_{2},...,x_{n},x_{n+1},...,x_{n+m}),x_{i}\in{\mathbb{R}}^{D}$ , in which the first $n$ elements and the last $m$ elements of $\boldsymbol{x}$ stand for the corresponding SMILES and graph representations.

Afterward, we apply a typical average pooling operation [22] on $\boldsymbol{x}$ to get the final molecular representation, denoted as $x_{cls}$ .

x_{cls}=\frac{1}{m+n}\sum_{i=1}^{m+n}{x_{i}}.

(2)

Based on $x_{cls}$ , different losses could be designed to facilitate the pre-training process. Considering the characteristics of SMILES and graph relation, we design both local (i.e. fragment-level) and global (i.e. molecular-level) losses.

2.3 SMILES-Graph Fragment Decomposition

Our SMILES-graph fragment decomposition algorithm first applies a mature method to decompose graph into different fragments, then assigns each SMILES character to the corresponding fragment, based on the SMILES definition.

Specifically, we use BRICS [18] algorithm to split the molecular graph into different fragments, i.e. $K_{j}$ fragments obtained for the $j$ -th molecule, denoted as $K$ for convenience. The labels of graph nodes could be represented as an vector $\boldsymbol{l}_{g}=[l_{1}^{(g)},l_{2}^{(g)},...,l_{m}^{(g)}]$ to demonstrate which fragment each atom belongs to, where $l_{i}^{(g)}$ stands for the fragment ID of the $i$ -th atom with value from $0$ to $K-1$ .

Since SMILES contains not only atoms but also special symbols representing chemical meanings, we define 4 rules as follows to label the special symbols, based on their meanings in SMILES grammar [5]. Each character of $\{.,-,=,\#,\$,:,/,\backslash\}$ is assigned the same label with its left nearest atom, because they denote the type of the corresponding bond started from the left nearest atom. The number after the character ‘C’ is assigned the same label as the character ‘C’, because the number usually indicates the beginning or the ending of a ring structure. The special sub-string ‘@@’ and ‘@’ are assigned the same fragment label with the carbon they decorate. Parentheses and brackets are assigned the same label as the nearest surrounding atom, because the contents inside them are usually in the same fragment. After label assignment, the labels of SMILES tokens are denoted as $\boldsymbol{l}_{s}=[l_{1}^{(s)},l_{2}^{(s)},...,l_{n}^{(s)}]$ , where $l_{i}^{(s)}$ stands for the fragment ID of the $i$ -th token with value from $0$ to $K-1$ .

2.4 Multi-Level Cross-Modality Masking

We design two kinds of cross-modality mask language modeling methods in this paper, i.e. token-level masking and fragment-level cross-modality masking. For fragment-level masking, the goal is similar to the previous conditional masking strategy [23], i.e using the surrounding contexts and the full information from the other modality to recover the corresponding masked SMILES or graph fragments. However, for tokens, we define the goal as only using the surrounding context of the single modality itself to predict the masked SMILES tokens or graph atoms, to emphasize the intra-modality pattern. In this way, both inter- and intra-modality patterns could be considered. This is reasonable because tokens are relatively smaller than fragments, and their context information in the same modality could be sufficient for MLM. Comparison with these masking strategies is conducted in the ablation study.

Firstly, we introduce the token-level masking. The input tokens of SMILES $S$ and atoms of graph $G$ are masked independently with the same ratio $r_{t}$ . Specifically, for the SMILES, we randomly select $|S|*r_{t}$ tokens from SMILES $S$ to mask and predict the type of masked tokens. As for the graph, we randomly select $|V|*r_{t}$ atoms, set the initial feature of both selected atoms and their adjacent edges to the pre-defined masked ones, and then utilize GNN on the masked graph to obtain node embeddings. We follow the practice in GROVER to predict the masked atom’s contextual information, and the total loss is defined as follows.

\mathcal{L}_{t}=-\left(\sum_{\forall S}(\log P_{\boldsymbol{\theta}}(\boldsymbol{s_{m}}|S_{\backslash\boldsymbol{m}}))+\sum_{\forall G}(\log P_{\boldsymbol{\theta}}(\boldsymbol{g_{m}}|G_{\backslash\boldsymbol{m}}))\right),

(3)

where $\boldsymbol{s_{m}}$ and $\boldsymbol{g_{m}}$ stand for masked SMILES tokens, atoms and edges, respectively. $S_{\backslash\boldsymbol{m}}$ and $G_{\backslash\boldsymbol{m}}$ stand for the surrounding contexts in the SMILES $S$ and graph $G$ .

For a given SMILES-graph pair $(S,G)$ , we introduce the fragment-level cross-modality masking (f-CMM). In this approach, we randomly select $r_{f}*K$ fragments and then randomly choose one modality with a probability of 0.5 to mask the selected fragments. The data processing and predicting target are the same with token-level masking except for taking fragments as masking unit. Thus the loss of f-CMM can be written as:

\mathcal{L}_{f}=-\sum_{\forall(S,G)}(\log P_{\boldsymbol{\theta}}(\boldsymbol{s_{m}}|S_{\backslash\boldsymbol{m}},G)+\log P_{\boldsymbol{\theta}}(\boldsymbol{g_{m}}|S,G_{\backslash\boldsymbol{m}})),

(4)

where $\boldsymbol{s_{m}}$ and $\boldsymbol{g_{m}}$ stand for masked SMILES and graph fragments, respectively. $S_{\backslash\boldsymbol{m}}$ and $G_{\backslash\boldsymbol{m}}$ stand for the surrounding contexts in the SMILES $S$ and graph $G$ .

The overall multi-level cross-modality masking loss, denoted as $\mathcal{L}_{CMM}$ , is the sum of $\mathcal{L}_{t}$ and $\mathcal{L}_{f}$ :

\mathcal{L}_{CMM}=\mathcal{L}_{t}+\mathcal{L}_{f}.

(5)

2.5 Fragment-Level Alignment

Fragment-level alignment (FLA) is a fine-grained cross-modality alignment strategy. The basic idea is to obtain corresponding fragment representations in a SMILES and graph pair, and then align them by a contrastive loss.

For each fragment $i=0,\cdots,K-1$ , the SMILES representation $f_{i}^{(s)}$ is obtained by a multi-head attention layer with average pooling to aggregate the associated token embeddings, and the graph representation $f_{i}^{(g)}$ is directly obtained by aggregating the associated atom embeddings with average pooling. Clearly, $f_{i}^{(s)}$ and $f_{i}^{(g)}$ form a positive pair because they stand for the same fragment in different modalities. While the negative pairs are constructed as follows. For each fragment $f_{i}^{(s)}$ , the negative graph fragment set $N_{s}$ includes both other fragments from the same graph and all fragments from other graphs. Similarly, we can obtain the negative SMILES fragment set $N_{g}$ for each fragment $f_{i}^{(g)}$ .

Afterward, we define the fragment-level contrastive learning loss as a combination of SMILES specific and graph specific contrastive losses, i.e. $\mathcal{L}_{FLA}=\mathcal{L}_{s}+\mathcal{L}_{g}$ , with

\mathcal{L}_{s}=-\sum_{i}(\log\frac{e^{cos(f_{i}^{(s)},f_{i}^{(g)})/\tau}}{e^{cos(f_{i}^{(s)},f_{i}^{(g)})/\tau}+\sum_{j\in N_{s}}e^{cos(f_{i}^{(s)},f_{j}^{(g)})/\tau}}),

(6)

\mathcal{L}_{g}=-\sum_{i}(\log\frac{e^{cos(f_{i}^{(g)},f_{i}^{(s)})/\tau}}{e^{cos(f_{i}^{(g)},f_{i}^{(s)})/\tau}+\sum_{j\in N_{g}}e^{cos(f_{i}^{(g)},f_{j}^{(s)})/\tau}}),

(7)

where $cos(,)$ represents the cosine similarity function and $\tau$ is the temperature hyper-parameter.

2.6 SMILES-Graph Matching

SMILES-Graph Matching is designed to reflect the molecular-level cross-modality alignment. Specifically, for each positive SMILES-graph pair $(S,G)$ , we construct an negative pair $(S,G^{\prime})$ by randomly replacing the graph $G$ to another molecular graph $G^{\prime}$ in the training set, and the corresponding molecular representations $x_{cls}$ and $x_{cls}^{\prime}$ are then taken as input to an MLP head $\boldsymbol{\theta}_{sg}$ to predict the corresponding binary label 1 and 0, with loss as follows.

\mathcal{L}_{SGM}=-\left(\sum_{\forall(S,G)}\log P_{\boldsymbol{\theta}_{sg}}(x_{cls})+\sum_{\forall(S,G^{\prime})}\log(1-P_{\boldsymbol{\theta}_{sg}}(x_{cls}^{\prime})\right).

(8)

2.7 Domain Knowledge Learning

We also design two pre-training losses to learn important chemical domain knowledge. Specifically, an MLP layer is added to predict the extended-connectivity fingerprints [24], which contains information about molecular topological structure and activity. The Mean Squared Error (MSE) loss is utilized for optimization. Furthermore, we add a classification layer to predict functional groups in a molecule, as in GROVER. The domain knowledge learning loss is defined as follows:

\mathcal{L}_{DKL}=\sum_{\forall(S,G)}||\boldsymbol{\theta}_{fp}(x_{cls})-y_{fp}||_{2}^{2}-\sum_{\forall(S,G)}\log P_{\boldsymbol{\theta}_{fg}}(y_{fg}|x_{cls}),

(9)

where $\boldsymbol{\theta}_{fp}$ and $\boldsymbol{\theta}_{fg}$ represent the MLP head, while $y_{fp}$ and $y_{fg}$ denote the labels of fingerprints and function groups respectively.

Finally, we set the weight of each loss to 1 due to stable optimization observations of their combination. Thus, the overall pre-training loss $L$ is written as:

\mathcal{L}=\mathcal{L}_{CMM}+\mathcal{L}_{FLA}+\mathcal{L}_{SGM}+\mathcal{L}_{DKL}

(10)

, where $\mathcal{L}_{DKL}$ denotes the domain knowledge learning loss.

3 Experiments

This section introduces our experimental evaluation of UniMAP on three different kinds of downstream tasks, including molecular property prediction, drug-target affinity prediction, and drug-drug interaction prediction(DDI). The results for DDI are detailed in appendix E

For pre-training, we select about 10 million molecules from PubChem, which is a widely used and publicly accessible database in chemistry, it contains about 114 million compounds which mostly are small molecules. In the data preprocessing phase, SMILES are canonicalized and converted to molecular graph by RDKit [25]. In the training phase, we adopt the RoBERTa model as the multi-modality encoder $\boldsymbol{\theta}$ and utilize Adam for optimization. Specifically, we utilize the Roberta base model as the shared transformer which contains 12 layers and the hidden size of embedding is set to 768. For optimization, the initial learning rate is set to 0.00005, and is then decreased by a linear decay scheduler. The mask ratio of token-level masking and fragment-level cross-modality masking is set as 0.2 and 0.6 separately, and the temperature $\tau$ of contrastive learning in fragment-level alignment is set as 0.05. Since we have multiple pre-training losses, the model is trained in a multi-task learning manner, and the weight of each per-training task is equal to 1. We train our model for 40 epochs on an 8×V100 GPUs machine. The batch size per GPU is set to 64. In total, it takes about 10 days to finish the training process, and half-precision training is utilized to save GPU memory.

Table 1: Results in terms of mean and std for the 8 classification tasks from MoleculeNet benchmark, we report ROC-AUC(%) performance. The best and second best results are marked bold and underlined, respectively. ‘-’ denotes the case that the original results are not provided.

Method	BBBP	Sider	ClinTox	Bace	Tox21	Toxcast	HIV	MUV	Average
ChemBERTa	64.3(-)	-	73.3(-)	-	72.8(-)	-	62.2(-)	-	68.1
SMILES Transformer	70.4(-)	-	-	70.1(-)	-	-	72.9(-)	-	71.1
KCL	69.9(0.6)	63.6(0.8)	58.8(1.9)	80.2(1.8)	70.6(0.5)	63.8(0.1)	75.7(0.6)	70.4(0.9)	69.1
GROVER(base)	71.9(0.1)	65.6(0.9)	81.7(2.5)	83.3(0.1)	73.6(0.3)	66.1(0.1)	75.0(0.2)	76.7(0.2)	74.2
GROVER(large)	71.8(0.5)	66.8(0.1)	73.9(5.6)	83.5(0.1)	73.4(0.1)	65.9(0.2)	73.1(1.7)	72.1(5.5)	72.6
AttrMasking	64.3(2.8)	61.0(0.7)	71.8(4.1)	79.3(1.6)	76.7(0.4)	64.2(0.5)	77.2(1.1)	74.7(1.4)	71.2
ContextPred	68.0(2.0)	60.9(0.6)	65.9(3.8)	79.6(1.2)	75.7(0.7)	63.9(0.6)	77.3(1.0)	75.8(1.7)	70.9
GraphLoG	72.5(0.8)	61.2(1.1)	76.7(3.3)	83.5(1.2)	75.7(0.5)	63.5(0.7)	77.8(0.8)	76.0(1.1)	73.4
GraphMAE	72.0(0.6)	60.3(1.1)	82.3(1.2)	83.1(0.9)	75.5(0.6)	64.1(0.3)	77.2(1.0)	76.3(2.4)	73.8
MGSSL	70.5(1.1)	60.5(0.7)	79.7(2.2)	79.7(0.8)	76.4(0.4)	63.8(0.3)	79.5(1.1)	78.1(1.8)	73.5
Mole-BERT	72.3(0.7)	62.2(0.8)	80.1(3.6)	81.4(1.0)	77.1(0.4)	64.5(0.4)	78.6(0.7)	78.3(1.2)	74.3
GraphMVP	72.4(1.6)	63.9(1.2)	79.1(2.8)	81.2(0.9)	75.9(0.5)	63.1(0.4)	77.0(1.2)	77.7(0.6)	73.8
3D InfoMax	69.1(1.1)	53.4(3.3)	59.4(3.2)	79.4(1.9)	74.5(0.7)	64.4(0.9)	-	76.1(1.3)	68.0
MOCO	71.6(1.0)	61.2(0.6)	81.6(3.7)	82.6(0.3)	76.7(0.4)	64.9(0.8)	78.3(0.4)	78.5(1.4)	74.4
UniMAP	75.6(1.8)	66.6(0.8)	98.3(1.7)	83.6(1.0)	77.2(0.9)	67.0(0.4)	80.1(0.7)	78.9(1.4)	78.4

3.1 Molecular Property Prediction

Data. We experiment on MoleculeNet [1], a widely used benchmark, to test the molecular property prediction ability, including three biophysics properties(i.e. Bace, HIV, and MUV) and five physiology properties (i.e. BBBP, Sider, ClinTox, Tox21, and Toxcast). Following previous practices in MoleculeNet, each dataset is split to train, validation, and test set with a ratio of 8:1:1, based on the scaffold splitter of DeepChem. Since the targets are classification tasks, we use ROC-AUC(%) as the evaluation metric. Specially for UniMAP, we run 3 times with different random seeds for each dataset and report the mean and standard deviation values.

Baseline Methods. We compare UniMAP with various pre-training baselines, including SMILES based methods ChemBERTa [6] and SMILES Transformer [7], graph based methods KCL [26], GROVER [8], AttrMasking [9], ContextPred [9], GraphLoG [27], GraphMAE [28], MGSSL [29], and Mole-BERT [30], 3D based methods 3D InfoMax [31] and GraphMVP [32], and hybrid method MOCO [14]. Most results are from the original paper except KCL and GROVER, since they use different data split methods. To guarantee a fair comparison, we employ their pre-trained models and conduct fine-tuning on the same data with other baselines for evaluation.

Fine-Tuning UniMAP. Based on the pre-trained representation $x_{cls}$ , a two-layer MLP is adopted to obtain the final classification label, supervised by a corresponding classification loss, i.e. cross entropy. For each task, we try 20 different hyper-parameter combinations, including learning rate, batch size, and weight-decay, via grid search on the validation set and report the best results.

Results. As shown in Table 1, UniMAP achieves the best results on 7 out of 8 datasets, while on Sider our result of 66.6% is the second best, which is slightly worse than GROVER’s 66.8%. Specifically from the averaged results, we can see that UniMAP (i.e. 78.4%) significantly outperforms the second-best method MOCO (i.e. 74.4%), which is also a hybrid method that combines four different molecular data forms. These results confirm the strength of UniMAP to perform consistently better on various molecular property prediction datasets from MoleculeNet, by using both SMILES and graph representation with fine-grained alignment.

3.2 Drug-Target Affinity Prediction

Table 2: Experiment results on DTA benchmarks. We report MSE and Concordance Index (CI) scores, the best and second best results are marked bold and underlined, respectively.

Task	Davis		KIBA
Method	MSE↓	CI↑	MSE↓	CI↑
KronRLS	0.329	0.847	0.852	0.688
GraphDTA	0.263	0.864	0.183	0.862
DeepDTA	0.262	0.870	0.196	0.864
SMT-DTA	0.219	0.890	0.154	0.894
GraphMVP	0.274	-	0.175	-
Mole-BERT	0.266	-	0.157	-
3D PGT	0.275	-	0.173	-
SimSGT	0.251	-	0.153	-
UniMAP	0.246	0.888	0.144	0.891

Data. Drug-Target Affinity (DTA) prediction is an important task in drug discovery, which is designed to predict the binding affinity between a drug and a target protein. We evaluate our method on two widely used DTA benchmarks, DAVIS [33] and KIBA [34]. DAVIS records the dissociation constant( $K_{d}$ ) value of kinase protein and its inhibitors, while KIBA contains the $K_{i}$ , $K_{d}$ , and $IC_{50}$ values to describe the bioactivity of kinase inhibitors. We follow GraphMVP to split the datasets and calculate the MSE and Concordance Index (CI) scores as evaluation metrics.

Baseline Models. We compare our method with three different types of baselines. KronRLS [35], GraphDTA [36] and DeepDTA [37] are supervised methods for DTA; SMT-DTA [38] is a semi-supervised method, which trains the model with an additional unsupervised masked language modeling task with large-scale unlabeled molecules and protein data; GraphMVP, 3D PGT [39], SimSGT [40] and Mole-BERT are molecular pre-training methods.

Fine-Tuning UniMAP. Similarly to GraphMVP, we model the target protein with a convolution neural network to get the protein embedding, concatenate it with the molecular embedding $x_{cls}$ , and then use an MLP to obtain the predicted binding affinity value.

Results. The results are demonstrated in Table 2. We report MSE and Concordance Index(CI) scores as evaluation metrics, GraphMVP and Mole-BERT didn’t provide CI scores in their original results. From the results, we can see that UniMAP outperforms all supervised DTA methods and other molecular pre-training methods, and performs slightly worse than SMT-DTA. It may be because SMT-DTA utilized large-scale unlabeled protein data and a more complicated Transformer to learn the protein embedding. Still, UniMAP performs better in terms of MSE in KIBA, indicating the ability of UniMAP to learn a meaningful representation for DTA.

4 Discussions

In this section, we provide an ablation study of losses and visualization results, to facilitate further understanding of UniMAP.

4.1 Ablation Study on Different Losses

Table 3: Comparison with different loss combinations.

Method	BBBP	Sider	ClinTox	Bace	Tox21	Toxcast	HIV	MUV	Average
SGM + CMM + FLA + DKL	76.1(0.4)	65.8(0.1)	98.6(0.2)	80.9(0.3)	76.3(0.6)	60.1(0.4)	77.1(0.4)	72.6(0.3)	75.9
SGM + CMM +FLA	75.0(0.4)	63.6(0.4)	98.7(0.2)	81.2(0.8)	74.9(0.3)	60.8(0.8)	77.6(1.5)	71.8(1.4)	75.5
SGM + CMM	71.1(0.1)	63.8(0.3)	98.1(0.4)	80.1(0.8)	75.4(0.2)	59.4(0.6)	76.5(0.9)	72.8(1.1)	74.6
SGM + Conditional Masking	71.1(0.5)	60.7(0.6)	96.4(0.8)	79.3(0.3)	74.6(0.1)	58.1(1.8)	72.2(0.3)	67.7(0.7)	72.5
SGM + Single-Modality Masking	71.0(0.3)	61.7(0.4)	95.8(0.7)	79.2(1.4)	75.6(0.2)	57.4(0.4)	75.7(0.1)	69.3(0.7)	73.2

Our ablation study contains two parts, comparing different losses, and different masking strategies. We test on 8 typical classification tasks, and the results are shown in Table 3. Each setting pre-trains the model using only 1 million molecules for the consideration of saving time and computational resources.

Firstly, we delete DKL and FLA one by one. From the results, we can see that DKL proves particularly beneficial for tasks such as Sider and Tox21. This advantage can be attributed to the function group prediction task within DKL, which directly relates to the molecular side effects and toxicity properties. Certain functional groups, when present, can induce adverse reactions or toxicity in the human body. For example, the presence of aldehyde groups has been linked to hepatocyte side effects [41]. Fragment-Level Alignment(FLA) is particularly advantageous for tasks like BBBP and HIV. The alignment of fragments across different modalities enables the model to explicitly capture the semantics of fragments, thus enhancing sensitivity to fragment replacement which may influence molecular properties. As depicted in Figure 1, substituting a fragment can lead to HIV inhibition. Similarly, in BBBP, pharmacologists commonly examine fragments within a drug to either promote or inhibit blood-brain barrier permeability [42].

Along with the SGM task which is commonly adopted in existing multi-modal methods, we compare CMM in UniMAP with the other two token-level mask strategies, i.e. token-level conditional masking and single-modality masking. Specifically, token-level conditional masking means that we only mask one-modality tokens, i.e. SMILES tokens or graph atoms, to conduct the prediction based on two modality data; while single-modality masking means that the model predicts the masked tokens from only the surrounding tokens of the same modality. The results show that our CMM performs the best on most tasks, except on Tox21 (i.e. slightly worse than conditional masking). The overall performance superiority of CMM over the other two masking strategies can be attributed to our proposed more challenging fragment-level conditional masking and the consideration of both intra-modal and inter-modal patterns. Therefore, we can see that an in-depth multi-modality dependency strategy will produce a better molecular representation.

4.2 Embedding Clustering Results

We conduct visualization analyses on the output embeddings from both molecular and fragment levels.

Firstly, we randomly select about two thousand molecules belonging to pre-defined 16 types of scaffold (denoted as different colors), then visualize their UniMAP representations through t-SNE [43]. The results are shown in Figure 4. We can see that the learned representations demonstrate a clear scaffold clustering property, i.e. molecules with the same scaffold are mapped to close positions in the embedding space.

In addition, we visualize the embeddings of four thousand molecular fragments, as shown in Figure 4. We find that some clustered fragments correspond to certain chemical categories [44]. For example, four adjacent clusters colored in light yellow, light grey, bright green, and magenta stand for Fluoride, Chloride, Bromide, and Iodide respectively, which contain different atoms of halogens like F, Cl, Br, and I. Three clusters colored red, blue, and magenta imply Alkane, Thioalkane, and Bromoalkane respectively. We also find three nitrogenous fragment clusters including Nitrile, Amide, and Nitro compound, and two heterocycle clusters including Oxygen heterocycle and Oxidized nitrogen heterocycle.

From the above visualization results, we can conclude that UniMAP provides meaningful molecular-level and fragment-level representations.

5 Conclusion

In this paper, we propose the single-stream multi-modality model to fusion both SMILES and graph in a fine-grained manner for molecular representation learning, namely UniMAP. Firstly, a SMILES-graph decomposition algorithm is designed to obtain corresponding SMILES and graph fragments, which together with tokens function as heterogenous input in pre-training tasks. Secondly, a shared Transformer is utilized to conduct deep cross-modality fusion. Besides molecular-level tasks such as SMILES-Graph Matching and Domain Knowledge Learning, we introduce two novel fragment-level pre-training tasks, i.e. Multi-Level Cross-Modality Masking and Fragment-Level Alignment. Finally, we conduct extensive experiments on three different kinds of downstream tasks, including molecular property prediction, drug-target affinity prediction, and drug-drug interaction. Our experimental results show that UniMAP significantly outperforms existing supervised and pre-training methods. We also conduct an ablation study on the effect of different losses, and observe an interesting complementary relation between global molecular-level pre-training loss and fine-grained local semantic representation loss. Furthermore, we conduct a visualization analysis on the output embeddings, and find that the clusters of both molecular and fragment representations demonstrate meaningful chemical properties. In addition, the learned attention weights of UniMAP show a clear fragment alignment pattern, validating the soundness of UniMAP.

Though this paper focuses on SMILES and graph data fusion, our proposed UniMAP is a modality-agnostic framework and therefore has the potential to be extended to a universal model, including various molecular data formats and downstream tasks, which could inspire diverse directions for future exploration.

References

[1] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
[2] Daniil Polykovskiy, Alexander Zhebrak, Benjamin Sanchez-Lengeling, Sergey Golovanov, Oktai Tatanov, Stanislav Belyaev, Rauf Kurbanov, Aleksey Artamonov, Vladimir Aladinskiy, Mark Veselov, et al. Molecular sets (moses): a benchmarking platform for molecular generation models. Frontiers in pharmacology, 11:565644, 2020.
[3] Talia B Kimber, Yonghui Chen, and Andrea Volkamer. Deep learning in virtual screening: recent applications and developments. International Journal of Molecular Sciences, 22(9):4435, 2021.
[4] Harry L Morgan. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of chemical documentation, 5(2):107–113, 1965.
[5] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988.
[6] Seyone Chithrananda, Gabriel Grand, and Bharath Ramsundar. Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885, 2020.
[7] Shion Honda, Shoi Shi, and Hiroki R Ueda. Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738, 2019.
[8] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33:12559–12571, 2020.
[9] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In International Conference on Learning Representations, 2019.
[10] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3):279–287, 2022.
[11] Yuyang Wang, Rishikesh Magar, Chen Liang, and Amir Barati Farimani. Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. Journal of Chemical Information and Modeling, 2022.
[12] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3438–3445, 2020.
[13] Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022.
[14] Yanqiao Zhu, Dingshuo Chen, Yuanqi Du, Yingze Wang, Qiang Liu, and Shu Wu. Improving molecular pretraining with complementary featurizations. arXiv preprint arXiv:2209.15101, 2022.
[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[16] Tairan Liu, Misagh Naderi, Chris Alvin, Supratik Mukhopadhyay, and Michal Brylinski. Break down in order to build up: decomposing small molecules for fragment-based drug design with e molfrag. Journal of chemical information and modeling, 57(4):627–631, 2017.
[17] Xiao Qing Lewell, Duncan B Judd, Stephen P Watson, and Michael M Hann. Recap retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. Journal of chemical information and computer sciences, 38(3):511–522, 1998.
[18] Jörg Degen, Christof Wegscheid-Gerlach, Andrea Zaliani, and Matthias Rarey. On the art of compiling and using’drug-like’chemical fragment spaces. ChemMedChem: Chemistry Enabling Drug Discovery, 3(10):1503–1507, 2008.
[19] Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109, 2019.
[20] Bharath Ramsundar, Peter Eastman, Patrick Walters, Vijay Pande, Karl Leswing, and Zhenqin Wu. Deep Learning for the Life Sciences. O’Reilly Media, 2019. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837.
[21] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns. arXiv preprint arXiv:2006.07739, 2020.
[22] Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, and Nan Duan. Whiteningbert: An easy unsupervised sentence embedding approach. arXiv preprint arXiv:2104.01767, 2021.
[23] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
[24] David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010.
[25] Greg Landrum et al. Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum, 2013.
[26] Yin Fang, Qiang Zhang, Haihong Yang, Xiang Zhuang, Shumin Deng, Wen Zhang, Ming Qin, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Molecular contrastive learning with chemical element knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3968–3976, 2022.
[27] Minghao Xu, Hang Wang, Bingbing Ni, Hongyu Guo, and Jian Tang. Self-supervised graph-level representation learning with local and global structure. In International Conference on Machine Learning, pages 11548–11558. PMLR, 2021.
[28] Zhenyu Hou, Xiao Liu, Yuxiao Dong, Chunjie Wang, Jie Tang, et al. Graphmae: Self-supervised masked graph autoencoders. arXiv preprint arXiv:2205.10803, 2022.
[29] Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. Motif-based graph self-supervised learning for molecular property prediction. Advances in Neural Information Processing Systems, 34:15870–15882, 2021.
[30] Anonymous. Mole-BERT: Rethinking pre-training graph neural networks for molecules. In Submitted to The Eleventh International Conference on Learning Representations, 2023. under review.
[31] Hannes Stärk, Dominique Beaini, Gabriele Corso, Prudencio Tossou, Christian Dallago, Stephan Günnemann, and Pietro Liò. 3d infomax improves gnns for molecular property prediction. In International Conference on Machine Learning, pages 20479–20502. PMLR, 2022.
[32] Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, and Jian Tang. Pre-training molecular graph representation with 3d geometry. In International Conference on Learning Representations, 2022.
[33] Mindy I Davis, Jeremy P Hunt, Sanna Herrgard, Pietro Ciceri, Lisa M Wodicka, Gabriel Pallares, Michael Hocker, Daniel K Treiber, and Patrick P Zarrinkar. Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology, 29(11):1046–1051, 2011.
[34] Jing Tang, Agnieszka Szwajda, Sushil Shakyawar, Tao Xu, Petteri Hintsanen, Krister Wennerberg, and Tero Aittokallio. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. Journal of Chemical Information and Modeling, 54(3):735–743, 2014.
[35] R Panico, WH Powell, and Jean-Claude Richer. A guide to IUPAC Nomenclature of Organic Compounds, volume 2. Blackwell Scientific Publications, Oxford, 1993.
[36] Thin Nguyen, Hang Le, Thomas P Quinn, Tri Nguyen, Thuc Duy Le, and Svetha Venkatesh. Graphdta: Predicting drug–target binding affinity with graph neural networks. Bioinformatics, 37(8):1140–1147, 2021.
[37] Hakime Öztürk, Arzucan Özgür, and Elif Ozkirimli. Deepdta: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.
[38] Qizhi Pei, Lijun Wu, Jinhua Zhu, Yingce Xia, Shufang Xia, Tao Qin, Haiguang Liu, and Tie-Yan Liu. Smt-dta: Improving drug-target affinity prediction with semi-supervised multi-task training. arXiv preprint arXiv:2206.09818, 2022.
[39] Xu Wang, Huan Zhao, Wei-wei Tu, and Quanming Yao. Automated 3d pre-training for molecular property prediction. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2419–2430, 2023.
[40] Zhiyuan Liu, Yaorui Shi, An Zhang, Enzhi Zhang, Kenji Kawaguchi, Xiang Wang, and Tat-Seng Chua. Rethinking tokenizer and decoder in masked graph modeling for molecules. Advances in Neural Information Processing Systems, 36, 2024.
[41] Richard M LoPachin and Terrence Gavin. Molecular mechanisms of aldehyde toxicity: a chemical perspective. Chemical research in toxicology, 27(7):1081–1091, 2014.
[42] Baichen Xiong, Yuanyuan Wang, Ying Chen, Shuaishuai Xing, Qinghong Liao, Yao Chen, Qi Li, Wei Li, and Haopeng Sun. Strategies for structural modification of small molecules to improve blood–brain barrier penetration: a recent perspective. Journal of Medicinal Chemistry, 64(18):13152–13173, 2021.
[43] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
[44] Jonathan Clayden, Nick Greeves, and Stuart Warren. Organic chemistry. Oxford university press, 2012.
[45] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
[46] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
[47] Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. Incorporating bert into neural machine translation. In International Conference on Learning Representations, 2020.
[48] Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
[49] Dongyu Xue, Han Zhang, Dongling Xiao, Yukang Gong, Guohui Chuai, Yu Sun, Hao Tian, Hua Wu, Yukun Li, and Qi Liu. X-mol: large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv, pages 2020–12, 2021.
[50] Xinyuan Lin, Chi Xu, Zhaoping Xiong, Xinfeng Zhang, Ningxi Ni, Bolin Ni, Jianlong Chang, Ruiqing Pan, Zidong Wang, Fan Yu, et al. Pangu drug model: Learn a molecule like a human. bioRxiv, 2022.
[51] Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.
[52] Pengyong Li, Jun Wang, Yixuan Qiao, Hao Chen, Yihuan Yu, Xiaojun Yao, Peng Gao, Guotong Xie, and Sen Song. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Briefings in Bioinformatics, 22(6):bbab109, 2021.
[53] Shuangli Li, Jingbo Zhou, Tong Xu, Dejing Dou, and Hui Xiong. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4541–4549, 2022.
[54] Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, Jingbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022.
[55] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2023.
[56] Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, and Di He. One transformer can understand both 2d & 3d molecular data. arXiv preprint arXiv:2210.01765, 2022.
[57] Rui Jiao, Jiaqi Han, Wenbing Huang, Yu Rong, and Yang Liu. Energy-motivated equivariant pretraining for 3d molecular graphs. arXiv preprint arXiv:2207.08824, 2022.
[58] Sheheryar Zaidi, Michael Schaarschmidt, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez-Gonzalez, Peter Battaglia, Razvan Pascanu, and Jonathan Godwin. Pre-training via denoising for molecular property prediction. arXiv preprint arXiv:2206.00133, 2022.
[59] Rui Feng, Qi Zhu, Huan Tran, Binghong Chen, Aubrey Toland, Rampi Ramprasad, and Chao Zhang. May the force be with you: Unified force-centric pre-training for 3d molecular conformations. Advances in Neural Information Processing Systems, 36, 2024.
[60] Jun Xia, Yanqiao Zhu, Yuanqi Du, and Stan Z Li. A systematic survey of chemical pre-trained models. arXiv preprint arXiv:2210.16484, 2022.
[61] Jinhua Zhu, Yingce Xia, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. Dual-view molecule pre-training. arXiv preprint arXiv:2106.10234, 2021.
[62] Zhihui Guo, Pramod Sharma, Andy Martinez, Liang Du, and Robin Abraham. Multilingual molecular representation learning via contrastive pre-training. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3441–3453, 2022.
[63] Ana Sanchez-Fernandez, Elisabeth Rumetshofer, Sepp Hochreiter, and Günter Klambauer. Contrastive learning of image- and structure-based representations in drug discovery. In ICLR2022 Machine Learning for Drug Discovery, 2022.
[64] Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, et al. Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561, 2021.
[65] Chenliang Li, Ming Yan, Haiyang Xu, Fuli Luo, Wei Wang, Bin Bi, and Songfang Huang. Semvlp: Vision-language pre-training by aligning semantics at multiple levels. arXiv preprint arXiv:2103.07829, 2021.
[66] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[67] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
[68] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
[69] Jae Yong Ryu, Hyun Uk Kim, and Sang Yup Lee. Deep learning improves prediction of drug–drug and drug–food interactions. Proceedings of the National Academy of Sciences, 115(18):E4304–E4311, 2018.
[70] Yujie Chen, Tengfei Ma, Xixi Yang, Jianmin Wang, Bosheng Song, and Xiangxiang Zeng. Muffin: multi-scale feature fusion for drug–drug interaction prediction. Bioinformatics, 37(17):2651–2658, 2021.
[71] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
[72] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pages 1067–1077, 2015.
[73] Han Tang, Shikun Feng, Bicheng Lin, Yuyan Ni, JIngjing Liu, Wei-Ying Ma, and Yanyan Lan. Contextual molecule representation learning from chemical reaction knowledge. arXiv preprint arXiv:2402.13779, 2024.
[74] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences, 44(3):1000–1005, 2004.
[75] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012.
[76] Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk, Carlos Amador-Bedolla, Roel S Sánchez-Carrera, Aryeh Gold-Parker, Leslie Vogt, Anna M Brockway, and Alán Aspuru-Guzik. The harvard clean energy project: large-scale computational screening and design of organic photovoltaics on the world community grid. The Journal of Physical Chemistry Letters, 2(17):2241–2251, 2011.
[77] Francisco-Javier Gamo, Laura M Sanz, Jaume Vidal, Cristina De Cozar, Emilio Alvarez, Jose-Luis Lavandera, Dana E Vanderwall, Darren VS Green, Vinod Kumar, Samiul Hasan, et al. Thousands of chemical starting points for antimalarial lead identification. Nature, 465(7296):305–310, 2010.
[78] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
[79] Daniel A Erlanson, Robert S McDowell, and Tom O’Brien. Fragment-based drug discovery. Journal of medicinal chemistry, 47(14):3463–3482, 2004.
[80] David L Davies and Donald W Bouldin. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227, 1979.
[81] W Haefely, M Facklam, P Schoch, JR Martin, EP Bonetti, JL Moreau, F Jenck, and JG Richards. Partial agonists of benzodiazepine receptors for the treatment of epilepsy, sleep, and anxiety disorders. Advances in biochemical psychopharmacology, 47:379–394, 1992.
[82] Shikun Feng, Minghao Li, Yinjun Jia, Weiying Ma, and Yanyan Lan. Protein-ligand binding representation learning from fine-grained interactions. arXiv preprint arXiv:2311.16160, 2023.
[83] Bowen Gao, Bo Qiang, Haichuan Tan, Yinjun Jia, Minsi Ren, Minsi Lu, Jingjing Liu, Wei-Ying Ma, and Yanyan Lan. Drugclip: Contrasive protein-molecule representation learning for virtual screening. Advances in Neural Information Processing Systems, 36, 2024.

Appendix A Open Access to Data and Code for Reproducibility

The code is released publicly at https://anonymous.4open.science/r/UniMAP-3CD7 for reproducibility.

Appendix B Experiments Details

The supplementary pre-training hyperparameters for UniMAP are outlined in Table 4. In the context of fine-tuning, we furnish the specifics of the search space in Table 5 for MoleculeNet grid search. The fine-tuning hyperparameters for tasks such as DDI prediction, and DTA prediction tasks (including DAVIS and KIBA) can be found in Table 6.

Table 4: Hyperparameters for the UniMAP pre-training.

Parameter	Value or description
Batch size	256
Optimizer	AdamW
Adam betas	(0.9, 0.999)
Max Learning rate	5e-5
Warm up steps	10000
Learning rate decay policy	linear decay
Training epochs	40
GNN encoder layers number	3
GNN embedding size	384
Transformer layers number	12
Transformer embedding size	768
token-level masking ratio	0.2
fragment-level masking ratio	0.6
temperature $\tau$ for contrastive loss	0.05

Table 5: Search space for the tasks in MoleculeNet dataset

Parameter	Search Space
Learning rate	[1e-6,1e-4]
Batch size	{16,32,64}
Weight decay	[1e-7,1e-3]

Table 6: Hyperparameters for fine-tuning on DDI prediction task and DTA prediction tasks.

Parameter	DDI	Davis	KIBA
Batch size	32	64	64
Learning rate	3e-5	2e-5	1e-4
Weight decay	1e-6	1e-6	1e-5

Appendix C Related Work

Here we briefly introduce some typical molecular representation learning methods, including SMILES based methods, graph based methods, 3D based methods, and hybrid methods using multiple kinds of molecular input.

C.1 SMILES based Methods

Due to the sequential format of a SMILES string, NLP based pre-training methods [45, 46, 47], have been recently applied to SMILES and achieved competitive results. For example, SMILES-BERT [48] and ChemBERTa [6] use BERT and RoBERTa as the backbone to conduct MLM on SMILES, respectively. While SMILES Transformer [7] and X-MOL [49] leverage an encoder-decoder architecture to learn the representation by SMILES generation. Owing to the success of large-scale language models, these models have the ability to utilize large-scale molecular data for pre-training, e.g. 1.1 billion molecules in X-MOL. However, previous work [8] has shown that these methods are limited in capturing molecular structure.

C.2 Graph based Methods

In general, graph based pre-training methods can be mainly classified into three categories: generative, predictive, and contrastive methods. The objective of generative methods is to reconstruct the original molecular graph or generate 1D sequence by 2D graph. For example, MGSSL [29] generates the motif tree of molecules using a pre-defined motif vocabulary, PanGu [50] takes 2D graph as input to generate 1D Self-Referencing Embedded Strings (SELFIES) [51] var a conditional variational autoencoder. Predictive methods construct context based on the graph structure and adopt context-aware masking for self-supervised learning. For example in [9, 8], they define neighbors in a graph as context and use the corresponding context on the molecular graph to predict the masked atoms/edges/motifs. While contrastive methods mainly apply contrastive learning on molecular graph. For example, MolCLR [10] and iMolCLR [11] firstly transform graphs by atom/edge/sub-graph masking to construct positive and negative pairs, and utilize a constrastive loss. Besides, some other works use chemical knowledge for data augmentation in contrastive learning, e.g. KCL [26] and MPG [52].

C.3 3D based Methods

Recently, several methods have been proposed to pre-train on 3D molecular conformations. For example, GraphMVP [32], 3D-Infomax [31], and GeomGCL [53] adopt contrastive or generative learning framework between 2D graph and 3D input, to leverage 3D information to enhance the representations of 2D graph. GEM [54] develops a geometry learning task to predict the atom distances and angles from 3D conformation. Besides, some works such as Uni-Mol [55], Transformer-M [56] and 3D-EMGP [57], propose to use 3D denoising methods to directly obtain the 3D molecular representation, and has achieved remarkable performances due to the proven equivalence between 3D denoising objective and learning a force field [58]. To our understanding, representation using 3D data is definitely a promising direction. Still, 3D data are relatively not so large as compared with 1D or 2D data. Besides, what objective is the best to reflect the characteristics of 3D conformation is not clear yet [59, 60]. Therefore, how to learn a reliable 3D conformation representation remains a challenging problem.

C.4 Hybrid Methods

Since a molecule can be represented as multiple forms, e.g. SMILES, 2D graph, 3D conformations, IUPAC [35], and cell-based microscopy images, some hybrid methods have been proposed to combine different forms to learn a unified representation. DMP [61], MM-Deacon [62] and CLOOME [63] combine 2D graph with SMILES, IUPAC with SMILES, and cell-based microscopy image with 2D graph, respectively. Moreover, MOCO [14] takes advantage of four molecular data forms containing SMILES, 2D graph, 3D conformations, and Morgan fingerprints for pre-training. However, these methods all use a two-stream or multi-stream model architecture, in which a separate encoder is first employed to obtain the representation of each data form, and then these representations are aligned with further loss, such as view consistency (in MOCO), multi-lingual contrastive alignment (in MM-Deacon), and InfoLOOB contrastive loss (in CLOOME). These two-stream models are good at learning the representation of each form, but miss the rich alignment information between different forms [64].

Appendix D Relations of UniMAP with Previous Work

Our work presents the first single-stream hybrid model, to adapt to the corresponding relation between SMILES and graph. The basic idea of fine-grained alignment, though designed for SMILES and graph in our work, is also valid for combining other data forms, such as IUPAC and cell-based microscopy images. In the future, we will investigate how to design a general framework to integrate different data forms.

Our work is also highly inspired by some recent vision-language multi-modality research, which can be categorized into two-stream and single-stream methods. According to the analysis in WenLan[64] and SemVLP[65], two-stream methods such as CLIP[66], ALIGN[67] and WenLan are good at capturing weak correlation between vision and language, while single-stream methods such as Oscar[68] and Uniter[23] are more suitable to model strong correlation. From our study, SMILES and graph have a typical strong correlation, i.e. SMILES provides detailed description of graph, just like the strong correlation between image and language in the image caption task. That is why we think a single-stream architecture is better for fusion between different data forms of a molecule.

Appendix E Drug-Drug Interaction Prediction

Data. In addition to molecular property prediction and drug-target affinity prediction, we also conduct experiments on the drug-drug interaction (DDI) prediction task, which plays a crucial role in drug repositioning and virtual screening. Specifically, we experiment on the DrugBank Multi-Typed DDI dataset [69], which contains 192,284 DDI pairs covering 86 interaction types. The objective is to predict the interaction type of each drug pair. The dataset is split to train, validation, and test set with the ratio 3:1:1. We randomly split the data 3 times with different seeds and report the mean value of evaluation metrics, including accuracy, precision, recall, and F1-score as in [70].

Baseline Models. We compare UniMAP with different kinds of baselines, including deep learning method DeepDDI [69], graph embedding methods DeepWalk [71] and LINE [72], and some recent pre-training methods such as MUFFIN [70], MUFFIN_KG [70], X-MOL [49], PanGu [50] and REMO [73]. Please note that we use MUFFIN_KG to refer to the MUFFIN variant with an extra knowledge graph branch.

Table 7: Results for DDI task. We report the accuracy, precision, recall and F1-score, the best and second best results are marked bold and underlined, respectively.

Method	Accuracy	Precision	Recall	F1-score
DeepDDI	0.877	0.799	0.759	0.766
DeepWalk	0.800	0.822	0.710	0.747
LINE	0.751	0.687	0.545	0.580
MUFFIN	0.939	0.926	0.908	0.911
MUFFIN_KG	0.965	0.957	0.948	0.950
X-MOL	0.952	-	-	-
PanGu	0.957	-	-	-
REMO	0.953	0.932	0.932	0.928
UniMAP	0.972	0.958	0.943	0.952

Fine-Tuning UniMAP. For the downstream fine-tuning of UniMAP, we first concatenate the representation $x_{cls}$ of the two paired drugs as the input, and then feed it to a two-layer MLP for multi-class prediction.

Results. The experimental results are shown in Table 7. Here we report four metrics including accuracy, precision, recall, and F1-score, where X-MOL and PanGu only show the accuracy result in their original paper. We can see that all pre-training methods (MUFFIN, MUFFIN_KG, X-MOL, Pangu, and UniMAP) outperform deep learning method (DeepDDI) and graph embedding methods (DeepWalk and LINE), demonstrating the benefit of pre-training. More importantly, UniMAP consistently outperforms the pre-training methods in terms of all four metrics, except for the recall compared with MUFFIN_KG. This may be because MUFFIN_KG utilizes extra knowledge graph information, which is not considered in both UniMAP and other pre-training baselines. Therefore, UniMAP shows better DDI prediction results by leveraging multi-modality pre-training, compared with single-modality pre-training methods, such as MUFFIN, X-MOL, and PanGu.

Appendix F Regression Tasks in MoleculeNet

Table 8: Results in terms of mean and std for the 4 regression tasks. We report RMSE scores, the best and second best results are marked as bold and underlined, respectively.

Method	ESOL	Lipo	Malaria	CEP
GROVER(base)	0.904(0.061)	0.815(0.012)	1.092(0.007)	1.339(0.005)
GROVER(large)	0.951(0.027)	0.803(0.038)	1.082(0.009)	1.353(0.004)
GraphMVP	1.064(0.045)	0.691(0.013)	1.106(0.013)	1.228(0.001)
Mole-BERT	1.015(0.030)	0.676(0.017)	1.074(0.009)	1.232(0.009)
MOCO	0.984(0.034)	0.707(0.001)	1.093(0.009)	1.101(0.007)
UniMAP	0.861(0.010)	0.664(0.023)	1.043(0.007)	1.128(0.007)

Beyond the classification tasks, we also verify our method’s effectiveness on 4 regression tasks: ESOL [74] contains regression labels measuring the molecular water solubility. Lipophilicity(Lipo), which is a subset of ChEMBL [75], consists of experimental results of octanol/water distribution coefficient. CEP records the organic photovoltaic efficiency of molecules selected from the Havard Clean Energy Project(CEP) [76]. Malaria [77] takes drug efficacy against malaria, which is a type of disease caused by parasites, as regression target. We choose the top 4 methods based on the averaged classification results in Table 1 for comparison. That is to say, MOCO, Mole-BERT, GROVER, and GraphMVP are treated as our baseline methods. Similarly, a two-layer MLP is adopted to obtain the final regression value of UniMAP, supervised by a corresponding regression loss, i.e. MAE. RMSE is utilized as the evaluation metric. From the results in Table 8, we can see that UniMAP performs the best on three tasks, and second best on the other one.

Appendix G Fragment Alignment Pattern

Inspired by the analysis in NLP [78] that higher layer attention weights usually demonstrate some learned patterns, we visualize the learned attention weights at the 8th layer. Interestingly, we find that the learned model demonstrates a clear fragment alignment pattern. We give three example molecules as case study, as shown in Figure 5. Given the SMILES and graph representation of a molecule, multiple fragments are obtained, and labeled with different colors, and the attention weights between different SMILES tokens and graph atoms are presented to lines with different color depths, where deeper color means higher attention weight. Clearly, the larger attention weights are obtained between SMILES tokens and graph atoms within the same corresponding fragment, validating the soundness of UniMAP.

The alignment of fragments not only facilitates the seamless fusion of SMILES and graphs in a finer-grained manner but also enables the model to highlight the differences among fragments, thereby enhancing the comprehension of molecular properties. Although SMILES better captures global contextual information [61], it lacks the ability to model meaningful substructures like fragments. For instance, in Figure 1, the adjacency of the yellow fragment in the SMILES is interrupted by another green-colored fragment. Through cross modality fragment alignment, Through cross-modality fragment alignment, the model successfully captures such fine-grained information via learned attention weights. Alignment of fragments helps the model better comprehend the molecule by decomposition into fragments, When examining and designing molecules, trained chemists intuitively decompose them into fragments, based on several key factors such as synthetic availability, chemical properties, and biological functionalities [79]. Therefore, fragment decomposition is essential for molecular representation learning, as it captures local information such as functional groups and simplifies molecule complexity for machine learning models.

In particular, regarding the (a) sub-figure in Figure 5, UniMAP outlines the yellow fragment known as carboxymethyl and the blue fragment known as pyridyl. Carboxymethyl is a well-known acidic fragment. Its presence often contributes to a molecule’s acidic character. Conversely, pyridyl which can readily accept protons is a recognized basic fragment. Additionally, this distinction in fragments also translates to different binding mechanisms. The carboxymethyl group, with its negative charge at physiological pH, can form strong ionic bonds with positively charged residues in the binding site, and it can act as both a hydrogen bond donor (due to the hydroxyl group) and acceptor (due to the carbonyl oxygen), further stabilizing the complex through forming potential hydrogen bonds. In contrast, the pyridyl can provide $\pi$ -stacking interactions due to its aromaticity.

Finally, we re-run the code of MOCO on the three molecules in Figure 5. As the MOCO aligns the global representation of each modality by contrastive learning, we cannot provide the visualization of attentions. Instead, we calculate the cosine similarity between the corresponding fragments in SMILES and graph, the fragment embeddings are average pooling from the corresponding tokens/atoms embeddings from 1D/2D network. The results are presented in Table 9.

Molecule	Fragment1	Fragment2	Fragment3
a	0.0737	0.0482	-
b	-0.0340	0.0258	-0.0177
c	0.0254	0.0227	-0.0203

Table 9: The cosine similarity of corresponding fragments between graph and SMILES on the three molecules depicted in Figure 5

As evident from Table 9, the low similarity scores for fragments indicate it’s difficult for single-stream methods like MOCO to capture fragment-level semantic information due to the lack of fine-grained interaction and fragment-level alignment supervision.

Appendix H Feature Visualization on ClinTox Task

From the results presented in Table 1, it is evident that UniMAP significantly outperforms the second-best baseline (GROVER) by a substantial margin (98.3 vs. 81.7) on the ClinTox task. To delve into the reasons behind UniMAP’s superiority, we extracted the unfine-tuned embeddings of three pretraining methods—UniMAP, Grover, and Mole-BERT. Subsequently, we employed T-SNE to visualize the clustering results, directly illustrating the effectiveness of the pretraining process. The clustering results are illustrated in Figure 6, where distinct colors represent different binary labels, namely 0 or 1. Additionally, to provide a quantitative evaluation of the clustering performance, we introduce the Davies-Bouldin Index (DB Index) [80]. A lower DB Index value signifies superior clustering performance. Upon examining Figure 6, it becomes evident that the labels are highly unbalanced in the ClinTox dataset, as evidenced by the scarcity of red points. This imbalance poses a challenge for clustering. While the red points appear scattered in the case of GROVER and Mole-BERT, UniMAP effectively aggregates the limited red points into a meaningful cluster positioned in the bottom right of the figure. This capability is advantageous for the subsequent fine-tuning process, elucidating the superior performance of UniMAP.

Appendix I The Influence of Different Pooling Methods for Obtaining Fragment Embeddings

We employ different pooling techniques for obtaining fragment-level representations of different modalities: multi-head attention with average pooling for SMILES, and average pooling for graphs. Our motivation here lies in the desire to employ various pooling methods tailored to different modality preferences. We supplement additional ablation experiments to verify the effect of different pooling methods. In particular, we add a setting that uses average pooling for both SMILES and graph and show the performance on MoleculeNet in Table 10, the new setting is denoted as ‘Both Avg’. All settings only use 1M pre-training data to save time and computational cost.

Method	BBBP	Sider	ClinTox	Bace	Tox21	Toxcast	MUV	HIV
UniMAP	76.1(0.4)	65.8(0.1)	98.6(0.2)	80.9(0.3)	76.3(0.6)	60.1(0.4)	72.6(0.3)	77.1(0.4)
Both Avg	73.7(1.8)	63.5(1.3)	97.2(1.2)	82.9(0.7)	75.3(1.1)	59.0(0.7)	71.9(0.2)	76.6(0.1)

Table 10: Performance of different pooling methods for obtaining fragment embeddings on MoleculeNet.

As we can see from Table 10, the performance of setting ‘Both Avg’ overall is inferior to the baseline setting. Since the fragment pooling operation only affects fragment-level alignment, we examine the loss curves and observe that the FLA (Fragment-Level Alignment) loss does not converge to the level of the default setting, we conjecture this discrepancy may contribute to the degradation of performance.

Appendix J More Discussions about the Complementary between SMILES and Graph

To better comprehend the complementary between SMILES and Graph in the single-stream approach. We make modifications to our designation to pre-train separate encoders which take a single modality as input with UniMAP strategies. In particular, we replace the shared Transformer with separate encoders: one for encoding SMILES and another for encoding the graph. Additionally, we introduce a 2-layer multi-head attention module for later fusion of the embeddings from these separate encoders. The losses of UniMAP remained unchanged throughout this adaptation. Each setting is pre-trained on 1M molecules to save time. Then, we evaluate the performance of two single-modal models, utilizing either SMILES or graph as input, on the MoleculeNet dataset. The results are presented in Table 11

Table 11: Comparison of the performance between a single-stream model and separate single-modality models on MoleculeNet.

Method	BBBP	Sider	ClionTox	Bace	Tox21	Toxcast	MUV	HIV
UniMAP	76.1(0.4)	65.8(0.1)	98.6(0.2)	80.9(0.3)	76.3(0.6)	60.1(0.4)	72.6(0.3)	77.1(0.4)
SMILES Only	74.1(0.8)	64.2(0.6)	96.1(0.2)	81.2(0.2)	73.6(1.3)	59.0(2.3)	71.4(1.2)	76.4(0.5)
Graph Only	71.3(1.4)	62.9(0.7)	85.6(0.2)	72.1(1.2)	73.9(0.3)	60.9(0.7)	69.8(0.4)	76.8(0.3)

Table 11 shows that the original setting which uses the single-stream model achieves better performance above the settings of separated encoders, indicating the single-stream model better captures the complementary of both input modalities.

To better elucidate our conjecture, we delve into the BBBP task, wherein the single-stream model exhibits a significant performance boost. Leveraging inductive bias, we analyze typical samples along with their corresponding predictions generated by different models. Specifically, we select four molecules and visually depict the labels and predictions of each model in Figure 7

In the first two cases (a) and (b), SMILES outperforms Graph. The first molecule features a large ring and multiple chiral centers. SMILES, being a sequential representation, excels in modeling long-distance interactions [61] and exhibits greater sensitivity in chirality predictions due to the explicit representation of chirality using the ’@’ character [14]. Consequently, it predicts such molecules more accurately. The second molecule (b) is an ionic compound lacking covalent connections between its ionic moieties. In 2D graphs, different ionic moieties are often depicted as independent graphs, each illustrating the connectivity of atoms within a single ionic moiety. However, SMILES connects the SMILES representations of different ionic moieties using the symbol ".", creating a contiguous sequence that models fine-grained interactions, helping the model comprehend the overall contextual information of such molecules.

In the last two cases (c) and (d), Graph outperforms SMILES. The first molecule contains many rings, which increase rigidity and make it hard to pass through the blood-brain barrier. The second molecule contains a benzodiazepine core fragment, a meaningful substructure recognized for its anti-anxiety and anti-seizure properties, indicating it can permeability through the blood-brain barrier [81]. 2D Graphs outperform in capturing locally significant structures and topological information of molecules, including ring count and modeling substructures. Conversely, in SMILES, such fragments are often non-adjacent and interpreted by other fragments (e.g., the blue fragment in Figure 5b and the yellow fragment in Figure 5c), posing challenges in outlining local topological structures.

Overall, UniMAP consistently outperforms single-modality based models in all four molecules, indicating the effectiveness of our single-stream approach. Properties like BBBP are influenced by various factors such as chirality, rigidity, and specific fragments. Simultaneously inputting SMILES and 2D Graph into a single-stream model enhances the depiction of these aspects, thereby improving overall performance.

Appendix K Comparison between Single-stream Methods and Two-stream Methods in Molecular Representation Learning Domain

Similar to the vision-language domain, single-stream approaches such as Transformer-M, MoleBLEND, and BindNet [82] excel in capturing fine-grained correlations and achieving higher performance. However, they require pairwise data and involve additional computational costs due to cross-modality fine-grained interaction.

On the other hand, two-stream methods such as GraphMVP, 3D-Infomax, and DrugCLIP [83] are designed to model weaker correlations, offering flexibility in network design. These methods are particularly suitable for retrieval scenarios, as feature calculation can be conducted offline. For instance, the two-stream method DrugCLIP significantly accelerates the speed of virtual screening.

Transformer-M and MoleBLEND, as other fine-grained pre-training methods, attempt to fuse 3D distance information and 2D topological paths into the attention bias of the transformer. We provide a comparison with MoleBLEND in Table 12 on MoleculeNet. Our superior performance aligns with our inductive bias that molecular fragments are more relevant to biological properties and thus more valuable to be considered in fine-grained approaches.

Method	BBBP	Tox21	MUV	BACE	ToxCast	SIDER	ClinTox	HIV	Avg.
MoleBLEND	73.0(0.8)	77.8(0.8)	77.2(2.3)	83.7(1.4)	66.1(0.0)	64.9(0.3)	87.6(0.7)	79.0(0.8)	76.2
UniMAP	75.6(1.8)	77.2(0.9)	78.9(1.4)	83.6(1.0)	67.0(0.4)	66.6(0.8)	98.3(1.7)	80.1(0.7)	78.4

Table 12: Comparision of the performance between UniMAP and MoleBLEND on MoleculeNet.

Appendix L Limitation

UniMAP focuses on the fusion of 1D SMILES and 2D molecular graphs, without involving 3D modality. The learning of 3D representations needs input data with 3D conformations. However, many downstream datasets, particularly those we focus on at UniMAP, such as MoleculeNet, Davis, and KIBA, don’t originally provide the 3D structures. Consequently, training a 3D-based encoder needs additional computation costs to generate the 3D conformations and might restrict the scope of applicability of methods. However, leveraging 3D conformation offers additional structural information compared to 2D or 1D representations, the motivation that captures fine-grained semantics between different modalities is compatible with 3D, thus, the integration of 3D data into UniMAP is worth exploring for future work.

NeurIPS Paper Checklist

1.

Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: The main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope.
Guidelines:
- •
  
  The answer NA means that the abstract and introduction do not include the claims made in the paper.
- •
  
  The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
- •
  
  The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
- •
  
  It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
2.

Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: We discuss the limitation of UniMAP at appendix L
Guidelines:
- •
  
  The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
- •
  
  The authors are encouraged to create a separate "Limitations" section in their paper.
- •
  
  The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
- •
  
  The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
- •
  
  The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
- •
  
  The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
- •
  
  If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
- •
  
  While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
3.

Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [N/A]
Justification: Our paper does not include theoretical results.
Guidelines:
- •
  
  The answer NA means that the paper does not include theoretical results.
- •
  
  All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.
- •
  
  All assumptions should be clearly stated or referenced in the statement of any theorems.
- •
  
  The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
- •
  
  Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
- •
  
  Theorems and Lemmas that the proof relies upon should be properly referenced.
4.

Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: We have provided detailed descriptions of the experiments along with open-source code to reproduce our results.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
- •
  
  If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
- •
  
  Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
- •
  While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
  1. (a)
    
    If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
  2. (b)
    
    If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
  3. (c)
    
    If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
  4. (d)
    
    We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
5.

Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes]
Justification: We have provided the open-source code and data in appendix A
Guidelines:
- •
  
  The answer NA means that paper does not include experiments requiring code.
- •
  
  Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
- •
  
  The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
- •
  
  The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
- •
  
  The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
- •
  
  At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
- •
  
  Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
6.

Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: We have specified the training and test details in section 3 and appendix B.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
- •
  
  The full details can be provided either with the code, in appendix, or as supplemental material.
7.

Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes]
Justification: We have reported the mean and standard deviation values for three different random seeds in the MoleculeNet experiments.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
- •
  
  The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
- •
  
  The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
- •
  
  The assumptions made should be given (e.g., Normally distributed errors).
- •
  
  It should be clear whether the error bar is the standard deviation or the standard error of the mean.
- •
  
  It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.
- •
  
  For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
- •
  
  If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
8.

Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: The details about the computer resource are provided in section 3.
Guidelines:
- •
  
  The answer NA means that the paper does not include experiments.
- •
  
  The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
- •
  
  The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
- •
  
  The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
9.

Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: The research conducted in the paper conforms with the NeurIPS Code of Ethics.
Guidelines:
- •
  
  The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
- •
  
  If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
- •
  
  The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
10.

Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [N/A]
Justification: There is no societal impact of the work performed.
Guidelines:
- •
  
  The answer NA means that there is no societal impact of the work performed.
- •
  
  If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
- •
  
  Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
- •
  
  The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
- •
  
  The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
- •
  
  If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
11.

Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [N/A]
Justification: The paper poses no such risks.
Guidelines:
- •
  
  The answer NA means that the paper poses no such risks.
- •
  
  Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
- •
  
  Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
- •
  
  We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12.

Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: We have cited the original paper that produced the dataset.
Guidelines:
- •
  
  The answer NA means that the paper does not use existing assets.
- •
  
  The authors should cite the original paper that produced the code package or dataset.
- •
  
  The authors should state which version of the asset is used and, if possible, include a URL.
- •
  
  The name of the license (e.g., CC-BY 4.0) should be included for each asset.
- •
  
  For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
- •
  
  If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
- •
  
  For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
- •
  
  If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
13.

New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [N/A]
Justification: The paper does not release new assets.
Guidelines:
- •
  
  The answer NA means that the paper does not release new assets.
- •
  
  Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
- •
  
  The paper should discuss whether and how consent was obtained from people whose asset is used.
- •
  
  At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
14.

Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [N/A]
Justification: The paper does not involve crowdsourcing or research with human subjects.
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
- •
  
  According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
15.

Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [N/A]
Justification: The paper does not involve crowdsourcing nor research with human subjects.
Guidelines:
- •
  
  The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
- •
  
  Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
- •
  
  We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
- •
  
  For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.