PSC-CPI: Multi-Scale Protein Sequence-Structure Contrasting for Efficient and Generalizable Compound-Protein Interaction Prediction
Abstract
Compound-Protein Interaction (CPI) prediction aims to predict the pattern and strength of compound-protein interactions for rational drug discovery. Existing deep learning-based methods utilize only the single modality of protein sequences or structures and lack the co-modeling of the joint distribution of the two modalities, which may lead to significant performance drops in complex real-world scenarios due to various factors, e.g., modality missing and domain shifting. More importantly, these methods only model protein sequences and structures at a single fixed scale, neglecting more fine-grained multi-scale information, such as those embedded in key protein fragments. In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI), which captures the dependencies between protein sequences and structures through both intra-modality and cross-modality contrasting. We further apply length-variable protein augmentation to allow contrasting to be performed at different scales, from the amino acid level to the sequence level. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during the training stage. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging “Unseen-Both” setting, where neither compounds nor proteins have been observed during training. Furthermore, even when encountering a situation of modality missing, i.e., inference with only single-modality data, PSC-CPI still exhibits comparable or even better performance than previous approaches.
Introduction
While various experimental assays (Bleicher et al. 2003; Inglese and Auld 2007; Mayr and Bojanic 2009) have been applied to screen drug candidates, identifying valid drugs with desirable properties from the enormous chemical space (estimated to contain potential “drug-like” molecule compounds (Bohacek, McMartin, and Guida 1996; Karimi et al. 2020)) is still expensive and time-consuming. To overcome this bottleneck, a number of computational methods for Compound-Protein Interaction (CPI) prediction (You et al. 2020; Karimi et al. 2020; Gao et al. 2018; Lim et al. 2021) have been proposed to screen drugs virtually in a high-throughput way. The primary purpose of CPI prediction is to facilitate drug discovery by predicting the interaction pattern (e.g., contact map) and strength (e.g., binding affinity) of the CPI. An example of the compound-protein interaction between the protein target of dipeptidyl peptidase-4 (DPP-4) and the molecular drug of alogliptin is shown in Fig. 1.

Computational methods for CPI prediction can be mainly divided into two categories: simulation-based methods and deep learning-based methods. Molecular docking (Trott and Olson 2010; Verdonk et al. 2003; Fan, Fu, and Zhang 2019; Pinzi and Rastelli 2019; Lin et al. 2023; Sethi et al. 2019) and molecular dynamics simulations (Salo-Ahen et al. 2020; Hollingsworth and Dror 2018) are two typical simulation-based methods. They utilize unimodal 3D protein structures to predict both the interaction sites as well as the binding postures. Despite the remarkable success, these methods (1) rely heavily on the availability of protein 3D structures and (2) require tremendous computational resources and are very time-consuming. With the development of deep learning techniques, there have been many deep learning-based methods (Lim et al. 2021) proposed for high-efficiency CPI prediction, making it possible to achieve large-scale drug screening in a relatively short time. Moreover, a large number of protein structure-free methods (Gao et al. 2018; Li et al. 2020; Karimi et al. 2020; Gao et al. 2023) are proposed to reduce the reliance on protein 3D structures. They can accurately predict the CPIs using only the protein sequences. However, it is the structure of a protein, rather than its sequence, that is the key to determining its functions and interactions with compounds. To combine the strengths of the two modalities, a recent work (You and Shen 2022) proposes to integrate the representations of protein sequences and structures through a complex cross-attention architecture, but it fails to model sequence-structure dependencies.
A desirable framework for CPI prediction should generally be efficient, effective, and generalizable, while two major bottlenecks deriving from real-world data may hinder the development of CPI methods. Modality Missing: While joint modeling of sequence-structure is of great benefit for CPI prediction during training, a problem often encountered in practical inference is modality missing, i.e., there is only ONE protein modality, either sequence or structure, that is available for inference. More importantly, we cannot presuppose which modality of protein data (or both) we can obtain. Domain Shifting: Most existing methods work well on trainset-homologous test data but are hard to generalize to more practical (trainset-heterologous) test data, where compounds, proteins, or both have never been observed during training. Thus, how to deal with the train-test gaps in real-world scenarios is still an important issue for CPI prediction.
In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting framework for CPI prediction (PSC-CPI) to address the above challenges. Firstly, PSC-CPI jointly pre-trains protein sequence and structure encoders to capture their dependencies by intra-modality and cross-modality contrasting. As a result, pre-trained sequence and structure encoders can enjoy the benefits of multimodal information during training, but do not require two protein modalities to be provided for inference. Secondly, a variable-length protein augmentation module is introduced, allowing both two contrasting to be performed at different scales to capture fine-grained multi-scale information embedded in key protein fragments. Finally, in order to more fairly evaluate the model generalizability, we split the test data into four settings based on whether compounds and proteins have been observed during training. Extensive experiments have shown that PSC-CPI generalizes well in all four settings, particularly in the more challenging “Unseen-Both” setting, where neither compounds nor proteins have been observed during training. Furthermore, PSC-CPI performs well for both unimodal and multimodal inference settings; more importantly, even when inferring with protein data of one single modality, PSC-CPI still demonstrates comparable or even better performance than previous leading methods. The source codes and related appendixes are available at: https://github.com/LirongWu/PSC-CPI.
Related Work
Conventional Methods for CPI.
Identifying compound-protein interactions plays a very important role in drug discovery. Since it is expensive and time-consuming to screen drug candidates from a large chemical space through various experimental assays (Bleicher et al. 2003; Inglese and Auld 2007; Mayr and Bojanic 2009), virtual screening by molecular docking (Trott and Olson 2010; Fan, Fu, and Zhang 2019; Sethi et al. 2019) or molecular dynamics simulations (Salo-Ahen et al. 2020; Hollingsworth and Dror 2018) has been studied for decades with great success in drug discovery. However, these simulation-based methods may not work well when the 3D structure of the protein is unknown or the number of known ligands is too small (Chen et al. 2020a). Recent advances in deep learning have provided new insights to reduce the reliance on 3D protein structures and to develop deep learning-based methods for CPI prediction.
Deep learning-based Methods.
Most deep learning-based methods treat compounds as 1D sequences or molecular graphs and treat proteins as 1D sequences and then jointly perform representation learning and interaction prediction in an end-to-end unified framework. For example, DeepDTA (Öztürk, Özgür, and Ozkirimli 2018) and DeepConvDTI (Lee, Keum, and Nam 2019) apply Convolutional Neural Networks (CNNs) (LeCun, Bengio et al. 1995) to extract low-dimensional representations of compounds and proteins, concatenated them, and pass it into fully connected layers to predict interactions. Similarly, GraphDTA (Nguyen et al. 2021) treats compounds as molecular graphs and uses Graph Neural Networks (GNNs) (Kipf and Welling 2016; Wu et al. 2021a, 2022b, 2023) instead of CNNs to learn compound representations. Besides, Recurrent Neural Networks (RNN) (Armenteros et al. 2020) are used by DeepAffinity+ (Karimi et al. 2020) to extract representations from sequential compounds. To better integrate compound and protein representations, TransformerCPI (Chen et al. 2020a) and HyperattentionDTI (Zhao et al. 2022) propose to learn joint compound-protein representations using a self-attentive mechanism. Recently, PerceiverCPI (Nguyen et al. 2023) proposes a cross-attention mechanism to improve the learning ability of the representation of compounds and protein interactions. Despite the great success, the above works have mostly modeled only the sequence information of proteins through CNN, RNN, LSTM (Hochreiter and Schmidhuber 1997), etc. However, it is the structure of a protein, not the sequence, that determines its functions and interactions with compounds. For this reason, an elaborate Cross-Interaction architecture is proposed in (You and Shen 2022), which improves CPI predictions by integrating the representations of protein sequences and structures. However, it fails to capture the sequence-structure dependencies and works only when both modalities are provided for inference.
Contrastive Learning on Proteins.
Recent years have witnessed the great success of Contrastive Learning (CL) in protein representation learning (Wu et al. 2022a; Huang et al. 2023a, b; Tan et al. 2023). However, most previous studies have focused on contrasting within a single protein modality, either sequence (Lu et al. 2020) or structure (Hermosilla and Ropinski 2022; Zhang et al. 2022). For example, Contrastive Predictive Coding (CPC) (Lu et al. 2020) applies different augmentation transformations on the input sequence to generate different views, and then maximizes the agreement of two jointly sampled pairs against that of two independently sampled pairs. In addition, Multiview Contrast (Hermosilla and Ropinski 2022) proposes to randomly sample two sub-structures from each protein, encoder them into two representations, and finally maximize the similarity between representations from the same protein while minimizing that of different proteins. Despite the great progress in single-modality contrasting, relatively little work is devoted to cross-modality contrasting learning on proteins.

Methodology
A chemical compound can be represented by a molecular graph ), where each node denote an atom in the compound, and each edge denotes a chemical bond between atom and atom . A protein with amino acid residues can be denoted by a string of its sequence, , where each residue is one of the 20 amino acid types. The amino acid sequence of a protein can be folded into a stable structure , forming a special kind of multimodal data . The protein structure can be modeled as a protein graph ), where is the node set of residues, and is the set of edges that connects the residues. Given proteins and compounds , the compound-protein interaction prediction aims to learn two mappings, and , that predict the interaction pattern and interaction strength between compounds and proteins, respectively.
A General Framework for CPI Prediction
A general CPI prediction framework consists of four main components: (1) compound encoder for extracting compound representations from given compound graphs, (2) multimodal protein representation learning for extracting protein representations from given protein sequences, structures, or both two modalities, (3) pairwise interaction pattern prediction for predicting the contact maps between residues of a protein and atoms of a compound and learning compound-protein joint representations, and (4) interaction strength prediction for predicting the compound-protein binding affinity. A high-level overview of the proposed PSC-CPI framework is illustrated in Fig. 2. Next, we introduce key components (1)(3)(4) and defer the discussions of multimodal protein representation learning until next section.
Compound Encoder.
The compound encoder takes molecular graph as input and learns a -dimensional node representation for each atom. In this paper, we adopt Graph Convolutional Networks (GCNs) as the compound encoder, which is a powerful variant of GNNs that have been widely used as a feature extractor for various graph data. Given a graph , GCNs take its adjacency matrix and node features as input and output representation for each node. In this paper, we consider a -layer GCN, which can be formulated as follows,
(1) |
where , represents a normalized adjacency matrix, is an identity matrix, and is a diagonal degree matrix for . In addition, , , and are three parameter matrices with the hidden dimension of .
Pairwise Interaction Pattern Prediction.
This module takes as inputs compound representations and protein representations , first transforms them into a low-dimensional latent space by two independent linear transformations and , then computes the interaction intensity by inner product for each residue-atom pair, and finally normalize it to obtain the interaction intensity between -th atom and -th residue, as follows
(2) | ||||
To obtain the compound-protein joint embeddings, we calculate the Manhattan product of representations of each residue-atom pair and add them with as weights,
(3) |
where and are representations of the -th atom in the compound and -th residue in the protein.

Interaction Strength Prediction.
We take their joint embeddings as input and map it from a high-dimensional space to a non-negative value , i.e., the binding affinity. Specifically, this module consists of a layer of 1D convolution , a layer of maximum pooling , and a 3-layer multilayer perceptrons , formulated as
(4) |
Multi-Scale Sequence-Structure Contrasting
This paper aims to design an architecture-agnostic framework that is applicable to a variety of sequence and structure encoders. More importantly, we expect this framework to well handle the modality missing problem during inference, i.e., to work well regardless of whether the protein sequence, the structure, or both modalities are provided for the inference. To achieve this, a multi-scale sequence-structure contrasting framework is proposed, as shown in Fig. 3, which fully captures the sequence-structure dependencies and multi-scale information through length-variable protein augmentation and intra-/cross-modality contrasting.
Length-Variable Protein Augmentation.
Data augmentation plays a very important role in the common contrastive learning frameworks (Wu et al. 2021b; He et al. 2020; Gao et al. 2022; Devlin et al. 2018; Radford et al. 2019). The main purpose of data augmentation is to generate different augmented views that share the same or similar semantics as the original one. The two main challenges for data augmentation on proteins are: (1) length diversity, different proteins may have different sequence lengths, and (2) key segment variability, key fragments on different proteins may have very different lengths and be located at different positions on the sequence. To tackle these challenges, we augment protein data by sampling length-variable consecutive segments (subsequences) from the entire protein sequence and extracting the corresponding subgraphs. Traditional augmentation methods generally sample protein subsequences with the same length or length ratio and then fix them before training. In this paper, we generate augmented subsequences of different lengths and corresponding subgraphs for each protein in each training epoch. As training proceeds, the length of the augmented protein subsequences keeps changing, enabling the model to “see” the same protein at more different scales, thus capturing more multi-scale information in the protein.




Intra- and Cross-modality Contrasting.
Two different contrastive learning objectives, i.e., intra-modality contrasting and cross-modality contrasting, are introduced in our framework to capture multi-scale information within protein sequences or structures and cross-modality dependencies. Firstly, we feed the -th protein sequence and the augmented subsequences into a sequence encoder to output the sequence representations, as follows
(5) |
where , , and is the number of subsequences. Similarly, we feed the -th protein graph and the augmented subgraphs into a structure encoder to output the structure representations,
(6) |
Following SimCLR (Chen et al. 2020b) and JOAO (You et al. 2021), two different two-layer MLP projection heads, denoted as and , are further applied to map sequence and structure representations to a lower-dimensional space, respectively. Next, a contrastive objective function consisting of intra-modality and cross-modality contrasting is defined to pre-train the sequence and structure encoders. For intra-modality contrasting, we treat a protein sequence (protein graph) and its subsequence (subgraph) as a positive pair and subsequences from other proteins (protein graphs) in the same batch as negative pairs, which can be defined as
(7) | |||
where is the batch size, denotes the cosine similarity, and is the temperature coefficient. The intra-modality contrasting of Eq. (7) transfers knowledge from protein fragments of different lengths to the final representations by maximizing the mutual information of subsequences (subgraphs) and full sequence (protein graph). Conversely, cross-modality contrasting defined in Eq. (8) aims to capture the sequence-structure dependencies by making the subsequence and subgraph of the same protein fragment share similar semantics at different scales. Specifically, cross-modality contrasting treats subsequence and subgraph from the same protein as a positive pair, defined as
(8) | |||
Finally, the total loss function used for the pre-trained sequence and structure encoders can be defined as
(9) |
where is a hyperparameter to trade-off between two losses.
Training and Inference
Training.
The CPI prediction mainly involves two downstream tasks, i.e., strength prediction and pattern prediction. When we know the ground-truth interaction strength between -th protein and -th compound, the objective function for CPI strength prediction is defined as follows,
(10) |
Similarly, if we know the ground-truth interaction pattern between -th protein and -th compound, the objective function for CPI pattern prediction is defined as follows,
(11) | ||||
where (Scardapane et al. 2017), (Tibshirani et al. 2005), and are three structure-aware sparsity regularization adopted by (Karimi et al. 2020) to control the sparsity of the interaction contact map .
Inference.
While it is feasible to train the model using a small number of known sequence-structure pairs, it is overly demanding to acquire both the sequence of a protein and its structure for inference. The number of known protein structures is orders of magnitude lower than the size of the sequence dataset due to the challenges of experimental protein structure determination (Zhang et al. 2022). The extreme data imbalance in the two modalities may lead to a modality missing problem, i.e., existing works, while they may work well in one modality, are hard to extend to the other modality. For a more practical application scenario, we cannot presuppose which modality of protein data (or both) will be available, so developing a general framework suitable for both unimodal and multimodal inference is one of the contributions of this paper. In this paper, we have not directly integrated protein sequences and structures through architectural designs. As an alternative, we pre-trained sequence and structure encoders by performing cross-modality contrasting using a small number of known sequence-structure pairs, aimed at aligning the representation space of sequences and structures. Consequently, despite “seeing” only the protein sequence (structure), the representations output by the pre-trained sequence (structure) encoder also contain part of the structural (sequential) information. As a result, the pre-trained sequence and structure encoders enjoy the benefits of multimodal information during training, but do not require both two protein modalities to be provided for inference.
Illustrations and Pseudo-Code.
We provide in Fig. 4 illustrations of the training and three inference settings. Taking multimodal inference with protein sequences and structures as an example, the pseudo-code for pre-training, fine-tuning, and inference is summarized in Appendix A.
Time Complexity Analysis.
As PSC-CPI is architecture-agnostic, we do not discuss here the time complexity of compound encoder, protein sequence and structure encoder. The time complexity of remaining key modules in PSC-CPI is as follows: (1) Multi-scale Contrasting ; (2) Pattern Prediction ; and (3) Strength Prediction , where is the dimensions of hidden space and is the number of subsequences. The total time complexity is square and linear w.r.t the number of proteins and the number of compounds , respectively.
Experiments
Experimental Setups
Datasets. The experiments are mainly conducted on a public compound-protein dataset (You and Shen 2022; Karimi et al. 2020), namely Karimi, which contains 4,446 pairs between 1,287 proteins and 3,672 compounds that are collected from PDBbind (Liu et al. 2015) and BindingDB (Liu et al. 2007). To better evaluate the generalizability, we split the test data into four subsets based on whether compounds and proteins have been seen in the training data: (1) Seen-Both (591 pairs): both have been seen; (2) Unseen-Comp (521 pairs): only proteins have been seen; (3) Unseen-Prot (795 pairs): only compounds have been seen; and (4) Unseen-Both (202 pairs): both have never been seen. A statistical histogram of the length of the protein and the number of atoms in the compound is shown in Fig. 5. In addition, three common datasets, Davis (Davis et al. 2011), KIBA (Tang et al. 2014), and Mert (Metz et al. 2011), are further used to evaluate the Unseen-Both setting, and we apply RaptorX-Contact (Xu 2019) to obtain their corresponding protein graphs from protein sequences. Note that unlike previous protein pre-training methods for learning transferable knowledge from large amounts of unlabeled data, this paper aims to facilitate CPI prediction by capturing sequence-structure dependencies with a small number of known sequence-structure pairs, and thus we only pre-train on the same data provided by the downstream task without using additional unlabeled data.


Sequ. | Stru. | PSC | Seen-Both | Unseen-Comp | Unseen-Prot | Unseen-Both | Avarage | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AUPRC | RMSE | AUPRC | RMSE | AUPRC | RMSE | AUPRC | RMSE | AUPRC | RMSE | |||
✔ | ✗ | ✗ | 22.05 | 1.56 | 19.32 | 1.48 | 6.48 | 1.66 | 5.62 | 1.75 | 13.37 | 1.61 |
✔ | 22.29 | 1.48 | 21.43 | 1.37 | 7.01 | 1.54 | 6.64 | 1.59 | 14.34 | 1.49 | ||
1.09% | 5.13% | 10.92% | 7.43% | 8.18% | 7.23% | 18.15% | 9.14% | 7.16% | 7.45% | |||
✗ | ✔ | ✗ | 22.11 | 1.58 | 21.56 | 1.52 | 10.70 | 1.73 | 9.40 | 1.80 | 15.94 | 1.66 |
✔ | 24.26 | 1.53 | 23.78 | 1.43 | 11.14 | 1.52 | 10.62 | 1.66 | 17.45 | 1.54 | ||
9.72% | 3.16% | 10.30% | 5.92% | 4.11% | 12.14% | 12.98% | 7.78% | 9.47% | 7.23% | |||
✔ | ✔ | ✗ | 23.86 | 1.55 | 23.12 | 1.44 | 9.06 | 1.61 | 8.52 | 1.65 | 16.14 | 1.56 |
✔ | 25.42 | 1.42 | 24.67 | 1.31 | 11.03 | 1.47 | 11.65 | 1.52 | 18.19 | 1.43 | ||
6.54% | 8.39% | 6.70% | 9.03% | 21.74% | 8.70% | 36.73% | 7.88% | 12.70% | 8.33% |
Methods | Davis | KIBA | Metz | |||
---|---|---|---|---|---|---|
MSE | CIndex | MSE | CIndex | MSE | CIndex | |
DeepConvDTI (Lee, Keum, and Nam 2019) | ||||||
GraphDTA (Nguyen et al. 2021) | ||||||
DeepAffinity+ (Karimi et al. 2020) | ||||||
HyperattentionDTI (Zhao et al. 2022) | ||||||
TransformerCPI (Chen et al. 2020a) | ||||||
PerceiverCPI (Nguyen et al. 2023) | ||||||
Cross-Interaction (You and Shen 2022) | ||||||
PSC-CPI (ours) |
Hyperparameter. The hyperparameters are set the same for all four datasets: Adam optimizer with learning rate = 5e-5, weight decay = 5e-4, = 0.001, and Epoch = 200. The other dataset-specific hyperparameters are determined by an AutoML toolkit NNI with the search spaces as hidden dimension ; batch size , temperature , loss weight , and .
Comparative Results
To evaluate the effectiveness of PSC-CPI under modality missing during inference, we report the performance of CPI pattern prediction (measured by AUPRC) and strength prediction (measured by RMSE) under four different test data splits on the Karimi dataset in Table. 1. We first train the model using a small number of known sequence-structure pairs and then test its generalization under three different inference settings, where protein sequence, structure, or both modalities are provided. For unimodal inference with only protein sequences or structures, we adopt HRNN and GAT (Veličković et al. 2017) as sequence encoder and structure encoder, respectively. For multimodal inference with both sequences and structures, we concatenate the outputs of HRNN and GAT as final representations. We can observe from Table. 1 that: (1) Strengths of multimodality. The protein sequences and structures have their own strengths for different tasks; for example, sequences are more beneficial for strength prediction, while structures help more in predicting interaction patterns. However, inference with both modalities, either pre-trained w/ and w/o PSC, combines their strengths and outperforms both individual modalities. (2) Applicability to different inference settings. The performance of pre-training with PSC consistently improves the vanilla CPI model w/o PSC regardless of the unimodal or multimodal inference. More importantly, by pre-training with PSC, the results of inference with only unimodal data can even outperform multimodal methods. (3) Generalizability. Pre-training with PSC shows noticeable advantages under all four test data splits, especially in the “Unseen-Both” setting. Due to space limitations, experiments on more metrics and architectures are placed in Appendix B.
To further compare PSC-CPI with other state-of-the-art (SOTA) competitors, we evaluated their performance of CPI binding affinity prediction on three public datasets (Davis, KIBA, and Mert, all in the “Unseen-Both” setting), using Mean Squared Error (MSE) and Concordance Index (CIndex) as metrics. The benchmarks for comparison include DeepConvDTI, TransformerCPI, HyperattentionDTI, PerceiverCPI, DeepAffinity+, and Cross-Interaction. Following the experimental setup in (Nguyen et al. 2023), we transform a few binary classification models, such as TransformerCPI, DeepconvDTI, and HyperattentionDTI into regression models by modifying their final layers for a fair comparison. From the reported results in Table. 2, it can be observed that two multimodal methods, Cross-Interaction and PSC, both show fairly good performance. However, Cross-Interaction still slightly lags behind the SOTA method - PerceiverCPI, while our PSC-CPI exceeds PerceiverCPI by a little bit, achieving the best in 5 out of 6 metrics for the three datasets.





Evaluation on Protein Lengths and Atom Numbers
To compare the performance of PSC-CPI with other baselines at different protein lengths and number of atoms, we select three representative methods (TransformerCPI, PerceiverCPI, and Cross-Interaction) and report their performance averaged over four different test data splits on the Karimi dataset, where strength prediction and pattern prediction are measured by AUPRC (higher is better) and RMSE (lower is better), respectively. As can be seen from the results in Fig. 6, the performance gains of PSC-CPI over other baselines keep expanding as the protein length and number of atoms increase. This indicates that PSC-CPI has a greater advantage in dealing with complex proteins or compounds due to the multi-scale information it captures.
Methods | Seen-Both | Unseen-Comp | Unseen-Prot | Unseen-Both | ||||
---|---|---|---|---|---|---|---|---|
AUPRC | RMSE | AUPRC | RMSE | AUPRC | RMSE | AUPRC | RMSE | |
Vanilla CPI (w/o PSC) | 23.86 | 1.55 | 23.12 | 1.44 | 9.06 | 1.61 | 8.52 | 1.65 |
PSC-CPI (full model) | 25.42 | 1.42 | 24.67 | 1.31 | 11.03 | 1.47 | 11.65 | 1.52 |
w/o Intra-modality CL | 25.36 | 1.44 | 24.47 | 1.32 | 10.46 | 1.50 | 11.30 | 1.54 |
w/o Cross-modality CL | 25.06 | 1.47 | 23.95 | 1.36 | 10.88 | 1.48 | 10.34 | 1.58 |
w/o Length-variable DA | 24.69 | 1.49 | 24.00 | 1.37 | 10.69 | 1.53 | 10.41 | 1.60 |
Ablation Study & Visualizations
To explore how different pre-training contrastive losses and augmentation strategies influence performance, we compare the vanilla CPI model (without pre-training with PSC) with four other schemes: (A) full model: pre-training with both two contrasting and length-variable augmentation; (B) w/o Intra-modality CL: pre-training without intra-modality contrasting; (C) w/o Cross-modality CL: pre-training without cross-modality contrasting; and (D) w/o length-variable augmentation: pre-training with both two contrasting but with length-fixed augmentation. We can observe from Table. 3 that (1) Both intra-modality and cross-modality contrasting help improve performance, especially the latter, suggesting that the sequence-structure dependence helps more than the multi-scale information. (2) The full model combines the strengths of two contrasting methods and outperforms both. (3) Data augmentation plays a very important role in contrastive learning. While length-fixed augmentation works as well, it performs poorer than length-variable augmentation as it ignores the multi-scale information.
To visualize the interaction patterns of our PSC-CPI and other baselines, we select two representative compound-protein pairs and plot the ground-truth labels and predicted results of the contact maps between potential residue sites and atomic sites for four representative methods. In addition, we threshold the predicted contact maps to make their interaction numbers equal to the ground-truth interaction numbers between the protein and the compound, and finally normalize them by the maximum value. For a fair comparison, all four methods default to adopt the pairwise interaction pattern prediction module proposed in this paper. As can be seen from the visualizations and AUC scores in Fig. 7, PerceiverCPI and Cross-Interaction perform much better than TransformerCPI, but still lag far behind PSC-CPI in terms of both qualitative visualizations and quantitative scores.
Conclusion
In this paper, we propose a novel multi-scale Protein Sequence-structure Contrasting (PSC) framework for CPI prediction that is applicable to inference on both unimodal and multimodal protein data. Owing to the length-variable augmentation and intra- and cross-modality contrasting, PSC-CPI has the capability to capture sequence-structure dependencies and multi-scale information, performing well for proteins of various sequence lengths and compounds of various atomic numbers. Extensive experiments show that PSC-CPI generalizes well across various data, especially in a more challenging “Unseen-Both” setting, where neither compounds nor proteins have observed seen during training. Despite much progress, limitations remain. For example, multi-scale modeling only involves the residue-protein scale, and it may be a promising direction to extend it to the atomic scale of proteins. Moreover, we have not explored in depth the efficiency issue, which will be left for future work.
Acknowledgments
This work was supported by National Key R&D Program of China (No. 2022ZD0115100), National Natural Science Foundation of China Project (No. U21A20427), and Project (No. WU2022A009) from the Center of Synthetic Biology and Integrated Bioengineering of Westlake University.
References
- Armenteros et al. (2020) Armenteros, J. J. A.; Johansen, A. R.; Winther, O.; and Nielsen, H. 2020. Language modelling for biological sequences–curated datasets and baselines. BioRxiv.
- Bleicher et al. (2003) Bleicher, K. H.; Böhm, H.-J.; Müller, K.; and Alanine, A. I. 2003. Hit and lead generation: beyond high-throughput screening. Nature reviews Drug discovery, 2(5): 369–378.
- Bohacek, McMartin, and Guida (1996) Bohacek, R. S.; McMartin, C.; and Guida, W. C. 1996. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal research reviews, 16(1): 3–50.
- Chen et al. (2020a) Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; Luo, X.; Chen, K.; Jiang, H.; and Zheng, M. 2020a. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics, 36(16): 4406–4414.
- Chen et al. (2020b) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020b. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
- Davis et al. (2011) Davis, M. I.; Hunt, J. P.; Herrgard, S.; Ciceri, P.; Wodicka, L. M.; Pallares, G.; Hocker, M.; Treiber, D. K.; and Zarrinkar, P. P. 2011. Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology, 29(11): 1046–1051.
- Devlin et al. (2018) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Fan, Fu, and Zhang (2019) Fan, J.; Fu, A.; and Zhang, L. 2019. Progress in molecular docking. Quantitative Biology, 7: 83–89.
- Gao et al. (2018) Gao, K. Y.; Fokoue, A.; Luo, H.; Iyengar, A.; Dey, S.; Zhang, P.; et al. 2018. Interpretable drug target prediction using deep neural representation. In IJCAI, volume 2018, 3371–3377.
- Gao et al. (2022) Gao, Z.; Tan, C.; Chacón, P.; and Li, S. Z. 2022. PiFold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643.
- Gao et al. (2023) Gao, Z.; Tan, C.; Zhang, Y.; Chen, X.; Wu, L.; and Li, S. Z. 2023. ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
- Hermosilla and Ropinski (2022) Hermosilla, P.; and Ropinski, T. 2022. Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675.
- Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735–1780.
- Hollingsworth and Dror (2018) Hollingsworth, S. A.; and Dror, R. O. 2018. Molecular dynamics simulation for all. Neuron, 99(6): 1129–1143.
- Huang et al. (2023a) Huang, Y.; Li, S.; Su, J.; Wu, L.; Zhang, O.; Lin, H.; Qi, J.; Liu, Z.; Gao, Z.; Liu, Y.; et al. 2023a. Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction. arXiv preprint arXiv:2310.11466.
- Huang et al. (2023b) Huang, Y.; Wu, L.; Lin, H.; Zheng, J.; Wang, G.; and Li, S. Z. 2023b. Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure Decoy. arXiv preprint arXiv:2302.10888.
- Inglese and Auld (2007) Inglese, J.; and Auld, D. S. 2007. High throughput screening (HTS) techniques: applications in chemical biology. Wiley Encyclopedia of Chemical Biology, 1–15.
- Karimi et al. (2020) Karimi, M.; Wu, D.; Wang, Z.; and Shen, Y. 2020. Explainable deep relational networks for predicting compound–protein affinities and contacts. Journal of chemical information and modeling, 61(1): 46–66.
- Kipf and Welling (2016) Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
- LeCun, Bengio et al. (1995) LeCun, Y.; Bengio, Y.; et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10): 1995.
- Lee, Keum, and Nam (2019) Lee, I.; Keum, J.; and Nam, H. 2019. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS computational biology, 15(6): e1007129.
- Li et al. (2020) Li, S.; Wan, F.; Shu, H.; Jiang, T.; Zhao, D.; and Zeng, J. 2020. MONN: a Multi-Objective Neural Network for Predicting Pairwise Non-Covalent Interactions and Binding Affinities between Compounds and Proteins. In Research in Computational Molecular Biology: 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10–13, 2020, Proceedings, 259–260. Springer.
- Lim et al. (2021) Lim, S.; Lu, Y.; Cho, C. Y.; Sung, I.; Kim, J.; Kim, Y.; Park, S.; and Kim, S. 2021. A review on compound-protein interaction prediction methods: data, format, representation and model. Computational and Structural Biotechnology Journal, 19: 1541–1556.
- Lin et al. (2023) Lin, H.; Huang, Y.; Zhang, H.; Wu, L.; Li, S.; Chen, Z.; and Li, S. Z. 2023. Functional-Group-Based Diffusion for Pocket-Specific Molecule Generation and Elaboration. arXiv preprint arXiv:2306.13769.
- Liu et al. (2007) Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; and Gilson, M. K. 2007. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 35(suppl_1): D198–D201.
- Liu et al. (2015) Liu, Z.; Li, Y.; Han, L.; Li, J.; Liu, J.; Zhao, Z.; Nie, W.; Liu, Y.; and Wang, R. 2015. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics, 31(3): 405–412.
- Lu et al. (2020) Lu, A. X.; Zhang, H.; Ghassemi, M.; and Moses, A. 2020. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv.
- Mayr and Bojanic (2009) Mayr, L. M.; and Bojanic, D. 2009. Novel trends in high-throughput screening. Current opinion in pharmacology, 9(5): 580–588.
- Metz et al. (2011) Metz, J. T.; Johnson, E. F.; Soni, N. B.; Merta, P. J.; Kifle, L.; and Hajduk, P. J. 2011. Navigating the kinome. Nature chemical biology, 7(4): 200–202.
- Nguyen et al. (2023) Nguyen, N.-Q.; Jang, G.; Kim, H.; and Kang, J. 2023. Perceiver CPI: a nested cross-attention network for compound–protein interaction prediction. Bioinformatics, 39(1): btac731.
- Nguyen et al. (2021) Nguyen, T.; Le, H.; Quinn, T. P.; Nguyen, T.; Le, T. D.; and Venkatesh, S. 2021. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics, 37(8): 1140–1147.
- Öztürk, Özgür, and Ozkirimli (2018) Öztürk, H.; Özgür, A.; and Ozkirimli, E. 2018. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics, 34(17): i821–i829.
- Pinzi and Rastelli (2019) Pinzi, L.; and Rastelli, G. 2019. Molecular docking: shifting paradigms in drug discovery. International journal of molecular sciences, 20(18): 4331.
- Radford et al. (2019) Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
- Salo-Ahen et al. (2020) Salo-Ahen, O. M.; Alanko, I.; Bhadane, R.; Bonvin, A. M.; Honorato, R. V.; Hossain, S.; Juffer, A. H.; Kabedev, A.; Lahtela-Kakkonen, M.; Larsen, A. S.; et al. 2020. Molecular dynamics simulations in drug discovery and pharmaceutical development. Processes, 9(1): 71.
- Scardapane et al. (2017) Scardapane, S.; Comminiello, D.; Hussain, A.; and Uncini, A. 2017. Group sparse regularization for deep neural networks. Neurocomputing, 241: 81–89.
- Sethi et al. (2019) Sethi, A.; Joshi, K.; Sasikala, K.; and Alvala, M. 2019. Molecular docking in modern drug discovery: Principles and recent applications. Drug discovery and development-new advances, 2: 1–21.
- Tan et al. (2023) Tan, C.; Gao, Z.; Xia, J.; Hu, B.; and Li, S. Z. 2023. Global-Context Aware Generative Protein Design. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. IEEE.
- Tang et al. (2014) Tang, J.; Szwajda, A.; Shakyawar, S.; Xu, T.; Hintsanen, P.; Wennerberg, K.; and Aittokallio, T. 2014. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. Journal of Chemical Information and Modeling, 54(3): 735–743.
- Tibshirani et al. (2005) Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; and Knight, K. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1): 91–108.
- Trott and Olson (2010) Trott, O.; and Olson, A. J. 2010. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2): 455–461.
- Veličković et al. (2017) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
- Verdonk et al. (2003) Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; and Taylor, R. D. 2003. Improved protein–ligand docking using GOLD. Proteins: Structure, Function, and Bioinformatics, 52(4): 609–623.
- Wu et al. (2022a) Wu, L.; Huang, Y.; Lin, H.; and Li, S. Z. 2022a. A survey on protein representation learning: Retrospect and prospect. arXiv preprint arXiv:2301.00813.
- Wu et al. (2021a) Wu, L.; Lin, H.; Gao, Z.; Tan, C.; Li, S.; et al. 2021a. Graphmixup: Improving class-imbalanced node classification on graphs by self-supervised context prediction. arXiv preprint arXiv:2106.11133.
- Wu et al. (2022b) Wu, L.; Lin, H.; Huang, Y.; and Li, S. Z. 2022b. Knowledge distillation improves graph structure augmentation for graph neural networks. Advances in Neural Information Processing Systems, 35: 11815–11827.
- Wu et al. (2023) Wu, L.; Lin, H.; Huang, Y.; and Li, S. Z. 2023. Quantifying the Knowledge in GNNs for Reliable Distillation into MLPs. arXiv preprint arXiv:2306.05628.
- Wu et al. (2021b) Wu, L.; Lin, H.; Tan, C.; Gao, Z.; and Li, S. Z. 2021b. Self-supervised learning on graphs: Contrastive, generative, or predictive. IEEE Transactions on Knowledge and Data Engineering.
- Xu (2019) Xu, J. 2019. Distance-based protein folding powered by deep learning. Proceedings of the National Academy of Sciences, 116(34): 16856–16865.
- You et al. (2021) You, Y.; Chen, T.; Shen, Y.; and Wang, Z. 2021. Graph contrastive learning automated. In International Conference on Machine Learning, 12121–12132. PMLR.
- You et al. (2020) You, Y.; Chen, T.; Wang, Z.; and Shen, Y. 2020. When does self-supervision help graph convolutional networks? In International Conference on Machine Learning, 10871–10880. PMLR.
- You and Shen (2022) You, Y.; and Shen, Y. 2022. Cross-modality and self-supervised protein embedding for compound–protein affinity and contact prediction. Bioinformatics, 38(Supplement_2): ii68–ii74.
- Zhang et al. (2022) Zhang, Z.; Xu, M.; Jamasb, A.; Chenthamarakshan, V.; Lozano, A.; Das, P.; and Tang, J. 2022. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125.
- Zhao et al. (2022) Zhao, Q.; Zhao, H.; Zheng, K.; and Wang, J. 2022. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics, 38(3): 655–662.