Relation-aware graph structure embedding with co-contrastive learning for drug-drug interaction prediction

Mengying Jiang [email protected] Guizhong Liu [email protected] Biao Zhao [email protected] Yuanchao Su [email protected] Weiqiang Jin [email protected] School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China College of Geomatics, Xi’an University of Science and Technology, Xi’an 710054, China

Abstract

Relation-aware graph structure embedding is promising for predicting multi-relational drug-drug interactions (DDIs). Typically, most existing methods begin by constructing a multi-relational DDI graph and then learning relation-aware graph structure embeddings (RaGSEs) of drugs from the DDI graph. Nevertheless, most existing approaches are usually limited in learning RaGSEs of new drugs, leading to serious over-fitting when the test DDIs involve such drugs. To alleviate this issue, we propose a novel DDI prediction method based on relation-aware graph structure embedding with co-contrastive learning, RaGSECo. The proposed RaGSECo constructs two heterogeneous drug graphs: a multi-relational DDI graph and a multi-attribute drug-drug similarity (DDS) graph. The two graphs are used respectively for learning and propagating the RaGSEs of drugs, aiming to ensure all drugs, including new ones, can possess effective RaGSEs. Additionally, we present a novel co-contrastive learning module to learn drug-pairs (DPs) representations. This mechanism learns DP representations from two distinct views (interaction and similarity views) and encourages these views to supervise each other collaboratively to obtain more discriminative DP representations. We evaluate the effectiveness of our RaGSECo on three different tasks using two real datasets. The experimental results demonstrate that RaGSECo outperforms existing state-of-the-art prediction methods.

keywords:

Adverse drug reactions , Graph neural networks , Graph structure embedding , Self-supervised learning

{highlights}

We innovatively propose the propagation of relation-aware graph structure embeddings from known drugs to new drugs with similar features.

It is the first attempt to learn drug-pair feature representations based on cross-view contrastive learning, and the selected views are the interaction and similarity views.

We design a novel positive sample selection strategy that considers the underlying correlations between drug pairs, further enhancing the effectiveness of cross-view contrastive learning.

1 Introduction

Human diseases are often the result of complex biological processes, and single-drug treatments are often insufficient to cure them entirely [18, 51]. Consequently, combination drug therapy has become attractive as it can reduce drug resistance and improve efficacy [20, 6]. Nevertheless, the simultaneous use of multiple drugs can alter their properties and lead to adverse drug interactions, which can cause harm to patients [41]. Therefore, identifying potential DDIs is crucial for safe coadministration [50]. Traditional methods for detecting DDIs, such as biological or pharmacological assays, are labor-intensive, time-consuming, and expensive [12, 23]. In contrast, computation-based methods are generally less expensive and can achieve higher accuracy than traditional methods [53]. In recent years, many computational methods have been proposed for predicting DDI events [58].

Many existing works regard the drugs with known interactions as known drugs and refer to those without known interactions as new drugs [5]. Typically, the lack of interaction information on new drugs makes it difficult for models to predict corresponding interactions accurately [5]. Nevertheless, predicting interactions involving new drugs is significant for the efficient development of new drugs [37]. Therefore, many DDI prediction approaches concern this task [30]. For example, Deng et al. [5] proposed the multimodal deep learning framework for predicting DDI events (DDIDML). DDIDML employs multiple biochemical attributes (such as enzymes, targets, pathways, and chemical substructures) to compute multiple similarities to build a deep neural network (DNN) model for multi-relational DDI prediction. Zhang et al. [52] proposed the sparse feature learning ensemble method with linear neighborhood regularization (SFLLN) that combines multiple drug features like DDIDML and known DDIs to predict novel DDI. Lin et al. [25] proposed a DDI prediction method that jointly utilizes the multi-source drug fusion, the multi-source feature fusion, and the transformer (MDF-SA-DDI). The MDF-SA-DDI first utilizes four different drug fusion networks to encode latent features of DPs, then adopts transformers to perform latent feature fusion for representation learning of DPs. The aforementioned methods employ multiple attributes to learn representative drug embeddings and have achieved leading performance when predicting interactions involving new drugs. Nevertheless, to ensure that the training and test sets have the same data distribution, these methods do not use multi-relational interaction information between drugs, limiting the DDI prediction performance.

Recently, self-supervised learning, which aims to derive supervised signals from the data itself spontaneously, has emerged as a promising strategy for effective representation learning [46]. Among the various techniques under the umbrella of self-supervised learning, contrastive learning has attracted substantial attention [29]. Contrastive learning first extracts positive and negative samples from data and then maximize the similarities between positive samples while minimizing the similarities between negative samples[55]. In this way, contrastive learning can learn discriminative representations even without abundant labels. Despite the broad application of contrastive learning in computer vision [27] and natural language processing [54], its potential in the context of DDI prediction tasks has been scarcely explored. Implementing contrastive learning to facilitate DDI prediction is by no means trivial and requires careful consideration of the task-specific characteristics and nuances of contrastive learning.

Based on the above discussion, the primary motivation for our work lies in enabling all drugs to distill effective RaGSEs, thereby facilitating DP representation learning and improving DDI prediction. We propose a relation-aware graph structure embedding with co-contrastive learning framework for DDI prediction, abbreviated as RaGSECo, an end-to-end learning model. Implementing our RaGSECo approach includes two main steps: drug embedding and DP representation learning. The relation-aware graph structure embedding learning and propagation (RaGSELP) method is proposed for drug embedding. RaGSELP constructs a multi-relational DDI graph by using known DDIs (training set). Then, RaGSELP learns RaGSEs of known drugs by aggregating their neighbor’s features under different relations. Inspired by an assumption that similar drugs may interact with the same drugs [35, 25, 5], RaGSELP constructs a multi-attribute DDS graph. Within this DDS graph, RaGSELP learns embeddings of new drugs by aggregating their neighbor’s RaGSEs, aiming to enable all drugs to possess effective RaGSEs. Furthermore, we incorporate multi-view representation learning and cross-view contrastive Learning to present a novel co-contrastive learning mechanism to learn DP representations. Unlike previous contrastive learning, which contrasts the original and the corrupted networks, we design two distinct views for DP representation learning: interaction and similarity views. Specifically, we leverage the RaGSEs of drugs to generate the DP representations under the interaction view, and we employ similarities between drugs to generate DP representations under the similarity view. The interaction view primarily utilizes known interaction relationships between drugs. The similarity view can discover inherent connections between drugs and facilitate inferring the potential therapeutic effects, toxicity reactions, or drug interactions of new drugs. Therefore, the two views are complementary. Moreover, we consider underlying correlations between DPs to design a novel positive selection strategy to enhance cross-view contrastive learning. With the training, these two views supervise each other collaboratively and learn more discriminative DP representations.

The main contributions of our work can be summarized as follows:

$\bullet$ Propagating the RaGSEs of known drugs to new drugs with similar features is innovatively proposed on relation-aware-based methods. This significantly improves DDI prediction performances of relation-aware-based methods, especially in predicting DDIs involving new drugs.

$\bullet$ To our best knowledge, this is the first attempt to perform the DP-level cross-view contrastive learning. More discriminative DP representations can be learned by co-contrastive learning based on a cross-view manner.

$\bullet$ The proposed RaGSECo ingeniously contrasts and incorporates two views of the drug information network (interaction and similarity views), enabling DP to capture both the known and the potential interactions.

2 Related Work

We review previous studies relevant to this work in three areas: relation-aware graph structure embedding, contrastive learning, and drug feature extraction.

2.1 Relation-Aware Graph Structure Embedding

The methods based on relation-aware graph structure embedding learning pay attention to the topology of the graph, typically learning entity embeddings by aggregating information from neighboring entities under different relations [47]. The practicability of these methods on multi-relational DDI predictions makes these methods attract considerable attention. Most existing methods use graph neural networks (GNNs), including relational graph convolutional networks (RGCNs) [34] and graph attention networks (GATs) [40]. For example, Hong et al. [15] used the GAT to propose the link-aware graph attention network (LaGAT) that learns drug embedding by aggregating features of neighbors from different attention pathways via different DDI event types, where the attention weights depend on embedding representations of drugs and their neighbors. Wang et al. [42] proposed the graph of graphs neural network (GoGNN) that captures the information on drug structure and multi-relational drug interactions in a hierarchical way to learn drug embedding for DDI prediction. Yu et al. [47] proposed the relation-aware network embedding for DDI prediction (RANEDDI) that considers both the multi-relational information between drugs and the relation-aware network structure information of drugs to obtain the drug embedding for DDI prediction. Yu et al. [48] proposed the substructure-aware tensor neural network model for DDI prediction (STNN-DDI) that learns a 3-D tensor of (substructure, substructure, interaction type) triplets, which characterizes a substructure–substructure interaction (SSI) space. The aforementioned methods take advantage of interaction information and perform well in the multi-relational DDI prediction task. Nevertheless, the absence of interaction information on new drugs limits the performance of predicting DDIs involving new drugs.

2.2 Contrastive Learning

The approaches based on contrastive learning learn representations by contrasting positive pairs against negative pairs have achieved considerable success across various domains [3]. In this section, we mainly focus on reviewing the contrastive learning methods related to DDI prediction tasks. Zhao et al. [55] constructed original and corrupted networks, then minimized the mutual information between outputs of original and corrupted networks, and maximized the mutual information between outputs from only the original or corrupted network. Wang et al. [44] proposed a self-supervised meta-path detection mechanism to train a deep transformer encoder model that can capture the global structure and semantic features in heterogeneous biomedical networks. Gao et al. [10] designed the drug-disease associations view and drug or disease similarity view, then maximized the mutual information between the two views. Zhuang et al. [59] learned high-level drug representations containing graph-level structural information by maximizing the local-global mutual information. Although many works have been proposed to learn high-level drug embeddings by contrastive learning, little effort has been made toward DP-level contrastive learning. Nevertheless, learning high-level DP representations is more practical for the DDI prediction task.

2.3 Drug Feature Extraction

Drug feature extraction is crucial for model training [25]. Zhu et al. [58] took into account eight drug attributes (molecular substructure, target, enzyme, pathway, side effect, phenotype, gene, and disease) to extract features. Deng et al. [5] jointly consider targets, enzymes, and chemical substructures, which achieved outstanding performance [25]. Typically, each attribute is associated with a set of descriptors. A drug can be denoted as a binary feature vector where the element (1 or 0) indicates the presence or absence of the corresponding descriptor [25]. These sparse binary feature vectors typically have high dimensionalities. In general, high dimensional inputs can be resource-intensive and induce the curse of dimensionality, leading to extremely poor performance in some cases [25]. Therefore, given that similar drugs may interact with the same drugs, Deng et al. [5] opted to use the Jaccard similarities calculated from the binary feature vectors to define drugs rather than the binary feature vector itself. The Jaccard similarity is calculated as follows:

\operatorname{Jaccard}(U,V)=\frac{|U\cap V|}{|U\cup V|}

(1)

where $U$ and $V$ are the descriptors of two drugs under a specific attribute. Herein, $|U\cap V|$ is the cardinality of the intersection between $U$ and $V$ , and $|U\cup V|$ is the union.

3 Methodology

3.1 RaGSECo framework

Refer to caption — Figure 1: Examples of construction strategies for three different test sets. The nodes represent drugs. Solid edges indicate known DDI interactions (training set), while dotted edges represent the prediction task (test set). The edges with different colors signify various interaction types.

The RaGSECo is proposed for multi-relational DDI prediction, which can be regarded as a multi-class classification task. Based on three different experimental settings, the multi-relational DDI prediction task can be further partitioned into three different tasks, defined as follows:

$\bullet$ Task 1: predicting unobserved interaction events between known drugs.

$\bullet$ Task 2: predicting interaction events between known drugs and new drugs.

$\bullet$ Task 3: predicting interaction events between new drugs.

These three tasks are vividly illustrated in Fig. 1. As the figure illustrates, the training and test sets contain the same drugs in Task 1. In Task 2, half of the drugs involved in the test set appear in the training set. In Task 3, the training and test sets possess different drugs. Therefore, from Task 1 to Task 3, the prediction difficulties increase in turn.

Taking Task 3 as an example, the framework of our RaGSECo is illustrated in Fig. 2. The RaGSECo constructs two heterogeneous drug graphs: a multi-relational DDI graph and a multi-attribute DDS graph. In the DDS graph, nodes denote drugs, and edges represent similarities between drugs. The different edge colors indicate similarities under varying attributes (targets, enzymes, and chemical substructures). Consequently, in the DDS graph, each pair of drugs can have up to three types of edges. In contrast, in the DDI graph, each pair of drugs has only up to one type of edge. The framework of our RaGSECo contains two primary parts: drug embedding and DP representation learning. The RaGSELP method is proposed to learn drug embedding, and the cross-view module is developed for learning DP representation for DDI prediction.

3.2 RaGSELP

The RaGSELP method is proposed to learn drug embedding. RaGSELP contains two parts: RaGSE learning (RaGSEL) and RaGSE propagation (RaGSEP). RaGSEL learns the RaGSEs of known drugs, and RaGSEP propagates the RaGSEs of known drugs to new drugs with similar features.

3.2.1 RaGSEL

We construct a multi-relational DDI graph $\mathcal{G}=(\mathcal{D},\mathcal{E},\mathcal{R})$ , where nodes represent drugs, and edges denote labeled interactions. Herein, $\mathcal{D}$ is the set of all drugs (including known and new drugs), and $|\mathcal{D}|=N$ represents the number of drugs. Let ${\bf X}=\left\{{\bf x}_{i}\right\}_{i=1}^{N}\in\mathbb{R}^{N\times d}$ be the initial feature matrix of drugs, derived from Section 2.3. $\mathcal{R}$ is the set of interaction event types, and $|\mathcal{R}|=R$ is the number of interaction type. $\mathcal{E}=\left\{\mathcal{E}_{r}\right\}_{r=1}^{R}$ is the set of known interactions, and $\mathcal{E}_{r}$ represents the set of interactions under interaction type $r$ . Let $\mathcal{A}=\left\{{\bf A}_{r}\right\}_{r=1}^{R}\in\mathbb{R}^{N\times N\times R}$ be the multi-relational adjacency tensor, where ${\bf A}_{r}\in\mathbb{R}^{N\times N}$ represents the adjacency matrix under interaction relation $r$ . Let $A_{r(i,j)}$ $(i,j=1,\dots,N)$ be the element of $\bf{A}_{r}$ , $A_{r(i,j)}=1$ if $(i,j)\in\mathcal{E}_{r}$ and $A_{r(i,j)}=0$ if $(i,j)\notin\mathcal{E}_{r}$ .

Subsequently, we use the R-GCN layer [34] to learn the RaGSEs of drugs from the multi-relational DDI graph. The forward propagation function is defined as follows:

\mathbf{h}_{i}=\sigma\left(\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{i}^{r}}\frac{\hat{A}_{r(i,j)}}{R_{i}}\mathbf{x}_{j}\mathbf{W}_{r}+\mathbf{x}_{i}\mathbf{W}_{o}\right)\\

(2)

where $\mathbf{h}_{i}\in\mathbb{R}^{1\times d^{\prime}}$ represents the RaGSEs of drug $i$ , $\mathbf{x}_{i}\in\mathbb{R}^{1\times d}$ are the initial feature vector of drug $i$ , $i\in\mathcal{D}$ . Herein, $\mathbf{W}_{r}\in\mathbb{R}^{d\times d^{\prime}}$ represents the relation-specific weight matrix and the adoption of a set of $\mathbf{W}_{r}$ $(r=1,\dots,R)$ supports multiple edge types. On the contrary, $\mathbf{W}_{0}\in\mathbb{R}^{d\times d^{\prime}}$ is a single weight matrix regardless of relations. Here, $\mathcal{N}_{i}^{r}$ represents the set of drugs connected to drug $i$ under relation $r$ . $\sigma(\cdot)$ is an element-wise activation function: $\operatorname{ReLU}(\cdot)=\max(0,\cdot)$ . To normalize the incoming messages of each drug, we use $\left\{R_{i}\right\}_{i=1}^{N}$ to define a set of normalized constants. Here, $R_{i}$ equals the number of interaction types in which drug $i$ is involved. $\hat{A}_{r(i,j)}$ is the aggregation weight assigned to the neighboring drug $j$ . Specifically, $\hat{A}_{r(i,j)}$ denotes the element in $\hat{\bf A}_{r}$ . The calculation of $\hat{\bf A}_{r}$ is based on a classic graph-based normalization method [19].

\hat{\bf A}_{r}={\bf D}_{r}^{-1/2}{\bf A}_{r}{\bf D}_{r}^{-1/2}

(3)

where ${\bf D}_{r}$ is the degree matrix of ${\bf A}_{r}$ and a diagonal matrix.

Herein, $\mathbf{h}_{i}$ can represent relation-aware graph structure information of drug $i$ if drug $i$ is known. Nevertheless, if drug $i$ is new, $\mathbf{h}_{i}$ would be influenced only by drug $i$ itself. Consequently, directly using the output of Eq. (2) as the drug embeddings for DDI prediction could potentially result in severe over-fitting issues for Tasks 2 and 3.

3.2.2 RaGSEP

Based on the assumption that similar drugs may interact with the same drugs [25, 5], we propose a strategy that propagates the RaGSEs of known drugs to new drugs with similar features. This strategy can effectively overcome the over-fitting issues that arise in Tasks 2 and 3. To facilitate the propagation of embeddings, we construct a multi-attribute DDS graph $\mathcal{G}_{s}=(\mathcal{D},\mathcal{E},\mathcal{S})$ , where nodes represent drugs, and edges denote Jaccard similarities between drugs under various attributes. Herein, we still employ these three biological attributes (i.e., chemical substructure, enzyme, and target). In the DDS graph, $\mathcal{D}$ is the set of nodes, with ${\bf H}=\left\{{\bf h}_{i}\right\}_{i=1}^{N}$ as the feature matrix of nodes, which is the output of Eq. (2). $\mathcal{E}$ represents the set of edges. $\mathcal{S}$ is the set of similarity types, and $|\mathcal{S}|=3$ is the number of similarity types. Let $\mathcal{A}=\left\{{\bf A}_{s},{\bf A}_{e},{\bf A}_{t}\right\}\in\mathbb{R}^{N\times N\times 3}$ be the adjacency tensor, where ${\bf A}_{s}$ , ${\bf A}_{e}$ , and ${\bf A}_{t}$ represent the substructure-based, enzyme-based, and target-based adjacency matrix, respectively.

To normalize the incoming messages, these three adjacency matrices also need to be normalized. The normalization method is referred to as Eq. (3). Let $\hat{\bf A}_{s}$ , $\hat{\bf A}_{e}$ , and $\hat{\bf A}_{t}$ be the normalized adjacency matrices. We use these normalized adjacency matrices to propagate the learned RaGSEs to the strongly associated drugs within the DDS graph. The propagation procedure can be expressed as follows:

{\bf H}_{s}=\sigma\left(\hat{\bf A}_{s}^{n}{\bf H}{\bf W}_{s}\right)

(4)

{\bf H}_{e}=\sigma\left(\hat{\bf A}_{e}^{n}{\bf H}{\bf W}_{e}\right)

(5)

{\bf H}_{t}=\sigma\left(\hat{\bf A}_{t}^{n}{\bf H}{\bf W}_{t}\right)

(6)

{\bf H}_{embed}={\bf H}_{s}+{\bf H}_{e}+{\bf H}_{t}

(7)

where $\hat{\bf A}^{n}_{s}\bf{H}$ , $\hat{\bf A}^{n}_{e}\bf{H}$ , and $\hat{\bf A}^{n}_{t}\bf{H}$ represent graph convolutions on the DDS graph. $n$ is a hyperparameter, and $\hat{\bf A}_{s}^{n}$ is $\hat{\bf A}_{s}$ to the power of $n$ . The same definition is extended to $\hat{\bf A}_{e}^{n}$ and $\hat{\bf A}_{t}^{n}$ . This strategy allows RaGSECo to capture higher-order neighbor information flexibly. $\mathbf{W}_{s}$ , $\mathbf{W}_{e}$ , $\mathbf{W}_{t}\in\mathbb{R}^{d^{\prime}\times d^{\prime}}$ are the attribute-specific trainable weight matrices. $\mathbf{H}_{s}$ , $\mathbf{H}_{e}$ , and $\mathbf{H}_{t}$ are the propagation results along with different pathways. $\mathbf{H}_{embed}=\left\{{\bf h}_{embed,i}\right\}_{i=1}^{N}\in\mathbb{R}^{N\times d^{\prime}}$ is the sum of the three propagation results and represents the final embeddings of drugs. Particularly, ${\bf h}_{embed,i}$ can represent effective RaGSEs of drug $i$ , whether drug $i$ is known or new.

3.3 Co-Contrastive Learning

The co-contrastive learning module, as illustrated in Fig. 3, is designed to learn DP representations. This module consists of two primary components: multi-view representation learning and collaborative contrastive optimization. The first component learns DP representations from multiple views. The second component facilitates a cooperative process in which these views optimize and supervise each other.

3.3.1 Multi-View Representation Learning

To increase data diversity, facilitate gradient propagation, and generate discriminative DP representations, we employ multiple drug features containing RaGSEs, initial features, and Simplified-Molecular-Input-Line-Entry-System (SMILES) string.

SMILES string is a line notation that uses a predefined set of rules to describe the structure of compounds, and each drug has its own SMILES string [4]. The characters of a SMILES string represent chemical atoms or chemical bonds, and the lengths of the SMILES strings are unconstrained [31]. In this work, we convert each SMILES string to a $p\times q$ dimensional feature matrix for simplicity, where $p$ denotes the number of characters and $q$ is the unified length of the SMILES string [16]. Accordingly, the columns of the SMILES-based feature matrix are one-hot vectors. Since the initial lengths of the SMILES strings are flexible, we will cut off the extra part if the actual length of the SMILES string is greater than $q$ , and we will pad zeros if the actual length is less than $q$ . The $p$ and $q$ are set as $64$ and $100$ according to Huang et al. [16].

Let $Q$ be the number of the known DDIs. Considering computational burdens, we adopt a batch-wise scheme to train the RaGSECo. Let $K$ be the number of DPs in a batch. Given DP $k$ $(k=1,\dots\,K)$ that involves two drugs $i$ and $j$ $(i,j=1,\dots\,N;i\neq j)$ . Let ${\bf{S}}_{i}$ , ${\bf{S}}_{j}\in\mathbb{R}^{p\times q}$ be the SMILES-based feature matrices of drugs $i$ and $j$ . ${\bf{h}}_{embed,i}$ and ${\bf{h}}_{embed,j}$ are the RaGSEs of two drugs. ${\bf x}_{i}$ and ${\bf x}_{j}$ are initial feature vectors. Subsequently, we employ two feedforward neural networks (FNNs) to process the RaGSEs and initial features of DPs, respectively. In addition, a CNN encodes SMILES strings of DPs. The employed CNN is a multi-layer 1-D CNN followed by a global max pooling layer, and with reference to [16]. Accordingly, three features of DP $k$ are encoded as follows:

{\bf p}_{k}^{initi}=\operatorname{FNN1}({\bf x}_{i}||{\bf x}_{j},\Theta_{\mathrm{FNN1}})

(8)

{\bf p}_{k}^{embed}=\operatorname{FNN2}({\bf h}_{embed,i}||{\bf h}_{embed,j},\Theta_{\mathrm{FNN2}})

(9)

{\bf p}_{k}^{smile}=\operatorname{CNN}(\mathbf{S}_{i}||\mathbf{S}_{j},\Theta_{\mathrm{CNN}}).

(10)

where the symbols $\Theta_{\mathrm{FNN1}}$ , $\Theta_{\mathrm{FNN2}}$ , and $\Theta_{\mathrm{CNN}}$ represent the trainable weights involved in FNN1, FNN2, and CNN, respectively.

The initial feature vectors of drugs are denoted by Jaccard similarity scores between drugs. Thus, ${\bf p}_{k}^{initi}\in\mathbb{R}^{1\times d^{FNN}}$ can be considered as the DP feature representations under the similarity view. Meanwhile, the drug RaGSEs primarily concentrate on interaction information between drugs. As such, ${\bf p}_{k}^{embed}\in\mathbb{R}^{1\times d^{FNN}}$ can be perceived as the DP representations under the interaction view.

3.3.2 Collaboratively Contrastive Optimization

The interaction view mainly focuses on known interaction relationships between nodes, while the similarity view can infer potential therapeutic effects, adverse reactions, or drug interactions of new drugs by discovering underlying drug correlations. Therefore, these two views are complementary and mutually reinforcing. Herein, we define positive and negative samples for contrastive learning. In this paper, given the feature representations of a DP under the interaction view, we can simply define its feature representations under the similarity view as the positive sample. Nevertheless, our analysis of the employed datasets suggests that there may be underlying correlations between DPs. Consequently, we propose a novel positive selection strategy.

This paper utilizes two real-world, multi-relational DDI datasets: Dataset 1 and Dataset 2. In Dataset 1, the known DDIs are categorized into 65 types of DDI events, while the number of interaction event types in which each drug is involved ranges from 1 to 20. The median of the event type counts is 10, significantly lower than 65. Similar findings are also observed in Dataset 2. These findings suggest that a particular drug is likely to associate with specific interaction event types strongly. Thus, the interaction event types a drug participates in can be considered significant characteristics of the drug. We refer to these significant characteristics as interaction characteristics of drugs. Let ${{\bf t}_{i}}\in{\mathbb{R}}^{1\times R}$ represent the interaction characteristics of drug $i$ , where $R$ is the number of interaction event types. The vector ${\bf t}_{i}$ is binary, and its nonzero elements represent the interaction event types the drug $i$ participates in. Given a DP $k$ that involves two drugs $i$ and $j$ , the interaction characteristics of DP $k$ are defined as ${\bf c}_{k}={\bf t}_{i}+{\bf t}_{j}$ . Based on this, we propose a new positive selection strategy: if two DPs exhibit similar interaction characteristics, they are designated as positive samples. One advantage of such a strategy is that the selected positive samples may reflect the underlying interaction tendency of the target DP. For DPs $k$ and $q$ , we calculate the cosine similarity of their interaction characteristics as follows:

{S}_{k,q}=\frac{{\bf c}_{k}\cdot{\bf c}_{q}}{\|{\bf c}_{k}\|\|{\bf c}_{q}\|}.

(11)

Given the threshold values $t_{pos}$ and $t_{neg}$ , where $t_{pos}>t_{neg}$ . DP $q$ is deemed a positive sample of DP $k$ if ${S}_{k,q}\geq t_{pos}$ , and conversely, is considered a negative sample if ${S}_{k,q}\leq t_{neg}$ . Let $\mathcal{P}$ represent the set of positive pairs of one batch, and $\mathcal{N}$ be the set of negative pairs. We then model a self-supervised learning task using the standard binary cross-entropy loss. With this, we present the co-contrastive learning loss function as follows:

	$\displaystyle\ell_{ss1}=$	$\displaystyle-\frac{1}{\|\mathcal{P}\|+\|\mathcal{N}\|}\left(\sum_{(k,q)\in\mathcal{P}}\log\Psi\left({\bf p}_{k}^{embed},{\bf p}_{q}^{initi}\right)\right.$		(12)
		$\displaystyle\left.+\sum_{(k,q)\in\mathcal{N}}\log\left(1-\Psi\left({\bf p}_{k}^{embed},{\bf p}_{q}^{initi}\right)\right)\right)$		(12)

where the contrastive discriminator $\Psi(\cdot,\cdot)$ is instantiated as $\sigma({\bf p}_{k}^{embed}{\bf W}{{\bf p}_{q}^{initi}}^{T})$ . ${\bf W}$ is a learnable weight matrix. The activation function $\sigma$ is the sigmoid function, which produces a score representing the probability of being a positive sample. The co-contrastive learning mechanism maximizes the mutual information of positive pairs under the interaction and similarity views. To stabilize the co-contrastive learning process and facilitate two views supervising each other, we present another co-contrastive learning loss function:

	$\displaystyle\ell_{ss2}=$	$\displaystyle-\frac{1}{\|\mathcal{P}\|+\|\mathcal{N}\|}\left(\sum_{(k,q)\in\mathcal{P}}\log\Psi\left({\bf p}_{k}^{initi},{\bf p}_{q}^{embed}\right)\right.$		(13)
		$\displaystyle\left.+\sum_{(k,q)\in\mathcal{N}}\log\left(1-\Psi\left({\bf p}_{k}^{initi},{\bf p}_{q}^{embed}\right)\right)\right).$		(13)

In conclusion, to fully utilize all significant information and learn more discriminative DP representations, we define the final DP representations ${\bf p}_{k}$ as follows:

{\bf p}_{k}^{add}={\bf p}_{k}^{initi}+{\bf p}_{k}^{embed}+{\bf p}_{k}^{smile}

(14)

{\bf p}_{k}=||\left({\bf p}_{k}^{initi},{\bf p}_{k}^{embed},{\bf p}_{k}^{smile},{\bf p}_{k}^{add}\right).

(15)

where the symbol $||$ denotes the concatenation.

3.4 Loss Function

We construct a Decoder by two fully connected layers that map ${\bf p}_{k}$ into the probability distribution space ${\bf y}_{k}\in\mathbb{R}^{1\times R}$ :

{\bf y}_{k}=\operatorname{Decoder}\left({\bf p}_{k};\Theta_{\mathrm{Decoder}}\right)

(16)

where $\Theta_{\mathrm{Decoder}}$ is the trainable parameters. The first fully connected layer is followed by an activation function GeLu [14], a batch normalization layer [17], and a dropout layer. The second fully connected layer is followed by a softmax function.

DDI events prediction is a multi-classification task. Thus, we choose the cross-entropy loss function as our classification loss function:

\mathcal{L}_{ce}=-\sum_{k=1}^{K}\sum_{r=1}^{R}\hat{y}_{k,r}\log\left(y_{k,r}\right)

(17)

where $\hat{y}_{k,r}$ represents the $r$ -th element of $\hat{\bf{y}}_{k}$ , which corresponds to the ground-truth vector (one-hot encoding) of the DP $k$ . $K$ denotes the number of DDI samples in a single batch for training. On the other hand, $y_{k,r}$ denotes the predicted probability score of the DP $k$ under the class $r$ , corresponding to the $r$ -th element of ${\bf y}_{k}$ .

To emphasize the importance of the classification loss, we introduce a hyperparameter $\lambda$ to scale the classification cross-entropy loss $\mathcal{L}_{ce}$ . Accordingly, the total loss function of the model is formulated as follows:

\mathcal{L}=\lambda\mathcal{L}_{ce}+\mathcal{L}_{ss1}+\mathcal{L}_{ss2}

(18)

where $\mathcal{L}$ is the loss of one batch. Furthermore, we employ Mixup [49], a data augmentation algorithm, to increase the quantity of the original data and improve the generalization ability and robustness of the model. Mixup is described in detail in the previous study [49].

Input:

N_{epochs}

: maximum epochs;

N_{batches}

: total batchs;

\lambda

: hyper-parameter;

\mathcal{T}_{r}

: the set of training DDIs;

\bf{\hat{Y}}

: ground truth of training DDIs;

\mathcal{T}_{e}

: the set of test DDIs.

Output:

\bf{Y}_{te}

: predicted types of test DDIs.

\bf{X}\leftarrow

Obtain drugs’ initial features via (1).

\mathcal{G}\leftarrow

Construct a DDI graph using known DDIs and

\bf{X}

3 //

\ast

Training Procedure

\ast

4 for $epoch\in[1,N_{epochs}]$ do

5 for $i\in[1,N_{batches}]$ do

6 //

\ast

Acquiring Drug Embddings

\ast

7 Use

\mathcal{G}

learn RaGSEs of known drugs,

{\bf H}

, via Eq.(2).

8 Use Jaccard similarities and

{\bf H}

to build a DDS graph

\mathcal{G}_{s}

9 Use

\mathcal{G}_{s}

obtain RaGSEs of all drugs,

{\bf H}_{embed}

, via Eq.(4)-(7).

10 //

\ast

Acquiring Drug-Pair Representations

\ast

11 Obtain

i

-th batch training DDIs from

\mathcal{T}_{r}

, and let the DP

k

belong to the

i

-th batch training DDIs.

12 Acquire three representations of DP

k

{\bf{p}}^{initial}_{k}

{\bf{p}}^{embed}_{k}

{\bf{S}}^{smile}_{k}

, via Eq.(8)-(10).

13 Obtain positive and negative sample set via Eq.(11).

\mathcal{L}_{ss1}

\mathcal{L}_{ss2}

\leftarrow

obtain co-contrastive learning losses via Eq.(12) and Eq.(13).

15 Calculate final representations of DP

k

{\bf{p}}_{k}

, via Eq.(14) and Eq.(15).

16 //

\ast

DDI Prediction

\ast

17 Use

{\bf{p}}_{k}

to calculate the probability distribution

{\bf{y}}_{k}

via Eq.(16).

\mathcal{L}_{ce}\leftarrow

use ground truth

\bf{\hat{y}_{k}}

and predicted result

\mathbf{y}_{k}

calculate prediction loss via Eq.(17).

19 Obtain the final loss

\mathcal{L}\leftarrow\lambda\mathcal{L}_{ce}+\mathcal{L}_{ss1}+\mathcal{L}_{ss2}

20 Minimize

\mathcal{L}

and update

{\bf H}

{\bf H}_{embed}

, and

{\bf{p}}_{k}

22 end for

24 end for

25//

\ast

Testing Procedure

\ast

26 Predict types of the DDI in

\mathcal{T}_{e}

using the trained MSAGC model.

Algorithm 1 Pseudocode of the proposed RaGSECo.

RaGSECo is a semi-supervised DDI prediction approach. Both known and new drugs are incorporated into the training and testing phases of each task. We present pseudocode in Algorithm 1 for a clearer illustration of optimizing the stated objective. We begin by extracting drug features via Jaccard similarities (Line 1). Following this, we construct a multi-relational DDI graph using the extracted drug features and known DDIs, derived from training data $\mathcal{T}_{r}$ (Line 2). During each batch of each training iteration, our initial step is to learn embeddings ${\bf H}_{embed}$ for all drugs (Lines 7-9). Then, a batch of training data is obtained from $\mathcal{T}_{r}$ , and feature representations ${\bf{P}}_{k}$ for the corresponding DPs are learned (Lines 11, 12, and 15). Subsequently, we construct sets of positive and negative samples (Line 16) and calculate co-contrastive learning losses $\mathcal{L}_{ss1}$ , $\mathcal{L}_{ss2}$ (Line 13 and 14). The prediction loss, denoted as $\mathcal{L}_{ce}$ , is subsequently computed (Lines 17 and 18). In the following step, we merge $\mathcal{L}_{ss1}$ , $\mathcal{L}_{ss2}$ and $\mathcal{L}_{ce}$ to generate the final loss $\mathcal{L}$ (Line 19), then optimize $\mathcal{L}$ to update ${\bf H}_{embed}$ and ${\bf{P}}_{k}$ (Line 20). Once all training iterations are complete, we use the trained RaGSECo to test the types of interaction events of DPs in $\mathcal{T}_{e}$ .

4 Results and Discussion

4.1 Datasets

This paper uses two real datasets to explore the effectiveness and competitiveness of our RaGSECo. The first data set (Dataset 1) was collected by Deng et al. from DrugBank ¹¹1https://go.drugbank.com/releases/latest and published in [5]. Dataset 1 contains $572$ drugs with $37~{}264$ pairwise DDIs associated with $65$ interaction types. Each drug is represented by four biological attributes: enzymes, targets, pathways, and chemical substructures.

We also construct a dataset (Dataset 2) from DrugBank. We use $1000$ drugs, and each drug is described by three features: enzymes, targets, and chemical substructures. Accordingly, we obtained a total of $206~{}029$ pairwise DDIs, which are associated with $99$ types of events. The detailed information on the two datasets is listed in Table 1.

Table 1: Summary of the datasets utilized in our experiments.

Dataset	Drug number	DDI number	DDI event types
Dataset 1	572	37264	65
Dataset 2	1000	206029	99

4.2 Baselines

We compare the proposed RaGSECo with six baselines:

$\bullet$ MCFF-MTDDI [11] extracts multiple drug-related features and proposes a gated recurrent unit-based multi-channel feature fusion module to yield comprehensive representations of DPs.

$\bullet$ MDF-SA-DDI [25] uses four encoders to generate four different DP features and adopts transformers to perform feature fusion.

$\bullet$ RANEDDI [47] is a relation-aware graph structure embedding method that considers the multi-relational interaction information to obtain drug embedding for DDI prediction.

$\bullet$ DDIMDL [5] is a DNN method that pays attention to multiply drug-drug-similarities for multi-relational DDI prediction.

$\bullet$ DeepDDI [33] is a DNN method that uses structural information of DPs to train the model and make DDI predictions.

$\bullet$ DNN [21] uses multiple similarity information of DPs to predict the pharmacological effects of DDIs by autoencoders and a deep feed-forward network.

4.3 Experimental Settings

To demonstrate the effectiveness, we evaluate our RaGSECo by implementing Tasks 1, 2, and 3 on the two real datasets. In Task 1, we randomly split DDIs into five subsets via 5-fold cross-validation (5-CV), with four subsets as the training set and the remaining as the test set. For Tasks 2 and 3, we randomly split drugs instead of DDIs into five subsets via 5-CV, with four subsets used as the known drugs and the remaining as the new drugs. In Task 2, the DDIs involving two known drugs are defined as the training set, and the DDIs involving one known drug and one new drug are utilized as the test set. In Task 3, the training set is the same as Task 2, and the DDIs between two new drugs are utilized as the test set.

The prediction task is the multi-class classification work. To evaluate the classification performance of our RaGSECo, we use six metrics: accuracy (ACC), the area under the precision–recall-curve (AUPR), the area under the ROC curve (AUC), Precision, Recall, and F1 score. The activation function, dropout layer, and batch normalization layer [17] are used between the fully connected layers. By default, we use the Gaussian error linear unit [14] activation function and Radam optimizer [28]. The proposed RaGSECo is implemented with the deep learning library PyTorch. The Python and PyTorch versions are 3.8.10 and 1.9.0, respectively. All experiments are conducted on a Windows server with a GPU (NVIDIA GeForce RTX 4090 Ti).

4.4 Hyperparameter Searching and Setting

The proposed RaGSECo involves some hyperparameters that influence the prediction performance, including batch size ( $bs$ ), learning rate ( $lr$ ), dropout rate ( $dr$ ), training epochs ( $te$ ), the dimension of drug RaGSEs ( $d^{\prime}$ ), the power of adjacency matrices in the DDS graph ( $n$ ), the output dimension of FNN1 ( $d^{FNN}$ ), threshold values ( $t_{pos}$ and $t_{neg}$ ), and the weight of CE loss ( $\lambda$ ). To thoroughly investigate the impact of each parameter on the prediction results, a grid search strategy is employed where one parameter is varied while keeping the other parameters fixed. Table LABEL:hyperparameters_value summarizes the values of all hyperparameters on different datasets and tasks. Among these hyperparameters, $n$ , $d^{FNN}$ , $t_{pos}$ and $t_{neg}$ are specific to the proposed RaGSECo. Therefore, in the subsequent experiments, we will provide the impact of these hyperparameters on the experimental results and analyze the reasons for them.

4.4.1 Impact of RaGSE Propagation Distance $n$

The hyperparameter $n$ represents the power of adjacency matrices in the DDS graph and determines the distance of RaGSE propagation. It plays a crucial role in RaGSE propagation. To understand the influence of $n$ on the prediction performance, we conduct experiments on three tasks of Dataset 1 and evaluate six metric scores across different $n$ values. Referring to Fig. 4 (a), in Task 1, the metric scores of RaGSECo do not change drastically as $n$ increases, and RaGSECo can consistently achieve good prediction performances. On the other hand, in Tasks 2 and 3, we observed that the prediction performance of RaGSECo is sensitive to the value of $n$ . Specifically, when $n$ is set to 0, the prediction performance of our RaGSECo is at its lowest. The prediction performance significantly improves as the value of $n$ increases. This is because the test DDIs in Tasks 2 and 3 involve new drugs that do not possess relation-aware interaction information when $n$ is 0. As a result, the model tends to be over-fitting. As $n$ increases, new drugs can aggregate effective relation-aware interaction information from similar known drugs, which significantly improves the generalization ability of RaGSECo. Based on the experimental results, we determined that setting $n$ as 0 in Task 1 and 3 in Tasks 2 and 3 leads to desirable prediction performances.

4.4.2 Impact of Output Dimension $d^{FNN}$

The output dimension $d^{FNN}$ of FNN1 and FNN2 is an important hyperparameter for our RaGSECo model. Increasing $d^{FNN}$ can enhance the generalization ability of RaGSECo to some extent. In order to investigate the impact of $d^{FNN}$ on the prediction performance of RaGSECo, we conducted experiments on three tasks of Dataset 1 and examined the changes in metric scores as $d^{FNN}$ varies. From Fig. 5, it can be observed that the prediction performances of RaGSECo gradually improve as $d^{FNN}$ increase among all tasks. This indicates the robustness of our model. To strike a balance between accuracy and efficiency, we set $d^{FNN}$ to 1000 for Task 1 and 1500 for Tasks 2 and 3.

4.4.3 Impact of Threshold Values $t_{pos}$ and $t_{neg}$

The threshold values $t_{pos}$ and $t_{neg}$ are important parameters for selecting positive and negative samples of DPs. In our experiments, we set $t_{pos}$ to the range of $[0.8,0.85,\dots,1]$ and $t_{neg}$ to the range of $[0,0.1,\dots,0.4]$ . Fig. 6 presents the comparisons of three representative metric scores (ACC, AURR, and F1) for different values of $t_{pos}$ and $t_{neg}$ . It can be observed that the results are not significantly affected by variations in $t_{pos}$ and $t_{neg}$ . This indicates the stability of our co-contrastive learning strategy. Herein, the feature representations from the two views of each DP serve as positive samples for each other when $t_{pos}$ equals 1. As observed, the model produces superior prediction results when $t_{pos}$ is set to 0.95, as opposed to when it is set to 1. This observation indicates the effectiveness of our positive sample selection strategy. Finally, we set $t_{pos}=0.95$ and $t_{neg}=0.1$ for subsequent experiments.

4.5 Analysis of RaGSECo with Its Variants

In this subsection, we compare our proposed RaGSECo with six variants to assess the necessity and effectiveness of each component in DDI prediction. RaGSECo incorporates multiple components, including the construction of multi-relational DDI and multi-attribute DDS graphs, the integration of SMILES string, initial features, and RaGSEs of drugs, and the adoption of a co-contrastive learning mechanism to enhance DP representation learning. By comparing RaGSECo with its variants, we aim to understand the contributions of each component to the overall performance. The six variants are represented as follows:

$\bullet~{}$ RaGSECo-R: A variant of RaGSECo that does not construct a multi-relational DDI graph to learn RaGSEs of known drugs. From a practical perspective, the initial features of drugs $\mathbf{X}$ are directly used as the node initial features of the multi-attribute DDS graph to learn drug embeddings.

$\bullet~{}$ RaGSECo-M: A variant of RaGSECo that does not construct the multi-attribute DDS graph for RaGSE propagation. In other words, the output of RaGSEL, $\mathbf{H}$ , is taken as the final drug embeddings.

$\bullet~{}$ RaGSECo-I: A variant of RaGSECo that neglects the initial features of drugs during DP representation learning. Hence, the representations of DP $k$ in Eq. (15) are denoted as: $\bf{p}_{k}=||\left({\bf p}_{k}^{smile},{\bf p}_{k}^{embed},{\bf p}_{k}^{smile}+{\bf p}_{k}^{embed}\right)$ .

$\bullet~{}$ RaGSECo-S: A variant of RaGSECo that does not use the SMILES strings of drugs during DP representation learning. Thus, the representations of DP $k$ in Eq. (15) are represented as: $\bf{p}_{k}=||\left({\bf p}_{k}^{initi},{\bf p}_{k}^{embed},{\bf p}_{k}^{initi}+{\bf p}_{k}^{embed}\right)$ .

$\bullet~{}$ RaGSECo-E: A RaGSECo variant that ignores drug RaGSEs during DP representation learning. Hence, the representations of DP $k$ in Eq. (15) are denoted as: $\bf{p}_{k}=||\left({\bf p}_{k}^{initi},{\bf p}_{k}^{smile},{\bf p}_{k}^{initi}+{\bf p}_{k}^{smile}\right)$ .

$\bullet~{}$ RaGSECo-C: A RaGSECo variant that neglects the employment of co-contrastive learning. Hence, the representations of DP $k$ in Eq. (15) are denoted as: $\bf{p}_{k}={\bf p}_{k}^{embed}$ . the total loss is represented as $\mathcal{L}=\mathcal{L}_{ce}$ .

Herein, we select three representative metric scores (Accuracy, AUPR, and F1) to evaluate the prediction performance of these models. Fig. 7, 8, and 9 illustrate the performance of RaGSECo and its six variants on Task 1, 2, and 3 of Dataset 1, respectively. These figures show that RaGSECo achieves higher metric scores compared to its variants, indicating the effectiveness of RaGSE learning and propagation, multimodal DP representation learning, and the co-contrastive learning mechanism. Comparing RaGSECo-R and RaGSECo-M, we can see that RaGSECo-M performs better on Task 1, while RaGSECo-R significantly outperforms RaGSECo-M on Tasks 2 and 3. This observation confirms that the test DDIs in Task 1 consist of known drugs with distinguishable relation-aware information. Meanwhile, it demonstrates that the test DDIs include new drugs in Tasks 2 and 3, which may impact the model performance. Nevertheless, RaGSE propagation can effectively mitigate this issue. Analyzing the results of RaGSECo-I, RaGSECo-S, and RaGSECo-E reveals that incorporating the drug’s initial features, SMILES string, and RaGSEs enhances data diversity and improves the model’s performance. Finally, the performance of RaGSECo-C further validates the effectiveness of the co-contrastive learning mechanism.

Table 2: Prediction performances of different methods on Dataset 1.

	Method	ACC	AUPR	AUC	Precision	Recall	F1
Task 1	RaGSECo	0.9461	0.9838	0.9991	0.9121	0.9043	0.9050
	MCFF-MTDDI	0.9350	0.9757	0.9985	0.9100	0.8820	0.8918
	MDF-SA-DDI	0.9301	0.9737	0.9989	0.9085	0.8760	0.8878
	RANEDDI	0.9228	0.9657	0.9980	0.8747	0.8701	0.8717
	DDIMDL	0.8852	0.9208	0.9976	0.8471	0.7182	0.7585
	DeepDDI	0.8371	0.8899	0.9961	0.7275	0.6611	0.6848
	DNN	0.8797	0.9134	0.9963	0.8047	0.7027	0.7223
Task 2	RaGSECo	0.6826	0.7002	0.9535	0.6514	0.5631	0.5860
	MCFF-MTDDI	0.6650	0.6800	0.9500	0.6561	0.5139	0.5574
	MDF-SA-DDI	0.6633	0.6776	0.9497	0.6547	0.5078	0.5584
	DDIMDL	0.6415	0.6558	0.9799	0.5607	0.4319	0.4460
	DeepDDI	0.5774	0.5594	0.9575	0.3630	0.3890	0.3416
	DNN	0.6239	0.6361	0.9796	0.4237	0.2840	0.2997
Task 3	RaGSECo	0.4464	0.4014	0.8848	0.3001	0.2513	0.2600
	MCFF-MTDDI	0.4400	0.3870	0.8701	0.2823	0.2351	0.2437
	MDF-SA-DDI	0.4338	0.3873	0.8630	0.2715	0.2226	0.2329
	DDIMDL	0.4075	0.3635	0.9512	0.2408	0.1452	0.1590
	DeepDDI	0.3602	0.2781	0.9059	0.1586	0.1450	0.1373
	DNN	0.4087	0.3776	0.9550	0.1836	0.1093	0.1152

Table 3: Prediction performances of different methods on Dataset 2.

	Method	ACC	AUPR	AUC	Precision	Recall	F1
Task 1	RaGSECo	0.9344	0.9805	0.9995	0.9021	0.9234	0.9113
	MCFF-MTDDI	0.9010	0.9532	0.9984	0.8300	0.9122	0.8631
	MDF-SA-DDI	0.8725	0.9385	0.9979	0.7518	0.9198	0.8220
	RANEDDI	0.8611	0.9225	0.9872	0.8155	0.9084	0.8110
	DDIMDL	0.8401	0.8824	0.9892	0.7678	0.8580	0.7800
	DeepDDI	0.7813	0.8542	0.9810	0.6459	0.8065	0.6732
	DNN	0.8281	0.8650	0.9722	0.7337	0.8239	0.7560
Task 2	RaGSECo	0.6570	0.6800	0.9862	0.5576	0.6087	0.5586
	MCFF-MTDDI	0.6422	0.6564	0.9601	0.5407	0.5087	0.5140
	MDF-SA-DDI	0.6333	0.6486	0.9667	0.5317	0.4900	0.4785
	DDIMDL	0.6039	0.6159	0.9737	0.5080	0.4465	0.4470
	DeepDDI	0.5266	0.5122	0.9522	0.3030	0.3592	0.3219
	DNN	0.5985	0.6035	0.9666	0.3572	0.2475	0.2661
Task 3	RaGSECo	0.4775	0.4362	0.9578	0.2957	0.2898	0.2723
	MCFF-MTDDI	0.4545	0.4136	0.9340	0.2857	0.2348	0.2427
	MDF-SA-DDI	0.4530	0.4092	0.9358	0.2768	0.2336	0.2399
	DDIMDL	0.4111	0.3739	0.9514	0.2715	0.1691	0.1823
	DeepDDI	0.3466	0.2685	0.9165	0.2284	0.1535	0.1640
	DNN	0.4062	0.3680	0.9454	0.1576	0.1245	0.1373

4.6 Comparison with Other Methods

4.6.1 Dataset 1

To assess the competitiveness of our RaGSECo, we compared it with several state-of-the-art DDI prediction methods, namely MCFF-MTDDI, MDF-SA-DDI, RANEDDI, DDIMDL, DeepDDI, and DNN. Table 2 presents the metric scores achieved by these methods across the three tasks of Dataset 1. As observed, in most cases, the comparison results demonstrate that our RaGSECo outperforms the competitors in terms of performance metrics in three tasks. In Task 1, although RANEDDI also considers relation-aware graph structure information to learn drug embedding and achieve outstanding prediction performance, our RaGSECo achieves better results than RANEDDI. Specifically, the improvements of RaGSECo over RANEDDI are $2.33\%$ , $1.81\%$ , $3.74\%$ , $2.83\%$ , and $3.33\%$ in terms of ACC, AUPR, Precision, Recall, and F1, respectively. The reasons causing this phenomenon are manifold. On the one hand, RANEDDI ignores multiple drug-related attributes, such as targets, enzymes, and chemical substructures. This makes RANEDDI lack the ability to capture relationships beyond interactions between drugs. On the other hand, simply concatenating embeddings of two drugs to obtain drug-pair features also limits the generalization ability of RANEDDI. MDF-SA-DDI considers multiple attributes representing drugs and employs multiple drug fusion methods to learn drug-pair features. Nevertheless, our RaGSECo still has $1.63\%$ , $1.01\%$ , $2.83$ , and $1.72\%$ improvements over MDF-SA-DDI with respect to ACC, AUPR, Recall, and F1, respectively. The main reason is that MDF-SA-DDI does not take advantage of specific interaction information between drugs.

For further insight, we investigate the performances of our RaGSECo and four competitive baselines for each event. Fig. 10 displays the AUPR and AUC scores of the five prediction models for each event on Task 1. As observed, RaGSECo produces greater AUPR and AUC scores than other methods in most event types. In most cases, RaGSECo can achieve a satisfactory result. All unsatisfactory prediction results are observed in low-frequency event types, such as $\#39$ , $\#52$ , and $\#64$ , with frequencies of only $98$ , $24$ , and $10$ samples, respectively. The limited availability of training samples for these low-frequency event types may contribute to relatively poorer performance.

In Tasks 2 and 3, we compare our RaGSECo with five competitive prediction methods, i.e., MCFF-MTDDI, MDF-SA-DDI, DDIMDL, DeepDDI, and DNN. Since RANEDDI only focuses on the embedding of known drugs, RANEDDI can not be applied to Task 2 and Task 3. In Task 2 and Task 3, the test DDIs include new drugs, and the lack of interaction information on the new drugs weakens the generalization ability of models. Therefore, the prediction accuracy of models performed on Task 2 and Task 3 is lower than that on Task 1. Nevertheless, RaGSECo outperforms other competitors in most cases. The reasons are three folds: 1) RaGSECo enables all drugs, including new drugs, to capture effective relation-aware interaction information. 2) RaGSECo inventively combines initial features, SMILES information, and drug RaGSEs, enhancing the information diversity. 3) RaGSECo captures the underlying correlation between DPs and enables two views to supervise each other by co-contrastive learning.

4.6.2 Dataset 2

Table 3 presents the performance metrics of the proposed RaGSECo and other outstanding approaches on three tasks of Dataset 2. Dataset 2 exhibits greater diversity compared to Dataset 1. It has more drugs, more DDIs, and a wider range of DDI event types. As observed, our RaGSECo can achieve better prediction performances than other competitors in most cases. In Task 1, compared with MCFF-MTDDI, the proposed RaGSECo has $0.73\%$ , $2.25\%$ , $1.47\%$ , and $2.11\%$ performance improvements in terms of ACC, Precision, Recall, and F1, respectively. As observed in Tasks 2 and 3, our RaGSECo still acquires the best results compared with five competitive prediction models. The experiment results demonstrate the effectiveness and robustness of RaGSECo.

4.7 Case Study

In this section, we conduct case studies to validate the usefulness of RaGSECo. We adopt all DDIs and their event types of Dataset 2 to train the RaGSECo model. Then, we use the trained RaGSECo model to test the other DPs. Finally, we report the top-ranked prediction results. We pay attention to five events with the highest frequencies and check up the top 10 predictions related to each event. Then, we test the selected DPs using the DDI Checker tool²²2https://go.drugbank.com/interax/multi_search. In the selected 50 DPs, 7 DDIs are confirmed and recorded in Table LABEL:casestudytab. For example, Ribavirin may decrease the excretion rate of Sofosbuvir, which could result in a higher serum level. The metabolism of Sevoflurane can be increased when combined with Prednisolone phosphate.

Table 4: The drug names and event types of the confirmed DDIs.

Rank	Drug names	DDI event type
1	Prednisolone phosphate and Sevoflurane	The metabolism increase
2	Dexamethasone and Amlodipine	The metabolism increase
3	Prednisolone phosphate and Flunitrazepam	The metabolism increase
4	Gestoden and Clofazimine	The metabolism decrease
5	Mometasone furoate and Cytarabine	The metabolism decrease
6	Prednisolone phosphate and Ranolazine	The serum concentration increase
$7$	Velpatasvir and Ribavirin	The excretion rate which could result in
$7$	Velpatasvir and Ribavirin	a higher serum level decrease

Table 5: The hyper-parameters of best accuracy for the proposed RaGSECo on all Tasks.

Dataset	Task	Hyper-parameters
Dataset 1	Task 1	$bs$ :512, $lr$ :2e-5, $dr$ :0.3, $te$ :120, $d^{\prime}$ :500, $n$ :0, $d^{FNN}$ :1000, $t_{pos}$ :0.95, $t_{neg}$ :0.1, $\lambda$ :5
	Task 2	$bs$ :512, $lr$ :5e-6, $dr$ :0.2, $te$ :120, $d^{\prime}$ :500, $n$ :3, $d^{FNN}$ :1500, $t_{pos}$ :0.95, $t_{neg}$ :0.1, $\lambda$ :5
	Task 3	$bs$ :512, $lr$ :5e-6, $dr$ :0.2, $te$ :120, $d^{\prime}$ :500, $n$ :3, $d^{FNN}$ :1500, $t_{pos}$ :0.95, $t_{neg}$ :0.1, $\lambda$ :5
Dataset 2	Task 1	$bs$ :1024, $lr$ :2e-5, $dr$ :0.5, $te$ :120, $d^{\prime}$ :500, $n$ :0, $d^{FNN}$ :1500, $t_{pos}$ :0.95, $t_{neg}$ :0.1, $\lambda$ :5
	Task 2	$bs$ :1024, $lr$ :5e-6, $dr$ :0.5, $te$ :120, $d^{\prime}$ :500, $n$ :3, $d^{FNN}$ :1500, $t_{pos}$ :0.95, $t_{neg}$ :0.1, $\lambda$ :5
	Task 3	$bs$ :1024, $lr$ :5e-6, $dr$ :0.5, $te$ :120, $d^{\prime}$ :500, $n$ :3, $d^{FNN}$ :1500, $t_{pos}$ :0.95, $t_{neg}$ :0.1, $\lambda$ :5

5 Conclusion

In multi-relational DDI prediction, relation-aware graph embedding-based methods hold considerable promise. Nevertheless, interaction information of new drugs is unknown, which may cause these methods to suffer from severe over-fitting when predicting DDIs involving new drugs. To address this issue, we introduce the RaGSECo approach. The primary contribution of the RaGSECo is enabling all drugs to capture effective relation-aware interaction features and promoting the identical distribution of the training set and test set. Additionally, our RaGSECo employs a cross-view contrastive mechanism to boost DP representation learning by utilizing the underlying correlation between DPs and supervising two views collaboratively. Our approach offers a promising drug and DP representation learning solution, thereby enhancing DDI prediction performance.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1] Gottlieb Assaf and et al. INDI: a computational framework for inferring drug interactions and their associated recommendations. Molecular systems biology, 8(1):592, 2012.
[2] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in Neural Information Processing Systems, 26, 2013.
[3] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML, volume 119, pages 1597–1607, 2020.
[4] Zhongjian Cheng, Qichang Zhao, Yaohang Li, and et al. IIFDTI: predicting drug-target interactions through interactive and independent features based on attention mechanism. Bioinformatics, 38(17):4153–4161, 2022.
[5] Yifan Deng, Xinran Xu, Yang Qiu, and et al. A multimodal deep learning framework for predicting drug-drug interaction events. Bioinformatics, 36(15):4316–4322, 2020.
[6] Yijie Ding, Jijun Tang, and Fei Guo. Identification of drug-target interactions via multi-view graph regularized link propagation model. Neurocomputing, 461:618–631, 2021.
[7] Yijie Ding, Jijun Tang, Fei Guo, and et al. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing, 325:211–224, 2019.
[8] Reza Ferdousi, Reza Safdari, and Yadollah Omidi. Computational prediction of drug-drug interactions based on drugs functional similarities. Journal of Biomedical Informatics, 70:54–64, 2017.
[9] Achille Fokoue, Mohammad Sadoghi, Oktie Hassanzadeh, and Ping Zhang. Predicting drug-drug interactions through large-scale similarity-based link prediction. volume 9678, pages 774–789, 2016.
[10] Zihao Gao, Huifang Ma, Xiaohui Zhang, Zheyu Wu, and Zhixin Li. Co-contrastive self-supervised learning for drug-disease association prediction. In PRICAI 2022: Trends in Artificial Intelligence - 19th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2022, Shanghai, China, November 10-13, 2022, Proceedings, Part I, volume 13629 of Lecture Notes in Computer Science, pages 327–338, 2022.
[11] Chen-Di Han, Chun-Chun Wang, Li Huang, and Xing Chen. MCFF-MTDDI: multi-channel feature fusion for multi-typed drug-drug interaction prediction. Briefings in Bioinformatics, 2023.
[12] Chengxin He, Lei Duan, Huiru Zheng, Linlin Song, and Menglin Huang. An explainable framework for drug repositioning from disease information network. Neurocomputing, 511:247–258, 2022.
[13] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9726–9735, 2020.
[14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415., 2020.
[15] Yue Hong, Pengyu Luo, Shuting Jin, and et al. Lagat: link-aware graph attention network for drug-drug interaction prediction. Bioinformatics, 38(24):5406–5412, 2022.
[16] Kexin Huang, Tianfan Fu, Lucas M. Glass, and et al. DeepPurpose: a deep learning library for drug-target interaction prediction. Bioinformatics, 36(22-23):5545–5547, 2021.
[17] Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[18] Humayun Kayesh, Md. Saiful Islam, Junhu Wang, Ryoma J. Ohira, and Zhe Wang. SCAN: A shared causal attention network for adverse drug reactions detection in tweets. Neurocomputing, 479:60–74, 2022.
[19] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, 2017.
[20] Geonhee Lee, Chihyun Park, and Jaegyoon Ahn. Novel deep learning model for more accurate prediction of drug-drug interaction effects. BMC Bioinformatics, 20(1):1–8, 2019.
[21] Geonhee Lee, Chihyun Park, and Jaegyoon Ahn. Novel deep learning model for more accurate prediction of drug-drug interaction effects. BMC Bioinformatics, 20(1):415:1–415:8, 2019.
[22] Zhifei Li, Hai Liu, Zhaoli Zhang, Tingting Liu, and Jiangbo Shu. Recalibration convolutional networks for learning interaction knowledge graph embedding. Neurocomputing, 427:118–130, 2021.
[23] Majun Lian, Xinjie Wang, and Wenli Du. Integrated multi-similarity fusion and heterogeneous graph inference for drug-target interaction prediction. Neurocomputing, 500:1–12, 2022.
[24] Jiacheng Lin, Lijun Wu, Jinhua Zhu, and et al. R2-DDI: relation-aware feature refinement for drug-drug interaction prediction. Briefings in Bioinformatics, 24(1), 2023.
[25] Shenggeng Lin, Yanjing Wang, Lingfeng Zhang, and et al. MDF-SA-DDI: predicting drug-drug interaction events based on multi-source drug fusion, multi-source feature fusion and transformer self-attention mechanism. Briefings in Bioinformatics, 23(1), 2022.
[26] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the AAAI conference on artificial intelligence, 2015.
[27] Kun Liu, Rui Meng, Longteng Li, Jingkun Mao, and Haiyong Chen. SiSL-Net: Saliency-guided self-supervised learning network for image classification. Neurocomputing, 510:193–202, 2022.
[28] Liyuan Liu, Haoming Jiang, Pengcheng He, and et al. On the variance of the adaptive learning rate and beyond. In ICLR, 2020.
[29] Qianhui Men, Edmond S.L. Ho, Hubert P.H. Shum, and Howard Leung. Focalized contrastive view-invariant learning for self-supervised skeleton-based action recognition. Neurocomputing, 537:198–209, 2023.
[30] Arnold K Nyamabo, Hui Yu, Zun Liu, and Jian Yu Shi. Drug-drug interaction prediction with learnable size-adaptive molecular substructures. Briefings in Bioinformatics, 23(1), 2022.
[31] Arnold K. Nyamabo, Hui Yu, and Jian-Yu Shi. SSI-DDI: substructure-substructure interactions for drug-drug interaction prediction. Briefings in Bioinformatics, 22(6), 2021.
[32] Shanchen Pang, Ying Zhang, Tao Song, Xudong Zhang, and Xun Wang. AMDE: a novel attention-mechanism-based multidimensional feature encoder for drug-drug interaction prediction. Briefings in Bioinformatics, 23(1), 2022.
[33] Jae Yong Ryu, Hyun Uk Kim, and et al. Deep learning improves prediction of drug-drug and drug-food interactions. Proceedings of the National Academy of Sciences, 115(18):E4304–E4311, 2018.
[34] Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In The Semantic Web - 15th International Conference, ESWC, volume 10843, pages 593–607, 2018.
[35] Yifan Shang, Lin Gao, Quan Zou, and Liang Yu. Prediction of drug-target interactions based on multi-layer network representation learning. Neurocomputing, 434:80–89, 2021.
[36] Ying Shen, Kaiqi Yuan, Min Yang, Buzhou Tang, Yaliang Li, Nan Du, and Kai Lei. KMR: knowledge-oriented medicine representation learning for drug-drug interaction and similarity computation. Journal of Cheminformatics, 11(1):22:1–22:16, 2019.
[37] Jian-Yu Shi, Hua Huang, Jia-Xin Li, and et al. TMFUF: a triple matrix factorization-based unified framework for predicting comprehensive drug-drug interactions of new drugs. BMC Bioinformatics, 19(14):27–37, 2018.
[38] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision - ECCV, volume 12356, pages 776–794, 2020.
[39] Ashish Vaswani, Noam Shazeer, and Niki Parmar aand et al. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
[40] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR, 2018.
[41] Santiago Vilar, Carol Friedman, and George Hripcsak. Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Briefings in Bioinformatics, 19(5):863–877, 2018.
[42] Hanchen Wang, Defu Lian, Ying Zhang, and et al. GoGNN: graph of graphs neural network for predicting structured entity interactions. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2020.
[43] Xiao Wang, Nian Liu, Hui Han, and Chuan Shi. Self-supervised heterogeneous graph neural network with co-contrastive learning. In Feida Zhu, Beng Chin Ooi, and Chunyan Miao, editors, KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 1726–1736. ACM, 2021.
[44] Xiaoqi Wang, Yaning Yang, Kenli Li, Wentao Li, Fei Li, and Shaoliang Peng. BioERP: biomedical heterogeneous network-based self-supervised representation learning approach for entity relationship predictions. Bioinform., 37(24):4793–4800, 2021.
[45] Yanda Wang, Weitong Chen, Dechang Pi, Lin Yue, Sen Wang, and Miao Xu. Self-supervised adversarial distribution regularization for medication recommendation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI, pages 3134–3140, 2021.
[46] Yueyue Wang, Danjun Song, Wentao Wang, Shengxiang Rao, Xiaoying Wang, and Manning Wang. Self-supervised learning and semi-supervised learning for multi-sequence medical image classification. Neurocomputing, 513:383–394, 2022.
[47] Hui Yu, Wenmin Dong, Jianyu Shi, and et al. RANEDDI: relation-aware network embedding for drug-drug interaction prediction. Information Sciences, 582:167–180, 2022.
[48] Hui Yu, Shiyu Zhao, and Jianyu Shi. STNN-DDI: a substructure-aware tensor neural network to predict drug-drug interactions. Briefings in Bioinformatics, 23(4), 2022.
[49] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: beyond empirical risk minimization. In ICLR, 2018.
[50] Tianlin Zhang, Jiaxu Leng, and Ying Liu. Deep learning for drug-drug interaction extraction from the literature: a review. Briefings in Bioinformatics, 21(5):1609–1627, 2020.
[51] Tongxuan Zhang, Hongfei Lin, Yuqi Ren, Zhihao Yang, Jian Wang, Xiaodong Duan, and Bo Xu. Identifying adverse drug reaction entities from social media with adversarial transfer learning model. Neurocomputing, 453:254–262, 2021.
[52] Wen Zhang, Kanghong Jing, Feng Huang, and et al. SFLLN: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions. Information Sciences, 497:189–201, 2019.
[53] Yu P. Zhang and Quan Zou. PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning. Bioinformatics, 36(13):3982–3987, 2020.
[54] Biao Zhao, Weiqiang Jin, Javier Del Ser, and Guang Yang. ChatAgri: Exploring potentials of chatgpt on cross-linguistic agricultural text classification, 2023.
[55] Chengshuai Zhao, Shuai Liu, Feng Huang, Shichao Liu, and Wen Zhang. CSGNN: Contrastive self-supervised graph neural network for molecular interaction prediction. In IJCAI, pages 3756–3763, 2021.
[56] Tianyi Zhao, Yang Hu, Linda R. Valsdottir, and et al. Identifying drug-target interactions based on graph convolutional network and deep neural network. Briefings in Bioinformatics, 22(2):2141–2150, 2021.
[57] Yi Zheng, Hui Peng, and et al. DDI-PULearn: a positive-unlabeled learning method for large-scale prediction of drug-drug interactions. BMC Bioinformatics, 20(19):661, 2019.
[58] Jiajing Zhu, Yongguo Liu, Yun Zhang, and et al. Multi-attribute discriminative representation learning for prediction of adverse drug-drug interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):10129–10144, 2022.
[59] Luhe Zhuang, Hong Wang, Meifang Hua, Wei Li, and Hui Zhang. Predicting drug-drug adverse reactions via multi-view graph contrastive representation model. Applied Intelligence, pages 1–18, 2023.
[60] Marinka Zitnik, Monica Agrawal, and Jure Leskovec. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics, 34(13):i457–i466, 2018.