This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hashing for Protein Structure Similarity Search

Jin Han, Wu-Jun Li
National Key Laboratory for Novel Software Technology
Department of Computer Science and Technology
Nanjing University, Nanjing 210023, China
[email protected],[email protected]
Corresponding author
Abstract

Protein structure similarity search (PSSS), which tries to search proteins with similar structures, plays a crucial role across diverse domains from drug design to protein function prediction and molecular evolution. Traditional alignment-based PSSS methods, which directly calculate alignment on the protein structures, are highly time-consuming with high memory cost. Recently, alignment-free methods, which represent protein structures as fixed-length real-valued vectors, are proposed for PSSS. Although these methods have lower time and memory cost than alignment-based methods, their time and memory cost is still too high for large-scale PSSS, and their accuracy is unsatisfactory. In this paper, we propose a novel method, called p¯\underline{\text{p}}ro¯\underline{\text{o}}tein s¯\underline{\text{s}}tructure h¯\underline{\text{h}}ashing (POSH), for PSSS. POSH learns a binary vector representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods. Furthermore, in POSH we also propose expressive hand-crafted features and a structure encoder to well model both node and edge interactions in proteins. Experimental results on real datasets show that POSH can outperform other methods to achieve state-of-the-art accuracy. Furthermore, POSH achieves a memory saving of more than six times and speed improvement of more than four times, compared with other methods.

1 Introduction

Proteins play essential roles in biological systems by binding with various ligands. Proteins with similar structures often share similar functions. Hence, protein structure similarity search (PSSS), which tries to search proteins with similar structures, plays a crucial role across diverse domains from drug design to protein function prediction and molecular evolution.

Traditional PSSS methods, which are often called alignment-based methods, directly calculate alignment on the protein structures. Representative alignment-based methods include TM-align [1], MATT [2] and several others [3, 4, 5]. Since identifying an optimal alignment between a pair of structures is an NP-hard problem [6], alignment-based methods are typically highly time-consuming with high memory cost even if heuristic techniques are adopted in these methods. Taking TM-align as an example, aligning a protein structure with 200 amino acids against the SCOPe database [7] with 14323 protein structures takes approximately 30 minutes. Note that this is the time cost for only a single query. The memory cost for storing the SCOPe database is approximately 4GB. With the development of Cryo-EM and protein structure prediction techniques, such as Alphafold 2 [8], RoseTTAFold [9] and ESMFold [10], the number of proteins with known structures grows rapidly. For example, Alphafold 2 has predicted structures with high confidence for over 200 million proteins, which has been used to construct the Alphafold database [11]. On large-scale datasets like Alphafold database, alignment-based methods will have huge time and memory cost, even become infeasible.

To reduce the cost for PSSS, alignment-free methods, which represent protein structures as fixed-length real-valued vectors, are proposed. Existing alignment-free methods can be categorized into non-learning methods and learning-based methods. Non-learning methods [12, 13] generate vector representations based on hand-crafted features. Learning-based methods [14, 15] generate vector representations by learning from data. PSSS is performed by calculating and ranking the similarity between the vector of the query and those of the database. With vector representation, alignment-free methods have lower time and memory cost than alignment-based methods. However, it is still time-consuming to calculate the similarity between real-valued vectors, and the memory cost for storing real-valued vectors is still too high for large-scale datasets. Furthermore, the accuracy of existing alignment-free methods is unsatisfactory because these methods have limitation in representing the complex protein structure. More specifically, the hand-crafted features in non-learning methods fail to model the complex and irregular three-dimensional protein shape. Existing learning-based methods suffer from the lack of expressive features and cannot model edge interactions in proteins.

In this paper, we propose a novel method, called protein structure hashing (POSH), for PSSS. The main contributions of POSH are outlined as follows:

  • To the best of our knowledge, POSH is the first hashing method for PSSS.

  • POSH learns a binary vector (or called hash code) representation for each protein structure, which can dramatically reduce the time and memory cost for PSSS compared with real-valued vector representation based methods.

  • Furthermore, in POSH we also propose expressive hand-crafted features and a structure encoder to well model both node and edge interactions in proteins.

  • Experimental results on real datasets show that POSH can outperform other methods to achieve state-of-the-art accuracy. Furthermore, POSH achieves a memory saving of more than six times and speed improvement of more than four times, compared with other methods.

2 Related Works

PSSS

Given a query protein, PSSS tries to search (return) similar proteins from a protein database, where the similarity between two proteins is computed based on the three-dimensional structures of these two proteins. Existing PSSS methods can be categorized into alignment-based methods and alignment-free methods.

Template modeling score (TM-score) [16] is a well-known metric for assessing the topological similarity of protein structures, which has been adopted in the evaluation of Critical Assessment of protein Structure Prediction (CASP)  [17]. TM-align [1] is an alignment-based method using TM-score. TM-align employs heuristic dynamic programming (DP) to generate residue-to-residue alignments, but the process of performing DP on the similarity score matrix is time-consuming. There have also appeared other alignment-based methods like  [3, 4, 2, 5]. However, none of them can avoid the time-consuming procedure of aligning the coordinates of individual atoms between structures. Furthermore, since these alignment-based methods directly perform alignment on the raw protein structures, the memory cost for storing the raw protein structures is also high.

Several alignment-free methods have been proposed to overcome the inefficiency of alignment-based methods by representing protein structures as fixed-length real-valued vectors, which can be further categorized into non-learning methods and learning-based methods. Non-learning methods use hand-crafted features to represent the protein structures. For example, SGM [12] represents protein structures using 30 structural measures, and SSEF [13] utilizes secondary structure information to map structures into vectors. The shortcoming of non-learning methods is that the hand-crafted features fail to model the complex and irregular three-dimensional protein shape. Hence, learning-based methods are proposed to learn more informative representations from the hand-crafted features. DeepFold [14] and GraSR [15] are two representative learning-based methods. DeepFold extracts a distance map from the protein structure and employs convolutional neural network (CNN) to learn a vector representation, but it neglects the graph property of protein structures. GraSR [15] models the protein structure as a graph with node features and utilizes long short-term memory (LSTM) [19] and graph convolutional network (GCN) [20] to update the node features. However, GraSR only considers the relations of local neighboring nodes and fails to model the edge interactions in proteins. Furthermore, all existing alignment-free methods adopt real-valued vectors for feature representation, which still have high time and memory cost for large-scale datasets.

Hashing

Hashing has been widely used for efficient search in many areas [21, 22, 23, 24, 25]. Hashing uses a hash function to map each data sample to a binary vector (or called hash code) while preserving the similarity in the original space. The hash function can be manually designed or learned from training data. In this paper, we focus on learned hash functions because learned functions can typically achieve better performance. Binary hash codes have advantages in search speed and memory cost. To the best of our knowledge, no research has explored the application of hashing for PSSS.

3 Method

Refer to caption
Figure 1: The architecture of POSH

Figure 1 illustrates the architecture of our proposed POSH for PSSS. The architecture comprises several key components: graph construction, structure encoder, hashing layer and contrastive learning component for objective function, and a data sampling strategy to sample negative samples and substructures of positive samples.

3.1 Graph Construction

We construct a graph to represent each protein, and introduce several techniques to extract features for both nodes and edges from the raw protein structure.

Graph Structure

A protein is represented as a graph 𝒢(𝒱,)\mathcal{G(V,E)} with 𝒱\mathcal{V} denoting the set of nodes and \mathcal{E} denoting the set of edges. Each amino acid corresponds to a node in the graph, and the edges are constructed by connecting each node with its kk nearest neighbors (kkNN). The distance between nodes is measured by the Euclidean distance between the raw node features of the CαC_{\alpha} atoms. The raw node features for distance computation are introduced in the following content.

Raw Node Features

Although the TM-score of proteins is typically computed based on the coordinates of CαC_{\alpha} atoms, we also incorporate other backbone atoms to construct node features. Because CαC_{\alpha} atoms and the other backbone atoms are constrained by the peptide plane and form the 3D protein shape together, it is reasonable to include them in node features. In particular, we calculate the bond and dihedral angles for each residual as our node features. As shown in Figure 2, the bond angle is the angle formed between two covalent bonds that share a common atom, and the dihedral angles, also known as torsion angles, refer to the angles between planes defined by four consecutive atoms in the protein backbone. For residual ii, we denote the bond angles of Ni-Cαi-Ci,Ci1-Ni-Cαi,Cαi-Ci-Ni+1N_{i}\mbox{-}C_{\alpha_{i}}\mbox{-}C_{i},C_{i-1}\mbox{-}N_{i}\mbox{-}C_{\alpha_{i}},C_{\alpha_{i}}\mbox{-}C_{i}\mbox{-}N_{i+1} as αi,βi,γi\alpha_{i},\beta_{i},\gamma_{i}, and the dihedral angles of Ni1-Cαi1,Cαi1-Ci1,Ci-NiN_{i-1}\mbox{-}C_{\alpha_{i-1}},C_{\alpha_{i-1}}\mbox{-}C_{i-1},C_{i}\mbox{-}N_{i} as ϕi,ψi,ωi\phi_{i},\psi_{i},\omega_{i}. The final node features are computed by {sin,cos}{αi,βi,γi,ϕi,ψi,ωi}\{sin,cos\}\circ\{\alpha_{i},\beta_{i},\gamma_{i},\phi_{i},\psi_{i},\omega_{i}\}. We use 𝑽Rn×dv\bm{V}\in R^{n\times d_{v}} to denote these hand-crafted node features (also called raw node features), where nn is the number of nodes and dvd_{v} is the dimension of the raw node features.

Raw Edge Features

Unlike existing learning-based methods which do not use edge features, we additionally extract edge features for better description of the structure. We consider the following five common types of atoms in the protein chain: CC, CαC_{\alpha}, NN, OO, and CβC_{\beta}. If a CβC_{\beta} atom is missing, we employ a technique in [26] to add a virtual CβC_{\beta} atom based on the constraint of the other backbone atoms. To calculate the distance between the iith node and the jjth node, we compute the Euclidean distance XiYj\|X_{i}-Y_{j}\|, where XiX_{i} is chosen from the set {Ci,Cαi,Ni,Oi,Cβi}\{C_{i},C_{\alpha_{i}},N_{i},O_{i},C_{\beta_{i}}\}, and YjY_{j} is chosen from the set {Cj,Cαj,Nj,Oj,Cβj}\{C_{j},C_{\alpha_{j}},N_{j},O_{j},C_{\beta_{j}}\}. Here, XiX_{i} and YjY_{j} represent the respective coordinates of the atoms. Subsequently, we encode the distances using a Gaussian radial basis function (RBF). The final representation of the edge features is computed by RBFk(XiYj)\text{RBF}_{k}(\|X_{i}-Y_{j}\|), where RBFk\text{RBF}_{k} denotes the kkth RBF employed in the encoding process. We use 𝑬Rm×de\bm{E}\in R^{m\times d_{e}} to denote these hand-crafted edge features (also called raw edge features), where mm is the number of edges and ded_{e} is the dimension of the raw edge features.

Refer to caption
Figure 2: Illustration of bond angles and dihedral angles. The letter R denotes the side chain of the amino acid.

3.2 Structure Encoder

We propose a novel structure encoder, which is a deep graph neural network with LL layers, to learn more informative node features from the hand-crafted raw features constructed in Section 3.1. The structure encoder consists of node update layer and edge update layer. We first map the hand-crafted raw features of node ii (the iith row of 𝑽\bm{V}) and the hand-crafted raw features of edge kk (the kkth row of 𝑬\bm{E}) to have the same hidden dimension with a linear mapping, to obtain the initial node representation 𝒉i0\bm{h}_{i}^{0} and edge representation 𝒆ij0\bm{e}_{ij}^{0}, respectively. Here, edge kk is from node ii to node jj, and its initial representation is 𝒆ij0\bm{e}_{ij}^{0}.

Refer to caption
Figure 3: POSH for protein structure similarity search. (a) In the training phase, the distance between binary hash codes will be minimized or maximized depending on the original similarity of the samples. (b) In the testing phase, the binary hash code for the new-coming query protein structure can be obtained from POSH, and then be used to search the database to get similar protein structures.

Node Update Layer

For each node ii in layer ll, we update its representation 𝒉il\bm{h}_{i}^{l} by aggregating features from its neighboring nodes and adjacent edges. This message passing mechanism is defined as follows:

𝒖il\displaystyle\bm{u}_{i}^{l} =𝒉il+1|𝒩i|j𝒩iNodeMLP(𝒉il𝒉jl𝒆ijl),\displaystyle=\bm{h}_{i}^{l}+\frac{1}{|\mathcal{N}_{i}|}\sum_{j\in\mathcal{N}_{i}}NodeMLP(\bm{h}_{i}^{l}\|\bm{h}_{j}^{l}\|\bm{e}_{ij}^{l}), (1)
𝒉~il\displaystyle\tilde{\bm{h}}_{i}^{l} =Φ(𝒉il+MLP(𝒖il)),\displaystyle=\Phi(\bm{h}_{i}^{l}+MLP(\bm{u}^{l}_{i})),

where 𝒩i\mathcal{N}_{i} denotes the index set of neighboring nodes of node ii and |𝒩i||\mathcal{N}_{i}| denotes the number of nodes in 𝒩i\mathcal{N}_{i}. \| denotes the concatenation operation. NodeMLPNodeMLP is a two-layer feed-forward neural network. The updated node representation 𝒖il\bm{u}_{i}^{l} is further updated by a MLPMLP, which also denotes a two-layer feed-forward neural network and a residual connection from 𝒉il\bm{h}_{i}^{l} is then added. Φ\Phi is a follow-up batchnorm function. This straightforward message passing mechanism is memory efficient so that we can sample more negative samples in a single mini-batch, the detail of which will be introduced in Section 3.4.

Edge Update Layer

As proved in [27, 28], neglecting the update of edge features could lead to suboptimal performance. Hence, we also perform edge message passing in our method. For each edge (i,j)(i,j), we aggregate features from the connected node ii and node jj, the result of which is then added to the original edge representation. Our edge message passing mechanism is defined as follows:

𝒆ijl+1=Φ(𝒆ijl+EdgeMLP(𝒉~il𝒉~jl𝒆ijl)).\displaystyle\bm{e}_{ij}^{l+1}=\Phi(\bm{e}_{ij}^{l}+EdgeMLP(\tilde{\bm{h}}_{i}^{l}\|\tilde{\bm{h}}_{j}^{l}\|\bm{e}_{ij}^{l})). (2)

Here, the updated node representation 𝒉~il\tilde{\bm{h}}_{i}^{l} and 𝒉~jl\tilde{\bm{h}}_{j}^{l} are concatenated with the edge representation 𝒆ijl\bm{e}_{ij}^{l}. The concatenated result will then be updated with an EdgeMLPEdgeMLP, which is also a two-layer feed-forward neural network. For the next layer, we set 𝒉il+1=𝒉~il\bm{h}_{i}^{l+1}=\tilde{\bm{h}}_{i}^{l}.

3.3 Objective Function

The output of the structure encoder is denoted as 𝒉iL\bm{h}_{i}^{L}, where LL is the number of stacked layers. For each protein structure tt, the representation of all the nodes in protein structure tt is {𝒉iL}i𝒱t\{\bm{h}_{i}^{L}\}_{i\in\mathcal{V}_{t}}.The final representation 𝒚t\bm{y}_{t} of protein structure tt is computed by 𝒚t=Linear(fp({𝒉iL}i𝒱t))\bm{y}_{t}=Linear(f_{p}(\{\bm{h}_{i}^{L}\}_{i\in\mathcal{V}_{t}})), where fpf_{p} is a max pooling function, and LinearLinear is a linear mapping. Note that the resulting vector 𝒚t\bm{y}_{t} is real-valued at this stage.

Let 𝒃t{1,1}d\bm{b}_{t}\in\{-1,1\}^{d} denote the binary hash code111During training, we use 𝒃t{1,1}d\bm{b}_{t}\in\{-1,1\}^{d} to denote the hash code for the convenience of modeling. After training, we change 1-1 to 0 to get the final hash code representation. for representing protein structure tt. We can get 𝒃t=sign(𝒚t)\bm{b}_{t}=sign(\bm{y}_{t}). To enable end-to-end learning of the binary hash code, we add a hashing layer to the model. More specifically, we introduce a loss to encourage 𝒚t\bm{y}_{t} to approach the binary hash code 𝒃t\bm{b}_{t} as in [21]:

hash=t𝒚t𝒃t22+γt(𝒚t𝟏)2,\mathcal{L}_{\text{hash}}=\sum_{t}\|\bm{y}_{t}-\bm{b}_{t}\|_{2}^{2}+\gamma\sum_{t}(\bm{y}_{t}\bm{1})^{2}, (3)

where γ\gamma is a hyperparameter and the term (𝒚t𝟏)2(\bm{y}_{t}\bm{1})^{2} is designed to avoid bias or skewed representations of protein structures.

Each time, we sample protein structures containing one protein structure as query, one positive sample, and KK negative samples. Here, positive samples are those labeled to be similar to the query, and negative samples are those labeled to be dissimilar to the query. We denote the query sample as QQ, the positive sample as PP, and the negative samples as F1,F2,,FKF_{1},F_{2},\cdots,F_{K}. As depicted in Figure 3 (a), our objective is to minimize distance for similar samples and maximize distance for dissimilar ones. To achieve this, we employ the InfoNCE loss [29] from contrastive learning, which aims to minimize the negative log-likelihood of the similar pairs. To ensure stability during training and avoid gradient exploding caused by the exponential function, we apply L2L_{2} normalization to each 𝒚t\bm{y}_{t} and obtain 𝒚^t=norm(𝒚t)\hat{\bm{y}}_{t}=\text{norm}(\bm{y}_{t}). Then we have:

sim=log(exp(𝒚^Q𝒚^P/τ)exp(𝒚^Q𝒚^P/τ)+i=1Kexp(𝒚^Q𝒚^Fi/τ)),\mathcal{L}_{\text{sim}}=-\log(\frac{exp({\hat{\bm{y}}_{Q}\cdot\hat{\bm{y}}_{P}/\tau})}{exp({\hat{\bm{y}}_{Q}\cdot\hat{\bm{y}}_{P}/\tau})+\sum_{i=1}^{K}exp({\hat{\bm{y}}_{Q}\cdot\hat{\bm{y}}_{F_{i}}/\tau})}), (4)

where τ\tau is a temperature hyper-parameter that controls the relative distance between samples, the denominator represents the sum over one positive sample and KK negative samples. The similarity of two vectors is computed using the dot product (\cdot).

The entire model can be trained in an end-to-end manner, utilizing the following combined objective function:

=sim+λhash,\mathcal{L}=\mathcal{L}_{\text{sim}}+\lambda\mathcal{L}_{\text{hash}}, (5)

where λ\lambda is a hyperparameter for balancing the two loss functions.

In the testing phase, a protein structure will be fed into POSH, and first get a real-valued representation 𝒚t\bm{y}_{t}. Then this real-valued representation undergoes a sign()sign(\cdot) function to get a binary hash code representation. As depicted in Figure 3 (b), the target database to be searched will be pre-encoded. Each new-coming query is first encoded as a binary vector (hash code) and subsequently the hash code is used to search in the target database based on Hamming distance. Due to the discrete nature of the Hamming distance, it is possible to have multiple samples in the database with the same Hamming distance to the query. We apply an additional scaling factor to the distance to distinguish the samples with the same Hamming distance, which is defined as follows:

dist=dist/(1+|lqlt|lmax),dist^{\prime}=dist/(1+\frac{|l_{q}-l_{t}|}{l_{max}}), (6)

where distdist is the original Hamming distance between the query protein structure and a protein structure in database, lql_{q} is the length of the query protein structure, ltl_{t} is the length of the protein structure in database, and lmaxl_{max} is the maximum length of the protein structure in database. This scaling factor imposes penalty on pairs with significantly different lengths.

3.4 Data Sampling Strategy

In this subsection, we introduce how to sample positive and negative samples according to the query sample, and a substructure sampling strategy is further proposed to enhance the diversity of positive samples.

Training Data Sampling

We first calculate the pairwise TM-score [16] for protein structures using TM-align [1], serving as the ground-truth measurement of structural similarity. In the training phase, we define two protein structures, denoted as (Pa,Pb)(P_{a},P_{b}), to be similar based on the following criterion: if the TM-score between PaP_{a} and PbP_{b} is larger than a threshold ρ\rho multiplied by the maximum TM-score between PaP_{a} and any other structure PiP_{i} in the training set 𝒟\mathcal{D} excluding PaP_{a}, as shown below:

TM(Pa,Pb)ρmax({TM(Pa,Pi)|Pi𝒟Pa}),TM(P_{a},P_{b})\geq\rho\cdot max(\{TM(P_{a},P_{i})|{P_{i}\in\mathcal{D}\setminus P_{a}}\}), (7)

where TM()TM(\cdot) denotes the TM-score, ρ\rho is a hyperparameter in (0, 1). In each time, we sample a mini-batch of data comprising one protein structure as query, one positive sample, and KK negative samples. The positive sample means structurally similar to the query, while the negative samples are dissimilar.

Substructure Sampling

The above sampling strategy might cause an issue. We observe that some protein structures have few (less than 5) similar protein structures (positive samples), and hence, the same structure will be repeatedly sampled during training. This could result in a potential risk of overfitting. This is caused by those structure pairs with high TM-score, which improves the threshold of similar pairs in Eq.(7). To address this issue, we further propose a substructure sampling strategy to enhance the diversity of positive samples. This strategy has been used in the pretraining of protein structures [30, 28]. However, directly applying the empirical sampling length or ratio in these existing methods will lead to a significant decline in performance for our method.

Due to the sensitivity of positive samples to protein structure variations, casually sampling a substructure could make a positive sample no longer similar to the query. In this paper, we propose a novel substructure sampling method. More specifically, our substructure sampling method starts from a randomly selected amino acid within the protein, and then we traverse along the protein chain towards both ends until the desired length is attained. To make sure our sampled substructure will not deviate from our original structure too much, we use TM-score as a guidance. Assume the original positive sample is denoted as PP, and the sampled substructure is denoted as PsP_{s}. We aim to satisfy the following constraint:

TM(P,Ps)α.TM(P,P_{s})\geq\alpha. (8)

Here α\alpha is set to 0.9, which means the two structures are extremely similar according to the meaning of TM-score [31]. Performing on-the-fly TM-score calculation during the training process would incur significant computation cost. Hence, we calculate the minimum sampling length for each structure to satisfy the above constraint before training. Through this finely controlled substructure sampling strategy, we achieve an increase in the diversity of positive examples while avoiding the failure of training. As shown in Figure 1, positive and negative samples are first sampled based on the query in the training phase. Then, the positive substructures are sampled based on the positive sample and pre-computed substructure length.

4 Experiment

4.1 Evaluation Setting

Datasets

We employ three datasets in our experiment, which are SCOPe v2.07 [7] , ind_PDB [15], and Alphafold database [11]. SCOPe is a database to study protein structural relationships, which has been utilized in existing works [14, 15]. To ensure a fair comparison, we employ the same filtering criteria as in [15], and the resulting dataset contains 14,215 protein structures. ind_PDB has also been utilized in [15], which consists of 1,900 protein structures collected from the Protein Data Bank [32]. AlphaFold database contains over 200 million protein structures predicted by Alphafold 2. Following existing work [14, 15], we first conduct 5-fold cross-validation on the SCOPe dataset and then evaluate the performance on the ind_PDB dataset. AlphaFold database is mainly used for comparing memory cost.

Metrics

We employ three metrics to evaluate the accuracy: the area under the receiver operating characteristic curves (AUROC), the area under the precision-recall curves (AUPRC), and the Top-k hit ratio. For each query, we calculate the AUROC and AUPRC, and report the average value across all queries. The Top-k hit ratio is computed as 1NPQi=1NPQNhitimin(k,Nnbri)\frac{1}{N_{P_{Q}}}\sum_{i=1}^{N_{P_{Q}}}\frac{N_{\text{hit}}^{i}}{\min(k,N_{\text{nbr}}^{i})}, where NPQN_{P_{Q}} represents the number of query structures, NhitiN_{\text{hit}}^{i} denotes the number of correctly identified similar structures in the top-k rankings, and NnbriN_{\text{nbr}}^{i} indicates the total number of similar structures for the given query PQP_{Q}. Our evaluation considers the Top-k values of Top-1, Top-5, and Top-10. We follow the settings of existing works [14, 15], considering structure pairs with a TM-score larger than or equal to 0.9max{TM(PQ,Pi)|Pi𝒟}0.9\cdot\max\{{\text{TM}(P_{Q},P_{i})|{P_{i}\in\mathcal{D}}}\} as similar pairs.

To prove the efficiency of POSH, we also compare POSH with baselines in terms of time and memory cost.

Baselines

We adopt four alignment-free methods as baselines in our experiment, which include SGM [12], SSEF [13], DeepFold [14] and GraSR [15]. SGM and SSEF are non-learning methods. DeepFold and GraSR are learning-based methods. The results of SGM and SSEF are obtained by running their provided scripts. For DeepFold, we train it on our dataset using the same settings described in the original paper. The results of GraSR are directly copied from its original paper for comparison since we utilize the same dataset and filtering criteria.

4.2 Results

During the testing phase, we aim to search protein structures similar to a given query structure from a database. In our experiment, the training set serves as the database, and the structures in the validation or test set serve as queries. The ranking of structures is based on the Hamming distance between the binary vectors of protein structure pairs.

Accuracy on SCOPe

We compare the accuracy of POSH with baselines on SCOPe dataset. The results are shown in Table 1. We can find that POSH outperforms all other methods on all metrics except the Top-1 hit ratio. For the Top-1 hit ratio, POSH is comparable to the best baseline, and significantly outperforms other baselines.

Accuracy on ind_PDB

We also evaluate the accuracy of POSH and baselines on ind_PDB. The results are shown in Table 2. Compared with the SCOPe dataset, the accuracy of all the methods on the ind_PDB dataset drops. The gap can be attributed to the fact that each protein structure in the ind_PDB dataset may consist of multiple domains, but in SCOPe each protein structure represents a single domain. From another perspective, ind_PDB is a dataset more related to real-world scenarios. The results show that POSH consistently outperforms all the other methods across all evaluation metrics on the ind_PDB dataset. We can find that POSH achieves about 3%-7% improvement on Top-k hit ratio compared to the state-of-the-art baselines.

Table 1: Accuracy on SCOPe
Model AUROC AUPRC Top-1 Top-5 Top-10
SGM 0.9043 0.4834 0.5633 0.5264 0.5521
SSEF 0.8418 0.0391 0.0830 0.0608 0.0638
DeepFold 0.9628 0.5175 0.5935 0.5680 0.5953
GraSR 0.9823 0.6595 0.7282 0.7101 0.7400
POSH 0.9906 0.6853 0.7242 0.7225 0.7592
Table 2: Accuracy on ind_PDB
Model AUROC AUPRC Top-1 Top-5 Top-10
SGM 0.8750 0.2563 0.3010 0.2881 0.3026
SSEF 0.8329 0.0323 0.0516 0.0402 0.0438
DeepFold 0.9389 0.3051 0.3358 0.3393 0.3678
GraSR 0.9528 0.4058 0.4558 0.4488 0.4764
POSH 0.9699 0.4719 0.4921 0.5130 0.5550
Refer to caption
Figure 4: Time cost of searching in databases of different sizes
Table 3: Comparison of memory cost
Model SCOPe ind_PDB Alphafold Database Compression Ratio
TM-align 3952.63MB 4380.84MB 21TB 1×\times
SGM 4.74MB 5.47MB 72GB 299×\times
SSEF 286.32MB 324.32MB 1196GB 18×\times
DeepFold 20.24MB 23.14MB 319GB 67×\times
GraSR 20.29MB 23.19MB 320GB 67×\times
POSH 0.71MB 0.83MB 11GB 1955 ×\times
Table 4: Ablation study
Edge Features Substructure Sampling Distance Scaling AUROC AUPRC Top-1 Top-5 Top-10
0.9685 0.4589 0.4811 0.5069 0.5489
0.9695 0.4498 0.4758 0.4960 0.5424
0.9651 0.4002 0.4237 0.4372 0.4826
0.9699 0.4719 0.4921 0.5130 0.5550

Time Cost

We show the time cost of searching in databases of different sizes for SGM, SSEF, GraSR and POSH in Figure 4. We use ind_PDB as the query dataset, and the number of protein structures in the target database increases from 100,000 to the size of the Alphafold database. Because the search time of TM-align ranges from hours to days, we exclude it in the figure. We have also excluded the results of DeepFold because DeepFold embeds the protein structure into the same dimension as GraSR, which results in the same time cost. We can find that the time cost of POSH is significantly lower than that of the other methods. POSH is nearly four and ten times faster than SGM and SSEF, and more than four times faster than the other best method GraSR. The advantage of POSH becomes more pronounced with the increasing database size, ensuring POSH’s fast search of large-scale data.

Memory Cost

The memory cost and memory compression ratio of TM-align, POSH, and the other alignment-free methods are shown in Table 3. The process of encoding a protein structure into a vector can be seen as a form of compression, so we also calculate the compression ratio for the Alphafold database. For TM-align, all protein structures must be stored on the computing device to calculate pairwise similarity. In contrast, alignment-free methods allow for memory saving by representing each protein structure as a vector. Hence, we can perform computation on the computing device with the vector representation, and then index the desired proteins on the storage device.

From the results in Table 3, we can find that the alignment-free methods can dramatically reduce the memory cost, compared with the alignment-based method TM-align. Compared with existing alignment-free methods, POSH achieves a memory saving of more than six times, by up to more than 100 times. Taking Alphafold database as an example. When calculating the TM-score by utilizing TM-align directly on the original protein structure data, the resulting memory cost is 21TB, which is unmanageable. Even if the protein structures are represented as real-valued vectors by using some alignment-free methods, the memory cost is still high. Specifically, the memory costs of SGM and SSEF are 72GB and 1196GB, which are 6.5 and 108.7 times larger than that of POSH, respectively. It takes over 300GB memory cost for the other two learning-based methods DeepFold and GraSR, which is about 32 times larger than that of POSH. The memory cost of POSH is only 11GB, which is acceptable for many real-world applications.

4.3 Ablation Study

We conduct ablation study on the edge updating scheme, substructure sampling strategy, and distance scaling strategy of POSH. The results are presented in Table 4. To verify the effectiveness of our edge features in capturing the structural similarity, we replace the raw edge features with those of GraSR and correspondingly remove the edge update layer. We can find that the hand-crafted raw edge features we construct and the structure encoder we design contribute significantly to the improvement of accuracy. We can also find that our substructure sampling strategy contributes 1% to 2% for Top-k hit ratio. The results also verify the importance of adopting distance scaling to differentiate the samples with the same Hamming distance.

5 Conclusion

In this paper, we propose a novel method called POSH for protein structure similarity search. POSH learns a binary vector (hash code) representation for each protein structure, which can dramatically reduce the time and memory cost compared with real-valued vector representation based methods. Experimental results show that POSH can outperform other methods to achieve the best performance, in terms of accuracy, time cost and memory cost.

References

  • [1] Yang Zhang and Jeffrey Skolnick. Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic Acids Research, 33(7):2302–2309, 2005.
  • [2] Matthew Menke, Bonnie Berger, and Lenore Cowen. Matt: local flexibility aids protein multiple structure alignment. PLoS Computational Biology, 4(1):e10, 2008.
  • [3] Ilya N Shindyalov and Philip E Bourne. Protein structure alignment by incremental combinatorial extension (ce) of the optimal path. Protein Engineering, 11(9):739–747, 1998.
  • [4] Yuzhen Ye and Adam Godzik. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19(suppl_2):ii246–ii255, 2003.
  • [5] Sheng Wang, Jianzhu Ma, Jian Peng, and Jinbo Xu. Protein structure alignment beyond spatial proximity. Scientific Reports, 3(1):1–7, 2013.
  • [6] Richard H Lathrop. The protein threading problem with sequence amino acid interaction preferences is np-complete. Protein Engineering, Design and Selection, 7(9):1059–1068, 1994.
  • [7] Naomi K. Fox, Steven E. Brenner, and John-Marc Chandonia. Scope: Structural classification of proteins - extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Research, 42(Database-Issue):304–309, 2014.
  • [8] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  • [9] Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
  • [10] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022.
  • [11] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 50(D1):D439–D444, 2022.
  • [12] Peter Røgen and Boris Fain. Automatic classification of protein structure by using gauss integrals. Proceedings of the National Academy of Sciences, 100(1):119–124, 2003.
  • [13] Elena Zotenko, Dianne P O’Leary, and Teresa M Przytycka. Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification. BMC Structural Biology, 6:1–12, 2006.
  • [14] Yang Liu, Qing Ye, Liwei Wang, and Jian Peng. Learning structural motif representations for efficient protein structure search. Bioinformatics, 34(17):i773–i780, 2018.
  • [15] Chunqiu Xia, Shi-Hao Feng, Ying Xia, Xiaoyong Pan, and Hong-Bin Shen. Fast protein structure comparison through effective representation learning with contrastive graph neural networks. PLoS Computational Biology, 18(3):e1009986, 2022.
  • [16] Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
  • [17] DT Jones. Critically assessing the state-of-the-art in protein structure prediction. The Pharmacogenomics Journal, 1(2):126–134, 2001.
  • [18] Michel van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with foldseek. Nature Biotechnology, 2023.
  • [19] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 2015.
  • [20] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [21] Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. Feature learning based deep supervised hashing with pairwise labels. In International Joint Conference on Artificial Intelligence, 2016.
  • [22] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Hashnet: Deep learning to hash by continuation. In International Conference on Computer Vision, 2017.
  • [23] Dong Xu and Wu-Jun Li. Hashing based answer selection. In Association for the Advancement of Artificial Intelligence, 2020.
  • [24] Ikuya Yamada, Akari Asai, and Hannaneh Hajishirzi. Efficient passage retrieval with hashing for open-domain question answering. In Association for Computational Linguistics, 2021.
  • [25] Jinbao Wang, Shuo Xu, Feng Zheng, Ke Lu, Jingkuan Song, and Ling Shao. Learning efficient hash codes for fast graph-based data similarity retrieval. IEEE Transactions on Image Processing, 30:6321–6334, 2021.
  • [26] Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  • [27] Zhangyang Gao, Cheng Tan, and Stan Z Li. Pifold: Toward effective and efficient protein inverse folding. arXiv preprint arXiv:2209.12643, 2022.
  • [28] Zuobai Zhang, Minghao Xu, Arian Rokkum Jamasb, Vijil Chenthamarakshan, Aurélie C. Lozano, Payel Das, and Jian Tang. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations, 2023.
  • [29] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [30] Pedro Hermosilla and Timo Ropinski. Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675, 2022.
  • [31] Jinrui Xu and Yang Zhang. How significant is a protein structure similarity with tm-score= 0.5? Bioinformatics, 26(7):889–895, 2010.
  • [32] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic Acids Research, 28(1):235–242, 2000.

Appendix A Implementation Details

In our implementation, we set the value of γ\gamma to 0.2 and λ\lambda to 0.5. The temperature coefficient τ\tau is set to 0.07. In each time, we sample 64 proteins: one query, one positive sample, and 62 negative samples. To ensure stable training, we utilize gradient accumulation with 40 accumulation steps. The number of layers in the structure encoder is 6. The code length of the binary hash code is set to 400, which is the same as that of the real-valued vectors used in existing learning-based methods [14, 15]. ρ\rho is set to 0.9. We employ the Adam optimizer with a learning rate of 0.0003 for optimization. Our model is trained on NVIDIA RTX A6000 GPUs, and each model is trained up to 100 epochs.

Appendix B Additional Experiments

Refer to caption
Figure 5: Comprehensive performance comparison

B.1 Comprehensive Analysis

In Figure 5, we provide a comprehensive comparison of accuracy, time cost, and memory cost for SGM, SSEF, GraSR and POSH. We can find that POSH can achieve faster speed, consume less memory, and conduct more accurate search simultaneously. Compared with the existing most efficient method SGM, POSH achieves an accuracy improvement of 20.29%. Compared with the existing baseline of best accuracy (GraSR), POSH achieves a memory saving of 32 times and speed improvement of four times. The result shows the promising potential of POSH in PSSS.

B.2 Code Length Experiment

To investigate the effect of code length, we train four models with code length to be 64, 128, 256, 400, and 512 respectively. The result is shown in Table 5. We can find that with the growing code length, the accuracy of the models consistently improves. Larger code length will lead to larger time and memory cost. Hence, in real applications, we need to choose a suitable code length to achieve a good trade-off between accuracy and cost. Our POSH method provides a good choice to achieve such a trade-off.

Table 5: Accuracy of POSH with different code length
Model AUROC AUPRC Top-1 Top-5 Top-10
POSH-64 0.9582 0.3327 0.3468 0.3752 0.4334
POSH-128 0.9641 0.3792 0.3905 0.4176 0.4696
POSH-256 0.9720 0.4356 0.4616 0.4769 0.5284
POSH-400\star 0.9708 0.4622 0.4858 0.5049 0.5455
POSH-512 0.9663 0.4810 0.5021 0.5176 0.5665

Appendix C Limitations

In this section, we will discuss its limitations. Firstly, in the hashing layer of POSH, various hashing techniques can be explored to address the binarization problem. Future research can investigate whether there are more suitable hashing techniques specifically designed for protein structure hashing. Secondly, as the number of known protein structures increases, utilizing more protein structure data in our training set becomes possible. However, the upper limit of our model’s performance when scaling up with the amount of data still needs to be explored. It is important to consider the challenges associated with using the original time-consuming alignment-based method for similarity calculation as the dataset grows, along with the cost of model training.