Modeling Protein Using Large-scale Pretrain Language Model
Abstract.
Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM. ††∗Corresponding author: Jie Tang.
1. Introduction
As an indispensable part of life activities, protein is responsible for catalysis (such as enzymes), transportation (such as hemoglobin), etc. Therefore, understanding the structure and functionality of protein is critical to the study of life science, as well as disease detection and drug discovery. Traditional protein analysis paradigms can be divided into experimental and analytical. Experimental methods usually require purification, crystallization, and X-ray crystallography. Analytical methods, like sequence alignment(DBLP:journals/corr/Ma15c, ), and molecular dynamics simulation (GENG20191162, ), tend to be incapable of handling large-scale protein data. Sequence alignment and similarity analysis leverage the idea that ”structure determines properties”, that sequential molecules with similar sequence order tend to have common ancestors and are relatively similar in structure and functionality. So similarity analysis often requires a large-scale annotated database, the properties of the sequence to be analyzed can be inferred from the labels of aligned sequences in the database. However, labeling such large databases requires lots of manpower and material resources. Molecular dynamics simulation (MD) and Monte Carlo (MC) simulations can be applied to protein analysis(Gsponer6719, )(Karplus6679, ), and can be quite accurate (simulation at atom-scale). However, requires a lot of computing resources and is time-consuming.
Generally speaking, most of the proteins that exist stably in nature have undergone millions of years of natural selection and evolution, and are in a low-energy stable state. The polarity of some amino acids makes certain amino acid arrangements in a lower energy state, and motifs in proteins are also made up of specific amino acid stretches folded. Such patterns can be captured by deep learning models. Researchers have explored various strategies. Inspired by Word2Vec(DBLP:journals/corr/abs-1301-3781, ), BioVec(asgari2015continuous, ) proposed ProtVec for proteins GeneVec for gene sequences modeling. However, the vocabulary size grows exponentially with dependence range (n-gram), making the cost of modeling long-range dependencies unbearable (n-grams representation). With the rise of representation learning, sequence representation learning(Alley589333, ) and transfer learning (Heinzinger614313, ) are also introduced to protein analysis. Recent years, the emergence of the attention mechanism(vaswani2017attention, ), which can compute hidden representations in parallel, allows researchers to better model long sequential data. ProtTrans(prottrans, ) also show that large-scale auto-regressive language models can model protein sequences quite well. Besides, the information encoded in an individual sequence is limited, MSA Transformer(Rao2021.02.12.430858, ), ESM(esm, ) leverage sequence alignment information to model protein even better. Other research like Neural Potts Model(Sercu2021.04.08.439084, ) obtained inspiration from the Potts model.
Thanks to the advancement of high-throughput sequencing technology, we have larger amino acid sequence databases than ever before. However, most of these data are unlabeled primary structures of proteins, the labeled sequences (like structure, stability) are relatively scarce. The amazing achievements of BERT(bert, ) reveal the fact that data patterns can be extracted using unsupervised learning from massive unlabeled data, which inspired us to train language models on massive protein sequences. We have trained multiple large-scale models on the PFAM(pfam2019, ) dataset, the largest with 3 billion parameters, outperforming TAPE’s (rao2019evaluating, ) performance.
2. Related Works
2.1. Standardized Datasets and Tasks
There are plenty of data in the computational proteomics field, however, current literature is fragmented in terms of unified datasets and evaluation metrics. The methods and models introduced by researchers are often evaluated on different datasets with different evaluation metrics. To solve this dilemma, TAPE(rao2019evaluating, ) put forward a set of five biologically related tasks, including secondary structure prediction ((ss, ), (ss_cc1, ), (ss_cc2, )), contact prediction ((rh_cc, ), (ss_cc1, ), (ss_cc2, )), remote homology detection ((rh_cc, )), fluorescence ((fluorescence, )), and stability ((stability, )). Besides, commonly used models, like LSTM(LSTM, ), Transformer(vaswani2017attention, ), and ResNet(resnet, ) are implemented for these tasks, serving as benchmarks for semi-supervised representation learning. One of their conclusions is that self-supervised training is beneficial for almost all models on all tasks, doubling performance in some downstream tasks. Our work is based on standardized datasets and evaluation metrics provided in TAPE(rao2019evaluating, ).
2.2. Large-scale Pretraining
The success of pretraining makes researchers wonder whether the in language model scale can always bring about improved performance. ProtTrans(prottrans, ) is one of the representatives, the researchers trained a series of language models with tens of billions of parameters, the largest one ProtT5-XXL with 11B parameters, and achieved excellent performance on downstream tasks such as secondary structure prediction and solubility prediction.
2.3. Efficient Pretraining of Language Models
Different from the usual pretraining, large-scale pretraining requires distributed training techniques, including model parallelism, data parallelism, memory optimization, data synchronization, etc. Fortunately, Megatron-LM(shoeybi2020megatronlm, ) provides us with an efficient training framework for language models. We have implemented and trained our protein language model within this framework, as well as downstream classification and regression tasks.
3. Methodology
3.1. Pretrain Tasks
Description
The goal for protein pretraining is modeling data patterns in massive unlabeled sequences. One closed-related model is BERT(bert, ) from natural language processing. We made some modifications to its loss function and model structure.
Dataset
Our work takes the dataset put forward by TAPE(rao2019evaluating, ). So some date descriptions are inherited. PFAM(pfam2019, ) is a widely-used database consisting of more than 32 million protein sequences. Sequences in PFAM are clustered into evolutionarily related groups (protein families). Leveraging this family information, TAPE constructed a test set (about 1% of the data) of fully held out families. The remaining data are used for constructing training and test sets using a random 95%/5% split. We use preprocessed PFAM from TAPE as the pretrain corpus.
Training Objective
BERT(bert, ) original loss consists of masked language model loss and next sentence prediction loss.
For protein pretraining, we inherited the masking strategy of the masked language model (MLM) in BERT(bert, ), randomly masking 15% of all the amino acid tokens, and then train the protein model to be able to predict the masked token from the rest of tokens. As for next sentence prediction (NSP), considering that the input sequences are randomly shuffled, we assume there is no evident semantic/biological correlation between sequences. So we discard the next sentence prediction loss, only keep the masked language model loss.
As for the training objective function, we modified the loss function of BERT: BERT’s loss function includes masked language model and next sentence prediction. Considering that there is no obvious contextual semantic relationship between protein and protein, we only retain masked language model loss.
In terms of model structure, Megatron-LM (shoeybi2020megatronlm, ) proposes that when the scale of the model grows huge, the position of the layernorm becomes critical. Therefore, the sublayers in the transformer layer have been reordered. The original layernorm is in the output layer, but now it is placed ahead of the input layer to prevent the input data from drifting.
3.2. Downstream Classification Tasks
There are three classification tasks, corresponding to token, sequence, and token-pair classification.
3.2.1. Secondary Structure

Description
Secondary structure classification is a token-level task. The input is protein sequence, the output is a sequence of labels with the same length, indicating the secondary structure position of the corresponding amino acid. As for Q3, the labels are Helix, Strand, and Other. As for Q8, the labels are helix (G), -helix (H), -helix (I), -stand (E), bridge (B), turn (T), bend (S), and others (C).
Input
Output
Dataset
The dataset for secondary structure task is the CB513(cb513, ) dataset.
Training Objective
A one-dimensional convolution layer can be applied to secondary structure prediction(jinbo_conv, ). However, due to the powerful modeling capabilities of our model, the encoding output from the protein language model already contains sufficient information for this task, so we take ProteinLM followed by a multilayer perceptron as the secondary structure classifier.
3.2.2. Remote Homology

Description
Remote homology detection is a sequence-level classification task. This task is introduced to measures a model’s ability to detect structural similarity across distantly related inputs. The input is a protein sequence, and the target is to predict which fold family this sequence belongs to. There are 1195 classes in all. Similar to the token-level prediction tasks, we adopt the multilayer perceptron for classification.
Here, means encoding results for token , AC[i] means the amino acid in protein sequence.
Training Objective
This is a classical classification task, we take the naive cross-entropy loss.
3.2.3. Contact Prediction

Description
Contact prediction means predicting whether or not amino acid pairs are in ”contact” in folded structure (in ”contact” means the distance in folded structure within 8 angstroms); facilitating 3-dimensional free modeling of protein. It is a classification task, assigning a binary label to amino acid pairs, indicating whether they are in ’contact’. The contact prediction task can evaluate the model’s ability to capture protein sequence’s global information. Unlike the commonly used residual connected 2D-convolution network, we adopted a simple predictor, concatenating embedding pairs and using multilayer perceptron to do this binary classification. Numerous hidden units, presentation layers, as well as huge corpus guarantee that our model can capture even more long-range dependence information than common models.
Dataset
The dataset from ProteinNet(proteinnet, ). And evaluation metric is , , most likely contact prediction accuracy contacts ( is the length of protein sequence).
3.3. Downstream Regression Tasks
3.3.1. Fluorescence

Description
Distinguishing protein sequences with different mutations can be difficult, since the computational cost grows exponentially with the number of mutations . The computational complexity for a sequence with mutation away is . The fluorescence prediction task can evaluate the model’s capacity to distinguish between very similar protein sequences, as well as its ability to generalize to unseen combinations of mutations(rao2019evaluating, ). Accurate predictions can facilitate the exploration of the protein landscape.
Dataset
The train set(fluorescence, ) is made up of neighborhoods from the parent green fluorescent protein(fluorescence, ), while the test set sequences with four or more mutations.
3.3.2. Stability

Description
Stability is very important in the design of protein drugs, because drugs with low stability are often degraded before they take effect. The stability of one protein sequence is measured experimentally and indirectly: the upper limit of concentration at which the protein can maintain its original folding structure(stability, ). Therefore, for this task, The input is the amino acid sequence, while the output is a continuous value predicting to which extent the protein can maintain its fold structure.
Dataset
The train set consists of proteins from four rounds of experimental design, while the test set contains Hamming distance-1 neighbors of top candidate proteins(rao2019evaluating, ).
4. Results
Our model has obtained amazing results in downstream tasks. There are great improvements in four tasks: secondary structure prediction, distant homology detection, stability, and contact prediction. It is worth mentioning that the performance of the 3B model on contact prediction has almost doubled compared with the baseline model.
Besides, we used 10 sets of model hyper-parameters in total, and conducted very sufficient experiments on a series of tasks. The results can be found inTable 7.
4.1. Training
We pretrained two large models on a 480 GPUs (Tesla-V100-32GB) cluster for about three weeks. The MLM loss and PPL of the pretrained models can be found in Table 1.
The 3B parameters model reached language model loss , perplexity .
The 1.2B parameters model reached language model loss , perplexity .
In pretraining, although the overall training iteration for the 3 billion model is only half of that for the 1.2 billion model, it reached even smaller MLM loss and PPL. This phenomenon demonstrated that, when handled properly, the expansion in model scale can contribute to the accurate capture of patterns in the data.
Model | Protein LM (1.2B) | Protein LM (3B) |
MLM Loss | 1.335 | 1.318 |
PPL | 3.802 | 3.736 |
4.2. Evaluation
Out of the five tasks, the results of four tasks have been improved.
Task | contact prediction |
Metric | P@L/5 |
TAPE | 0.36 |
ProteinLM (200M) | 0.52 |
ProteinLM (3B) | 0.75 |
Task | remote homology |
Metric | Top 1 Accuracy |
TAPE | 0.21 |
ProteinLM (200M) | 0.26 |
ProteinLM (3B) | 0.30 |
Task | secondary structure |
Metric | Accuracy (Q-3) |
TAPE | 0.73 |
ProteinLM (200M) | 0.75 |
ProteinLM (3B) | 0.79 |
Task | fluorescence |
Metric | Spearman’s rho |
TAPE | 0.68 |
ProteinLM (200M) | 0.68 |
ProteinLM (3B) | 0.68 |
Task | stability |
Metric | Spearman’s rho |
TAPE | 0.73 |
ProteinLM (200M) | 0.77 |
ProteinLM (3B) | 0.79 |
P@L/5 | P@L/2 | P@L | Fluorescence | RH | SS Q@3 | SS Q@8 | Stability | |
hidden-512-layer-32-head-8 | 0.503 | 0.477 | 0.409 | 0.679 | 0.205 | 0.716 | 0.578 | 0.758 |
hidden-768-layer-12-head-6 | 0.487 | 0.428 | 0.369 | 0.677 | 0.198 | 0.721 | 0.570 | 0.770 |
hidden-768-layer-16-head-16 | 0.534 | 0.469 | 0.396 | 0.676 | 0.205 | 0.722 | 0.575 | 0.762 |
hidden-768-layer-16-head-24 | 0.519 | 0.427 | 0.376 | 0.678 | 0.192 | 0.719 | 0.572 | 0.687 |
hidden-1024-layer-12-head-16 | 0.572 | 0.490 | 0.419 | 0.676 | 0.209 | 0.729 | 0.584 | 0.744 |
hidden-1024-layer-12-head-32 | 0.500 | 0.446 | 0.377 | 0.680 | 0.201 | 0.721 | 0.575 | 0.762 |
hidden-2048-layer-12-head-16 | 0.676 | 0.576 | 0.495 | 0.677 | 0.266 | 0.752 | 0.614 | 0.732 |
hidden-2048-layer-24-head-16 | 0.710 | 0.658 | 0.563 | 0.678 | 0.271 | 0.791 | 0.652 | 0.679 |
hidden-2048-layer-24-head-8 | 0.673 | 0.600 | 0.531 | 0.674 | 0.262 | 0.762 | 0.624 | 0.785 |
hidden-3072-layer-24-head-16 | 0.753 | 0.662 | 0.566 | 0.681 | 0.298 | 0.791 | 0.654 | 0.794 |
means contact prediction precision@X.
means classification for secondary structure.
means remote homology.
5. Contact Map Visualization
Generally, the accuracy of predictions on the anti-diagonal can reflect the model’s ability to capture long-range dependency. Therefore, we also visualized the ground truth contact maps, as well as contact maps predicted by our model and TAPE. The contact map below demonstrates that our model is good at capturing long-range dependency.
5.1. Factual Contact Map
We visualize the contact map of protein #TBM-T0889 in Figure 6, and we can intuitively see that there are many long-range contacts (contacts that are separated by at least 24 residues). This protein sequence can be used to distinguish the ability to capture long-distance dependence of different models.

5.2. TAPE Contact Map
Through the visualized predictions from TAPE (7), we can see that the TAPE model (small-scale transformer) can capture medium-range contacts (contacts that are separated by 12-23 residues). As for the long-range contact prediction, there are lots of missings on the anti-diagonal belt.

5.3. ProteinLM Contact Map
ProteinLM-3B model shows very good performance in contact prediction, and the visualized prediction map confirmed this. ProtrinLM-3B can capture medium-range and long-range dependencies, with lots of hits on the anti-diagonal belt.

5.4. Analysis and Discussion
With the limited amount of computing resources and computing time, the selection of hyper-parameters is critical to the model’s performance. A trade-off is necessary. The depth (transformer layers) of the model has a greater impact on the performance than the width (hidden size).
On the one hand, the model that is too flat () performs poorly, even though it has a large hidden size. We trained a model with , . Although its training speed (average time for each iteration) is the fastest among all models, it failed to converge after 5 days of training.
On the other hand, the model that is too deep is not a feasible choice in this scenario (limited training time). The training time of Figure 9 is 3.5 times that of Figure 10. And it takes about 25 days to train the model with 32 transformer layers Figure 9.
Our empirical conclusion is that the model parameters of , are feasible and can well balance training efficiency and resource consumption.
6. Summary
We proposed ProteinLM, a large-scale pretrain model for protein sequences. The optimizations we introduced into protein pretraining make billion-level model training possible and efficient. And the significantly improved performance in downstream tasks shows that as the scale of the model increases, the biological information in the sequence and the long-term dependence can be captured more accurately. In addition, through a large number of controlled experiments, we found and summarized some empirical rules for hyperparameter selection.
References
- [1] Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M. Church. Unified rational protein engineering with sequence-only deep representation learning. bioRxiv, 2019.
- [2] Mohammed AlQuraishi. Proteinnet: a standardized data set for machine learning of protein structure. arXiv preprint arXiv:1902.00249, 2019.
- [3] Ehsaneddin Asgari and Mohammad RK Mofrad. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one, 10(11):e0141287, 2015.
- [4] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
- [5] James A. Cuff and Geoffrey J. Barton. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics, 34(4):508–519, 1999.
- [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- [7] Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, Erik L L Sonnhammer, Layla Hirsh, Lisanna Paladin, Damiano Piovesan, Silvio C E Tosatto, and Robert D Finn. The Pfam protein families database in 2019. Nucleic Acids Research, 47(D1):D427–D432, 10 2018.
- [8] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik, and Burkhard Rost. Prottrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. CoRR, abs/2007.06225, 2020.
- [9] Naomi K Fox, Steven E Brenner, and John-Marc Chandonia. Scope: Structural classification of proteins—extended, integrating scop and astral data and classification of new structures. Nucleic acids research, 42(D1):D304–D309, 2013.
- [10] Hao Geng, Fangfang Chen, Jing Ye, and Fan Jiang. Applications of molecular dynamics simulation in structure prediction of peptides and proteins. Computational and Structural Biotechnology Journal, 17:1162–1170, 2019.
- [11] Jörg Gsponer and Amedeo Caflisch. Molecular dynamics simulations of protein folding from the transition state. Proceedings of the National Academy of Sciences, 99(10):6719–6724, 2002.
- [12] Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes, and Burkhard Rost. Modeling the language of life – deep learning protein sequences. bioRxiv, 2019.
- [13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- [14] M. Karplus and J. Kuriyan. Molecular dynamics and protein function. Proceedings of the National Academy of Sciences, 102(19):6679–6685, 2005.
- [15] Michael Schantz Klausen, Martin Closter Jespersen, Henrik Nielsen, Kamilla Kjaergaard Jensen, Vanessa Isabell Jurtz, Casper Kaae Soenderby, Morten Otto Alexander Sommer, Ole Winther, Morten Nielsen, Bent Petersen, et al. Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinformatics, 2019.
- [16] Jianzhu Ma. Protein structure prediction by protein alignments. CoRR, abs/1510.05682, 2015.
- [17] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.
- [18] John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tramontano. Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins: Structure, Function, and Bioinformatics, 86:7–15, 2018.
- [19] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S. Song. Evaluating protein transfer learning with tape, 2019.
- [20] Roshan Rao, Jason Liu, Robert Verkuil, Joshua Meier, John F. Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. Msa transformer. bioRxiv, 2021.
- [21] Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv, 2019.
- [22] Gabriel J Rocklin, Tamuka M Chidyausiku, Inna Goreshnik, Alex Ford, Scott Houliston, Alexander Lemak, Lauren Carter, Rashmi Ravichandran, Vikram K Mulligan, Aaron Chevalier, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science, 357(6347):168–175, 2017.
- [23] Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova, Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova, Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397, 2016.
- [24] Tom Sercu, Robert Verkuil, Joshua Meier, Brandon Amos, Zeming Lin, Caroline Chen, Jason Liu, Yann LeCun, and Alexander Rives. Neural potts model. bioRxiv, 2021.
- [25] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017.
- [27] Sheng Wang, Jian Peng, Jianzhu Ma, and Jinbo Xu. Protein secondary structure prediction using deep convolutional neural fields. Scientific reports, 6, 01 2016.
- [28] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 636–644, 2017.

