This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Predicting Anti-microbial Resistance using Large Language Models

Hyunwoo Yoo, Bahrad Sokhansanj, James R. Brown, Gail Rosen
Drexel University
{hty23, bas44, jb4633, glr26}@drexel.edu
Abstract

During times of increasing antibiotic resistance and the spread of infectious diseases like COVID-19, it is important to classify genes related to antibiotic resistance. As natural language processing has advanced with transformer-based language models, many language models that learn characteristics of nucleotide sequences have also emerged. These models show good performance in classifying various features of nucleotide sequences. When classifying nucleotide sequences, not only the sequence itself, but also various background knowledge is utilized. In this study, we use not only a nucleotide sequence-based language model but also a text language model based on PubMed articles to reflect more biological background knowledge in the model. We propose a method to fine-tune the nucleotide sequence language model and the text language model based on various databases of antibiotic resistance genes. We also propose an LLM-based augmentation technique to supplement the data and an ensemble method to effectively combine the two models. We also propose a benchmark for evaluating the model. Our method achieved better performance than the nucleotide sequence language model in the drug resistance class prediction.

Predicting Anti-microbial Resistance using Large Language Models


Hyunwoo Yoo, Bahrad Sokhansanj, James R. Brown, Gail Rosen Drexel University {hty23, bas44, jb4633, glr26}@drexel.edu


Refer to caption
Figure 1: Overview of our approach
Output Input Example BioBERT
Base Gene Family: Beta-lactamases, Resistance Mechanism: Antibiotic incativation 78.20
Entity marker (punct) [Gene Family]: Beta-lactamases, [Resistance Mechanism]: Antibiotic incativation 77.41
Typed entity marker *Beta-lactamases*, #Resistance Mechanism# 77.70
Typed entity marker (punct) *[Gene Family]: Beta-lactamases*, #[Resistance Mechanism]: Antibiotic incativation# 78.46
Table 1: Test macro F1 score of different entity representation techniques in Antibiotic Resistance Drug Class Prediction with BioBERT.

1 Introduction

The genes for antibiotic resistance have increased rapidly over the past 10 years and have become a threat to human health Zhang et al. (2022). Moreover, dangerous infectious diseases like COVID-19 can also spread. In such times, it is important to classify the DNA sequences of antibiotic resistance genes. In bioinformatics, the main method for classifying DNA sequences has been to find similar sequences by aligning two DNA sequences using text alignment Bonin et al. (2023). Recently, there have been methods that use language models created from the nucleotide or protein sequences of various species and fine-tune them to create classifiersBrandes et al. (2022); Ji et al. (2021); Zhou et al. (2023). These methods have the advantage of being able to identify which parts of the nucleotide sequence are important. To fine-tune, databases containing information on antibiotic resistance genes must be used. The main databases are CARD Jia et al. (2017) and MEGARes Doster et al. (2020). Existing methods use the labels associated with antibiotic resistance genes, such as the class to which the resistance gene belongs, for example, the label of the antibiotic to which resistance is present. It is a prediction of a single label from a single gene sequence Kang et al. (2022). However, if we look at the CARD or MEGARes databases, there are several attributes that describe a particular gene. There are Gene Family and Resistance Mechanism. If we use this information when predicting the antibiotic to which resistance is present, it could be helpful for prediction. Here, we get an idea and propose a model that uses human-readable information to predict antibiotic resistance genes. We also provide a method to merge the different classification systems of CARD and MEGARes. We will also explain the LLM-based data augmentation technique for rare classes with few samples.

2 Approaches

Our approaches include fine-tuning a pre-trained language model with various species’ gene nucleotide sequence data to predict antibiotic resistance genes and their classes. We also fine-tune a pre-trained language model trained on a corpus containing diverse papers from the fields of biology and medicine to predict the names of antibiotic resistance gene properties. We provide an effective ensemble model Kumari et al. (2021) using the above two models in a weighted soft voting method. To integrate the classes, we combine the DNA sequences and the concepts that describe them from CARD and MEGARes into one. We use the EBI ARO ontology Cook et al. (2016) to combine CARD tagging and MEGARes tagging into one class system. For rare classes with few samples, we use BioGPT Luo et al. (2022) prompting to perform data augmentation.

2.1 Nucleotide Sequence Based Antibiotic Resistance Drug Class Classification

Following the structure of Dalla-Torre et al. (2023), we uses a large pre-training language model based on nucleotide sequences and fine-tune a classifier based on Drug Class data. The nucleotide sequence input is limited to a length of 1000, the input size of the pre-training model. The tokenizer uses a 6-mer tokenizer. A 6-mer tokenizer is a type of k-mer tokenizer. A k-mer tokenizer is a technique used in genome analysis and bioinformatics research that splits a biological sequence into substrings of length k Mejía-Guerra and Buckler (2019). The pre-training model uses NT, which is pre-trained on multi-species including bacteria, fungi, inverterbate, protozoa, verterbate gene sequences. Unlike other nucleotide sequence-based pre-training models that mostly use human genes, this model is trained on multi-species genes, providing a better representation. Fine-tuning is done using LoRA tuning. LoRA tuning is a method that fixes the weights of a pre-trained large-scale language model and inserts a low-rank decomposed matrix into each transformer layer, dramatically reducing the number of trainable parameters for the downstream task Hu et al. (2021). This allows for more effective fine-tuning.

2.2 Text Information Based Antibiotic Resistance Drug Class Classification

Text information based antibiotic resistance drug class classification uses a BioBERT language model pre-trained on a large medical and biological text corpus as the pre-training model. BioBERT is a pre-trained biomedical language representation model that uses a large-scale biomedical text corpus including PubMed abstracts, PMC full-text articles, and the Genia corpus. Lee et al. (2020) We fine-tune this model to extract antibiotic resistance drug classes, such as Drug Class or Gene Family, from text that describes antibiotic resistance genes. We aim to improve the performance of the classifier by utilizing a pre-trained biomedical text-based model. Instead of using multiple classification layers, we create a single classification layer and fine-tune it. The training data is structured as [Resistance Mechanism] followed by a description of the attribute, such as Antibiotic inactivation. To further improve performance, we create a format that encloses special characters Zhou and Chen (2021), such as *[Gene Family]: Beta-lactamases*, #[Resistance Mechanism]: Antibiotic inactivation#.

Method Accuracy Macro F1 Precision Recall
NT 84.15 64.04 72.78 59.28
NT with data augmentation 83.42 64.85 80.15 58.65
NT with reads 82.85 61.02 68.32 57.06
NT with reads and data augmentation 83.11 62.82 74.81 57.32
Table 2: Result of data augmentation for the class which has small samples. Data augmentation increases the F1 score.

2.3 Weighted Soft-voting Ensemble

To combine the pre-trained nucleotide sequence-based language model and the pre-trained text-based language model mentioned earlier, we use a soft-voting ensemble model. Additionally, we find the optimal weights through validation data and apply them to create a weighted soft voting ensemble model. A more detailed explanation of the validation data will be provided in the Experiment section. This data is a third dataset separate from the training and test data. This allows us to use both nucleotide sequence information and the text information that describes it. This model requires both types of input. It receives the nucleotide sequence and information about Gene Family and Resistance Mechanism in the format [Resistance Mechanism]: Antibiotic Effuls, #[Gene Family]: Bata-Lactamases#.

2.4 Integrating Classes Based on Antibiotic Resistance Ontology

The databases provided in the literature (CARD, MEGARes) have different classification systems and hierarchical relationships. EBI ARO provides hierarchical information on antibiotic resistance genes. EBI stands for European Bioinformatics Institute. These diverse antibiotic resistance classification systems, gene groupings, and resistance mechanisms can be combined through the EBI ontology, and the model can store integrated concept representations. Each database’s header is read and the EBI API is searched. The mapped items are used as new Gene Family. Rather than using very small and specific hierarchical classes, more general hierarchical classes are employed. The third level from the top in the EBI ARO hierarchy is used as the basis.

Refer to caption
Figure 2: EBI ARO Gene Family mapping: search to find mapping information with header and ontology by using API.

2.5 Data Augmentation Using a Large Language Model

The categories were integrated based on the EBI ARO Ontology’s gene group and CARD Resistance Mechanism. However, there are still cases where the number of samples corresponding to a class is small. Data augmentation was conducted for these cases. BioGPT was used for data augmentation. Similar data were created through prompting. Through this, it was possible to see that performance improved as follows: In particular, the accuracy in classes with a small number of samples increased.

3 Experiments

Dataset Method Accuracy Macro F1 Precision Recall
CARD NT 87.92 63.08 66.46 61.51
CARD BB 97.22 89.68 92.09 90.54
CARD Ensemble 97.55 93.44 95.72 92.86
MEGARes NT 89.61 46.42 54.92 43.94
MEGARes BB 99.64 99.47 99.96 99.03
MEGARes Ensemble 99.99 99.99 99.99 99.99
Integrated NT 82.89 65.79 81.84 58.67
Integrated BB 90.26 79.34 84.05 77.14
Integrated Ensemble 92.11 80.95 83.52 78.94
Integrated with reads NT 83.11 62.82 74.81 57.32
Integrated with reads BB 90.24 79.34 84.05 77.14
Integrated with reads Ensemble 93.40 81.85 84.34 80.25
Table 3: Result of using the CARD, MEGARes, and Integrated databases for antibiotic resistance drug class prediction using Nucleotide Transformer(NT), BioBERT(BB), and a weighted ensemble of both. The weighted ensemble with Nucleotide Transformer(NT) and BioBERT(BB) shows better performance in every datasets.

3.1 Datasets

The CARD and MEGARes v3 datasets are used for training and evaluation. Classes with fewer than 15 samples are removed because obtaining meaningful results from the data split is difficult. The remaining data is split into 75% for training data, 20% for test data, and 5% for validation data. EBI ARO ontology search is used to integrate the data, which is then split similarly to the above. Classes with difficult-to-obtain meaningful results are also removed. The MEGARes dataset consists of 9733 Reference Sequences, 1088 SNPs, 4 antibiotic types, 59 resistance classes, and 233 mechanisms. The CARD dataset consists of 5194 Reference Sequences and 2005 SNPs, 142 Drug Classes, 331 Gene Families, and 10 Resistance Mechanisms. The EBI ARO ontology provides hierarchical group information for genes. Using the EBI ARO Ontology, Gene Family class information can be integrated into a higher-level hierarchy. The number of Gene Family text information classes in the case of MEGARes is 589, while for CARD, it is 331. There are 300 and 166 datasets with only one sample in their respective classes for Gene Family in the case of MEGARes and CARD, respectively. Resistance Mechanism is integrated based on the 6 categories of CARD. The original 8 categories were reduced to 6, excluding cases of various class combinations and those with very few samples. Drug Class is integrated using 9 common Drug Classes found in competing models. Integration is done based on names and theories and has been verified. Macro f1 score, accuracy, balanced accuracy, and precision are used as performance metrics, and the results are listed in the table 3.

3.2 Implementation Details

Basic structure of the model and fine-tuning follow the methods proposed by BioBERT and Nucleotide Transformer. The layers and information of the model are in the Appendix.

3.3 Main Results

Tables 3 show metrics using our method with the latest techniques (SOTA) in the text-based information model for the CARD and MEGARes experiments, showing that our method surpasses previous SOTA. Additionally, the method using integrated data shows superiority over previous SOTA. Our method also demonstrates competitive results compared to other competing models and SOTA.

4 Discussion

Does text information help?

In all datasets, using a text information-based language model shows a 9.53 accuracy and 30.34 macro f1 score improvement in CARD and 10.38 accuracy and 50.57 macro f1 score improvement in MEGARes. Adjusted ratio ensemble models show better performance compared to other cases through experiments. Existing NT and other nucleotide sequence-based models find it difficult to process natural language. Our fine-tuned text-based language model was trained using a small amount of pre-training resources (40GB A100 GPU). By constructing an ensemble model, it achieves better performance compared to competing models such as AMR-meta Marini et al. (2022), Meta-MARC Lakin et al. (2019), and Deep ARG Arango-Argoty et al. (2018).

Does text information class integration help?

To compare with other models, we integrated the class system. This enables comparison with competing models. It also allows us to create models for predicting Gene Family and Resistance Mechanism. In particular, the number of samples corresponding to classes in Gene Family and Resistance Mechanism is very small in many cases. This integration helps to implement Gene Family and Resistance Mechanism prediction models. The integrated class system shows better performance compared to cases where it is not. The number of genes available for training increases.

Sequencing Read Generation

In some competing models, it is recommended to use reads instead of full genes. In the case of AMR-meta, it aims to predict paired end genes. To compare with these models, it is necessary to generate reads. Reads generation uses ART. ART is a simulator for analyzing nucleotide sequences, and it helps with accurate modeling of biological information data as a software Huang et al. (2012). ART has the advantage of customizable indel error rates Milhaven and Pfeifer (2023). The learning and experiments using these reads are presented in Table . In this experiment, the proposed model also demonstrates strong competitiveness.

5 Related Work

AMR-meta is a method for classifying antibiotic resistance in high-speed metagenomic data. This method uses a sequence alignment-free approach based on k-mers and meta-features, and it utilizes both resistant and non-resistant genes as training data. As a result, AMR-meta can more accurately identify antibiotic resistance genes and reduce false-positive rates for non-resistant genes. However, it uses a complex matrix decomposition method to generate meta-features, which can be computationally intensive. Additionally, the prediction performance of AMR-meta may vary depending on the type of antibiotic used or the diversity of the resistance genes. These characteristics make AMR-meta useful for analyzing high-speed metagenomic data, but at the same time, they suggest that it may be limited in certain situations.
AMR++ is a customized bioinformatics pipeline that uses high-throughput sequencing data to predict the diversity and abundance of antibiotic resistance genes (ARGs). This pipeline is integrated with the MEGARes database, allowing for efficient analysis of ARGs in large-scale metagenomic sequencing data. The main advantage of AMR++ is its high throughput and efficiency, enabling users to quickly and accurately analyze complex datasets. In addition, this software can distinguish between types of ARGs, including cases where resistance genes require specific mutations. However, this pipeline requires high-quality assembled and/or translated data, which may cause difficulties or limitations in generating metagenomic datasets. Furthermore, AMR++ may require advanced bioinformatics skills and resources, potentially limiting accessibility for some researchers.
Meta-MARC is a machine learning classifier developed to enhance the detection and classification of antibiotic resistance genes. This system is based on the MEGARes database and uses DNA-based hierarchical Hidden Markov Models (HMMs) to classify antibiotic resistance genes in high-throughput sequencing data. Meta-MARC is robust against various gene mutations, which is particularly useful for non-standard databases and sequences. This tool provides high sensitivity and specificity, playing a crucial role in accurate antibiotic resistance detection. However, Meta-MARC is computationally demanding, particularly when dealing with large datasets, which can result in increased processing time and memory usage. Additionally, high sensitivity settings may potentially increase false positives, so users must carefully interpret the results.
DeepARG is a deep learning-based system used for predicting antibiotic resistance genes (ARGs) in metagenomic data. It utilizes two models, DeepARG-SS and DeepARG-LS, for classifying short and full-length gene sequences. Compared to the traditional ’best hit’ approach, it has the advantage of identifying a wider range of ARG diversity with lower false negative rates. However, the performance of this system heavily depends on the quality of the training database, and it has limitations when it comes to predicting new categories of ARGs. Despite these limitations, DeepARG is a useful tool for evaluating the presence and diversity of ARGs in environmental samples.

6 Conclusion

As far as we know, our work is the first to combine natural language models and biological sequence models to predict antibiotic resistance genes. We proposed a model that combines two different attribute language models into an ensemble. By using both nucleotide sequence information and its description, including Gene family and resistance mechanism information, it enables more accurate drug class predictions. We also integrated various databases using the EBI ontology and used a large language model (LLM) for data augmentation in classes with insufficient data. As a result, we achieved performance close to the state-of-the-art. We believe this fusion has significant meaning. Moreover, we tested the structure we trained using only nucleotide sequences and obtained acceptable results. This seems promising for future research.

Acknowledgements

This work is supported by the National Science Foundation.

References