Enhancing Clinical Efficiency through LLM:
Discharge Note Generation for Cardiac Patients
Abstract
Medical documentation, including discharge notes, is crucial for ensuring patient care quality, continuity, and effective medical communication. However, the manual creation of these documents is not only time-consuming but also prone to inconsistencies and potential errors. The automation of this documentation process using artificial intelligence (AI) represents a promising area of innovation in healthcare.
This study directly addresses the inefficiencies and inaccuracies in creating discharge notes manually, particularly for cardiac patients, by employing AI techniques, specifically large language model (LLM). Utilizing a substantial dataset from a cardiology center, encompassing wide-ranging medical records and physician assessments, our research evaluates the capability of LLM to enhance the documentation process.
Among the various models assessed, Mistral-7B distinguished itself by accurately generating discharge notes that significantly improve both documentation efficiency and the continuity of care for patients. These notes underwent rigorous qualitative evaluation by medical expert, receiving high marks for their clinical relevance, completeness, readability, and contribution to informed decision-making and care planning. Coupled with quantitative analyses, these results confirm Mistral-7B’s efficacy in distilling complex medical information into concise, coherent summaries.
Overall, our findings illuminate the considerable promise of specialized LLM, such as Mistral-7B, in refining healthcare documentation workflows and advancing patient care. This study lays the groundwork for further integrating advanced AI technologies in healthcare, demonstrating their potential to revolutionize patient documentation and support better care outcomes.
1 Introduction
Discharge notes are essential in the healthcare sector, serving as a comprehensive summary of a patient’s hospital stay, including diagnosis, treatments, and follow-up care recommendations. These documents are crucial for ensuring smooth transitions between care settings, enhancing communication among healthcare providers, and supporting effective patient management after hospitalization. Furthermore, they play a significant role in reducing readmission rates by facilitating the proper management of ongoing care plans, thus maintaining the quality and safety of patient care.
With the advancement of AI, especially natural language processing (NLP) algorithms, there has been a growing interest in their application across various domains, including healthcare. NLP technologies have shown capability in automating tasks that require understanding and generating human language, transforming the way we interact with data and technology. In healthcare, these applications range from automating clinical documentation to enhancing patient interaction with care providers through conversational agents.
LLM has found significant applications in the medical field, including creating clinical notes, interpreting lab results, and anonymizing patient data (Yang et al., 2023a). Their ability to generate human-like text suggests to increase both efficiency and accuracy in medical documentation, potentially reducing the administrative workload for healthcare professionals and allowing them more time for patient care. However, implementing LLM in healthcare faces challenges, notably the linguistic diversity and medical jargon prevalent in multi-cultural hospital settings. This complexity demands LLM to comprehend and interpret medical terminology across different languages and understand the nuances of clinical communication.
Our study underscores the pivotal role of deploying specialized LLM, particularly Mistral-7B, in automating the creation of discharge notes within cardiology, utilizing actual hospital patient data. By expertly navigating the complexities of genuine patient records, Mistral-7B significantly elevates documentation efficiency and continuity. Endorsed by cardiology professional for its clinical relevance and utility, Mistral-7B’s ability to streamline healthcare documentation with real-world data showcases a promising avenue towards integrating AI-driven tools in enhancing patient care and optimizing medical record accuracy. This advancement marks a crucial step forward in leveraging cutting-edge AI to refine healthcare documentation practices directly from the front lines of patient care.
2 Related works
The application of NLP in the medical field has been experiencing a steady increase. This trend is reflected in numerous studies and projects that leverage NLP techniques to extract valuable insights from medical texts, enhance patient care, and facilitate medical research. As such, our work contributes to this growing body of research, further demonstrating the potential of NLP in healthcare.
Clinical Notes Generation The generation of clinical notes, particularly radiology reports (Yang et al., 2023b), represents a critical intersection between NLP and clinical practices. Traditional methods of generating radiology reports have been both time-consuming and tedious for radiologists, prompting the exploration of automated systems. Recent advances have seen the development of multi-modal approaches that combine images, disease labels, and textual data to generate comprehensive radiology reports.
Patient Summary Generation Generating patient summaries (Shing et al., 2021) is an area where NLP is making strides, addressing the need for efficient tools to concisely capture crucial patient information. Recent work proposes an extractive-abstractive summarization pipeline that extracts key sentences from clinical notes and abstracts them into coherent summaries. This approach aids in handling large text volumes while ensuring faithfulness and traceability to original documents. NLP applications in clinical note and patient summary generation can enhance accuracy, efficiency, and comprehensiveness of medical documentation, supporting better patient outcomes and alleviating healthcare professionals’ workload.
Unlike previous study, we directly generate comprehensive discharge records, including chief complaints, medical history, hospital course, discharge status, and follow-up instructions, using large language models. Additionally, we have developed a model specialized in cardiac-related medical terminology and document writing, utilizing a large dataset from a cardiology department of actual hospital.
Large Language Model (LLM) Recent years have witnessed the rise of powerful LLM that have revolutionized various natural language processing tasks, including text generation, summarization, and question answering. These models, trained on vast amounts of textual data, have demonstrated remarkable capabilities in understanding and generating human-like text.
One of the pioneering open source models in this domain is Llama (Touvron et al., 2023), developed by Meta AI. Llama is a family of models ranging from billions to trillions of parameters, trained on a vast corpus of online data. These models have exhibited impressive performance on a wide range of tasks, including open-ended generation, question answering, and code generation. Llama’s scalability and adaptability have made it a popular choice among researchers and developers.
Another notable open soruce LLM is Mistral (Jiang et al., 2023), developed by Mistral AI. This model has demonstrated remarkable performance across numerous benchmarks, outperforming all other models of the same size on many tasks. Notably, it has outperformed the larger-sized model of Llama across various benchmarks. With its impressive performance across various benchmarks, innovative architectural design, and specialization for conversational tasks, Mistral has emerged as a powerful and versatile language model, poised to make significant contributions in natural language processing and its applications.
3 Dataset
3.1 Ethical approval
This study’s protocols received approval from the Asan Medical Center Institutional Review Board (IRB No.2023-1001), aligning with the principles outlined in the 2008 Declaration of Helsinki. Moreover, the need for informed consent was waived due to the utilization of an anonymous, de-identified database for research purposes.
3.2 Data
This study utilizes a comprehensive dataset derived from the Cardiology Department of Asan Medical Center, focusing on patients admitted for care. The data was collected through the Asan Biomedical Research Environment (ABLE) system, which ensures a high standard of data integrity and relevance for clinical research (Shin et al., 2013).
The dataset encompasses patient records spanning from September 2018 to December 2021, providing a broad temporal snapshot of patient care within the institution. For the sake of computational resource and time efficiency, our final dataset, comprising 4,588 unique patient records, was established, with each record resulting in less than 2,048 tokens upon tokenization. These records were then segregated into training, validation, and testing sets for effective model development and evaluation. Specifically, the dataset is divided into 4,077 records for training, 122 records for validation, and 459 records for testing purposes. This division allows for comprehensive training of the models while ensuring robust validation and testing to evaluate the performance accurately.
In this research, the Progress Notes documenting the detailed course of the patient’s treatment were utilized as the input data for our model, while the Discharge Notes reflecting the patient’s status at the point of discharge served as the target or label data. The Progress Notes includes fields such as RECORD DATE, PROBLEM LIST, SUBJECTIVE, OBJECTIVE, ASSESSMENT, GOAL, PLAN, and COMMENT, offering a detailed account of the patient’s clinical status and treatment plan during their stay. Meanwhile, the Discharge Notes are composed of sections detailing the CHIEF COMPLAINT, OPERATION AND PROCEDURE, HOSPITAL COURSE, CONDITION AT DISCHARGE, and TYPE OF DISCHARGE. This structure ensures a comprehensive overview of the patient’s hospital journey, from admission to discharge, providing a valuable foundation for automating the generation of discharge notes using LLM.
4 Methods

4.1 Models
In the advancement of our study, we embarked on employing a diverse array of models to automate the generation of discharge notes from detailed patient records. Utilizing the supervised fine-tuning (SFT) technique, we tapped into the extensive capabilities of several pre-trained LLM to refine their performance towards our specialized medical documentation task. Specifically, we utilized models such as TinyLlama-1.1B (Zhang et al., 2024), Llama2-7B (Touvron et al., 2023), Mistral-7B (Jiang et al., 2023), BioMistral-7B (Labrak et al., 2024), and Meditron-7B (Chen et al., 2023), SOLAR-10.7B (Kim et al., 2023), each selected for their unique strengths and potential in handling complex language tasks pertinent to the medical domain. We also utilized the cutting-edge tool, the Unsloth library (Han, 2023), to facilitate the fine-tuning process of language models, optimize VRAM usage, and significantly accelerate training speed, thereby enhancing the efficiency of our model’s learning process.
4.2 Parameter Efficient Fine Tuning (PEFT)
A significant technique incorporated into our model fine-tuning approach is Quantized Low Rank Adaption (QLoRA) (Dettmers et al., 2024), a method under of Parameter Efficient Fine Tuning (PEFT) (Mangrulkar et al., 2022). This methodology is optimally designed to overcome limitations arising from limited computational resources while maintaining performance levels.
PEFT emerges as a strategic fine-tuning method to counterbalance the substantial computational demand and memory usage associated with leveraging Transformer-based LLM, particularly those encompassing billions of parameters. The essence of PEFT lies in its ability to fine-tune PLM with a minimal subset of parameters, thereby maintaining their performance on specific tasks. This approach significantly mitigates the challenges inherent in deploying these sophisticated models, especially in settings where computational resources are at a premium.
The adaptation of PEFT methodologies enables the efficient utilization of LLM for generating medically accurate and contextually precise discharge notes, aligning with our objectives of streamlining healthcare documentation processes.
4.2.1 LoRA Parameters
The LoRA fine-tuning process employed the following hyperparameters:
-
•
r : 8
-
•
lora alpha : 16
-
•
lora dropout : 0
-
•
target module : q, k, v, o, gate, up, down
4.3 Supervised Fine Tuning (SFT)
Supervised Fine Tuning (SFT) is a crucial component of our methodology, aiming to adapt pre-trained language model (PLM) for the specialized task of generating discharge notes from patient medical records. Unlike the broader, unsupervised training approaches traditionally associated with LLM, SFT leverages labeled datasets, specifically curated to mirror the task at hand. The fundamental distinction of SFT lies in its utilization of previously validated responses, ensuring that the fine-tuning process is informed by data that has been reliably categorized according to the desired outputs.
Through supervised fine-tuning, our LLMs are intricately molded to recognize and replicate the complex patterns and nuances inherent in the domain-specific data. The adaptation of the model’s parameters to the particular distribution of the labeled data and the explicit requirements of generating medically coherent and accurate discharge notes ensures that the model becomes highly proficient in this task. Consequently, SFT empowers the pre-trained LLM to transition from their general language understanding capabilities to specialized expertise required for the precise and effective generation of discharge notes, central to enhancing patient care documentation in healthcare settings. This process not only improves the models’ performance on the target task but also enriches their contextual understanding relevant to the medical documentation domain, thereby facilitating the creation of highly reliable discharge notes informed by validated data.
4.3.1 SFT Parameters
The supervised fine-tuning process employed the following hyperparameters:
-
•
batch size : 8
-
•
gradient accumulation step : 4
-
•
warmup step : 10
-
•
optimizer : adamw_8bit
4.4 Evaluation
In our study, we employed both quantitative and qualitative evaluations to assess the model performance. For the quantitative evaluation, we utilized several common metrics.
-
•
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) : ROUGE is a set of metrics used for evaluating automatic summarization and machine translation. It compares the output with reference summaries to measure the quality, focusing on the overlap of n-grams, words, or bytes.
-
•
Bilingual Evaluation Understudy (BLEU) : BLEU is a metric for evaluating a generated sentence to a reference sentence. It calculates the precision of n-grams in the generated sentence that also appear in the reference sentence, offering a quantitative measure for translation quality.
-
•
BERT Score: BERT Score is a metric for evaluating text generation tasks by calculating the similarity of token embeddings between the generated and reference texts. It leverages the BERT model’s ability to capture complex semantic representations, providing a more nuanced evaluation.
-
•
Perplexity: Perplexity is a measurement of how well a probability model predicts a sample. In language modeling, it quantifies the uncertainty of predicting the next token in a sequence, with lower perplexity indicating better prediction performance.
These measures allowed us to objectively evaluate the performance of our model in terms of various aspects such as precision, recall, semantic coherence, and language fluency.
On the other hand, for the qualitative evaluation, we obtained expert judgement from a professional in Cardiology. This allowed us to incorporate a professional’s assessment into our evaluation process, providing a more comprehensive and practical perspective on the usability and accuracy of our generated content. The evaluation was conducted based on following five criteria.
-
•
Accuracy : Evaluates how accurately the generated discharge note reflects the patient’s actual medical condition, treatment process, and recommended follow-up actions.
-
•
Completeness : Assesses whether all important medical information (diagnosis, treatment methods, observations of improvement or deterioration, discharge criteria, and follow-up actions) is included in the discharge note.
-
•
Readability and Comprehensibility : Evaluates whether the generated document is easy to understand and clear for the target audience (patients, guardians, other medical professionals, etc.).
-
•
Consistency : Assesses whether the discharge notes generated across various patient records maintain a consistent format and quality.
-
•
Utility : Evaluate whether the documentation generated is specifically helpful in making clinical decisions and contributes to the development of follow-up care plans.
4.5 Environment
The experiments were conducted on an Ubuntu 22.04 LTS system with NVIDIA RTX 3090 GPU. The following software versions were used:
-
•
Python 3.10.12
-
•
Transformers 4.38.2
-
•
torch 2.1.1+cu118
-
•
TRL 0.7.7
-
•
CUDA 11.8
5 Results
5.1 Quantitative result
Table 1 presents the evaluation results of various fine-tuned language models on the task of generating discharge notes for cardiac patients, using multiple quantitative metrics.
While no single model outperformed across all metrics, the quantitative results collectively highlight the potential of fine-tuned language models to generate accurate, coherent, and clinically relevant discharge notes from patient medical records.
5.2 Qualitative result
Model | Rouge | BLEU | BERTscore | Perplexity | ||
---|---|---|---|---|---|---|
Rouge1 | Rouge2 | RougeL | ||||
TinyLlama-1.1B | 0.267 | 0.191 | 0.239 | 0.09 | 0.838 | 1.655 |
Llama2-7B | 0.469 | 0.363 | 0.434 | 0.11 | 0.866 | 1.573 |
Mistral-7B | 0.471 | 0.350 | 0.422 | 0.17 | 0.875 | 1.709 |
BioMistral-7B | 0.397 | 0.280 | 0.346 | 0.12 | 0.865 | 1.970 |
Meditron-7B | 0.377 | 0.273 | 0.336 | 0.10 | 0.853 | 1.592 |
SOLAR-10.7B | 0.442 | 0.313 | 0.386 | 0.12 | 0.872 | 2.187 |
Model | Accuracy | Completeness | R & C | Consistency | Utility | Total |
---|---|---|---|---|---|---|
Mistral-7B | 4.4 | 4.4 | 4.2 | 4 | 4.2 | 21.2 |
The qualitative assessment of the discharge notes generated by the Mistral-7B model is presented in Table 2. Five sample notes from the test set were evaluated by a cardiology expert across five criteria: Accuracy, Completeness, Readability and Comprehensibility, Consistency, and Utility, with each aspect rated on a 5-point scale.
Referring to the Table 2, one can observe the model’s performance in generating clinically relevant and usable discharge documentation for patients. The scores reflect the model’s capabilities in accurately capturing patient information, maintaining completeness of medical details, ensuring readability for the target audience, exhibiting consistent quality across samples, and providing documentation valuable for clinical decision-making and follow-up care planning.
While the specifics of the scores can be examined in Table 2, the overall assessment suggests the Mistral-7B model’s proficiency in automated generation of discharge notes, meeting the standards expected in real-world healthcare settings.
Actual Discharge Note | Generated Discharge Note | Evaluation | |||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
6 Discussion
The findings of this study emphasize the significant potential of fine-tuned LLM, particularly Mistral-7B, in automating the generation of discharge notes for cardiac patients. The quantitative results, evaluated across multiple metrics such as ROUGE, BLEU, BERT Score, and Perplexity, collectively demonstrate the models’ ability to generate accurate, coherent, and clinically relevant discharge notes from patient medical records. While no single model excelled across all metrics, the overall performance highlights the promise of this approach in enhancing healthcare documentation efficiency and continuity of care.
Notably, the qualitative assessment by an expert in cardiology further reinforces the practical utility of the generated discharge notes. The Mistral-7B model exhibited a high degree of accuracy in capturing patients’ medical conditions, treatment processes, and follow-up recommendations. The generated notes were deemed complete, including all essential medical information, while maintaining excellent readability and comprehensibility for the intended audience, including patients, guardians, and healthcare professionals. Additionally, the consistent quality across various patient records and the overall utility in supporting clinical decision-making and care planning underscore the model’s potential for real-world implementation in healthcare settings.
Limitations While the quantitative metrics employed, such as ROUGE, BLEU, BERT Score, and Perplexity, provide valuable insights into the models’ performance, they are not specifically designed to evaluate medical documentation. These metrics may not fully capture the nuances and critical aspects of healthcare documentation, such as adherence to clinical guidelines, appropriate use of medical terminology, and patient safety considerations. The lack of domain-specific, clinically-oriented evaluation metrics poses a challenge in comprehensively assessing the generated discharge notes’ quality and suitability for healthcare applications. Additionally, the lack of standardization in the formatting and style of progress notes poses a challenge for model training and performance. Progress notes are often written in a free-form manner by different physicians, leading to inconsistencies in aspects such as date formatting, abbreviations, and terminology usage. For instance, some physicians may write dates as ”210304,” while others use formats like ”2021/03/04.” Such variations can introduce confusion and inconsistencies during the model’s learning process, potentially impacting its ability to accurately interpret and generate discharge notes. Establishing guidelines or implementing preprocessing steps to standardize the input data could mitigate these issues and improve the model’s performance.
Future works Firstly, the development and integration of domain-specific, clinically-oriented evaluation metrics tailored for assessing medical documentation is crucial. These metrics should capture essential aspects such as adherence to clinical guidelines, appropriate use of medical terminology, and patient safety considerations, enabling a more comprehensive and reliable evaluation of the generated discharge notes.
Another crucial direction for future research is the extension of this approach to other medical specialties beyond cardiology. While this study focused on generating discharge notes for cardiac patients, the methodologies and insights gained could be adapted and applied to other domains, such as oncology, neurology, or pediatrics. By expanding the scope of the dataset to encompass diverse medical conditions and specialties, the models could be fine-tuned to generate discharge notes tailored to the specific requirements and terminologies of each field. This would not only broaden the applicability of the AI-driven documentation approach but also contribute to a more comprehensive and unified system for streamlining medical documentation across various healthcare domains.
Furthermore, expanding the dataset to include data from multiple healthcare facilities and diverse patient populations could significantly enhance the models’ generalizability and robustness, ensuring their applicability across different clinical settings and patient demographics. This could involve establishing collaborations with other healthcare institutions or leveraging existing large-scale medical datasets.
The exploration of multi-modal approaches, incorporating not only textual data but also medical images, lab results, and other relevant patient information, presents a promising direction. By leveraging these diverse data sources, the models could gain a more holistic understanding of patients’ conditions, enabling the generation of more accurate and well-rounded discharge notes. Additionally, the integration of advanced techniques such as attention mechanisms and multimodal fusion could further improve the models’ ability to effectively utilize and combine information from multiple modalities.
Conclusion This study represents a significant step towards leveraging the power of large language models in revolutionizing healthcare documentation practices. The promising results demonstrate the potential for automating discharge note generation, reducing administrative burdens on healthcare professionals, and facilitating seamless continuity of care for patients. However, continued research and development, including the incorporation of domain-specific evaluation metrics and the exploration of multi-modal and knowledge-augmented approaches, are crucial for further refining and advancing these AI-driven solutions in the healthcare domain.
References
- Chen et al. (2023) Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023.
- Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- Han (2023) Daniel Han. unsloth. https://github.com/unslothai/unsloth, 2023. Accessed: 2024-03-05.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Kim et al. (2023) Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166, 2023.
- Labrak et al. (2024) Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
- Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Shin et al. (2013) Soo-Yong Shin, Yongman Lyu, Yongdon Shin, Hyo Joung Choi, Jihyun Park, Woo-Sung Kim, and Jae Ho Lee. Lessons learned from development of de-identification system for biomedical research in a korean tertiary hospital. Healthcare Informatics Research, 19(2):102–109, 2013.
- Shing et al. (2021) Han-Chin Shing, Chaitanya Shivade, Nima Pourdamghani, Feng Nan, Philip Resnik, Douglas Oard, and Parminder Bhatia. Towards clinical encounter summarization: Learning to compose discharge summaries from prior notes. arXiv preprint arXiv:2104.13498, 2021.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Yang et al. (2023a) Rui Yang, Ting Fang Tan, Wei Lu, Arun James Thirunavukarasu, Daniel Shu Wei Ting, and Nan Liu. Large language models in health care: Development, applications, and challenges. Health Care Science, 2(4):255–263, 2023a.
- Yang et al. (2023b) Shuxin Yang, Xian Wu, Shen Ge, Zhuozhao Zheng, S Kevin Zhou, and Li Xiao. Radiology report generation with a learned knowledge base and multi-modal alignment. Medical Image Analysis, 86:102798, 2023b.
- Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.