Resource-Efficient Medical Report Generation using Large Language Models
Abstract
Medical report generation is the task of automatically writing radiology reports for chest X-ray images. Manually composing these reports is a time-consuming process that is also prone to human errors. Generating medical reports can therefore help reduce the burden on radiologists. In other words, we can promote greater clinical automation in the medical domain. In this work, we propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation. We introduce a lightweight solution that achieves better or comparative performance as compared to previous solutions on the task of medical report generation. We conduct extensive experiments exploring different model sizes and enhancement approaches, such as prefix tuning to improve the text generation abilities of the LLMs. We evaluate our approach on a prominent large-scale radiology report dataset - MIMIC-CXR. Our results demonstrate the capability of our resource-efficient framework to generate patient-specific reports with strong medical contextual understanding and high precision.
Index Terms:
Medical Report Generation, Large Language Model, Prefix Tuning, Chest X-rayI Introduction
Medical report generation aims to automatically generate detailed paragraphs that describe the observations and findings from a given chest X-ray image. Writing these radiology reports manually is a time-consuming task that is also prone to errors. Automating the medical report generation process can relieve radiologists of this workload and promote clinical automation.
In recent years, researchers have proposed various solutions to the task of medical report generation. These solutions can be broadly categorized into two main approaches. The first approach focuses on improving the model structure to enhance the performance of medical report generation. For instance, some works have utilized hierarchically structured LSTM [1] architectures to handle the long-form nature of medical reports. Others have explored different network structures, such as using a generative sentence model and a generative paragraph model that leverages the generated sentences to produce the next sentence [2]. Additionally, image-report matching networks have been proposed [3] to bridge the gap between the image and the text. More recently, transformer [4] architectures have been used as decoders, along with memory mechanisms, as an alternative to LSTM-based models [5]. Some approaches have also leveraged LLMs [6] to harness their generative capabilities for medical report generation. However, these works incorporated very large LLMs and also trained those LLMs which is resource-intensive and this hinders their adoption in clinical automation. In comparison, we provide a lightweight solution which achieves better or similar performance on the natural language generation metrics while being resource-efficient.
The second category of solutions focuses on leveraging available medical domain knowledge to improve the quality of the generated reports. This includes integrating knowledge graphs [7], utilizing disease tags [8], and combining general knowledge from pre-constructed knowledge graphs with specific knowledge derived from retrieving similar reports [9]. Some works have also proposed using a medical concepts generation network [10] to produce semantic information and integrate it into the report generation process.
In our work as shown in Fig. 1, we aim to promote clinical automation by introducing a resource-efficient framework which incorporates LLM and enhances its capabilities without fine-tuning of LLM by utilizing prefix tuning for the task of medical report generation.
II Methods
The proposed method largely consists of a vision encoder, a large language model, and a mapping network. For efficient training of the whole model, the mapping network is the only trainable part of our model. The details will be introduced in the following subsections.
II-A Vision Encoder
Contrastive language image pretraining (CLIP) [11] introduces a method for training a text and image-based encoder model on multimodal data using contrastive learning. This approach aims to bring similar data pairs closer together in the model’s projection space while pushing dissimilar pairs farther apart, thereby bridging the gap between different modalities such as text and image. As a result, closely related text and image samples exhibit high cosine similarity scores, while dissimilar pairs show lower scores.
We utilize a medical CLIP (i.e., MedCLIP [12]), to extract visual embeddings, which we term prefix embeddings for use in the medical report generation. MedCLIP is trained to capture detailed information present in chest X-ray images. When these visual embeddings are translated through a lightweight mapping network into the language model’s space, they provide valuable visual context. This enables the model to generate patient-specific radiology reports.

II-B Large Language Models
Large language models are a class of powerful neural networks that are trained on vast amounts of text data, enabling them to generate human-like text and perform a wide range of natural language processing tasks. To evaluate the performance of our method, we utilized a variety of LLM models. Specifically, we employed the pioneering GPT-2 [13] models as well as the more recent Qwen1.5 [14] model. Both GPT-2 and Qwen1.5 are LLMs, but they differ in their training datasets. The GPT-2 models were trained on the WebText dataset, which comprises 8 million web pages. The training objective for these models was next-word prediction, which is a common objective for autoregressive models. In contrast, Qwen1.5 is an LLM model that has demonstrated better performance on natural language generation tasks, even with a relatively smaller model size. By using both GPT-2 and Qwen1.5 LLMs in our experiments, we aimed to explore the performance of our method on different LLMs and understand how the choice of pre-trained LLM can impact the overall results for medical report generation. The LLMs in our framework are frozen and thus reduce the requirement of resources making them resource-efficient.
II-C Mapping Network
Prefix Tuning, as proposed by [15], involves adding task-specific vectors at the beginning of the input sequence of language models while keeping the model parameters fixed. This only optimizes the prefix while the LLM is frozen. The language model is conditioned on this prefix, along with other information, to generate the medical report. Since we are integrating the language model with vision, we also employ a small mapping network. This network’s role is to translate vision features from the medical domain-adapted vision model to the language space of the GPT-based model.
The mapping network is the only trainable part of our framework. Drawing inspiration from the [16] approach, we employ a compact transformer-based architecture to convert visual information from CLIP’s embedding space to the LLM embedding space. This Mapping Network has a small number of parameters and is trained to effectively translate from CLIP’s 512 dimensions to LLM’s 768 dimensions for GPT-2 based models and 1024 dimensions for the Qwen1.5 model. By integrating this mapping network, the LLM gains visual capabilities, enhancing its performance in generating reports from chest X-ray images. The visual embeddings from MedCLIP are converted into sequences of tokens based on the hyperparameter of clip length. A trainable prefix is also added before these tokens and these are then processed by the transformer. The prefix retrieves meaningful information from CLIP embedding through the multi-head attention and it learns to adjust the fixed LLM to the new data.
III Experimental Results
III-A Dataset
For our work, we utilized the publicly available dataset: MIMIC-CXR [17]. The MIMIC-CXR dataset contains 377,110 chest X-ray images and their corresponding free-text reports. We adopted the standard dataset split of MIMIC-CXR for our experiments. As a preprocessing step, we removed any special characters from the reports and converted all tokens to lowercase. The datasets include a variety of image views, such as frontal and lateral views. However, given the dominance of the anteroposterior (AP) and posteroanterior (PA) views, we focused our experiments solely on the unique AP and PA view image and report pairs which amounted to 243,334. The number of samples for train, val, and test sets is 237,972, 1,959, and 3,403 respectively.
III-B Results
Our method showed improved performance on the task of medical report generation. The GPT-2 based LLMs performed relatively lower while the Qwen1.5 LLM performed better showing improved performance on the NLG metrics [18] such as Bleu [19] scores as shown in TABLE I.
LLM | Params. | Bleu-1 | Bleu-2 | Bleu-3 | Bleu-4 |
---|---|---|---|---|---|
GPT-2 small | 225M | 23.1 | 12.4 | 7.3 | 5.0 |
GPT-2 medium | 456M | 24.6 | 13.3 | 7.8 | 4.9 |
Qwen1.5-0.5B | 601M | 34.2 | 20.0 | 13.0 | 8.5 |
Despite the effectiveness of GPT-2 LLMs in various language tasks, including text generation, their performance on specific NLG metrics such as Bleu scores was not as strong as expected. This could be attributed to their architecture, which might not be optimized for certain types of NLG tasks, such as report generation from medical images.
In contrast, the Qwen1.5 LLM exhibited significantly improved performance on the NLG metric compared to the GPT-2 models. The Qwen1.5 model appears to be more effective for tasks requiring generation from multimodal inputs. This improved performance suggests that the Qwen1.5 model can generate more accurate and fluent text, especially in the context of medical report generation from chest X-ray images.
Training | Bleu-1 | Bleu-2 | Bleu-3 | Bleu-4 |
---|---|---|---|---|
Fine Tuning | 33.6 | 19.9 | 12.7 | 8.6 |
Prefix Tuning | 34.2 | 20.0 | 13.0 | 8.5 |
TABLE II highlights the effectiveness of Prefix Tuning as compared to Fine Tuning of the LLM. Prefix Tuning showed improved performance and is more resource-efficient as compared to the finetuning of the framework in which the LLM is also trained along with the mapping network. This shows that by leveraging the pretrained LLMs we can get better results on the task of medical report generation. This will enable more clinical automation as the LLMs get efficient in their sizes and become better in their performance in the future.
The Qwen1.5 LLM even outperformed past full transformer-based frameworks and other larger LLM-based solutions on Bleu metric. This is a noteworthy result, as past transformer-based and LLM-based frameworks were resource-intensive and were specifically designed for handling complex NLG tasks. The Qwen1.5 model’s ability to surpass these frameworks indicates its effectiveness in integrating visual information and generating high-quality text outputs while being resource-efficient, showing the strength of our framework for the medical report generation task.
The results presented in TABLE III demonstrate the potential of our proposed method for medical report generation as compared to full transformer-based and LLM-based frameworks. R2gen(base) [5] denotes the vanilla Transformer, with three layers, 8 heads, and 512 hidden units without other extensions and modifications while RadDialog-INS [6] denotes the Vicuna-7b [20] based framework trained on the Instruction dataset for medical report generation. Our approach consists of Qwen1.5-0.5B as the LLM. Notably, our approach, which consisted of 601 million total parameters including 101M of trainable parameters of the mapping network as compared to 72 million and an astounding 7 billion parameters of previous studies, was able to achieve competitive performance compared to the more resource-intensive and complex frameworks.
At inference time it would require approximately 1GB of VRAM to load the transformer-based models, 2GB to load our Qwen1.5-0.5B LLM-based framework, and a huge number of 25GB VRAM to load the Vicuna-7b LLM-based solutions to generate the medical reports [21]. This hinders the adoption of these solutions in the medical domain pertaining to their resource requirements. Our framework provides a resource-efficient alternative to these frameworks while achieving better performance.
A key strength of our method is its performance on the Bleu-1, Bleu-2, and Bleu-3 metrics, which measure the n-gram overlap between the generated text and reference reports. Our model achieved Bleu-1, Bleu-2, and Bleu-3 scores of 34.2, 20.0, and 13.0 respectively, outperforming the more sophisticated transformer-based and LLM-based models included in the comparison.
Our proposed framework represents an efficient approach to medical report generation by leveraging the generative abilities of LLMs efficiently through the use of prefix tuning demonstrating its superior performance on Natural Language Generation (NLG) metrics specifically the Bleu score compared to full transformer-based and larger LLM-based frameworks.
Method | Params. | VRAM | B1 | B2 | B3 | B4 |
---|---|---|---|---|---|---|
R2gen(base) [5] | 72M | 1GB | 31.4 | 19.2 | 12.7 | 9.0 |
RaDialog-INS [6] | 7B | 25GB | 34.0 | - | - | 9.7 |
Ours | 601M | 2GB | 34.2 | 20.0 | 13.0 | 8.5 |
IV Conclusion and Future Directions
Our findings suggest that a well-designed smaller LLM-based framework without using huge resources can capture important linguistic patterns and coherence in the medical report generation task, as compared to full transformer-based and larger LLM-based architectures. This could be particularly advantageous in clinical automation where computational resources are constrained, as our method provides a more resource-efficient solution while maintaining strong performance. The ability to leverage smaller LLMs, without relying on the full transformer-based structures and larger LLMs, opens up interesting possibilities for further research and optimization. Exploring ways to further enhance our approach, such as through improved pretraining, fine-tuning techniques, or the incorporation of domain-specific knowledge, could lead to even stronger performance on medical report generation. Additionally, investigating the generalizability of our method to other text generation tasks in the medical domain or beyond would be a valuable direction for future work. Assessing its performance on a wider range of benchmarks could provide further insights into the capabilities and limitations of this more lightweight approach.
Furthermore, we plan to enhance our framework by incorporating more efficient LLMs and training strategies. In the future, we also aim to investigate the use of knowledge graphs as additional knowledge for medical report generation. By continually refining our framework, we aim to provide healthcare professionals with a powerful tool for generating accurate and comprehensive medical reports efficiently.
Acknowledgement
This work was supported in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) Grant funded by the Korea Government (MSIT) under Grant 2022-0-00078 (Explainable Logical Reasoning for Medical Knowledge Generation), Grant RS-2022-00155911 (Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University)), and by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.RS-2024-00334321).
References
- [1] Baoyu Jing, Pengtao Xie, and Eric P. Xing. ”On the automatic generation of medical imaging reports.” In ACL, 2018.
- [2] Yuan Xue and Xiaolei Huang. ”Improved disease classification in chest X-rays with transferred features from report generation.” In IPMI, 2019.
- [3] Zhanyu Wang, Luping Zhou, Lei Wang, and Xiu Li. ”A selfboosting framework for automated radiographic report generation.” In CVPR, 2021.
- [4] Vaswani, Ashish, et al. ”Attention is all you need.” Advances in neural information processing systems 30 2017.
- [5] Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. ”Generating radiology reports via memory-driven transformer.” In EMNLP, 2020.
- [6] Pellegrini, Chantal, et al. ”RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance.” unpublished, arXiv preprint arXiv:2311.18681, 2023.
- [7] Mingjie Li, Rui Liu, Fuyu Wang, Xiaojun Chang, and Xiaodan Liang. ”Auxiliary signal-guided knowledge encoderdecoder for medical report generation.” World Wide Web, 2023.
- [8] Di You, Fenglin Liu, Shen Ge, Xiaoxia Xie, Jing Zhang, and Xian Wu. ”Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation.” In MICCAI, 2021.
- [9] S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao. ”Knowledge matters: Radiology report generation with general and specific knowledge.” In Medical Image Analysis, 2021.
- [10] Zhanyu Wang, Mingkang Tang, Lei Wang, Xiu Li, and Luping Zhou. ”A medical semantic-assisted transformer for radiographic report generation.” In MICCAI, 2022.
- [11] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: ”Learning transferable visual models from natural language supervision.” In International conference on machine learning, 2021.
- [12] Wang, Z., Wu, Z., Agarwal, D., Sun, J.: ”Medclip: Contrastive learning from un- paired medical images and text.” In EMNLP 2022.
- [13] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: ”Language models are unsupervised multitask learners.” OpenAI blog 1(8), 9, 2019.
- [14] Bai, Jinze, et al. ”Qwen technical report.” unpublished, arXiv preprint arXiv:2309.16609, 2023.
- [15] Li, Xiang Lisa, and Percy Liang. ”Prefix-tuning: Optimizing continuous prompts for generation.” In ACL, 2021.
- [16] Mokady, R., Hertz, A., Bermano, A.H.: ”Clipcap: Clip prefix for image captioning.” arXiv preprint arXiv:2111.09734, 2021.
- [17] Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: ”Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.” 2019.
- [18] Sai, Ananya B., Akash Kumar Mohankumar, and Mitesh M. Khapra. ”A survey of evaluation metrics used for NLG systems.” In ACM Computing Surveys (CSUR) 55.2, 2022.
- [19] Papineni, Kishore, et al. ”Bleu: a method for automatic evaluation of machine translation.” In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.
- [20] Zheng, Lianmin, et al. ”Judging llm-as-a-judge with mt-bench and chatbot arena.” In Advances in Neural Information Processing Systems 36, (2024).
- [21] https://huggingface.co/spaces/hf-accelerate/model-memory-usage