¹¹institutetext: The Hong Kong University of Science and Technology
¹¹email: [email protected]

Dia-LLaMA: Towards Large Language Model-driven CT Report Generation

Zhixuan Chen Luyang Luo Yequan Bie Hao Chen^🖂

Abstract

Medical report generation has achieved remarkable advancements yet has still been faced with several challenges. First, the inherent imbalance in the distribution of normal and abnormal cases may lead models to exhibit a biased focus on normal samples, resulting in unreliable diagnoses. Second, the frequent occurrence of common template sentences in the reports may overwhelm the critical abnormal information. Moreover, existing works focus on 2D chest X-rays, leaving CT report generation underexplored due to the high-dimensional nature of CT images and the limited availability of CT-report pairs. Recently, LLM has shown a great ability to generate reliable answers with appropriate prompts, which shed light on addressing the aforementioned challenges. In this paper, we propose Dia-LLaMA, a framework to adapt the LLaMA2-7B [21] for CT report generation by incorporating diagnostic information as guidance prompts. Considering the high dimension of CT, we leverage a pre-trained ViT3D with perceiver [7] to extract the visual information. To tailor the LLM for report generation and emphasize abnormality, we extract additional diagnostic information by referring to a disease prototype memory bank, which is updated during training to capture common disease representations. Furthermore, we introduce disease-aware attention to enable the model to adjust attention for different diseases. Experiments on the chest CT dataset demonstrated that our proposed method outperformed previous methods and achieved state-of-the-art on both clinical efficacy performance and natural language generation metrics. The code will be made publically available.

Keywords:

CT Report Generation LLM Prototype Representation.

^†^†🖂 Corresponding author.

1 Introduction

CT report writing is an indispensable component of clinical practice as it provides clinicians with a comprehensive summary of findings and highlights significant abnormal information. However, this job is tedious as it necessitates examining a series of scans and acquiring a comprehensive understanding of the CT volumes. Therefore, automated CT report generation (CTRG) holds significant value in reducing the workload. Further, the recent development of large language models (LLMs) provides a powerful tool for report generation. For example, MAIRA [6] and XrayGPT [20] have attempted to employ LLM for chest X-ray (CXR) report generation. However, several challenges in CTRG have not been fully explored in these works: 1) CT reports typically adhere to a rigid template structure, with minor modifications to describe specific abnormalities [19, 10, 24]. This standardized format hinders the model from capturing critical abnormal information. 2) The prevalence of certain abnormalities in reports varies due to the biased nature of diseases, with some being frequently observed and others being rare [12, 8]. This data imbalance issue may cause the model to overlook infrequent abnormalities. Furthermore, the high-dimensional nature of CT images and the limited availability of CT-report datasets hinder the development of CTRG.

In this paper, we propose a novel framework that seamlessly embeds LLM for CT report generation, relieving the challenges inherent in this task. To emphasize the critical abnormal information, we leverage diagnostic text prompts to guide LLM for CTRG. To relieve the data imbalance problem, we diagnose the diseases by referring to the learnable prototypes in a disease prototype memory bank, which records common representations of normal and abnormal samples separately. Supervised by contrastive loss, these disease prototypes are updated to be distinctive for effective reference during diagnosis. Furthermore, to enhance targeted attention for different disease regions, we introduce a disease-aware attention module to extract disease-level features from CT volumes. Experiments conducted on a recent public chest CT report dataset demonstrated that our proposed framework achieved state-of-the-art (SOTA) performance in both the clinical efficacy (CE) and natural language generation (NLG) metrics.

2 Method

Refer to caption — Figure 1: The overall architecture. The visual embeddings and diagnostic information are combined into an LLM to generate reports. The disease-aware attention is adopted to extract the disease features, which are used to update the disease prototypes. In inference, the diagnosis results are generated by feature similarity.

2.1 Framework

The overall architecture is shown in Figure 1. For introducing LLM to report generation, we utilize combined prompts to aggregate the visual embedding and critical diagnostic information. Our designed prompts consist of two segments: $\mathcal{P}=\{\mathcal{S},\mathcal{D}\}$ , where the first segment $\mathcal{S}=\{s_{1},s_{2},\ldots,s_{N}\}$ represents a fixed number of special tokens $s_{n}$ for visual embeddings and the second segment $\mathcal{D}=\{d_{1},d_{2},\ldots,d_{L}\}$ represents diagnostic prompt tokens. Let $\mathcal{R}=\{r_{1},r_{2},\ldots,r_{T}\}$ denotes a generated report, where $r_{t}$ represents the token at timestep $t$ and $T$ is the length of report. The decoding process of the LLM $f_{l}$ is described as follows:

r_{t}=f_{l}(\mathcal{P},\mathcal{R^{-}})=f_{l}(s_{1},\ldots,s_{N},d_{1},\ldots,d_{L},r_{1},\ldots,r_{t-1}),

(1)

where $\mathcal{R^{-}}$ represents the generated report at timestep $t-1$ . The report generation process is optimized by minimizing the language modeling loss $\mathcal{L}_{LM}$ :

\mathcal{L}_{LM}=-\sum_{t=1}^{T}\log p(r_{t}|s_{1},\ldots,s_{N},d_{1},\ldots,d_{L},r_{1},\ldots,r_{t-1}).

(2)

For extracting visual embeddings, the $i_{th}$ CT volume $V_{i}$ is encoded into patch features by a vision encoder $f_{v}$ and subsequently projected into the embedding space of the LLM by a perceiver $f_{p}$ :

	$\displaystyle f_{v}(V_{i})$	$\displaystyle=A_{i}=\{A_{i}^{1},A_{i}^{2},\ldots,A_{i}^{M}\},$		(3)
	$\displaystyle f_{p}(A_{i})$	$\displaystyle=X_{i}=\{X_{i}^{1},X_{i}^{2},\ldots,X_{i}^{N}\},$		(4)

where $A_{i}^{m}\in\mathbb{R}^{c}$ represents a patch feature, $X_{i}^{n}\in\mathbb{R}^{d}$ represents visual embedding, $c$ and $d$ denote the feature and embedding dimension, $M$ and $N$ represent the number of patch features and visual embeddings, respectively. The visual embeddings $X_{i}$ are integrated into the embedding layer of the LLM.

For deriving diagnostic information, we first utilize disease-aware attention (Section 2.2) to gather disease-level features $D_{i}$ from patch features $A_{i}$ . To provide a typical reference for diagnoses, we construct a disease prototype memory bank (Section 2.3) to capture the common representations of various diseases. The diagnostic results can be obtained by comparing disease features and prototypes, then interpreted into diagnostic text prompts (Section 2.4) for LLM.

2.2 Disease-Aware Attention

Employing average-pooled patch features to diagnose various diseases may lead to unreliable diagnosis due to mixed disease information. To alleviate this issue, we propose a disease-aware attention (DAA) module to extract disease-level features from patch features. Specifically, we assign a learnable attention weight for each disease. The patch features from the vision encoder $f_{v}$ are element-wise multiplied with the attention weights and are then aggregated to form the disease-level features. The process can be expressed as:

D_{i}=\sum_{m=1}^{M}(\text{softmax}(\mathbf{W}_{D})\otimes A_{i})_{m},

(5)

where $D_{i}\in\mathbb{R}^{L\times c}$ represents the aggregated disease features, $\mathbf{W}_{D}\in\mathbb{R}^{L\times M\times 1}$ denotes the disease-aware attention weights, and $A_{i}\in\mathbb{R}^{1\times M\times c}$ encapsulates the patch features. The disease features $D_{i}$ are then utilized for disease classification, which requires distinguishing abnormal and normal samples.

2.3 Disease Prototype Memory Bank

Due to the biased nature of certain diseases, some abnormalities are relatively rare. To improve the diagnostic accuracy for infrequent abnormalities, we introduce a disease prototype memory bank (DPM) as a reference during diagnosis. The diagnostic results are obtained by comparing the similarity between disease-level features and a set of learnable prototypes. Specifically, the DPM includes both abnormal prototypes $\mathbf{P}_{1}^{l}$ and normal prototypes $\mathbf{P}_{0}^{l}$ to capture the presence and absence of each disease, respectively. These prototypes are updated through the InfoNCE loss [15], which pulls the positive pairs closer and pushes the negative pairs farther. In our case, the positive case $\mathbf{P}_{y_{i}^{l}}^{l}$ and negative case $\mathbf{P}_{1-y_{i}^{l}}^{l}$ are determined based on the disease label $y_{i}^{l}$ . The contrastive disease-prototype loss $\mathcal{L}_{DP}$ is defined as

\mathcal{L}_{DP}=-\frac{1}{BL}\sum_{i=1}^{B}\sum_{l=1}^{L}\log\frac{\exp(D_{i}^{l}\cdot\mathbf{P}_{y_{i}^{l}}^{l}/\tau)}{\exp(D_{i}^{l}\cdot\mathbf{P}_{y_{i}^{l}}^{l}/\tau)+\exp(D_{i}^{l}\cdot\mathbf{P}_{1-y_{i}^{l}}^{l}/\tau)},

(6)

where $y_{i}^{l}$ represents the label of the $l_{th}$ disease, and $\tau$ is the learnable temperature parameter.

2.4 Diagnostic Text Prompts

It is essential for medical reports to precisely capture abnormal information [19]. Despite the strong capabilities of LLM, directly recognizing abnormalities from visual embeddings without additional guidance is still challenging, which is validated in Section 3.3.1. Therefore, we introduce diagnostic text prompts (DTP), leveraging the diagnostic results as guidance prompts. Specifically, the diagnostic results are converted into text prompts $\mathcal{D}$ , which follows a template description “The {disease name} is [disease state]”. For instance, the diagnostic result $c_{1}$ : Present in Figure 1 is interpreted as The enlarged cardio mediastinum is present in this image, where $c_{1}$ represents the enlarged cardio mediastinum disease.

The overall training loss of our model is expressed as the weighted sum of the disease-prototype loss $\mathcal{L}_{DP}$ and the language modeling loss $\mathcal{L}_{LM}$ :

\mathcal{L}=\mathcal{L}_{DP}+\lambda\mathcal{L}_{LM},

(7)

where $\lambda$ represents the weight adjustment factor.

3 Experiments and Results

3.1 Datasets and Metrics

We adopted a large-scale CT report dataset (CTRG-Chest-548K [19]) to evaluate our method and the compared methods. This dataset comprises 1,804 CT-report pairs. Adhering to the original split ratio [19], we randomly selected 80% of the data for training and 20% for testing. Following the previous works [8, 23], we utilized a pretrained report labeler called CheXbert [18] to extract labels. Despite being pre-trained on the CXR dataset (MIMIC [9]), CheXbert remains effective in our experiments, attributed to the similar content between the chest CT and CXR report. There are 14 diseases recognized by CheXbert, with each having four states: present, absent, uncertain, and blank. Due to the clinical focus on present diagnosis, we categorized all other states as absent.

For evaluation, both NLG and CE metrics are adopted. NLG metrics include BLEU [16], METEOR [4], and ROUGE-L [11]. Following the CE metrics setting in [14, 8], we assess Precision, Recall, and F1 score with CheXbert [18].

3.2 Implementation details

For the compared methods in CXR, all settings are consistent with the original paper. We selected 30 CT scans at specific intervals for each sample. For RadFM [22] and our method, the pre-trained ViT3D [22] is adopted as the vision encoder and each volume is resized to $256\times 256\times 64$ as the input. We selected the LLaMA2-7B [21] as the LLM in all our experiments and utilized LoRA [5] for parameter-efficient fine-tuning, where trainable parameters are only 0.06% of the whole parameters. During training, we utilized AdamW [13] as the optimizer, with an initial learning rate of 5e-5, following a constant learning rate schedule that includes a warmup phase. The model was trained on two RTX 3090 GPUs for about 16 hours, built with PyTorch 2.0. The training involved 2000 steps, with an effective batch size of 16. The factor $\lambda$ was set to 4 to balance the two types of loss. To optimize memory usage, we employed the ZeRO [17] stage 2 training strategy in conjunction with gradient checkpointing [1].

3.3 Comparison and Analysis

Due to the limited works in CTRG, we compared our method with SOTA methods in chest X-ray (CXR), including R2Gen [3], R2GenCMN [2], M2KT [23], and PromptMRG [8]. In addition, we also compared with a CT report generation work SL-DG [19] and a generalist model RadFM [22]. To ensure a fair comparison, we matched the LLM in RadFM with that used in our experiments.

The Table 1 shows the comparison results on CTRG-Chest-548K [19] dataset. We observed that the proposed method achieves SOTA performance across the three CE metrics and the majority of NLG metrics. For CE metrics, we achieved a 0.372 F1 score, which represents a 7.8% improvement compared to the RadFM. As for the Precision and Recall metrics, we obtained 4.5% and 7.2% improvements compared to the second-best results. In terms of NLG metrics, our method also achieved SOTA performance. Regarding the BLEU-1, BLEU-4, and METEOR metrics, our approach obtained improvements of 7.2%, 20%, and 4.3%, respectively, compared to the inferior methods. However, our method did not attain the highest ROUGE-L score. This may be due to the property of this metric, which evaluates reports based on the Longest Common Subsequence with reference reports. Methods based on memory mechanisms [3, 2] can more easily generate common template sentences, resulting in higher ROUGE-L scores.

Table 1: The performance of our model compared with other SOTA methods on the CTRG-Chest-548K [19] dataset.

*

indicates results cited from the original paper. The data split used differs from ours, yet the split ratio remains the same. Our method is highlighted in green. The best results and the second-best results are highlighted in bold and underlined, respectively.

METHOD	YEAR	CE Metrics			NLG Metrcis
METHOD	YEAR	Pre.	Rec.	F1	BL-1	BL-4	MTR	RG-L
R2Gen [3]	2020	0.207	0.121	0.144	34.11	23.39	21.40	47.75
R2GenCMN [2]	2022	0.158	0.100	0.114	35.88	23.37	21.43	45.94
M2KT [23]	2023	0.220	0.119	0.145	46.09	21.93	25.20	36.47
PromptMRG [8]	2023	0.290	0.330	0.290	47.73	23.02	22.87	37.35
SL-DG^∗ [19]	2024	-	-	-	-	23.70	21.90	43.80
RadFM [22]	2023	0.403	0.361	0.345	46.70	24.70	24.01	38.98
Ours	-	0.421	0.387	0.372	51.16	29.64	26.28	42.15

Table 2: Ablation study of each module on CTRG-Chest-548K [19] dataset.

DPM	DAA	DTP	CE Metrics			NLG Metrcis
DPM	DAA	DTP	Pre.	Rec.	F1	BL-1	BL-4	METEOR	ROUGE-L
✗	✗	✗	0.403	0.361	0.345	46.70	24.70	24.01	38.98
✗	✗	✓	0.415	0.336	0.347	45.74	27.05	24.80	42.29
✗	✓	✓	0.424	0.347	0.358	44.22	26.38	24.34	42.68
✓	✗	✓	0.437	0.313	0.339	44.06	27.10	24.46	44.5
✓	✓	✓	0.421	0.387	0.372	51.16	29.64	26.28	42.15

3.3.1 Ablation Study

To demonstrate the effectiveness of all the proposed components, we conducted a thorough ablation study, as shown in Table 2. We adopted RadFM [22] as the baseline, which lacks additional diagnostic information. For the method that solely incorporates DTP, we directly input the average-pooled patch features into a classification head to generate diagnostic prompts. We can see improvements in almost all metrics compared to the baseline, which confirms the significance of incorporating diagnostic information for guiding LLM in report generation. When the DAA is incorporated, the CE metrics show further improvement, which validates the effectiveness of the DAA. After integrating the DPM, our complete method with all designed components achieved SOTA performance in most metrics. We also tested the method without DAA, which resulted in a lower F1 score, underscoring the essential role of fine-grained disease features for diagnosis. A representative qualitative example is presented in Figure 3 (readers may refer to the supplementary materials for more examples). It demonstrates that our method captures more critical abnormal information compared to the baseline and achieves higher diagnostic accuracy.

We assessed the F1 scores for each disease separately to validate the diagnostic performance of our method across diseases, as presented in Figure 3.3.1. It should be noted that we selected eight diseases with an abnormal ratio greater than 4% to evaluate. The last group in Figure 3.3.1 represents the average F1 score calculated across various diseases. We observed that the method only with DTP achieved poor performance when the abnormal samples were limited, which demonstrates that the diagnosis based on classification head can be affected by data imbalance. In contrast, our complete method with DPM achieved a higher F1 score, particularly in diseases with fewer abnormal samples. This validates that our proposed method can alleviate the challenge presented by data imbalance, thereby improving overall diagnostic accuracy.

Moreover, we conducted an ablation study on different prompt types to find the appropriate one, as presented in Table 3.3.1. Specifically, the None prompt indicates that no diagnostic result is used as the prompt. The Text prompt is just the DTP proposed in Section 2.4, while Token prompt indicates we incorporated learnable special tokens $<$ POS $>$ and $<$ NEG $>$ to represent the disease diagnosis instead of text tokens. For the Feature prompt, we directly leveraged the disease prototypes $\mathbf{P}_{1}^{l}$ or $\mathbf{P}_{0}^{l}$ as prompts. The results indicate that the Text prompt obtained the most significant enhancement relative to the None prompt, so we adopted text prompts as the default prompt type. In contrast, the Token and Feature prompts appear to degrade performance. We speculate that this situation arises due to the embedding layer of LLM requiring large-scale pre-training. An LLM possesses robust text embedding naturally, thereby contributing to satisfactory performance with text prompts. In contrast, integrating additional learnable token embeddings may lead to a lack of alignment with LLM, thereby potentially impairing performance.

Prompt	B-4	Pre.	Rec.	F1
None	24.70	0.403	0.361	0.345
Text	29.64	0.421	0.387	0.372
Token	25.40	0.363	0.387	0.340
Feature	23.10	0.327	0.359	0.310

4 Conclusion

In this work, we propose a novel CTRG framework called Dia-LLaMA, which adapts LLaMA2-7B [21] to generate reports with diagnostic guidance prompts. Specifically, we adopt a disease-aware attention module to obtain disease-level features, enabling fine-grained diagnosis tailored to different diseases. Additionally, a disease prototype memory bank is proposed to capture common representations of various diseases. The diagnosis results are obtained by feature similarities between disease-level features and prototypes, significantly reducing the negative impacts of data imbalance. We then interpret the diagnosis results into textual prompts as critical information guidance for LLM to generate reports, achieving both linguistic coherency and satisfactory diagnostic performance. Experiments on the CTRG-Chest-548K [19] dataset demonstrated the superiority of our method over compared SOTA methods. We acknowledge the limitation of the current work that this framework only focuses on CT report generation. In future work, we will continuously explore the potential of LLM, developing a framework that can generate reports based on all radiology modalities.

References

[1] Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)
[2] Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258 (2022)
[3] Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056 (2020)
[4] Denkowski, M., Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the sixth workshop on statistical machine translation. pp. 85–91 (2011)
[5] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
[6] Hyland, S.L., Bannur, S., Bouzid, K., Castro, D.C., Ranjit, M., Schwaighofer, A., Pérez-García, F., Salvatelli, V., Srivastav, S., Thieme, A., et al.: Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668 (2023)
[7] Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: General perception with iterative attention. In: International conference on machine learning. pp. 4651–4664. PMLR (2021)
[8] Jin, H., Che, H., Lin, Y., Chen, H.: Promptmrg: Diagnosis-driven prompts for medical report generation. arXiv preprint arXiv:2308.12604 (2023)
[9] Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
[10] Li, M., Liu, R., Wang, F., Chang, X., Liang, X.: Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web 26(1), 253–270 (2023)
[11] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
[12] Liu, G., Liao, Y., Wang, F., Zhang, B., Zhang, L., Liang, X., Wan, X., Li, S., Li, Z., Zhang, S., et al.: Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning. IEEE Transactions on Neural Networks and Learning Systems 32(9), 3786–3797 (2021)
[13] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[14] Nicolson, A., Dowling, J., Koopman, B.: Improving chest x-ray report generation by leveraging warm starting. Artificial intelligence in medicine 144, 102633 (2023)
[15] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
[16] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
[17] Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: Zero: Memory optimizations toward training trillion parameter models. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–16. IEEE (2020)
[18] Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1500–1519 (2020)
[19] Tang, Y., Yang, H., Zhang, L., Yuan, Y.: Work like a doctor: Unifying scan localizer and dynamic generator for automated computed tomography report generation. Expert Systems with Applications 237, 121442 (2024)
[20] Thawkar, O., Shaker, A., Mullappilly, S.S., Cholakkal, H., Anwer, R.M., Khan, S., Laaksonen, J., Khan, F.S.: Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 (2023)
[21] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[22] Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463 (2023)
[23] Yang, S., Wu, X., Ge, S., Zheng, Z., Zhou, S.K., Xiao, L.: Radiology report generation with a learned knowledge base and multi-modal alignment. Medical Image Analysis 86, 102798 (2023)
[24] Yang, S., Ji, J., Zhang, X., Liu, Y., Wang, Z.: Weakly guided hierarchical encoder-decoder network for brain ct report generation. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). pp. 568–573. IEEE (2021)