33institutetext: College of Computer Science, Sichuan University, Sichuan, China 33email: [email protected], [email protected]
44institutetext: Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates 55institutetext: China Agricultural University, Beijing, China 66institutetext: The Sixth People’s Hospital of Chengdu, Sichuan, China 77institutetext: School of Cyber Science and Engineering, Sichuan University, Sichuan, China
Textual Inversion and Self-supervised Refinement for Radiology Report Generation
Abstract
Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specifically, textual inversion can project text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap. Subsequently, self-supervised refinement refines these pseudo words through contrastive loss computation between images and texts, enhancing the fidelity of generated reports to images. Notably, TISR is orthogonal to most existing methods, plug-and-play. We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines, which demonstrates the effectiveness and generalization of TISR. The code will be available soon.
Keywords:
Radiology report generation Cross-modal learning Textual inversion Auxiliary diagnosis1 Introduction
Radiology report generation provides the basis for physician diagnosis [10]. However, observing radiograph and writing report is time-consuming and laborious for doctors [24]. It’s even error-prone for inexperienced doctors as they often struggle to accurately capture the abnormalities in images [2, 23]. Previous approaches adopt the framework of image captioning [9] straightforwardly and make it more suitable for generating radiology reports by improving image encoders [30, 19, 32] to be better adapted to medical images or refining text decoders to generate long paragraphs [34, 15, 11]. Building upon this, innovative techniques are utilized to improve performance, such as knowledge graph [36], causal inference [4], and dynamic graph [18].

Despite these notable advances, there are still two challenges in generating accurate reports. (1) Existing methods cannot explicitly constrain the reports generated by the text decoder to be faithful to visual information. Prior method [5] suffers from the phenomenon of illusion generation, as shown in Fig. 1. It generates “consolidation”, which is not mentioned in the ground truth while misses “atelectasis” and “pneumonia”. Some works have enhanced the ability of grounding by extracting additional expert information, such as anchor box [30] and sentence retrieval library [14]. However, their implementation needs to make additional labels or reconstruct the entire dataset, which not only requires expensive costs but is not always accessible in clinical applications. (2) The inherent modal gap between images and language[16]. Previous approaches [30, 19, 32, 34, 15, 11, 35, 17, 33] adhere to the image encoder-text decoder paradigm [22], which lacks cross-modal interaction. As Image and text exist in distinct feature spaces with a feature gap between them, we propose to fill this gap with pseudo words, constructing a unified public hidden space for image and text, as shown in Fig. 1.
We propose Textual Inversion and Self-supervised Refinement (TISR) to solve the problems discussed above. We employ a lightweight mapping module, named textual inversion, to convert image features into text features [28]. Through textual inversion, the pseudo words obtained by transforming image embeddings contain both image features and linguistic spatial characteristics. Textual inversion can eliminate the spatial gap effectively, making the features of two modalities be computed in the common compact space. We then perform self-supervised refinement by calculating contrastive loss between the obtained pseudo words and image features. Instead of relying on ground truth, TISR guides the network to generate reports faithful to the images by minimizing the contrastive loss[3]. Experimental results on two widely used datasets and three radiology report generation networks verify the efficacy and plug-and-play capability of TISR. In summary, the contributions of this paper are as follows:
-
•
We bridge the modality gap by transforming visual features into linguistic space through textual inversion.
-
•
The self-supervised refinement module searches for text representations close to the image content to minimize the contrastive loss. Consequently, we can generate faithful reports to radiographs, providing more credible diagnostic information for clinical practice.
-
•
Our TISR is orthogonal to other radiology report generation networks, plug-and-play. Experimental results show that by improving the network with TISR, the accuracy improved compared to the baselines.
2 Method

As shown in Fig. 2, our pipeline consists of two stages. We extract image features from a radiograph through an image encoder [12], where is the batch size, is the number of the processed patches and is the dimension. With the image features and the previously generated text embeddings , the text decoder [7] can obtain the log probability of the next word.
(1) |
where , and n denote text encoder, text decoder and one word of the target report respectively. By applying linear and softmax to the log probability, we obtain the word in the vocabulary corresponding to the highest probability and use it as the n-th word of the report.
(2) |
where refers linear function. We can finally obtain the complete text embeddings and report after continuous autoregression of the text decoder [25], where represents the length of the target sequence.
Image features are processed by textual inversion to generate pseudo words . In the self-supervised refinement, we supervise the network to generate more refined pseudo words by calculating the contrastive loss between text features and image features instead of using ground truth as the supervision signal. The details are illustrated in the following subsections.
2.1 Textual Inversion
Radiology report generation is an image-to-text cross-modal task, as medical images and radiology reports are in two different feature spaces. Existing methods are more tend to improve the overall performance by extracting refined image features [30, 19, 32] or improving the network structure of the text decoder [34, 15, 11, 35] and ignoring the gap between modalities. Therefore, we propose textual inversion to reconstruct image representation within the text embedding space to eliminate the spatial gap. In this module, we map image embeddings to pseudo words . via feeding image features into a three-layered full-connected network, which can be formulated as:
(3) |
2.2 Self-supervised Refinement
After obtaining the pseudo words, we input them into the text decoder after a series of operations to obtain . We explicitly constrain that the generated pseudo words should be able to represent the image features sufficiently by calculating the contrastive loss between and the image feature . This optimization process guides the network to generate reports that are faithful to images.
We first perform cross-modal interaction by employing cross attention mechanism [31] between pseudo words and text embeddings . Since pseudo words are derived directly from image features, it is beneficial to align visual and linguistic features through this interaction. This process can be expressed as:
(4) |
(5) |
We assume that the pseudo words can compensate for the missing information or correct the redundant information for the text feature . Based on this intuition, we concatenate aligned text feature with aligned pseudo words . The concatenated features can be fused well through multi-layer perceptron (MLP), and thus we obtained the processed pseudo words , where . It can be expressed using a formula:
(6) |
The log probability is obtained by decoding . We then implement self-supervised refinement by calculating contrastive loss between text embeddings and image features . By minimizing the contrastive loss, we encourage the network to generate that closely resemble the expression of . After continuous back-propagation and optimization, the generated pseudo words can adequately represent the image semantics, which is beneficial for the generation of reports faithful to the original images.
2.3 Training Objective
We utilize to quantify the difference between the generated report and the ground truth [6], thus guiding the model to generate reports that are close to the ground truth. The formulation of is as follows:
(7) |
The log probability of the output against the target sequence is obtained by for the position of the -th sample. To ensure consistent input sequence length, all sequences are filled to the same length. The mask indicates whether a real word exists at the position: if present, it is 1; otherwise, it is 0. The log probability of the filled part is set to 0 by multiplying with the mask to prevent the filled part from affecting . Finally, we normalize by dividing it with the sum of all mask values to ensure that the loss value is not affected by changes in sequence length.
In addition to optimizing the network to generate more accurate reports through , we also constrain the textual inversion to generate pseudo words that are close to the image representation through . We obtained the score matrix by calculating the correlation between image features and text features via dot product, which can be expressed as , where denotes matrix multiplication. we evaluate the cosine similarity between image features and text features and get the score matrix of size . We optimize the network by constructing a symmetric cross-entropy loss to maximize the cosine similarity between real image-text pairs while minimizing the cosine similarity between unpaired image-text pairs [27].
(8) |
is a matrix of size , where the elements on the diagonal are 1, indicating positive samples, while the off-diagonal elements are 0, indicating negative samples. The overall loss function of our network is defined as: . Instead of relying on manually labeled datasets, we leverage contrastive learning to measure the similarity between text and image, guiding the network to optimize parameters for generating reports faithful to the visual content.
3 Experiment
3.1 Dataset and Evaluation Metrics
Dataset. We conducted experiments on two widely-used datasets: the small dataset IU X-ray†\dagger†\daggerhttps://openi.nlm.nih.gov/. [8] (containing 7,470 chest X-ray images and 3,955 corresponding reports) and the large dataset MIMIC-CXR†\dagger†\daggerhttps://physionet.org/content/mimic-cxr/2.0.0/. [13] (containing 377,110 images and 227,835 corresponding reports). To ensure consistency and fairness in comparisons, we followed the data processing methods utilized by the three baselines [4, 5, 6]. After excluding samples without corresponding radiology reports, IU X-ray is divided into training, validation and testing sets with a proportion of 7:1:2 [20] while MIMIC-CXR is divided according to the official splits [6].
Evaluation Metrics. We evaluate TISR on natural language generation (NLG) metrics including BLEU [26], METEOR [1] and ROUGE-L [21], which are widely used to assess the fluency and accuracy of generated reports. We not only focus on the quality of the generated reports but also on their ability to accurately capture lesions in the images. Therefore, we employ clinical efficacy (CE) metrics to evaluate the detection accuracy of generated reports. CheXbert [29] is applied to extract labels of 14 medical observations from reports. Precision, recall and F1 are calculated by comparing the labels of the generated reports with ground truth.
Method | NLG Metrics | CE Metrics | |||||||
---|---|---|---|---|---|---|---|---|---|
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | MTR | RG-L | Precision | Recall | F1 | |
Experimental results on IU X-ray dataset. | |||||||||
R2Gen* [6] | 0.443 | 0.286 | 0.212 | 0.168 | 0.175 | 0.355 | - | - | - |
+TISR(Ours) | 0.470 | 0.310 | 0.233 | 0.187 | 0.194 | 0.369 | - | - | - |
+0.027 | +0.024 | +0.021 | +0.019 | +0.019 | +0.014 | - | - | - | |
R2GenCMN* [5] | 0.469 | 0.300 | 0.215 | 0.164 | 0.190 | 0.370 | - | - | - |
+TISR(Ours) | 0.483 | 0.313 | 0.229 | 0.176 | 0.191 | 0.371 | - | - | - |
+0.014 | +0.013 | +0.014 | +0.012 | +0.001 | +0.001 | - | - | - | |
VLCI* [4] | 0.467 | 0.306 | 0.225 | 0.175 | 0.193 | 0.377 | - | - | - |
+TISR(Ours) | 0.485 | 0.318 | 0.232 | 0.179 | 0.199 | 0.382 | - | - | - |
+0.018 | +0.012 | +0.007 | +0.004 | +0.006 | +0.005 | - | - | - | |
Experimental results on MIMIC-CXR dataset. | |||||||||
R2Gen* [6] | 0.350 | 0.214 | 0.143 | 0.103 | 0.135 | 0.271 | 0.424 | 0.254 | 0.317 |
+TISR(Ours) | 0.358 | 0.219 | 0.147 | 0.106 | 0.139 | 0.275 | 0.467 | 0.302 | 0.367 |
+0.008 | +0.005 | +0.004 | +0.003 | +0.004 | +0.004 | +0.043 | +0.048 | +0.050 | |
R2GenCMN* [5] | 0.344 | 0.210 | 0.139 | 0.098 | 0.136 | 0.275 | 0.455 | 0.317 | 0.374 |
+TISR(Ours) | 0.363 | 0.224 | 0.149 | 0.105 | 0.143 | 0.279 | 0.450 | 0.344 | 0.390 |
+0.019 | +0.014 | +0.010 | +0.007 | +0.007 | +0.004 | -0.005 | +0.027 | +0.016 | |
VLCI* [4] | 0.393 | 0.239 | 0.159 | 0.113 | 0.150 | 0.276 | 0.439 | 0.283 | 0.344 |
+TISR(Ours) | 0.396 | 0.242 | 0.161 | 0.115 | 0.149 | 0.278 | 0.453 | 0.306 | 0.366 |
+0.003 | +0.003 | +0.002 | +0.002 | +0.001 | +0.002 | +0.014 | +0.023 | +0.022 |
3.2 Experiments Results and Analyses
Comparison with Baselines. To verify the generalization and effectiveness of TISR, we use R2Gen [6], R2GenCMN [5] and VLCI [4] as the baseline models in our experiments. These baseline models are improved with TISR, and the results are compared with the original baselines, as shown in Table 1. Experimental results show that all metrics are enhanced by improving networks with TISR, which indicates that TISR can eliminate the gap between modalities and generate more accurate reports. It is remarkable that our approach does not require additional data and can seamlessly integrate into these baselines [4, 5, 25], which is of great importance for network migration and practical applications. What’s more, we can recognize from the results that prior methods have overlooked the impact of the gap between modalities on radiology report generation. Hence, future research should focus more on cross-modal interactions.
Ablation Study. To explore the effectiveness of each component in TISR and the rationality of the network structure, we conducted various ablation experiments. First of all, we explored the effectiveness of textual inversion and self-supervised refinement, as shown in Table 2. The significance of textual inversion is investigated by calculating the contrastive loss between image features and text embeddings , while the role of self-supervised refinement is explored through the calculation of the contrastive loss between image features and pseudo words .
|
|
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
0.443 | 0.286 | 0.212 | 0.168 | 0.175 | 0.355 | ||||||
✓ | 0.462 | 0.304 | 0.224 | 0.176 | 0.190 | 0.361 | |||||
✓ | 0.463 | 0.289 | 0.206 | 0.157 | 0.181 | 0.356 | |||||
✓ | ✓ | 0.470 | 0.310 | 0.233 | 0.187 | 0.194 | 0.369 |
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | |
---|---|---|---|---|---|---|
Transformer | 0.416 | 0.268 | 0.194 | 0.149 | 0.178 | 0.350 |
MLP | 0.470 | 0.310 | 0.233 | 0.187 | 0.194 | 0.369 |
Secondly, we carried out experiments to explore the structure of TISR. We replace the three-layer linear structure with a three-layer transformer encoder to explore the structure of the textual inversion module. It’s easy to see that the result is worse than MLP with the same dimension of the hidden layer from Table 3. We hypothesize that this is because the medical image features are not complex, so using a transformer may lead to overfitting, and it will also increase computational costs. We investigate the significance of cross-modal interaction and MLP by removing cross attention and MLP from the complete self-supervised refinement network respectively. Pseudo words’ significance is investigated by directly incorporating into the self-supervised refinement network. Furthermore, the importance of decoding text embeddings is explored by computing the contrastive loss between and . We can speculate that each module plays an important role in generating more refined pseudo words from Table 4 since removing any one of them the performance of the network is degraded.
|
|
|
MLP | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.442 | 0.277 | 0.198 | 0.152 | 0.176 | 0.352 | ||||||||||
0.446 | 0.291 | 0.218 | 0.175 | 0.183 | 0.365 | ||||||||||
0.459 | 0.298 | 0.221 | 0.173 | 0.186 | 0.358 | ||||||||||
0.432 | 0.301 | 0.226 | 0.175 | 0.190 | 0.398 | ||||||||||
0.470 | 0.310 | 0.233 | 0.187 | 0.194 | 0.369 |
Quantitative Analysis. We draw attention maps to explore the region of the medical image that the word of the generated report is interested in. Fig. 3 illustrates that the model improved by TISR is more sensitive to the correct regions and can generate reports that are closer to the ground truth. This demonstrates that our model can eliminate the cross-modal gap and thus generate reports faithful to images.

4 Conclusion
In this study, we propose Textual Inversion and Self-supervised Refinement (TISR) to address the radiology report generation problem. By inverting image features into pseudo words, textual inversion aims to bridge the modality gap by representing visual features in the linguistic space. We employ self-supervised refinement to iteratively improve the quality of pseudo words by minimizing the contrastive loss between them and the image features. This iterative process helps to generate radiology reports that are faithful to the radiology image. TISR is designed to compensate for most existing approaches seamlessly, offering a plug-and-play solution. Significant improvements across all three baselines illustrate the effectiveness and generation of our proposed method.
References
- [1] Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. pp. 65–72 (2005)
- [2] Brady, A., Laoide, R.Ó., McCarthy, P., McDermott, R.: Discrepancy and error in radiology: concepts, causes and consequences. The Ulster medical journal 81(1), 3 (2012)
- [3] Cao, M., Wei, F., Xu, C., Geng, X., Chen, L., Zhang, C., Zou, Y., Shen, T., Jiang, D.: Iterative proposal refinement for weakly-supervised video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6524–6534 (2023)
- [4] Chen, W., Liu, Y., Wang, C., Li, G., Zhu, J., Lin, L.: Visual-linguistic causal intervention for radiology report generation. arXiv preprint arXiv:2303.09117 (2023)
- [5] Chen, Z., Shen, Y., Song, Y., Wan, X.: Cross-modal memory networks for radiology report generation. arXiv preprint arXiv:2204.13258 (2022)
- [6] Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056 (2020)
- [7] Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
- [8] Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23(2), 304–310 (2016)
- [9] Deng, C., Ding, N., Tan, M., Wu, Q.: Length-controllable image captioning. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16. pp. 712–729. Springer (2020)
- [10] Elliott, J.: The value of case reports in diagnostic radiography. Radiography 29(2), 416–420 (2023)
- [11] Harzig, P., Chen, Y.Y., Chen, F., Lienhart, R.: Addressing data bias problems for chest x-ray image report generation. arXiv preprint arXiv:1908.02123 (2019)
- [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- [13] Johnson, A.E., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
- [14] Kong, M., Huang, Z., Kuang, K., Zhu, Q., Wu, F.: Transq: Transformer-based semantic query for medical report generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 610–620. Springer (2022)
- [15] Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Knowledge-driven encode, retrieve, paraphrase for medical image report generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 6666–6673 (2019)
- [16] Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: G2l: Semantically aligned and uniform video grounding via geodesic and game theory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12032–12042 (2023)
- [17] Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: Exploiting auxiliary caption for video grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18508–18516 (2024)
- [18] Li, M., Lin, B., Chen, Z., Lin, H., Liang, X., Chang, X.: Dynamic graph enhanced contrastive learning for chest x-ray report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3334–3343 (2023)
- [19] Li, M., Liu, R., Wang, F., Chang, X., Liang, X.: Auxiliary signal-guided knowledge encoder-decoder for medical report generation. World Wide Web 26(1), 253–270 (2023)
- [20] Li, Y., Liang, X., Hu, Z., Xing, E.P.: Hybrid retrieval-generation reinforced agent for medical image report generation. Advances in neural information processing systems 31 (2018)
- [21] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out. pp. 74–81 (2004)
- [22] Liu, C., Tian, Y., Song, Y.: A systematic review of deep learning-based research on radiology report generation. arXiv preprint arXiv:2311.14199 (2023)
- [23] Manning, D., Ethell, S., Donovan, T., Crawford, T.: How do radiologists do it? the influence of experience and training on searching for chest nodules. Radiography 12(2), 134–142 (2006)
- [24] McGaghie, W.C.: Education for chest radiograph interpretation performance improvement. Chest 164(2), e57 (2023)
- [25] Nicolson, A., Dowling, J., Koopman, B.: Improving chest x-ray report generation by leveraging warm starting. Artificial intelligence in medicine 144, 102633 (2023)
- [26] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)
- [27] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
- [28] Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19305–19314 (2023)
- [29] Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.: Chexbert: combining automatic labelers and expert annotations for accurate radiology report labeling using bert. arXiv preprint arXiv:2004.09167 (2020)
- [30] Tanida, T., Müller, P., Kaissis, G., Rueckert, D.: Interactive and explainable region-guided radiology report generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7433–7442 (2023)
- [31] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
- [32] Wang, Y., Wang, K., Liu, X., Gao, T., Zhang, J., Wang, G.: Self adaptive global-local feature enhancement for radiology report generation. In: 2023 IEEE International Conference on Image Processing (ICIP). pp. 2275–2279. IEEE (2023)
- [33] Wu, X., Li, H., Luo, Y., Cheng, X., Zhuang, X., Cao, M., Fu, K.: Uncertainty-aware sign language video retrieval with probability distribution modeling (2024)
- [34] Xue, Y., Xu, T., Rodney Long, L., Xue, Z., Antani, S., Thoma, G.R., Huang, X.: Multimodal recurrent model with attention for automated radiology report generation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I. pp. 457–466. Springer (2018)
- [35] Yan, B., Pei, M., Zhao, M., Shan, C., Tian, Z.: Prior guided transformer for accurate radiology reports generation. IEEE Journal of Biomedical and Health Informatics 26(11), 5631–5640 (2022)
- [36] Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A., Xu, D.: When radiology report generation meets knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 12910–12917 (2020)