An Intrinsically Explainable Approach to Detecting Vertebral Compression Fractures in CT Scans via Neurosymbolic Modeling

Blanca Inigo¹, Yiqing Shen¹, Benjamin D. Killeen¹,
Michelle Song², Axel Krieger², Christopher Bradley³,
Mathias Unberath¹

¹Department of Computer Science, Johns Hopkins University, Baltimore, MD
²Department of Mechanical Engineering, Johns Hopkins University, Baltimore, MD
³Department of Interventional Radiology, Johns Hopkins University, Baltimore, MD

Abstract

Vertebral compression fractures (VCFs) are a common and potentially serious consequence of osteoporosis. Yet, they often remain undiagnosed. Opportunistic screening, which involves automated analysis of medical imaging data acquired primarily for other purposes, is a cost-effective method to identify undiagnosed VCFs. In high-stakes scenarios like opportunistic medical diagnosis, model interpretability is a key factor for the adoption of AI recommendations. Rule-based methods are inherently explainable and closely align with clinical guidelines, but they are not immediately applicable to high-dimensional data such as CT scans. To address this gap, we introduce a neurosymbolic approach for VCF detection in CT volumes. The proposed model combines deep learning (DL) for vertebral segmentation with a shape-based algorithm (SBA) that analyzes vertebral height distributions in salient anatomical regions. This allows for the definition of a rule set over the height distributions to detect VCFs. Evaluation of VerSe19 dataset shows that our method achieves an accuracy of 96% and a sensitivity of 91% in VCF detection. In comparison, a black box model, DenseNet, achieved an accuracy of 95% and sensitivity of 91% in the same dataset. Our results demonstrate that our intrinsically explainable approach can match or surpass the performance of black box deep neural networks while providing additional insights into why a prediction was made. This transparency can enhance clinician’s trust thus, supporting more informed decision-making in VCF diagnosis and treatment planning.

Keywords: Explainable AI, Transparency, Interpretable AI, Machine Learning (ML), Artificial Intelligence (AI), Medical Image Analysis

1 INTRODUCTION

Vertebral compression fractures (VCFs) are a significant health concern, particularly in aging populations.[1, 2, 3] As the most common complication of osteoporosis,[4] VCFs affect more than 700,000 Americans annually[5, 6, 2, 7] and have a global incidence rate of 10.7 per 1000 women and 5.7 per 1000 men as of 2012.[2]. Automated detection of VCFs in computed tomography (CT) images is an important step toward low-cost opportunistic screening of VCFs. It enables straightforward screening of CT images acquired in the course of routine care, [8] and supports large-scale extraction of the desired cohort from existing image databases. Ultimately, these cohorts can be used to generate large-scale in silico simulations of VCF cases in other medical image modalities [9, 10], facilitating further research and development in the field.

In recent years, deep neural networks (DNNs) have yielded significant improvements to VCF detection and classification,[11, 12] but the lack of explainability in these algorithms is regarded as a barrier to their real-world implementation and adoption.[13] One avenue for explainability is the use of saliency maps to identify image regions that most influenced a DNN’s output.[14] However, recent work suggests that saliency-based explainability is vulnerable to imperceptible perturbations in the input, which cause a DNN to reverse its decision without affecting the corresponding saliency map.[15] Other promising strategies leverage deep learning techniques with interpretable algorithms, ensuring the decision-making process adheres to standardized medical guidelines. [16, 17, 18] There is, however, an unmet need for more interpretable VCF diagnosis algorithms based on CT imaging.

Neurosymbolic AI, which combines the strengths of neural networks and symbolic AI, has been explored to create more interpretable AI systems[19]. There is a growing interest in evaluating user trust and adherence given different explainability mechanisms[20, 21]. In the scope of VCF detection, Burns et al.[22] and Baum et al.[23] automated the extraction of vertebral height parameters and used them to detect fractures, employing a machine learning model and statistical tests, respectively. However, these approaches still fall short of providing a fully transparent and interpretable solution. To address this need, we propose an intrinsically explainable model that extracts symbolic representations of knowledge and defines logical rules to accurately detect VCFs. Our approach generates standardized 2D height maps for each vertebra and computes statistical measurements from multiple sections of the map. Unlike traditional methods that focus on anterior, middle, and posterior heights (AH, MH, PH)[24, 23, 22, 25], our model captures height measurements across the entire axial plane of the vertebra. This allows for a more complete representation of vertebral structures and deformations. Using these parameters, a predefined 2-rule set indicates whether the vertebra is moderate or severely fractured, ensuring generalization and interpretability.

2 METHODS

2.1 Data processing and parameter extraction overview

This method aims to provide an interpretable decision-making pipeline to identify VCFs in a straightforward and intuitive way. Our model generates rules based on domain knowledge about vertebral anatomy and fracture characteristics, ensuring that the decision-making process is interpretable and evidence-based.[26, 27, 24]

Figure 1a shows the pipeline followed to generate the vertebral shape measurements. First, for every CT scan, the vertebral bodies are segmented using TotalSegmentator [28, 29]. Then, individual 3D meshes are generated for each vertebral body using the marching cubes algorithm[30]. Height map generation consists of multiple steps: main surface detection by applying k-mean clustering (k=6) on the point cloud normal vectors; mesh reorientation to ensure the posterior and the inferior surfaces are aligned with their corresponding planes; height projection by fitting a 3D grid cell to the point cloud and computing the maximum height per column. Finally, seven regions of interest (ROIs) are defined and their statistical measurements (mean and standard deviation) are extracted. Similarly to traditional approaches that utilize the Anterior-Posterior ratio (APR), Middle-Posterior ratio (MPR), and Middle-Anterior ratio (MAR) to quantify bone height loss[23], we compute the pair-wise ratio of our section’s average heights. Conventionally, radiologists compare the vertebral heights (AH, MH, and PH) with an anatomically proximate comparator to determine the percentage height loss [22]. For each vertebra, we quantified this observation by finding the vertebra with the highest central average height (C) and computing the section-wise ratios between both height maps ( $\overline{C}$ ).

Refer to caption — (a) Height Map Extraction

2.2 Rule generation

Our model uses the RuleFit[31] algorithm to identify the optimal combination of feature thresholds that enhance accuracy. RuleFit combines decision rules from tree-based models with linear regression to create a predictive model that can capture non-linear relationships. Given a tree ensemble, Rulefit creates rules from all the trees of the ensemble, with each rule defining specific conditions based on feature thresholds (Eq 1).

r_{m}(x)=\prod_{j\epsilon T_{m}}I(x_{j}\epsilon s_{jm}),

(1)

where $T_{m}$ is the set of features used in the m-th tree, I is the indicator function that is 1 when feature $x_{j}$ is in the subset of values s for the j-th feature and 0 otherwise. In our case, the tree ensemble model used is GradientBoostingClassifier, and the 3 rules with the highest stratification power are included as features for the regression model. Finally, we train a sparse linear model with LASSO on the new rule features, which results in a linear parametrization of the model,

\hat{f}(x)=\hat{\beta}_{0}+\sum^{K}_{k=1}{\hat{\alpha_{k}}r_{k}(x)},

(2)

where $\hat{\alpha}$ is the estimated weight vector for the rule features and $\hat{\beta_{0}}$ is the intercept.

3 RESULTS

Dataset

Our data was obtained from the Large Scale Vertebrae Segmentation Challenge (VerSe19)[32] challenge and the ground truth scores defined by the Genant semiquantitative grading system[33] are publicly available as well. The VerSe19 dataset originally included 160 CT scans from 141 patients, containing 1491 vertebrae. In this study, vertebrae with foreign material such as screws and other metal prostheses were excluded. Moreover, scans with only one vertebrae annotation were not used, as the absence of another vertebra impeded the calculation of intervertebral ratios. After these exclusions, the final dataset contains 1,460 annotated thoracolumbar vertebrae. The dataset followed the same train/validation/test split defined in [34].

Rules for VCF detection

RuleFit returns a set of coefficients associated with each binary rule. According to the coefficients provided by our trained linear model, the VCF prediction is determined by Eq. 2, resulting in the dot product,

\hat{f}(x)=0+\begin{bmatrix}1.49471001&0.36870275&-4.10884354\end{bmatrix}\cdot\begin{bmatrix}\mathtt{avg(A_{0}):avg(P)}\leq 0.91\\ \mathtt{avg(A_{0}):avg(P)}>0.91\&\mathtt{avg(C):avg(\overline{C})}\leq 0.81\\ \mathtt{avg(A_{0}):avg(P)}>0.91\&\mathtt{avg(C):avg(\overline{C})}>0.81\end{bmatrix}.

(3)

Here, A₀ is antero-centric section, P is the posterior section, C is the whole central section, $\overline{C}$ is the central section of the reference vertebra mentioned in Section 2.1, and avg(X) indicates the average height of section X. Thus, a VFC will be predicted positive IF: the average height of the antero-centric section is smaller than 91% of the average height of the posterior section OR that condition is not met but the average height of the central section is smaller than the 81% of the average height of the reference-vertebra central section.

Black-box model benchmark

We trained two backbone deep learning models, ResNeXt50 and DenseNet, as a benchmark in our study. For each vertebra, we resampled a cropped volume with 1mm isotropic resolution. A stack of 14 centered sagittal slices was then extracted. Augmentation consisted of random rotations and flips. A class-balanced data sampler was used during training, based on all four fracture levels. After training, the slice-level predictions and labels are binarized. The slice-level threshold that determines the vertebra-level predictions is obtained by optimizing the Youden J statistic index during validation.[35]. Similar to [25], we take advantage of the available vertebral body segmentations to mask the 2D samples and retrain the DL models.

We evaluate every model on VerSe19 test set (Table 1). Both DL models benefit from masked data which can be explained given the importance of detecting the vertebral endplates in the prediction of VCFs. However, our neurosymbolic approach outperforms both ResNeXT and DenseNet with and without masking. Figure 2a and 2b highlight the interpretability of the proposed model by providing a clear and intuitive visual representation of its reasoning criteria. This contrasts with the more complex raw sagittal CT images and vertebra masks. The generated maps facilitate visual inspection of data during opportunistic screenings by simplifying the assessment process, enabling quicker and more straightforward evaluations of vertebral conditions

Model	F1	Accuracy	Precision	Recall
ResNeXt_unmasked	0.62	0.88	0.45	0.97
ResNeXt_masked	0.65	0.9	0.5	0.94
DenseNet_unmasked	0.65	0.91	0.5	0.94
DenseNet_masked	0.77	0.95	0.67	0.91
Interpretable approach	0.81	0.96	0.74	0.91

Table 1: Vertebra-level performance metrics comparison

4 CONCLUSION

Our neurosymbolic rule-based model demonstrates superior performance in detecting VCFs in CT scans compared to traditional deep learning models. Its intrinsic transparency provides clear visual and logical cues for the decision-making process, which is especially appealing in the scope of opportunistic detection of VCFs. Future work will investigate whether our explainable approach will allow users to adequately calibrate their trust in the automated detection method, thus supporting usability.

References

[1] James A. Simon and Carol J. Mack “Prevention and management of osteoporosis” In Clin. Cornerstone 5 Excerpta Medica, 2003, pp. S5–S12 DOI: 10.1016/S1098-3597(03)90042-1
[2] Daniela Alexandru and William So “Evaluation and Management of Vertebral Compression Fractures” In Permanente Journal 16.4 Kaiser Permanente, 2012, pp. 46 DOI: 10.7812/tpp/12-037
[3] Seung-Kwan Lee, Deuk-Soo Jun, Dong-Keun Lee and Jong-Min Baik “Clinical Characteristics of Elderly People with Osteoporotic Vertebral Compression Fracture Based on a 12-Year Single-Center Experience in Korea” In Geriatrics 7.6 Multidisciplinary Digital Publishing Institute, 2022, pp. 123 DOI: 10.3390/geriatrics7060123
[4] Daniel Alsoof et al. “Diagnosis and Management of Vertebral Compression Fracture” In Am. J. Med. 135.7 Elsevier, 2022, pp. 815–821 DOI: 10.1016/j.amjmed.2022.02.035
[5] J.. Barr, M.. Barr, T.. Lemley and R.. McCann “Percutaneous vertebroplasty for pain relief and spinal stabilization” In Spine 25.8 See full text options at Wolters Kluwer, 2000, pp. 923–928 DOI: 10.1097/00007632-200004150-00005
[6] JASON McCARTHY and Amy Davis “Diagnosis and management of vertebral compression fractures” In American family physician 94.1, 2016, pp. 44–50
[7] Cyrus C Wong and Matthew J McGirt “Vertebral compression fractures: a review of current management and multimodal therapy” In Journal of multidisciplinary healthcare Taylor & Francis, 2013, pp. 205–214
[8] Oliver Chaudry Klaus Engelke and Stefan Bartenschlager “Opportunistic Screening Techniques for Analysis of CT Scans” In National Library of Medicine (NLM), 2023
[9] Benjamin D. Killeen et al. “In silico simulation: a key enabling technology for next-generation intelligent surgical systems” In Prog. Biomed. Eng. 5.3 IOP Publishing, 2023, pp. 032001 DOI: 10.1088/2516-1091/acd28b
[10] Qianye Yang et al. “MRI Cross-Modality Image-to-Image Translation” In Sci. Rep. 10.3753 Nature Publishing Group, 2020, pp. 1–18 DOI: 10.1038/s41598-020-60520-6
[11] Amir Bar et al. “Compression fractures detection on CT” In Proceedings Volume 10134, Medical Imaging 2017: Computer-Aided Diagnosis 10134 SPIE, 2017, pp. 1036–1043 DOI: 10.1117/12.2249635
[12] Magnus Grønlund Bendtsen and Mette Friberg Hitz “Opportunistic Identification of Vertebral Compression Fractures on CT Scans of the Chest and Abdomen, Using an AI Algorithm, in a Real-Life Setting” In Calcif. Tissue Int. 114.5 Springer US, 2024, pp. 468–479 DOI: 10.1007/s00223-024-01196-2
[13] Perry J. Pickhardt et al. “Opportunistic Screening: Radiology Scientific Expert Panel” In Radiology Radiological Society of North America, 2023 URL: https://pubs.rsna.org/doi/full/10.1148/radiol.222044
[14] Katarzyna Borys et al. “Explainable AI in medical imaging: An overview for clinical practitioners – Saliency-based XAI approaches” In Eur. J. Radiol. 162 Elsevier, 2023, pp. 110787 DOI: 10.1016/j.ejrad.2023.110787
[15] Jiajin Zhang et al. “Revisiting the Trustworthiness of Saliency Methods in Radiology AI” In Radiology: Artificial Intelligence 6.1 Radiological Society of North America, 2024 DOI: 10.1148/ryai.220221
[16] Chen H.and Dreizin D.and Gomez C.and Zapaishchykova A.and & Unberath M. “Interpretable Severity Scoring of Pelvic Trauma Through Automated Fracture Detection and Bayesian Inference.” In IEEE Transactions on Medical Imaging, 2024
[17] Chen H.and Unberath M. and Dreizin D. “Toward automated interpretable AAST grading for blunt splenic injury.” In Emergency radiology, 2022
[18] Zapaishchykova A., Dreizin D., Li Z.and Wu J..and Faghihroohi S. and Unberath M. “An interpretable approach to automated severity scoring in pelvic trauma.” In Medical Image Computing and Computer Assisted Intervention–MICCAI, 2021 DOI: 10.1007/s10140-022-02099-1
[19] Chen H. et al. “An interpretable Algorithm for uveal melanoma subtyping from whole slide cytology images.”, 2021
[20] Gomez C. and Wang R.and Breininger K.and Casey C.and Bradley C.and Pavlak M.and & Unberath M. “Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone.”, 2024
[21] Gomez C.and Smith B..and Zayas A.and Unberath M. and Canares T. “Explainable AI decision support improves accuracy during telehealth strep throat screening.” In npj Digital Medicine 7.18, 2024 DOI: https://doi.org/10.1038/s41746-024-00962-8
[22] Joseph E. Burns, Jianhua Yao and Ronald M. Summers “Vertebral Body Compression Fractures and Bone Density: Automated Detection and Classification on CT Images” In RSNA 284.3, 2017 DOI: 10.1148/radiol.2017162100
[23] Thomas Baum et al. “Automatic detection of osteoporotic vertebral fractures in routine thoracic and abdominal MDCT” In European Radiology 24.4, 2014 DOI: 10.1007/s00330-013-3089-2
[24] Leon Lenchik, Lee F. Rogers, Pierre D. Delmas and Harry K. Genant “Diagnosis of Osteoporotic Vertebral Fractures: Importance of Recognition and Description by Radiologists” In American Journal of Roentgenology, 2024
[25] Yuhang Wang, Zhiqin He, Qinmu Wu and Tingsheng Lu “Spinal Vertebral Fracture Detection and Fracture Level Assessment Based on Deep Learning” In CMC 79, 2024 DOI: 10.32604/cmc.2024.047379
[26] H. Chen, C. Gomez, C.. Huang and M. Unberath “Explainable medical imaging AI needs human-centered design: guidelines and evidence from a systematic review.” In NPJ digital medicine 5.156, 2022 DOI: https://doi.org/10.1038/s41746-022-00699-2
[27] Mikayel Grigoryan et al. “Recognizing and reporting osteoporotic vertebral fractures” In PMC, 2003
[28] Jakob Wasserthal et al. “TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images” In Radiology: Artificial Intelligence Radiological Society of North America, 2023 URL: https://pubs.rsna.org/doi/10.1148/ryai.230024
[29] Fabian Isensee et al. “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation” In Nat. Methods 18 Nature Publishing Group, 2021, pp. 203–211 DOI: 10.1038/s41592-020-01008-z
[30] William E. Lorensen and Harvey E. “Marching cubes: A high resolution 3D surface construction algorithm” In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’87) 21.4, 1987
[31] Jerome H. Friedman and Bogdan E. Popescu “Predictive Learning via Rule Ensembles” In Annals of Applied Statistics, 2008
[32] Anjany Sekuboyina et al. “VerSe: A Vertebrae Labelling and Segmentation Benchmark for Multi-Detector CT Images” In Medical Image Analysis, 2021
[33] Maximilian T. Löffler et al. “A Vertebral Segmentation Dataset with Fracture Grading” In radiology: Artificial Intelligence, 2020
[34] Anjany Sekuboyina et al. “VerSe: A Vertebrae Labelling and Segmentation Benchmark for Multi-detector CT Images” In arXiv, 2020 DOI: 10.1016/j.media.2021.102166
[35] Anjany Sekuboyina et al. “Youden Index and optimal cut-point estimated from observations affected by a lower limit of detection” In 2008, 2008