This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding
11institutetext: School of Information Technology and Electrical Engineering,
The University of Queensland, Brisbane, Australia.
22institutetext: The Commonwealth Scientific and Industrial Research Organisation,
Canberra, Australia.
33institutetext: Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), School of Biomedical Engineering & Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China.
44institutetext: Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China.
55institutetext: School of Human Movement and Nutrition Sciences,
The University of Queensland, Brisbane, Australia.

Evidence-aware multi-modal data fusion and its application to total knee replacement prediction

Xinwen Liu 11    Jing Wang 22    S. Kevin Zhou 3344    Craig Engstrom 55   
Shekhar S. Chandra
11
Abstract

Deep neural networks have been widely studied for predicting a medical condition, such as total knee replacement (TKR). It has shown that data of different modalities, such as imaging data, clinical variables and demographic information, provide complementary information and thus can improve the prediction accuracy together. However, the data sources of various modalities may not always be of high quality, and each modality may have only partial information of medical condition. Thus, predictions from different modalities can be opposite, and the final prediction may fail in the presence of such a conflict. Therefore, it is important to consider the reliability of each source data and the prediction output when making a final decision. In this paper, we propose an evidence-aware multi-modal data fusion framework based on the Dempster-Shafer theory (DST). The backbone models contain an image branch, a non-image branch and a fusion branch. For each branch, there is an evidence network that takes the extracted features as input and outputs an evidence score, which is designed to represent the reliability of the output from the current branch. The output probabilities along with the evidence scores from multiple branches are combined with the Dempster’s combination rule to make a final prediction. Experimental results on the public OA initiative (OAI) dataset for the TKR prediction task show the superiority of the proposed fusion strategy on various backbone models.

Keywords:
Dempster-Shafer theory Medical condition prediction Multi-modal data fusion

1 Introduction

Deep learning (DL) has been increasingly investigated for medical condition and disease prediction at the early stage. DL recommendations for early interventions can potentially lead to a better treatment outcome, reduce healthcare cost, and save the life of patients. Deep neural networks have been leveraged to support clinical decision-making by analyzing a large amount of data consisting of health records, medical images, and clinical information [8, 9, 15, 25, 28, 30]. Data from multiple modalities normally contain complementary information, and the effective combination of multi-modal information can improve the prediction and diagnosis accuracy [1, 2, 7, 19, 20, 21].

Multi-modality learning has shown to be effective in many predictive tasks such as Alzheimer’s disease (AD) progression prediction [8, 12, 16, 29], knee osteoarthritis (OA) trajectory forecasting [9, 13, 14, 22, 26], COVID-19 mortality risk prediction [24] and peripapillary atrophy forecasting [17]. For example, Knee OA is the most common musculoskeletal (MSK) disorder. When diagnosed early, knee OA can be managed through non-invasive treatment options; otherwise, the only option is the total knee replacement (TKR) surgery. The successful prediction of TKR can assist with the optimal choice of treatment options. Authors in [27] combine the use of magnetic resonance (MR) images and clinical data for an accurate TKR prediction.

However, the prediction outcomes from each modality can be different, because the quality of the source data may not always be good and the information inherent in one modality only provides partial support for decision-making [18]. The final decision can be catastrophically wrong without a proper consideration of such a variance in various sources, although one source could have provided a right prediction. For example, when analyzing TKR prediction experiments with existing methods [27], we find that there are nearly half failure cases could have been classified correctly with a single modality prediction.

To alleviate this issue and improve the prediction confidence and accuracy, we propose an evidence-aware multi-modal data fusion framework for the medical condition prediction based on Dempster-Shafer theory (DST). Specifically, an evidence network is placed in parallel to the backbone prediction branch. The evidence networks take the extracted features as input and return evidence scores to quantify the confidence of the output. The evidence scores and the decision outputs of all sources are combined based on Dempster’s combination rule for a final prediction. The induced-results consider the probabilities and the evidence from all the sources, making a more reliable final decision with good interpretablity.

2 Methods

2.1 Overall framework

The overall framework of the proposed evidence-aware multi-modal data fusion is shown in Fig. 1. In general, it contains backbone classification networks, evidence estimation networks, and the DST fusion module. The backbone network commonly includes two branches [9, 22, 27], namely, image branch and non-image branch. Concretely, each branch comprises a feature extraction backbone and a classifier to output a probability based on the corresponding data. The features from two branches are concatenated to form the third branch and are attached to a third classifier for prediction.

Refer to caption
Figure 1: Illustration of the evidence-aware multi-modal data fusion framework. It is composed image branch, non-image branch, and the fusion branch. For each branch, there is an evidence network that takes the features as input and outputs the evidence score of the prediction. The evidence score and the probability of each branch are combined based on the Dempster’s combination rule.

Decision-making level fusion of multi-modal data are shown to be effective in medical condition prediction [9, 22, 27]. In our task, considering that the quality of various modalities may vary, the decisions from each branch vary too, which results in conflict of prediction. The aim of our work is to provide a reliable fusion strategy for multi-modal data to improve the prediction accuracy. To this end, we introduce an evidence network to output an evidence score for each branch. The evidence network is designed to estimate the reliability of the data and the output of each branch. Finally, Dempster’s combination rule is employed to fuse the information of image branch, non-image branch, and fusion branch according to the estimated evidence scores to obtain the final prediction results. In this way, the decision from each branch can be considered accordingly based on the reliability estimation, alleviating the clash of the decision during the decision fusion stage, leading to a reliable prediction.

2.2 Evidence network

To reliably use the decisions from multiple modalities, we need to estimate the reliability of each modality. However, the backbone networks alone are not able to determine whether the data source is of good use and the output is reliable or not. Therefore, an additional network branch, named as evidence network, is employed to evaluate the reliability of the output results. As illustrated in Fig. 1, we introduce three evidence networks for image branch, non-image branch, and feature fusion branch, respectively. The input of the evidence network is the extracted features of corresponding branch and the output indicates the reliability. The evidence network contains a few linear layers, layer normalisation layers, and non-linear ReLU activation functions.

We define the reliability of the current output pip_{i} of branch ii as evidence score si[0,1]s_{i}\in[0,1]. The more reliability a decision has, the larger its sis_{i} is. This means the current branch is making a confident and right decision. On the contrary, when the branch makes a wrong decision, the sis_{i} needs to be a small value, indicating the output is of high uncertainty and not reliable. Given a well-trained classifier of branch ii, we pass all the samples xj|j=1Nx^{j}|_{j=1}^{N}, where NN is the number of samples, through the network for classification and record if it makes a mistake or not. When classifier of the ithi^{th} branch returns a wrong decision for a sample xjx^{j}, this means the decision pijp_{i}^{j} are not reliable, and the sijs_{i}^{j} for this sample should be 0. On the contrary, when the classifier of branch ii returns an accurate result for the sample, sijs_{i}^{j} should be 1, denoting a trust-worthy decision. We take these binary scores as the evidence network’s learning target sijs_{i}^{j*}. The evidence networks are trained to generate the evidence scores 𝐬^\hat{\mathbf{s}} that are close to the true evidence score 𝐬\mathbf{s^{*}}.

2.3 Dempster’s combination rule

Dempster-Shafer theory (DST) [3, 4], also known as evidence theory or the theory of belief functions, is a framework for reasoning with uncertainty. DST combines the evidence from various sources with a degree of belief. Studies have shown DST offers superior fusion results in many field [5, 6, 10, 11, 18]. The Dempster’s combination rule is based on DST and is applied to reasonably fuse the estimated evidence scores 𝐬^\hat{\mathbf{s}} with the probability 𝐩\mathbf{p} for all the branches.

Specifically, let Ω={ω1,ω2,,ωK}\Omega=\{\omega_{1},\omega_{2},...,\omega_{K}\} be a list of KK hypotheses, and the evidence can be represented by a basic probability assignment, which is a mapping m():2Ω[0,1]m(\cdot):2^{\Omega}\rightarrow[0,1]. We have AΩm(A)=1\sum_{A\subseteq\Omega}{m(A)=1}, and m()m(\cdot) is called a mass function.

The evidence of two mass function m1(X1)m_{1}(X_{1}) and m2(X2)m_{2}(X_{2}) can be fused by the Dempster’s combination rule:

mf(A)=1MX1X2=Am1(X1)m2(X2),m_{f}(A)=\frac{1}{M}\sum_{{X_{1}}\cap{X_{2}}=A}~{}{m_{1}({X_{1}})}{m_{2}({X_{2}})}, (1)

where M=X1X2m1(X1)m2(X2)M=\sum_{{X_{1}}\cap{X_{2}}\neq\emptyset}{m_{1}({X_{1}})}{m_{2}({X_{2}})}. Eq. (1) is extendable to multiple evidence:

mf(A)=1MX1X2XT=Ai=1Tmi(Xi),m_{f}(A)=\frac{1}{M}\sum_{{X_{1}}\cap{X_{2}}\cap...{X_{T}}=A}~{}{\prod_{i=1}^{T}m_{i}({X_{i}})}, (2)

where M=X1X2XTi=1Tmi(Xi)M=\sum_{{X_{1}}\cap{X_{2}}\cap...{X_{T}}\neq\emptyset}\prod_{i=1}^{T}m_{i}({X_{i}}).

In our task, we have two hypotheses, positive and negative, so we define Ω={T,F}\Omega=\{T,F\}. TT represents the positive outcome, and FF represents the negative outcome. Then, we can construct the set {,T,F,U}\{\varnothing,T,F,U\}, where U={T,F}U=\{T,F\} represents the uncertain prediction.

The mass function for each branch ii can be defined as the evidence score calibrated output. We ignore the sample index jj here for simplicity:

{mi(T)=si^pi;mi(F)=si^(1pi);mi(U)=1si^.\displaystyle\begin{cases}m_{i}(T)=\hat{s_{i}}p_{i};\\ m_{i}(F)=\hat{s_{i}}(1-p_{i});\\ m_{i}(U)=1-\hat{s_{i}}.\end{cases} (3)

We take two pieces of evidence as an example, the combination of two branches can be formulated as:

mf(T)=1M[m1(T)m2(T)+m1(T)m2(U)+m2(T)m1(U)];\displaystyle m_{f}(T)=\frac{1}{M}[m_{1}(T)m_{2}(T)+m_{1}(T)m_{2}(U)+m_{2}(T)m_{1}(U)]; (4)
mf(F)=1M[m1(F)m2(F)+m1(F)m2(U)+m2(F)m1(U)];\displaystyle m_{f}(F)=\frac{1}{M}[m_{1}(F)m_{2}(F)+m_{1}(F)m_{2}(U)+m_{2}(F)m_{1}(U)]; (5)
mf(U)=1Mm1(U)m2(U),\displaystyle m_{f}(U)=\frac{1}{M}m_{1}(U)m_{2}(U), (6)

where M=mf(T)+mf(F)+mf(U)M=m_{f}(T)+m_{f}(F)+m_{f}(U) is the normalisation factor. The combination rule can be extended to multiple branches according to equation (2).

2.4 Network training

We first train the backbone classification networks with the cross entropy loss. Once the backbone network is fully-trained, we fixed the weights and generate the evidence score labels as described in section 2.2. For each branch, we train an evidence network with a mean absolute error (MAE) loss Levid=1Nj=1N|𝐬^j𝐬j|L_{evid}=\frac{1}{N}\sum_{j=1}^{N}|\hat{\mathbf{s}}^{j}-\mathbf{s}^{j*}|. The true evidence scores are imbalanced in the training samples, where the majorities are 1, and the minorities are 0. Therefore, similar to [27], we duplicated the minority cases during the training to make the dataset balanced.

3 Experiments and results

3.1 Datasets

In this study, we apply the proposed methods to TKR prediction and follow the settings in [27] to conduct the experiments. The data was obtained from Osteoarthritis Initiative (OAI) [23] dataset. 3D Double Echo Steady-state (DESS) MRI images are used as the imaging data, and 27 non-imaging variables are used for the non-image predictions, as in [27].

The DESS MRIs were cropped in the center region to have size of 320×320×120320\times 320\times 120. The patients who underwent TKR in 5 years are labelled as positive cases; otherwise, the samples are non-TKR control group. The class imbalance problem is significant in the dataset as the majority of the patients did not get the knee replaced. To avoid the bias introduced by the class imbalance, which can affect the evaluation of the model [31], we randomly sample the non-TKR cases to have a similar number of TKR cases in our dataset. We repeated the random sampling process to create three datasets to evaluate the proposed method. In total, there are 1,717 samples in each of our dataset, and we use 80%80\% for training, 10%10\% for validation, 10%10\% for testing.

3.2 Network configuration and experimental setups

The experiments were conducted on an NVIDIA Tesla V100 GPU and we implemented the networks with PyTorch library. The evidence networks in our study contain three interleaved linear, normalisation and ReLU layers, and the size of all the linear layers is 32. We train the evident networks for 100 epochs and saved the best model on the validation dataset. We use the Adam optimiser with a learning rate of 0.0001 and a weight decay to prevent over-fitting.

There are three settings in our experiments, which are image-only, clinical-only, and multi-modal on the three datasets. The backbone networks are mainly based on [27] and [9, 22]. For image-only setting, we employed DenseNet-121 as our backbone using the code and configurations from [27]. For the clinical-only experiments, we trained three models. (1) logistic regression (LR): we first used a simple LR method to perform the classification [27]. (2) fully-connected network (FC): since LR is a simple approach and only contain one linear layer, we improved the non-imaging branch with a fully-connected network (FC). In FC, the categorical variables go through embedding layers, and the continuous variables are fed into linear layers. After layer normalisation and ReLU activation function, the projected features are summed up and fed into another linear layer. (3) Transformer (Trans): to further improve the non-imaging branch, we use the transformer architecture [9, 22]. Then, we implemented multi-modality models and our proposed method. The image model is combined with each of the three clinical models, respectively. The LR in multi-modality section of Table 1 combined imaging data with non-imaging data with logistic regression (LR) for multi-modal prediction [27]. For FC and Trans in the multi-modality part of Table 1, the corresponding imaging and non-imaging branches’ features are concatenated and fed into a third fusion branch, which contains a linear layer for the decision making. Finally, we implemented the proposed fusion method on all baselines, labelled as DST in Table 1. Specifically, the features from the backbone networks are used as the evidence network input. The input of the evidence network for imaging branch draws from the DenseNet-121 and the feature size is 1018. The input of the evidence network for LR, FC, Trans and fusion branch has feature size of 34, 32, 64, and 4, respectively. We use accuracy, specificity, sensitivity, precision, F1 score and area under the ROC curve (AOC) in the scikit-learn package to evaluate the algorithms’ performance.

3.3 Algorithms comparison

The comparison of various methods over all datasets is shown in Table 1. The image-only predictions have good accuracy, which are around 82%\%, 81%\% and 83%\% on datasets 1, 2,and 3, respectively. In comparison, the clinical-only predictions are normally inferior, which are less than 80%\%, except that FC and Trans on dataset 3 are close to 85%\%.

Table 1: Performance comparison of different methods on OAI dataset. The bold number are the improved results.
Data Method Acc. Spec. Sens. Prec. F1 AUC
1 Image-only 82.08 86.00 76.71 71.20 78.32 81.36
Clinical-only LR 72.25 92.00 45.21 59.51 57.89 68.60
FC 79.77 83.00 75.34 67.96 75.86 79.17
Trans 78.61 82.00 73.97 66.46 74.48 77.99
Multi-modal LR 84.39 86.00 82.19 74.16 81.63 84.10
LR (DST) 85.55 89.00 80.82 76.21 82.52 84.91
FC 84.39 89.00 78.08 74.70 80.85 83.54
FC (DST) 85.55 91.00 78.08 76.68 82.01 85.55
Trans 83.82 93.00 71.23 74.92 78.79 82.12
Trans (DST) 84.39 92.00 73.97 75.41 80.00 82.99
2 Image-only 80.92 85.00 75.34 69.60 76.92 80.17
Clinical only LR 69.94 88.00 45.21 56.27 55.93 66.60
FC 77.46 84.00 68.49 65.18 71.94 76.25
Trans 80.35 84.00 75.34 68.77 76.39 79.67
Multi-modal LR 83.81 87.00 79.45 75.38 80.56 83.23
LR (DST) 86.71 91.00 80.82 78.22 83.69 85.91
FC 83.82 86.00 80.82 73.41 80.82 83.41
FC (DST) 84.97 85.00 84.93 74.74 82.67 84.97
Trans 84.97 89.00 79.45 75.46 81.69 84.23
Trans (DST) 85.55 89.00 80.82 76.01 82.76 85.10
3 Image-only 83.24 87.00 78.08 72.83 79.72 82.54
Clinincal only LR 80.35 94.00 61.64 70.58 72.58 77.82
FC 84.97 86.00 83.56 74.90 82.43 84.78
Trans 85.55 87.00 83.56 75.82 82.99 85.28
Multi-modal LR 87.86 88.00 87.67 79.03 85.91 87.84
LR (DST) 90.75 92.00 89.04 83.91 89.04 90.52
FC 89.02 90.00 87.67 81.03 87.07 88.84
FC (DST) 90.17 92.00 87.67 83.13 88.28 89.84
Trans 89.02 90.00 87.67 81.03 87.07 88.84
Trans (DST) 89.60 92.00 86.30 82.36 87.50 89.15

The multi-modal networks without DST combine the information from both sources and consistently obtain better results than the single-model predictions on all datasets. The improvement in accuracy of the multi-model networks are around 2-4%\% higher than the best single model. This means that combination of information from multiple sources improves the prediction performance.

However, the multi-modal networks without DST still conduct a simple combination treating all the data sources and branches equally, and cannot make good prediction when multiple branches output conflicting predictions and have significantly different reliability. The proposed evidence-aware approach, labelled as DST, considers the reliability of the source data and the network prediction, greatly reducing the uncertainties of the prediction. As illustrated in Table 1, the methods with DST almost consistently outperform the counterparts in terms of all metrics on all datasets. This shows the effectiveness of the proposed evidence network and fusion strategy.

3.4 Additional analysis

We further investigated the effectiveness of the proposed methods. First, we compared the proposed fusion strategy with the average fusion method [18], which simply averages the classification probabilities of all the branches. The comparison results of the proposed method (DST) and the average fusion (Avg.) is shown in Table 2. The proposed methods have around 1%\% higher accuracy compared to the average fusion method on dataset 1. The improvement is also seen on datasets 2 and 3. Therefore, it is important to consider the reliability of each branch’s output for final results.

Table 2: Prediction accuracy between the proposed method (DST) and the average fusion (Avg.) [18] on all dataset with different baseline networks. D1, D2, D3 represent dataset 1, 2, 3, respectively.
Acc. D1 LR D1 FC D1 Trans D2 LR D2 FC D2 Trans D3 LR D3 FC D3 Trans
DST 85.55 85.55 84.39 86.71 84.97 85.55 90.75 90.17 89.60
Avg. 84.60 84.97 83.24 83.24 84.97 85.55 87.86 89.60 89.02

Second, we examine the evidence score estimation to confirm that the evidence network has indeed learned to perform its task. Specifically, we check the evidence scores of correctly classified samples and misclassified samples, respectively. The correctly classified samples should have larger evidence scores, and the evidence scores of the rest samples should be near 0. We use histogram to visualise it. Fig.S1-S4 in the supplementary shows an example histogram of the fuse branch of LR backbone on dataset 1. For most of the correct results, the scores falls near 11; and for the wrong classifications, the scores are much smaller and close to 0. This means the evidence network can learn the reliability of the backbone networks.

4 Discussion and conclusion

In this paper, we propose a novel evidence-aware multi-modal data fusion strategy for medical condition prediction based on DST. The proposed method considers the reliability of the source data and the output of each modality. We apply our methods to TKR prediction task and the experiments show the increased prediction accuracy of the proposed approach. Although the experiments are conducted on DESS MRI and clinical data from OAI dataset, other source of data, such as bio-mechanical analysis, can also be included by adding an additional branch with the corresponding evidence network. The proposed approach is also applicable to other medical condition prediction tasks that require the information from multiple modalities.

References

  • [1] Bayoudh, K., Knani, R., Hamdaoui, F., Mtibaa, A.: A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. The Visual Computer pp. 1–32 (2021)
  • [2] Cheng, J., Liu, Z., Guan, H., Wu, Z., Zhu, H., Jiang, J., Wen, W., Tao, D., Liu, T.: Brain age estimation from mri using cascade networks with ranking loss. IEEE Transactions on Medical Imaging 40(12), 3400–3412 (2021)
  • [3] Dempster, A.P.: Upper and lower probability inferences based on a sample from a finite univariate population. Biometrika 54(3-4), 515–528 (1967)
  • [4] Dempster, A.P., et al.: Upper and lower probabilities induced by a multivalued mapping. Classic works of the Dempster-Shafer theory of belief functions 219(2), 57–72 (2008)
  • [5] Denoeux, T.: A neural network classifier based on dempster-shafer theory. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 30(2), 131–150 (2000)
  • [6] Denoeux, T.: A k-nearest neighbor classification rule based on dempster-shafer theory. Classic works of the Dempster-Shafer theory of belief functions pp. 737–760 (2008)
  • [7] Guan, B., Liu, F., Mizaian, A.H., Demehri, S., Samsonov, A., Guermazi, A., Kijowski, R.: Deep learning approach to predict pain progression in knee osteoarthritis. Skeletal Radiology 51(2), 363–373 (2022)
  • [8] Guan, H., Wang, C., Tao, D.: Mri-based alzheimer’s disease prediction via distilling the knowledge in multi-modal data. NeuroImage 244, 118586 (2021)
  • [9] Hoang Nguyen, H., Blaschko, M.B., Saarakkala, S., Tiulpin, A.: Clinically-inspired multi-agent transformers for disease trajectory forecasting from multimodal data. arXiv e-prints pp. arXiv–2210 (2022)
  • [10] Huang, L., Denoeux, T., Vera, P., Ruan, S.: Evidence fusion with contextual discounting for multi-modality medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V. pp. 401–411. Springer (2022)
  • [11] Huang, L., Ruan, S., Denoeux, T.: Covid-19 classification with deep neural network and belief functions. In: The Fifth International Conference on Biological Information and Biomedical Engineering. pp. 1–4 (2021)
  • [12] Huang, M., Wang, T., Chen, X., Zhang, X., Zhou, S., Feng, Q.: Multi-view imputation and cross-attention network based on incomplete longitudinal and multi-modal data for alzheimer’s disease prediction. arXiv preprint arXiv:2206.08019 (2022)
  • [13] Joseph, G., McCulloch, C., Nevitt, M., Link, T., Sohn, J.: Machine learning to predict incident radiographic knee osteoarthritis over 8 years using combined mr imaging features, demographics, and clinical factors: data from the osteoarthritis initiative. Osteoarthritis and Cartilage 30(2), 270–279 (2022)
  • [14] Karim, M.R., Jiao, J., Döhmen, T., Cochez, M., Beyan, O., Rebholz-Schuhmann, D., Decker, S.: Deepkneeexplainer: explainable knee osteoarthritis diagnosis from radiographs and magnetic resonance imaging. IEEE Access 9, 39757–39780 (2021)
  • [15] Kline, A., Wang, H., Li, Y., Dennis, S., Hutch, M., Xu, Z., Wang, F., Cheng, F., Luo, Y.: Multimodal machine learning in precision health: A scoping review. npj Digital Medicine 5(1),  171 (2022)
  • [16] Lee, G., Nho, K., et al.: Predicting alzheimer’s disease progression using multi-modal deep learning approach. Scientific reports 9(1),  1952 (2019)
  • [17] Li, J., Wu, B., Sun, X., Wang, Y.: Causal hidden markov model for time series disease forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12105–12114 (2021)
  • [18] Li, Q., Zhang, C., et al.: Confidence-aware fusion using dempster-shafer theory for multispectral pedestrian detection. IEEE Transactions on Multimedia (2022)
  • [19] Li, X., Jia, M., Islam, M.T., Yu, L., Xing, L.: Self-supervised feature learning via exploiting multi-modal data for retinal disease diagnosis. IEEE Transactions on Medical Imaging 39(12), 4023–4033 (2020)
  • [20] Liu, Y., Fan, L., Zhang, C., Zhou, T., Xiao, Z., Geng, L., Shen, D.: Incomplete multi-modal representation learning for alzheimer’s disease diagnosis. Medical Image Analysis 69, 101953 (2021)
  • [21] Mallya, M., Hamarneh, G.: Deep multimodal guidance for medical image classification. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII. pp. 298–308. Springer (2022)
  • [22] Nguyen, H.H., Saarakkala, S., Blaschko, M.B., Tiulpin, A.: Climat: Clinically-inspired multi-agent transformers for disease trajectory forecasting from multi-modal data. arXiv preprint arXiv:2104.03642 (2021)
  • [23] Peterfy, C.G., Schneider, E., Nevitt, M.: The osteoarthritis initiative: report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthritis and cartilage 16(12), 1433–1441 (2008)
  • [24] Rahman, T., Chowdhury, M.E., Khandakar, A., Mahbub, Z.B., Hossain, M.S.A., Alhatou, A., Abdalla, E., Muthiyal, S., Islam, K.F., Kashem, S.B.A., et al.: Bio-cxrnet: A robust multimodal stacking machine learning technique for mortality risk prediction of covid-19 patients using chest x-ray images and clinical data. arXiv preprint arXiv:2206.07595 (2022)
  • [25] Rana, S.S., Ma, X., Pang, W., Wolverson, E.: A multi-modal deep learning approach to the early prediction of mild cognitive impairment conversion to alzheimer’s disease. In: 2020 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT). pp. 9–18. IEEE (2020)
  • [26] Tiulpin, A., Klein, S., Bierma-Zeinstra, S., Thevenot, J., Rahtu, E., Meurs, J.v., Oei, E.H., Saarakkala, S.: Multimodal machine learning-based knee osteoarthritis progression prediction from plain radiographs and clinical data. Scientific reports 9(1), 1–11 (2019)
  • [27] Tolpadi, A.A., Lee, J.J., Pedoia, V., Majumdar, S.: Deep learning predicts total knee replacement from magnetic resonance images. Scientific reports 10(1), 1–12 (2020)
  • [28] Venugopalan, J., Tong, L., Hassanzadeh, H.R., Wang, M.D.: Multimodal deep learning models for early detection of alzheimer’s disease stage. Scientific reports 11(1),  3254 (2021)
  • [29] Zhang, D., Shen, D.: Multi-modal multi-task learning for joint prediction of clinical scores in alzheimer’s disease. In: Multimodal Brain Image Analysis: First International Workshop, MBIA 2011, Held in Conjunction with MICCAI 2011, Toronto, Canada, September 18, 2011. Proceedings 1. pp. 60–67. Springer (2011)
  • [30] Zheng, S., Zhu, Z., Liu, Z., Guo, Z., Liu, Y., Yang, Y., Zhao, Y.: Multi-modal graph learning for disease prediction. IEEE Transactions on Medical Imaging 41(9), 2207–2216 (2022)
  • [31] Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B., Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE 109(5), 820–838 (2021)

5 Supplementary Materials

Refer to caption
Figure S1: Histogram of the evidence scores for the image branch of the LR (DST) model on dataset 1.
Refer to caption
Figure S2: Histogram of the evidence scores for the clinical branch of the LR (DST) model on dataset 1.
Refer to caption
Figure S3: Histogram of the evidence scores for the fusion branch of the LR (DST) model on dataset 1. The orange is the scores for the misclassified samples, and the purple is the scores for the correct samples.
Refer to caption
Figure S4: We collect all the correct samples from three branches of the LR (DST) model on dataset 1 and visualise the evidence scores in purple. The histogram in orange is the evidence score distribution of the misclassified samples of all branches