Completed Feature Disentanglement Learning for Multimodal MRIs Analysis
Abstract
Multimodal MRIs play a crucial role in clinical diagnosis and treatment. Feature disentanglement (FD)-based methods, aiming at learning superior feature representations for multimodal data analysis, have achieved significant success in multimodal learning (MML). Typically, existing FD-based methods separate multimodal data into modality-shared and modality-specific features, and employ concatenation or attention mechanisms to integrate these features. However, our preliminary experiments indicate that these methods could lead to a loss of shared information among subsets of modalities when the inputs contain more than two modalities, and such information is critical for prediction accuracy. Furthermore, these methods do not adequately interpret the relationships between the decoupled features at the fusion stage. To address these limitations, we propose a novel Complete Feature Disentanglement (CFD) strategy that recovers the lost information during feature decoupling. Specifically, the CFD strategy not only identifies modality-shared and modality-specific features, but also decouples shared features among subsets of multimodal inputs, termed as modality-partial-shared features. We further introduce a new Dynamic Mixture-of-Experts Fusion (DMF) module that dynamically integrates these decoupled features, by explicitly learning the local-global relationships among the features. The effectiveness of our approach is validated through classification tasks on three multimodal MRI datasets. Extensive experimental results demonstrate that our approach outperforms other state-of-the-art MML methods with obvious margins, showcasing its superior performance.
Multimodal learning, Feature disentanglement, Dynamic fusion, MRIs.
1 Introduction


Multi-modality data contains multiple aspects of information about an object, and different modalities can provide complementary information. Numerous previous studies have demonstrated the remarkable success of multimodal learning (MML) [1] for medical image analysis. However, inappropriate processing of multimodal information can significantly impact the efficiency of MML. According to [2], the key to successful MML lies in achieving a higher quality of feature representation. Many previous works [3, 4, 5, 6, 7, 8, 9, 10, 11] have focused on enhancing the learning performance, which can be classified into three categories. Two categories focus on extracting the shared information between multiple modalities [4, 3] or specific information of each modality [5, 6]. These methods cannot fully extract multimodal information, as they only focus on one type of feature, leading to information loss [11, 10]. The third category focuses on feature disentanglement (FD), which decouples modality-shared features as well as modality-specific features [7, 9, 8, 10], leading to sound results.
We revisit the relationship between the representation spaces of multimodal data. As illustrated in the concept map of the three-modal case in Fig. 1, intuitively, we can consider that there exists modality-shared information (yellow area) as well as modality-specific information (green area). Upon further consideration and exploration, we discover that there is shared information present between subsets of modalities (blue area). However, existing FD methods could potentially ignore such information. On the other hand, our preliminary experiments reveal that the lost information is crucial for accurate prediction (see the first and fourth rows of Table 4). In Fig. 2, we take MEN dataset as an example for illustration. Both T1C and FLAIR-C highlight the tumor area, indicating shared tumor information. Similarly, FLAIR-C and ADC highlight the edema area, showing shared edema information. Additionally, T1C and ADC share information about cell density in the tumor area. In fact, such shared information among pair-wise modalities is found to be relevant to the prediction of meningiomas grade and invasion in clinical research [12, 13, 14].
Furthermore, current multimodal fusion studies mainly focus on uncertainty-based fusion methods [15, 16, 17] and attention-based methods [18, 19]. However, these approaches generally address modality-shared and modality-specific information fusion across multiple modalities, while overlooking the design of fusion mechanisms for shared information between subsets of modalities.
To tackle the above issues, we propose a completed feature disentanglement multimodal learning (CFDL) approach for multimodal MRIs analysis. First, we present a novel completed feature disentanglement (CFD) strategy to address the information loss in previous FD-based methods. In addition to decoupling modality-shared features among all modalities and modality-specific features, We further decouple features shared between subsets of modalities, referred to as modality-partial-shared features. The modality partial-shared features are also expected to have higher similarities while being dissimilar from the other two kinds of features. As demonstrated in Section IV, these features play a critical role in prediction performance. Next, to improve the interpretability of feature fusion, we propose a new dynamic mixture-of-experts fusion (DMF) module, which can explicitly capture local-global interrelationships between the decoupled features to achieve more effective fusion. Finally, we evaluate our framework on three multimodal MRI datasets, demonstrating its effectiveness and superiority compared to state-of-the-art methods.

2 Related Work
2.1 Feature Representation Learning in MML
Feature representation learning is a crucial aspect in the field of MML, which contains three types of methods. Some approaches [4, 3] have focused on the first type of method, which extracted specific features from each modality and subsequently fused them with obtained embeddings. Braman et al. [4] designed the Multimodal Orthogonalization (MMO) loss function to obtain the maximum specific representation for each of radiology, pathology, genomic and clinical data. Several methods [5, 6] have concentrated on the second type of method, which captured modality-shared features from multiple modalities. Ning et al. [6] built a bi-directional mapping between original space and shared space of multimodal to effectively obtained multi-modal shared representation. However, the fist two type of methods have primarily emphasized either modality-specific or modality-shared features, thus failing to learn a comprehensive representation of multimodal data. The third type of method, FD, has proven to be effective in separating multimodal information into meaningful components and has been successfully applied in various applications [7, 9, 8, 10]. Hu et al. [7] proposed a disentangled-multimodal adversarial autoencoder (DMM-AAE) model that employed a VAE to disentangle multimodal MRIs information into modal-common features and modal-specific features. However, this method only addressed the two-modal fusion scenario. Cheng et al. [9] extended this approach to multimodal fusion scenario. It is worth noting that both of these methods cannot be trained end-to-end due to the reliance on hand-crafted features as model inputs. Hazarika et al. [8] decoupled multimodal information into modality-invariant and modality-specific features using Central Moment Discrepancy metric, orthogonality constraints and reconstruction loss. Li et al. [10] proposed decoupled multimodal distillation (DMD), which first separated the representation of each modality into modality-irrelevant space and modality-exclusive space. Then, a graph distillation unit was employed to each space for dynamic enhancing the features of each modality.
The aforementioned FD methods have a common drawback that can result in incomplete feature representation learning in the case of three or more modalities, as depicted in Fig. 1. In contrast, the proposed CFD strategy addresses this limitation by decoupling multimodal information into modality-shared features, modality-specific features, and modality-partial-shared features, thereby enabling comprehensive feature representation learning.
2.2 Multimodal Feature Fusion
The fusion strategy is another crucial aspect of MML. Several approaches have involved concatenating features extracted from different modalities [20, 3] or representation spaces [7, 9, 10], such as modality-shared and modality-specific features. However, concatenation fusion has not effectively utilized the correlations between multiple modalities.
In recent years, there has been an increasing focus on exploring the correlations among multiple modalities to obtain effective features. Some methods achieve multimodal fusion by assigning weights or probabilities to each modality. [15, 16, 17] explored the uncertainties of different modalities to obtain reliable multimodal fusion information. Choi et al. [21] proposed EmbraceNet which performed multimodal representation fusion based on a probabilistic approach. Zhou et al. [22] introduced a canonical correlation analysis (CCA)-based method named ADCCA to exploit the correlation between multiple modalities and integrate the complementary information from these modalities. Zhuang et al. [23] proposed a global-guided fusion method which consider both global and local correlations of multiple modalities.
With the proven ability of attention mechanisms to enhance feature representation and explore complex correlation between multiple modalities, many attention-based multimodal fusion methods have emerged [24, 18, 25, 19, 26]. Zhang et al. [18] proposed a modality-aware mutual learning (MAML) framework that weighted the multimodal features using an attention-based modality-aware (MA) module. Zhu et al. [25] captured complementary information from multimodal data using self-attention and cross-modal attention, and further designed a triple network to obtain more discriminative information. Xing et al. [19] developed the NestedFormer framework, which included the Nested Modality-aware Feature Aggregation (NMaFA) module to explore long-range correlations within and between modalities for effective and comprehensive information learning. However, these attention-based methods cannot explicitly reveal the contribution of each decoupled features during fusion process.
Mixture-of-Experts (MoE) [27] employs multiple experts to extract distinct representation spaces from the input and generated corresponding weights using a gating network. MoE have the ability to dynamically capture the mixture information from multiple experts. Several studies [28, 29] have extended MoE to handle multi-input scenarios, where each expert processes a specific input. These approaches leverage the dynamic nature of MoE. However, these methods concatenate all inputs to generate weights in the gating network without thoroughly considering the relationships between different inputs, which can limit the effectiveness of the fusion process. In contrast, we introduce a gating network to capture the local-global relationships between the decoupled features.
3 Method
Let’s denote the input multimodal data as , where is the number of samples; , denotes the modality number of each sample; is the classification label for the -th sample.
3.1 Overview
The proposed CFDL framework, illustrated in Fig. 3, comprises three parts: a) feature extraction from multimodal MRIs, b) completed feature disentanglement for feature decoupling, and c) dynamic MoE fusion for dynamically integrating the decoupled features. The framework employs the same type of backbone to extract latent features from each modality. To capture a comprehensive representation of the multimodal data, the latent features are decoupled into modality-shared features, modality-specific features, and modality-partial-shared features using the proposed CFD strategy. The decoupled features are then integrated via the DMF module. Within the DMF module, each decoupled feature is paired with a specific expert, and a gating network named LinG_GN generates weights for multiple experts. The fused features are obtained by aggregating the weighted features from the experts. In the following, we take three-modal condition as examples to illustrate the CFD strategy and DMF module.
3.2 Completed Feature Disentanglement Strategy
Inspired by previous FD methods [7, 8, 9], we first decouple the extracted latent features into modality-shared features and modality-specific features. We employ a shared-encoder to decouple modality-shared features and three private-encoders to decouple modality-specific feature for each modality. Three modality-shared features can be formulated as follows:
(1) |
and three modality-specific features can be obtained with:
(2) |
The final modality-shared feature is the mean of all modality-shared features, given by
(3) |
We further consider modality-partial-shared features between pair-wise modalities. As a result, three groups of modality-partial-shared features are decoupled, with each group consisting of two features. The two features in the same group (, ) are decoupled with the same partial-shared-encoder named , i.e.,
(4) |
Specifically, represents the modality-partial-shared feature between the -th modality and the -th modality, which is decoupled from the -th modality. The final modality-partial-shared feature can be calculated by averaging the modality-partial-shared features in each group,
(5) |
In the end, we obtain three modality-shared features (, , ), three modality-specific features (, , ), and three groups of modality-partial-shared features including , , . Furthermore, we get 7 final decoupled features, which is denoted as a set , where .
To enhance the completeness of the decoupled representation, we use the following three constraints:
-
1)
Modality-shared features should exhibit high similarity to one another.
-
2)
Modality-partial-shared features within each group should have maximum similarity.
-
3)
The final decoupled features should exhibit maximum dissimilarity from one another.
To ensure the effective decoupling of modality-shared features and modality-partial-shared features, we employ the mean squared error (MSE) loss as a constraint. The MSE loss measures the discrepancy between two features, and we aim to increase similarity between two features by minimizing this loss. The losses for modality-shared features and modality-partial-shared features are expressed as:
(6) |
(7) |
We denote as the sum of and .
To enhance the decoupling of modality-specific features and minimize redundancy among all final decoupled features, we incorporate cosine similarity as a constraint for better optimization. Our objective is to increase the dissimilarity between the final decoupled features by reducing the cosine similarity between each pair of these features. The loss for all final decoupled features is calculated by:
(8) |
where represents the cosine similarity function.
3.3 Dynamic MoE Fusion Module
To ensure dynamic fusion of the final decoupled features, we introduce the DMF module based on the MoE architecture, which is shown in Fig. 3 (c). In DMF module, each final decoupled feature is associated with a specific expert , which is implemented as a fully-connected layer. We introduce LinG_GN to dynamically generate weights for these experts, taking into account the relationships of these final decoupled features. The LinG_GN operates with two inputs to capture a comprehensive understanding of these features. Firstly, we concatenate all the final decoupled features as one input for the LinG_GN. To integrate this concatenated feature, we map the concatenated feature into a unified representation utilizing a fully-connected layer . This unified representation captures the collective information from all the final decoupled features and is treated as the global feature ,
(9) |
where means column concatenation operation and is the activation function. Secondly, we stack all the final decoupled features as the second input for the LinG_GN. This stacked features retain the individual information from each final decoupled feature and are considered as the local features, denoted as , where . By obtaining both the global feature and the local features, we can explore the importance of each local feature within the context of the global feature. This exploration allows us to determine the weight of local features in contributing to the fused representation. The weight is calculated by:
(10) |
where represents matrix multiplication operation, . Each element in represents the weight for the corresponding expert network . The fused feature is obtained by concatenating the linearly weighted experts,
(11) |
For the final prediction , we utilize a Multi-Layer Perceptron () as the classifier. The cross-entropy (CE) loss is employed as the supervision for prediction. The classification loss is defined as:
(12) |
The final loss can be defined as the weighted sum of aforementioned losses,
(13) |
where and are balance factors.
3.4 Network Architecture
We utilize the 3D ResNet18 [30] as the backbone for feature extraction, and the parameters of these backbones are not shared. The dimension of the each extracted latent feature is 512. The shared-encoder , each private-encoders , each partial-shared-encoders , and each expert are all implemented as one fully-connected layer with neurons. The consists of two fully-connected layers with neurons and one output layer with neurons, where represents the number of classes. Each fully-connected layer in is followed by a ReLU layer and a Dropout layer. We empirically set as 32.

4 Experiments
4.1 Datasets and Tasks
We utilize three multimodal MRI datasets, which consist of two public datasets and one private dataset, to verify the effectiveness of the proposed framework. Some cases from these three datasets are shown in Fig 4.
4.1.1 Meniscal Tear Prediction
For the prediction of meniscal tear, we employ the MRNet dataset [31], which is a publicly available knee multi-MRI dataset111More information about the MRNet dataset can be available in https://stanfordmlgroup.github.io/competitions/mrnet/.. There are 1130 cases in training set and 120 cases in validation set. The training set contains 397 meniscal tear cases and 733 contrast cases, and the validation set contains 52 meniscal tear cases and 68 contrast cases. Each case includes three MRIs: sagittal plane T2-weighted series (T2-sagittal), coronal plane T1-weighted series (T1-coronal) and axial plane PD-weighted series (PD-axial). We resize the MRIs to 24*128*128 as the model input.
4.1.2 Meningiomas Grading Prediction
We collect Meningiomas Grading Prediction Dataset, referred to as MEN, from the Brain Medical Center of Tianjin University, Tianjin Huanhu Hospital 222The Ethical Committee of Tianjin Huanhu Hospital approves scientific research using these MRIs and waives the need for informed patient consent ((Jinhuan) Ethical Review No.(2022-046)).. This dataset consists of three grades of meningiomas: Grade 1 (G1), Grade 2 with invasion (G2inv) and Grade 2 without invasion (G2ninv). The total dataset comprises 798 cases, including 650 Grade_1, 62 Grade_2inv and 86 Grade_2ninv cases. Each case includes three brain MRIs: Contrast-Enhanced T1 series (T1C), Contrast-Enhanced T2 FLAIR series (FLAIR-C) and Apparent Diffusion Coefficient series (ADC). Following previous works [32, 33], we request radiologists to crop the regions of interest (ROIs). To maintain the shape of the tumor and edema regions, the ROIs are zero-padded into squares and resized to dimensions of 24*128*128, which serve as the inputs for the model.
4.1.3 MGMT Promoter Status Prediction
The MGMT Promoter Status Prediction Dataset, known as BraTS 2021 [34], is a publicly available multimodal brain MRI dataset. It encompasses cases with MGMT methylated (MGMT+) and unmethylated (MGMT-) status. The dataset comprises 580 available cases333 We drop 5 cases from original 585 cases during pre-processing. There are 3 cases with unexpected issues. The other 2 cases cannot be registered by CaPTk. Get more information in https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification., with each case containing four modalities: T1, post-contrast T1-weighted (T1Gd), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR). Specifically, there are 275 MGMT- cases and 305 MGMT+ cases. Pre-processing for each modality involves image registration and skull-stripping using the Cancer Imaging Phenomics Toolkit (CaPTk) [35]. We crop the ROIs using masks generated by the pretrained Swin UNETR [36]. Finally, we zero-pad the ROIs into squares and resize them to 16*128*128 as the inputs for the proposed method.
4.2 Implementation Details
4.2.1 Training Details
We employ 3-fold cross-validation for private MEN and public BraTS 2021 datasets, and train three times using different seeds with already divided training and validation data for MRNet dataset. During model training, we implement several techniques to prevent overfitting, such as data augmentation, L2 regularization (weight decay) and dropout [37]. Data augmentation techniques include random clip, random crop, gaussian noise and random erasing [38]. The weight decay is set as , and the dropout value is set to . The network is optimized with the Adam optimizer [39]. We linearly warm up the learning rate from zero to the preset value over epochs and apply a learning rate decay strategy, reducing the learning rate to after every epochs. The batch size is set . For MRNet dataset, we initialize the learning rate value as , and set the number of epochs to . For MEN dataset, the learning rate is specified as , and number of epochs is fixed as . For BraTS 2021 dataset, we preset the learning rate value as and the number of epochs as . The balance factors, and , are set to {}, {}, {} for MRNet, MEN and BraTS, respectively. Details of the ablation analysis for the balance factors are provided in the supplementary material. All experiments are conducted with PyTorch on an NVIDIA RTX 3090 GPU.
4.2.2 Evaluation Metrics
For two-class datasets, MRNet and BraTS 2021, we employ seven metrics to assess the effectiveness of the proposed framework, including Sensitivity (SEN), Specificity (SPE), Accuracy (ACC), G-mean, Balanced Accuracy (Ba_ACC) [40], Area Under the Precision-Recall Curve (AUPRC), Area Under the Curve (AUC). For three-class dataset, MEN, we utilize seven evaluation metrics, including Accuracy (ACC), Accuracy of G1 (ACC_G1), Accuracy of G2inv (ACC_G2inv), Accuracy of G2ninv (ACC_G2ninv), weighted F1 score (weighted-F1), macro F1 score (macro-F1) and AUC. For the statistical analysis, Wilcoxon signed-rank [41] is adopted to compare the metrics of our proposed framework with other methods.
4.2.3 Compared Methods
We compare the proposed framework with nine state-of-the-art (SOTA) MML methods, including EmbraceNet [21], ETMC [16], ADCCA [22], MAML [18], NestedFormer [19], MISA [8], DMD [10], CCML [17] and GLoMo [23]. Specifically, ETMC and CCML are uncertainty-based MML methods. ADCCA and GLoMo are correlation-based MML methods. MAML and NestedFormer are attention-based MML methods originally designed for segmentation, but adapted to the classification task by adding a classifier after the encoders. MISA and DMD are FD-based MML methods. To ensure a fair comparison, in addition to the transformer-based NestedFormer method, we set the backbones of other CNN-based comparison methods to be the same as that of the proposed framework.





Method | SEN | SPE | ACC | G-Mean | Ba_ACC | AUPRC | AUC |
EmbraceNet [21] | 0.6410* 0.0111 | 0.7696 0.0424 | 0.7139* 0.0192 | 0.7021* 0.0136 | 0.7053* 0.0157 | 0.5924* 0.0199 | 0.7509* 0.0136 |
MISA [8] | 0.6795* 0.0111 | 0.7451 0.0557 | 0.7167* 0.0289 | 0.7111* 0.0232 | 0.7123* 0.0249 | 0.5963* 0.0314 | 0.7517* 0.0294 |
MAML [18] | 0.6923 0.1071 | 0.7598 0.1390 | 0.7306 0.0337 | 0.7183* 0.0200 | 0.7261* 0.0186 | 0.6148 0.0346 | 0.7609* 0.0068 |
ETMC [16] | 0.7949 0.0675 | 0.6274* 0.0946 | 0.7000* 0.0300 | 0.7031* 0.0268 | 0.7111* 0.0224 | 0.5834* 0.0266 | 0.7655* 0.0169 |
NestedFormer [19] | 0.7308 0.0193 | 0.7206* 0.0255 | 0.7250* 0.0167 | 0.7255* 0.0159 | 0.7257* 0.0160 | 0.6042* 0.0174 | 0.7540* 0.0113 |
ADCCA [22] | 0.7500 0.0693 | 0.6765* 0.0589 | 0.7083* 0.0084 | 0.7104* 0.0096 | 0.7133* 0.0094 | 0.5882* 0.0073 | 0.7627* 0.0050 |
DMD [10] | 0.6154* 0.1201 | 0.7990 0.0945 | 0.7195* 0.0048 | 0.6955* 0.0244 | 0.7072* 0.0136 | 0.5997* 0.0051 | 0.7668* 0.0112 |
CCML [17] | 0.6795* 0.0728 | 0.7696 0.0473 | 0.7306 0.0096 | 0.7214* 0.0199 | 0.7245* 0.0153 | 0.6100* 0.0100 | 0.7669* 0.0085 |
GLoMo [23] | 0.6667* 0.0867 | 0.7451 0.1251 | 0.7111* 0.0337 | 0.6995* 0.0165 | 0.7059* 0.0199 | 0.5934* 0.0344 | 0.7670* 0.0154 |
Proposed | 0.7244 0.0588 | 0.7500 0.0778 | 0.7389 0.0255 | 0.7351 0.0178 | 0.7372 0.0199 | 0.6207 0.0301 | 0.8029 0.0219 |
Method | ACC | ACC _G1 | ACC _G2inv | ACC _G2ninv | weighted -F1 | macro -F1 | AUC |
EmbraceNet [21] | 0.9136* 0.0103 | 0.9570 0.0206 | 0.7894* 0.0353 | 0.6730* 0.1192 | 0.9155* 0.0100 | 0.8053* 0.0329 | 0.8683* 0.0286 |
MISA [8] | 0.9099* 0.0103 | 0.9678 0.0199 | 0.8848 0.0603 | 0.4873* 0.0535 | 0.9056* 0.0074 | 0.7885* 0.0301 | 0.9411* 0.0092 |
MAML [18] | 0.9360* 0.0107 | 0.9707 0.0209 | 0.9015 0.0523 | 0.6992* 0.1059 | 0.9361* 0.0076 | 0.8524* 0.0229 | 0.9600* 0.0113 |
ETMC [16] | 0.8834* 0.0088 | 0.9385* 0.0138 | 0.8045* 0.0569 | 0.5198* 0.1783 | 0.8836* 0.0158 | 0.7595* 0.0512 | 0.9051* 0.0364 |
NestedFormer [19] | 0.9273* 0.0060 | 0.9584 0.0167 | 0.8212* 0.0620 | 0.7682* 0.0468 | 0.9298* 0.0036 | 0.8436* 0.0030 | 0.9695* 0.0083 |
ADCCA [22] | 0.8295* 0.0446 | 0.9091* 0.0698 | 0.5803* 0.1253 | 0.4119* 0.2221 | 0.8284* 0.0301 | 0.6289* 0.0534 | 0.8753* 0.0421 |
DMD [10] | 0.8897* 0.0076 | 0.9677 0.0002 | 0.8030* 0.1046 | 0.3619* 0.0644 | 0.8798* 0.0101 | 0.7303* 0.0279 | 0.9202* 0.0227 |
CCML [17] | 0.8550* 0.0419 | 0.9187* 0.0607 | 0.8106* 0.1344 | 0.4024* 0.2119 | 0.8551* 0.0270 | 0.7003* 0.0255 | 0.9164* 0.0257 |
GLoMo [23] | 0.9059* 0.0240 | 0.9507 0.0416 | 0.8697* 0.0605 | 0.5952* 0.0899 | 0.9066* 0.0184 | 0.7968* 0.0211 | 0.9585* 0.0213 |
Proposed | 0.9462 0.0113 | 0.9616 0.0160 | 0.9182 0.0315 | 0.8492 0.0383 | 0.9483 0.0101 | 0.8936 0.0106 | 0.9776 0.0021 |
4.3 Quantitative Results
4.3.1 Evaluation on Meniscal Tear Prediction Dataset
The comparison results on the MRNet dataset are summarized in Table 1. Among the comparison methods, the correlation-based method GLoMo [23] achieves the best AUC (), while the uncertainty-based method CCML [17] obtains the second place (). The attention-based method MAML [18] obtains better results than other comparison methods in three metrics: ACC (), Ba_ACC () and AUPRC (). Our proposed framework achieves the first place in five metrics: ACC (, better than the 2nd), G-Mean (, better than the 2nd), Ba_ACC (, better than the 2nd), AUPRC (, better than the 2nd) and AUC (, better than the 2nd). The higher ACC and AUC, along with the more balanced accuracy between positive and negative cases, demonstrate the effectiveness of proposed framework on the MRNet dataset.
4.3.2 Evaluation on Meningiomas Grading Prediction Dataset
We further validate the proposed framework on the private MEN dataset, and the comparison results are shown in Table 2. Among the comparison methods, the attention-based method NestedFormer achieves the best AUC () and ACC_G2ninv (). The other attention-based method, MAML, achieves better rusults than other comparison methods in five metrics, including ACC (), ACC_G1 (), ACC_G2inv (), weighted-F1 () and macro-F1 (). The reason for the poor performance of DMD is that the difficult prediction of class G2ninv results in distillation leaning towards other classes. In contrast, our proposed framework achieves first place in six metrics: ACC (, better than the 2nd), ACC_G2inv (, better than the 2nd), ACC_G2ninv (, better than the 2nd), weighted-F1 (, better than the 2nd), macro-F1 (, better than the 2nd) and AUC (, better than the 2nd). Benefit from the CFD strategy and DMF module, our proposed framework achieves relatively high and balanced accuracy on each class.
4.4 Ablation Analysis
We also verify the effectiveness of the proposed CFD strategy and DMF module. Ablation studies are conducted on both adopted datasets, and the results are summarized in Table 3 and Table 4, respectively. In the tables, the models are termed as baseline1, baseline2, …, baseline6 from the top row to the bottom row, with baseline6 representing the proposed framework. The ablation studies consider three factors: , and . The factor, which related to CFD strategy, determines whether to decouple modality-partial-shared features during the feature decoupling process. There are two factors related to the DMF module, namely and . The factor represents whether to adopt MoE for the feature fusion, with “” indicating fusion with concatenation operation. The factor denotes whether to utilize proposed LinG_GN to generate weights in MoE, with “” representing the use of concatenation of decoupled features as the input of the gating network. The concatenation operation for inputs in gating network is the common setting in MoE-based MML methods [42, 29].
Specifically, the order of the three modalities is PD-axial, T1-coronal and T2-sagittal on the MRNet dataset, and T1C, Flair-C, ADC on the MEN dataset.
Proposed | SEN | SPE | ACC | G-Mean | Ba_ACC | AUPRC | AUC | ||
dis_ps | MoE | LinG | |||||||
- | 0.6859 | 0.7157 | 0.7028 | 0.6988 | 0.7008 | 0.5817 | 0.7544 | ||
0.6859 | 0.7108 | 0.7000 | 0.6874 | 0.6983 | 0.5810 | 0.7566 | |||
0.7885 | 0.6863 | 0.7306 | 0.7326 | 0.7374 | 0.6114 | 0.7672 | |||
- | 0.7436 | 0.7255 | 0.7333 | 0.7298 | 0.7346 | 0.6142 | 0.7745 | ||
0.7820 | 0.6569 | 0.7111 | 0.7151 | 0.7195 | 0.5917 | 0.7591 | |||
0.7244 | 0.7500 | 0.7389 | 0.7351 | 0.7372 | 0.6207 | 0.8029 |
Proposed | ACC | ACC _G1 | ACC _G2inv | ACC _G2ninv | weighted -F1 | macro -F1 | AUC | ||
dis_ps | MoE | LinG | |||||||
- | 0.9076 | 0.9371 | 0.9333 | 0.6635 | 0.9113 | 0.8091 | 0.9584 | ||
0.9265 | 0.9539 | 0.8106 | 0.8072 | 0.9302 | 0.8549 | 0.9590 | |||
0.9326 | 0.9525 | 0.8515 | 0.8397 | 0.9364 | 0.8634 | 0.9699 | |||
- | 0.9101 | 0.9232 | 0.8833 | 0.8278 | 0.9171 | 0.8355 | 0.9638 | ||
0.9297 | 0.9491 | 0.8364 | 0.8270 | 0.9314 | 0.8563 | 0.9745 | |||
0.9462 | 0.9616 | 0.9182 | 0.8492 | 0.9483 | 0.8936 | 0.9776 |

4.4.1 Effectiveness of the CFD Strategy
In Table 3, using consistently results in significant improvements when using the same settings of factors and on the MRNet dataset (see baseline1 and baseline4, or baseline2 and baseline5, or baseline3 and baseline6). Similar results are obtained on the MEN dataset (see Table 4). These ablation studies on the factor validate the effectiveness of the proposed CFD strategy.
We also visualize the distribution of the decoupled features using t-SNE [43], and draw the heatmap which displays the cosine similarity between each pair of these features. Fig. 5 (b) shows the visualization results for MRNet dataset. These visualization results meet the three principles which we described in Sec. 3.2.
-
1)
Modality-shared features including , and cluster together in the t-SNE visualization. These features have a cosine similarity value of with each other as shown in the corresponding heatmap.
-
2)
Modality-partial-shared features in each group exhibit a high degree of similarity. For example, when considering the two features (, ) from the same group, t-SNE visualization reveals overlaps between these representations. Additionally, the heatmap shows a cosine similarity value of 1 between them.
-
3)
There are relatively far distance between the final modality-shared feature, each final modality-partial-shared feature and each modality-specific feature, as shown in the t-SNE visualization. These features have a vary small cosine similarity value with each other as shown in the heatmap.
Similar visualization results are observed on the MEN dataset, as depicted in Fig. 5 (d).
Furthermore, we conduct ablation studies for and on both adopted datasets, as shown in Fig. 5. Specifically, and are balance factors of losses related to the CFD strategy ( and ). For both datasets, the comparison results (see sub-figure (a) and (b) for MRNet dataset and sub-figure (c) and (d) for MEN dataset) clearly demonstrate that both modality-shared features and modal-partial-features are better learnt when using CFD-related losses (, ).
4.4.2 Effectiveness of the DMF Module
The ablation results on the MRNet dataset are shown in Table 3. From this table, we observe that using the factor and can improve the performance when not using (see baseline1, baseline2 and baseline3). But when using , the performance of using is lower than that of without (see baseline4 and baseline5). The possible reason is that simple concatenation used in the gating network cannot effectively capture the relationship between the final decoupled features as the number of these features increased. In contrast, our proposed LinG_GN can dynamically capture the complex relationship between these features, allowing for better weighting and ultimately achieving improved prediction performance (see baseline4, baseline5 and baseline6). The ablation studies on the MEN dataset are shown in Table 4. Our proposed DMF module achieves the best performance on both MRNet and MEN datasets.
Moreover, we draw heatmaps of weights for the final decoupled features learned in the LinG_GN. Fig. 6 (a) and (b) show the heatmaps on the MRNet dataset. Firstly, we plot the mean weight of all cases in the test set for each final decoupled feature in a heatmap named MRNet-mean (see Fig. 6 (a)). This heatmap illustrates that and play greater roles during feature integration, with obtaining a maximum weight of around . Specifically, represents the final modality-partial-shared feature between PD-axial and T1-corona, and represents modality-specific feature of T1-corona. Additionally, we randomly display the heatmaps of two cases (see Fig. 6 (b)).
The heatmaps on the MEN dataset are shown in Fig. 6 (c) and (d), indicating that , and have more important roles during feature fusion, with obtaining the maximum weights, almost reaching 0.5. Specifically, represents the final modality-shared feature, represents the final modality-partial-shared feature between FLAIR-C and ADC, and represents the modality-specific feature of T1C.
These heatmaps demonstrate that our proposed DMF module can dynamically capture the relationships between the decoupled features across different samples (see Fig. 6 (b) for the MRNet dataset and Fig. 6 (d) for the MEN dataset). Moreover, there are modality-partial-shared features playing important roles during feature fusion on both datasets, further illustrating the necessity of the CFD strategy.
5 Discussion

5.1 Visualization analysis
To enhance the interpretability of our framework, Grad-CAM [44] was employed to visualize the activation maps of the network for each modality in the MEN dataset. As shown in Fig 7, the model pays more attention to the tumor-edema junction across different modalities, with variations in the specific areas of focus. For the T1C modality, the model predominantly focuses on the tumor and its surrounding region. In the FLAIR-C modality, the model highlights the tumor and edema regions. Conversely, in the ADC modality, the model primarily attends to the edema region. Moreover, based on the feature weights in the DMF module (Fig. 6 (c)), the characteristics of the tumor-edema junction (, the final modality-shared feature), the edema (, the final modality-partial-shared feature between FLAIR-C and ADC) and the tumor-surrounding region (, the modality-specific feature of T1C) contribute significantly to the final task.


To validate the effectiveness of our framework, we visualized the feature space distribution using the Manifold Discovery and Analysis (MDA) [45] algorithm on the MEN dataset, randomly selecting balanced samples across categories for optimal visualization. Fig. 8 demonstrates that categorical separation becomes increasingly distinct in deeper network blocks, with B2-C2 through B5-C2 showing progressive improvement in class discrimination. The temporal evolution of feature distributions across training epochs (Fig. 9 (a-c)) reveals increasingly pronounced categorical boundaries. Furthermore, we examined the model’s robustness by analyzing feature distributions under Gaussian noise perturbation of inputs, as shown in Fig. 9 (d). The results showed that the feature distributions maintained clear categorical separation, demonstrating the method’s resilience to input variations.
5.2 Evaluation in the Four-modal Case
In order to assess the generalization of the proposed framework on a wider range of modalities, we extend the proposed framework to MGMT promoter status prediction dataset (BraTS 2021) in the four-modal case, and the concept map is shown in Fig. 10. In this case, the complexity increases as we needed to decouple not only the pair-wise modality-partial-shared features but also the triplet-wise modality-partial-shared features. This leads to a total of 32 decoupled features and 15 final decoupled features.
The comparison results on BraTS 2021 dataset are listed in Table 5. The seven SOTA methods obtain comparable results. Our proposed framework achieves the best results in five metrics, including ACC (, better than the 2nd), G-Mean (, better than the 2nd), Ba_ACC (, better than the 2nd), AUPRC (, better than the 2nd), AUC (, better than the 2nd). The statistical test results obtained on the BraTS 2021 dataset are similar to those of the MRNet and MEN datasets.
However, it is important to acknowledge that as the number of modalities increases, the relationships between them become more intricate, posing challenges for the CFD strategy. Additionally, incorporating more modalities increases the number of network parameters, further complicating network optimization. However, in clinical studies, commonly used multimodal MRI datasets typically contain two to four modalities. Our proposed framework has been designed to achieve effective performance under these conditions.

Method | SEN | SPE | ACC | G-Mean | Ba_ACC | AUPRC | AUC |
EmbraceNet [21] | 0.6034 0.1006 | 0.5200 0.1484 | 0.5639* 0.0213 | 0.5506* 0.0358 | 0.5617* 0.0269 | 0.5610* 0.0180 | 0.5942* 0.0412 |
MISA [8] | 0.5515* 0.0991 | 0.6428 0.0966 | 0.5948* 0.0288 | 0.5905* 0.0294 | 0.5972* 0.0281 | 0.5849* 0.0190 | 0.5992* 0.0373 |
MAML [18] | 0.7353 0.1172 | 0.4427* 0.1313 | 0.5966* 0.0250 | 0.5607* 0.0503 | 0.5890* 0.0260 | 0.5765* 0.0164 | 0.6004* 0.0060 |
ETMC [16] | 0.7275 0.1158 | 0.4620* 0.0974 | 0.6016* 0.0171 | 0.5732* 0.0154 | 0.5948* 0.0129 | 0.5798* 0.0071 | 0.5958* 0.0332 |
NestedFormer [19] | 0.5934* 0.1240 | 0.6037 0.0879 | 0.5982* 0.0237 | 0.5921* 0.0214 | 0.5985* 0.0182 | 0.5842* 0.0104 | 0.5962* 0.0236 |
ADCCA [22] | 0.6390 0.0446 | 0.5602 0.0383 | 0.6016* 0.0171 | 0.5975* 0.0158 | 0.5996* 0.0167 | 0.5843* 0.0107 | 0.6003* 0.0303 |
DMD [10] | 0.6654 0.0229 | 0.5059* 0.0604 | 0.5898* 0.0223 | 0.5792* 0.0293 | 0.5857* 0.0239 | 0.5750* 0.0160 | 0.5686* 0.0159 |
CCML [17] | 0.5998 0.0335 | 0.5711 0.0225 | 0.5861* 0.0072 | 0.5848* 0.0053 | 0.5854* 0.0057 | 0.5751* 0.0032 | 0.5965* 0.0295 |
GLoMo [23] | 0.6781 0.1069 | 0.4628* 0.1495 | 0.5759* 0.0155 | 0.5495* 0.0431 | 0.5704* 0.0218 | 0.5656* 0.0150 | 0.5757* 0.0124 |
Proposed | 0.6397 0.0605 | 0.5849 0.0806 | 0.6137 0.0075 | 0.6089 0.0136 | 0.6123 0.0108 | 0.5934 0.0089 | 0.6177 0.0205 |

5.3 Computational Complexity Analysis
As shown in Fig. 11, we compare the number of parameters and GFLOPs between the proposed method and the comparison methods444We compare only with methods that utilize the same backbone as the proposed approach. For clarity, MAML is not shown in Fig. 11, as its parameter and GFLOPs values are significantly larger than those of other methods. Specifically, in the three-modality case, MAML has 202.61M parameters and 165.63 GFLOPs, while in the four-modality case, these values increase to 270.09M parameters and 167.89 GFLOPs.. In the three-modal case (left part of Fig. 11), using MRNet as an example, the proposed method achieves the best performance with the fewest parameters and the lowest GFLOPs. In the four-modal case (right part of Fig. 11), while EmbraceNet and CCML have the fewest parameters and GFLOPs, our method achieves a significant performance improvement with a comparable number of parameters and relatively low GFLOPs. Although the number of parameters increases for all methods from the three-modal case to the four-modal case, our proposed method achieves superior performance with only a modest increase in parameters.
6 Conclusion and Future Work
In this paper, we propose an effective MML framework called CFDL, incorporating a novel CFD strategy that separates multimodal information into modality-shared, modality-specific, and modality-partial-shared features, the last of which has been overlooked in previous FD-based methods. Our analysis and experiments demonstrate the critical role of modality-partial-shared features in prediction. Additionally, we present the DMF module, which explicitly and dynamically fuses the decoupled features. The LinG_GN within the DMF module can generate the decoupled feature weights by capturing their local-global relationships. This customized fusion module can provide interpretability for clinical analysis, enabling a deeper understanding of the characteristics and behaviors of each decoupled features. Furthermore, we consider that the underlying principles of proposed framework can be extended to other medical imaging tasks. In the future, we plan to explore the application of our framework to medical segmentation tasks, which are closely related to medical classification tasks.
References
References
- [1] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
- [2] Y. Huang, C. Du, Z. Xue, X. Chen, H. Zhao, and L. Huang, “What makes multi-modal learning better than single (provably),” Advances in Neural Information Processing Systems, vol. 34, pp. 10 944–10 956, 2021.
- [3] X. He, Y. Deng, L. Fang, and Q. Peng, “Multi-modal retinal image classification with modality-specific attention network,” IEEE transactions on medical imaging, vol. 40, no. 6, pp. 1591–1602, 2021.
- [4] N. Braman, J. W. Gordon, E. T. Goossens, C. Willis, M. C. Stumpe, and J. Venkataraman, “Deep orthogonal fusion: multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. Springer, 2021, pp. 667–677.
- [5] T. Zhou, M. Liu, H. Fu, J. Wang, J. Shen, L. Shao, and D. Shen, “Deep multi-modal latent representation learning for automated dementia diagnosis,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part IV 22. Springer, 2019, pp. 629–638.
- [6] Z. Ning, Q. Xiao, Q. Feng, W. Chen, and Y. Zhang, “Relation-induced multi-modal shared representation learning for alzheimer’s disease diagnosis,” IEEE Transactions on Medical Imaging, vol. 40, no. 6, pp. 1632–1645, 2021.
- [7] D. Hu, H. Zhang, Z. Wu, F. Wang, L. Wang, J. K. Smith, W. Lin, G. Li, and D. Shen, “Disentangled-multimodal adversarial autoencoder: Application to infant age prediction with incomplete multimodal neuroimages,” IEEE transactions on medical imaging, vol. 39, no. 12, pp. 4137–4149, 2020.
- [8] D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1122–1131.
- [9] J. Cheng, M. Gao, J. Liu, H. Yue, H. Kuang, J. Liu, and J. Wang, “Multimodal disentangled variational autoencoder with game theoretic interpretability for glioma grading,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 2, pp. 673–684, 2021.
- [10] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6631–6640.
- [11] S. Zheng, Z. Zhu, Z. Liu, Z. Guo, Y. Liu, Y. Yang, and Y. Zhao, “Multi-modal graph learning for disease prediction,” IEEE Transactions on Medical Imaging, vol. 41, no. 9, pp. 2207–2216, 2022.
- [12] K. Hess, D. C. Spille, A. Adeli, P. B. Sporns, C. Brokinkel, O. Grauer, C. Mawrin, W. Stummer, W. Paulus, and B. Brokinkel, “Brain invasion and the risk of seizures in patients with meningioma,” Journal of Neurosurgery, vol. 130, no. 3, pp. 789–796, 2018.
- [13] X. Li, Y. Lu, J. Xiong, D. Wang, D. She, X. Kuai, D. Geng, and B. Yin, “Presurgical differentiation between malignant haemangiopericytoma and angiomatous meningioma by a radiomics approach based on texture analysis,” Journal of Neuroradiology, vol. 46, no. 5, pp. 281–287, 2019.
- [14] W. C. Chen, C.-H. G. Lucas, S. T. Magill, C. L. Rogers, and D. R. Raleigh, “Radiotherapy and radiosurgery for meningiomas,” Neuro-Oncology Advances, vol. 5, no. Supplement_1, pp. i67–i83, 2023.
- [15] Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view classification,” in International Conference on Learning Representations, 2020.
- [16] ——, “Trusted multi-view classification with dynamic evidential fusion,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 2, pp. 2551–2566, 2022.
- [17] Y. Liu, L. Liu, C. Xu, X. Song, Z. Guan, and W. Zhao, “Dynamic evidence decoupling for trusted multi-view learning,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 7269–7277.
- [18] Y. Zhang, J. Yang, J. Tian, Z. Shi, C. Zhong, Y. Zhang, and Z. He, “Modality-aware mutual learning for multi-modal medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer, 2021, pp. 589–599.
- [19] Z. Xing, L. Yu, L. Wan, T. Han, and L. Zhu, “Nestedformer: Nested modality-aware transformer for brain tumor segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V. Springer, 2022, pp. 140–150.
- [20] J. Gao, T. Lyu, F. Xiong, J. Wang, W. Ke, and Z. Li, “Mgnn: A multimodal graph neural network for predicting the survival of cancer patients,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 1697–1700.
- [21] J.-H. Choi and J.-S. Lee, “Embracenet: A robust deep learning architecture for multimodal classification,” Information Fusion, vol. 51, pp. 259–270, 2019.
- [22] R. Zhou, H. Zhou, B. Y. Chen, L. Shen, Y. Zhang, and L. He, “Attentive deep canonical correlation analysis for diagnosing alzheimer’s disease using multimodal imaging genetics,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 681–691.
- [23] Y. Zhuang, Y. Zhang, Z. Hu, X. Zhang, J. Deng, and F. Ren, “Glomo: Global-local modal fusion for multimodal sentiment analysis,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1800–1809.
- [24] S. Li, Y. Xie, G. Wang, L. Zhang, and W. Zhou, “Adaptive multimodal fusion with attention guided deep supervision net for grading hepatocellular carcinoma,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 8, pp. 4123–4131, 2022.
- [25] Q. Zhu, H. Wang, B. Xu, Z. Zhang, W. Shao, and D. Zhang, “Multimodal triplet attention network for brain disease diagnosis,” IEEE Transactions on Medical Imaging, vol. 41, no. 12, pp. 3884–3894, 2022.
- [26] P. Zhou, H. Chen, Y. Li, and Y. Peng, “Coco-attention for tumor segmentation in weakly paired multimodal mri images,” IEEE Journal of Biomedical and Health Informatics, 2023.
- [27] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
- [28] A. Goyal, N. Kumar, T. Guha, and S. S. Narayanan, “A multimodal mixture-of-experts model for dynamic emotion prediction in movies,” in 2016 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2016, pp. 2822–2826.
- [29] B. Cao, Y. Sun, P. Zhu, and Q. Hu, “Multi-modal gated mixture of local-to-global experts for dynamic image fusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 555–23 564.
- [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [31] N. Bien, P. Rajpurkar, R. L. Ball, J. Irvin, A. Park, E. Jones, M. Bereket, B. N. Patel, K. W. Yeom, K. Shpanskaya et al., “Deep-learning-assisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of mrnet,” PLoS medicine, vol. 15, no. 11, p. e1002699, 2018.
- [32] A. Adeli, K. Hess, C. Mawrin, E. M. S. Streckert, W. Stummer, W. Paulus, A. Kemmling, M. Holling, W. Heindel, R. Schmidt et al., “Prediction of brain invasion in patients with meningiomas using preoperative magnetic resonance imaging,” Oncotarget, vol. 9, no. 89, p. 35974, 2018.
- [33] L. Joo, J. E. Park, S. Y. Park, S. J. Nam, Y.-H. Kim, J. H. Kim, and H. S. Kim, “Extensive peritumoral edema and brain-to-tumor interface mri features enable prediction of brain invasion in meningioma: Development and validation,” Neuro-oncology, vol. 23, no. 2, pp. 324–333, 2021.
- [34] U. Baid, S. Ghodasara, S. Mohan, M. Bilello, E. Calabrese, E. Colak, K. Farahani, J. Kalpathy-Cramer, F. C. Kitamura, S. Pati et al., “The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification,” arXiv preprint arXiv:2107.02314, 2021.
- [35] C. Davatzikos, S. Rathore, S. Bakas, S. Pati, M. Bergman, R. Kalarot, P. Sridharan, A. Gastounioti, N. Jahani, E. Cohen et al., “Cancer imaging phenomics toolkit: quantitative imaging analytics for precision diagnostics and predictive modeling of clinical outcome,” Journal of medical imaging, vol. 5, no. 1, pp. 011 018–011 018, 2018.
- [36] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” in International MICCAI Brainlesion Workshop. Springer, 2021, pp. 272–284.
- [37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
- [38] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 13 001–13 008.
- [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [40] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, “The balanced accuracy and its posterior distribution,” in 2010 20th international conference on pattern recognition. IEEE, 2010, pp. 3121–3124.
- [41] F. Wilcoxon, “Individual comparisons by ranking methods,” in Breakthroughs in Statistics: Methodology and Distribution. Springer, 1992, pp. 196–202.
- [42] Z. Xue and R. Marculescu, “Dynamic multimodal fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2574–2583.
- [43] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- [44] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
- [45] M. T. Islam, Z. Zhou, H. Ren, M. B. Khuzani, D. Kapp, J. Zou, L. Tian, J. C. Liao, and L. Xing, “Revealing hidden patterns in deep neural network feature space continuum via manifold learning,” Nature Communications, vol. 14, no. 1, p. 8506, 2023.