Cohort-Individual Cooperative Learning for Multimodal Cancer Survival Analysis

Huajun Zhou \IEEEmembershipMember, IEEE Fengtao Zhou and Hao Chen \IEEEmembershipSenior Member, IEEE This work was supported by the Hong Kong Innovation and Technology Fund (Project No. ITS/028/21FP and No. PRP/034/22FX), Shenzhen Science and Technology Innovation Committee Fund (Project No. SGDX20210823103201011), and Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. R6003-22 and C4024-22GF). (Corresponding author: Hao Chen.)Huajun Zhou and Fengtao Zhou are with the Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China (e-mail: [email protected]; [email protected]).Hao Chen is with the Department of Computer Science and Engineering, Department of Chemical and Biological Engineering and Division of Life Science, Hong Kong University of Science and Technology, Hong Kong, China (e-mail: [email protected]).

Abstract

Recently, we have witnessed impressive achievements in cancer survival analysis by integrating multimodal data, e.g., pathology images and genomic profiles. However, the heterogeneity and high dimensionality of these modalities pose significant challenges for extracting discriminative representations while maintaining good generalization. In this paper, we propose a Cohort-individual Cooperative Learning (CCL) framework to advance cancer survival analysis by collaborating knowledge decomposition and cohort guidance. Specifically, first, we propose a Multimodal Knowledge Decomposition (MKD) module to explicitly decompose multimodal knowledge into four distinct components: redundancy, synergy and uniqueness of the two modalities. Such a comprehensive decomposition can enlighten the models to perceive easily overlooked yet important information, facilitating an effective multimodal fusion. Second, we propose a Cohort Guidance Modeling (CGM) to mitigate the risk of overfitting task-irrelevant information. It can promote a more comprehensive and robust understanding of the underlying multimodal data, while avoiding the pitfalls of overfitting and enhancing the generalization ability of the model. By cooperating the knowledge decomposition and cohort guidance methods, we develop a robust multimodal survival analysis model with enhanced discrimination and generalization abilities. Extensive experimental results on five cancer datasets demonstrate the effectiveness of our model in integrating multimodal data for survival analysis. The code will be publicly available soon.

{IEEEkeywords}

Cohort guidance, Knowledge decomposition, Multimodal learning, Prognosis prediction, Survival analysis.

Refer to caption — Figure 1: Cohort knowledge offers a global view of multimodal data, assisting deep models to capture general multimodal interactions and facilitating a more effective fusion.

1 Introduction

\IEEEPARstart

Survival analysis, one of the most important tasks of cancer prognosis, aims to assess the probability of an event (typically death in survival analysis) occurring for a particular patient and accurately rank the risks of cancer patients. It offers insights into disease progression, treatment effectiveness, and patient prognosis, ultimately leading to improved decision-making and patient care in research and clinical scenarios. However, the complex nature of cancer necessitates a comprehensive evaluation of diverse personalized data, posing a significant challenge for survival analysis models to effectively capture and incorporate this heterogeneity. Therefore, the development of an effective multimodal integration approach is essential yet challenging for constructing robust and accurate survival analysis models.

Recent advances in Deep Learning (DL) [1, 2] have made survival analysis more efficient and accurate by leveraging patient’s clinical data [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], e.g., genomics and pathology images, reducing the workloads of clinicians considerably. Genomic profiles provide molecular information, enabling personalized medicine and understanding cancer genetics, while pathology images offer visual details that assist in diagnosis, grading, and assessing tumor heterogeneity. Together, these high-dimensional data enhance the understanding of tumours, thereby facilitating advancements in patient care and cancer research. Recently, DL-based multimodal methods [18, 19, 20, 21, 22, 23, 24] integrated these two modalities to enable the acquisition of complementary information from diverse perspectives, promising more accurate survival analysis. For example, several works [25, 26, 27, 28, 29, 30] focused on enhancing modality representations by leveraging cross-modal interactions. However, the challenges posed by the heterogeneity gap are still not well addressed, thus undermining the efficacy of multimodal integration. Furthermore, the high dimensionality increases the risk of overfitting task-irrelevant information, resulting in performance degradation on unseen samples.

By tackling the above issues, our goal is to construct a more general and effective multimodal survival analysis model by incorporating a comprehensive knowledge decomposition and a more general patient cohort guidance. First, the importance of different knowledge components in the integration process is variable. For example, common knowledge shared by multiple modalities typically is redundant in multimodal integration, thus interfering models to learn other discriminative information. Furthermore, synergy is the new knowledge generated from multimodal interactions and may be overlooked without explicit modeling. In our framework, we present a comprehensive decomposition of multimodal knowledge into four distinct components: redundancy, synergy, and uniqueness of the two modalities. This decomposition will enable a deeper understanding of underlying factors impacting survival outcomes. Second, extracting discriminative features from high-dimensional data while ensuring good generalization is a tough challenge. For example, a wealth of task-irrelevant information can produce spurious correlations between modalities, demanding elaborated solutions to learn multimodal interactions with enhanced generalization ability. Therefore, we seek to develop a more general patient cohort guidance mechanism, allowing for a broader range of patient characteristics to be considered during model training. Such guidance can prevent the models from overemphasizing task-irrelevant information and thus enhance the model’s generalization ability. By combining these advancements, we can construct an effective and robust multimodal survival analysis model with enhanced accuracy and applicability.

In this paper, we propose a Cohort-individual Cooperative Learning (CCL) framework to integrate genomics and pathology images for cancer survival analysis. Specifically, first, we propose a Multimodal Knowledge Decomposition (MKD) module to decompose multimodal knowledge into four distinct components: redundancy, synergy, and uniqueness of the two modalities. Such a comprehensive decomposition serves as an illuminating framework, enabling models to discern often disregarded yet crucial information. It paves the way for effective multimodal fusion, enhancing the integration of diverse data modalities and their complementary information. Second, we propose a Cohort Guidance Modeling (CGM) to unleash the potential of distinct knowledge components and to enhance the model’s generalization ability. Our cohort guidance assists feature learning to at both knowledge and patient levels, capturing the essence of multifaceted data at various levels of granularity. By cooperating knowledge decomposition and cohort guidance, we enhance our model’s discrimination and generalization abilities by effectively fusing diverse modalities while mitigating overfitting risks. Experiment results on five datasets in The Cancer Genome Atlas (TCGA) program prove that our framework achieves state-of-the-art performance in survival analysis.

The main contributions are summarized as:

•

We propose a Multimodal Knowledge Decomposition (MKD) module to comprehensively and explicitly decompose multimodal knowledge into distinct components, facilitating an effective fusion of heterogeneous data.
•

We propose a Cohort Guidance Modeling (CGM) to enhance the generalization and discrimination abilities by mitigating the overfitting of task-irrelevant information.
•

Extensive experiment results prove that the proposed framework achieves state-of-the-art performance on five datasets in The Cancer Genome Atlas (TCGA) program.

2 Related Works

2.1 Unimodal Survival Analysis

Survival analysis can be expressed as an estimation of the hazard function, which models the patient’s probability of death at a certain time, conditioned on personal clinical records. In an early stage, Cox’s proportional hazards regression model [31] conceptualises the hazard function as a multiplication of two components: 1) the underlying baseline hazard function, describing how the risk of event per time unit changes over time at baseline levels of covariates; and 2) the hazard ratio, measuring the impact of covariates. Since the baseline hazard function for a given cancer is constant, the hazard ratio determines whether a patient is at high risk. Based on the Cox’s regression model, subsequent survival analysis methods [3, 4, 5, 7, 6, 8] aim to predict the personalized hazard ratio using quantitative data extracted from short-term clinical indicators or long-term follow-up reports. For example, Kappen et al. [6] summarized seventeen pretreatment characteristics, such as residual tumor size, age, and thrombocytes, to predict the treatment outcome using a neural network. Ohno-Machado et al. [8] predicted the survival of AIDS patients conditioned on demographics, laboratory markers, and clinical findings, such as age, hemoglobin, and albumin.

Genomic profiles provide molecular information of tumors, which is important for cancer prognosis prediction. For example, certain genetic mutations or variations can affect tumor growth, metastasis, and response to chemotherapy or targeted therapies. Extensive studies [9, 10, 12, 11] have been conducted to predict cancer prognosis and treatment response by leveraging genomics data. Recently, Yousefi et al. [12] combined deep learning and Bayesian optimization methods to tackle high-dimensional cancer outcomes prediction tasks. Qiu et al. [11] built good predictive models with limited high-dimensional samples by using a meta-learning survival analysis framework.

Pathological images provide morphological features of tumors, which can provide valuable information about the aggressiveness of the tumor, its response to treatment, and the likelihood of disease recurrence. Some recent works [13, 14, 15, 16] constructed effective survival analysis models based on giga-pixel Whole Slide Images (WSIs). For example, Yao et al. [15] introduced the siamese MI-FCN and attention-based MIL pooling to efficiently learn imaging features from the WSI and then aggregate WSI-level information to patient-level. Zhu et al. [16] adaptively sampled and grouped hundreds of patches from each WSI into different several clusters. Then they employed an aggregation model to make patient-level predictions based on cluster-level survival prediction results.

Despite impressive performance achieved on survival analysis datasets, they have a narrow perspective and potentially overlook important aspects or correlations present in other modalities. On the contrary, multimodal models can offer improved performance, robustness, and a more comprehensive understanding of cancer prognosis.

2.2 Multimodal Survival Analysis

Integrating genomics and pathology images can enhance the predictive power of survival analysis models [18, 19, 20, 32, 25, 22, 21, 24, 23] and thus attract increasing attention recently. By utilizing the cellular graph embedded in the tissue, Nakhli et al. [22] utilized a unified representation for each patient, leveraging the hierarchical organization of the tissue. Chen et al. [21] captured the interactions between features across multiple modalities by combining the unimodal feature representations using the Kronecker product. Additionally, they incorporated a gating-based attention mechanism to control the expressive power of each representation. Furthermore, Chen et al. [25] proposed an interpretable, dense co-attention mapping between WSIs and genomic features formulated in the embedding space. Xu et al. [28] introduced the optimal transport theory to match WSI patches and gene embeddings for selecting informative patches to represent gigapixel WSIs and build an interpretable co-attention module to effectively fuse multimodal data. Moreover, Zhou et al. [29] found that the generated cross-modal representation can enhance and recalibrate intra-modal representation, and thus significantly improve its discrimination for survival analysis.

Existing co-attention-based methods focus on extracting common knowledge by using multimodal interactions. Moreover, they are exposed to the risk of overfitting high-dimensional data, leading to performance degradation on unseen samples. To address these issues, multimodal knowledge is comprehensively and explicitly decomposed in our framework to facilitate effective multimodal fusion. Meanwhile, our framework leverages cohort guidance to improve the generalization ability of the decomposed knowledge components.

3 Our Approach

We propose a Cohort-individual Cooperative Learning (CCL) strategy to advance survival analysis by extracting more general and discriminative features, as depicted in Fig. 2. Our CCL includes a Multimodal Knowledge Decomposition (MKD) module to comprehensively and explicitly decompose multimodal knowledge into distinct components, and a Cohort Guidance Modeling (CGM) to enhance the generalization and discrimination abilities of our model.

3.1 Unimodal Feature Extraction

Genomics. Genomic profiles can identify specific genetic alterations or biomarkers associated with cancer prognosis. For example, certain genetic mutations, gene expression patterns, or alterations in DNA copy number can serve as prognostic markers, helping to predict the likelihood of survival conditions of patients. In our framework, we partition RNA sequencing (RNA-seq), Copy Number Variation (CNV), Simple Nucleotide Variation (SNV) sequences into six sub-sequences as previous methods [29, 28]. Each sub-sequence is transformed into a feature by a Self-normalizing Neural Network (SNN) [17], which is a more powerful alternative to conventional Multi-Layer Perceptron (MLP). By integrating all sub-sequence features, we obtain the genomic representation $F_{g}\in\mathbb{R}^{1\times 256}$ .

Pathology. Whole Slide Images (WSIs), i.e., pathology images, describe the information about the tumor immune microenvironment and provide valuable information for cancer prognosis prediction. Considering the huge resolution of WSI, which exceeds the capacity of Convolutional Neural Networks (CNNs), we adopt a strategy of splitting tissue regions within each WSI into non-overlapping patches at 20x magnification (256 x 256 resolution for each patch). Following previous works [28, 29, 21], we utilize an ImageNet pre-trained ResNet-50 [2] model to extract a 1024-dimensional embedding for each patch, while all patch embeddings of the same WSI are collected as an embedding set. It is worth noting that such an embedding set is still high-dimensional as it typically contains tens of thousands of patches for each WSI. To further reduce information redundancy, we employ the K-means algorithm [33] to cluster all patch embeddings into $k$ groups and leverage the cluster centers as pathology features. However, due to the stochastic nature of K-means, cluster centers of different samples may be misaligned, which means that two cluster centers from the same ordinal position of two samples may exhibit completely different phenotypes. Consequently, deep models prefer to learn phenotype-independent knowledge instead of specialized knowledge tailored to each phenotype, overlooking crucial information in important patches. In our framework, we address this misalignment issue by cluster center alignment (CCA), an optimal matching between an anchor and cluster centers, as shown in Fig. 3. This involves assigning each cluster center to a feature in the anchor with maximized similarities of matched pairs. To solve this, the Hungarian algorithm [34] is employed to calculate the permutation matrix that maps the current centers to their matched ordinal positions. After that, we obtained the aligned centers by multiplying cluster centers with the permutation matrix. Aligned centers are utilized to update the anchor with a ratio of $\tau$ during training and generated the pathology representation $F_{p}\in\mathbb{R}^{1\times 256}$ , ensuring accurate alignment and capturing crucial information.

3.2 Cohort-individual Cooperative Learning

Genomic profiles and pathology images provide both unique and collaborative insights into tumors. Multimodal methods that can effectively integrate these two modalities are promised to be more impressive in learning discriminative representations than unimodal ones. To achieve this goal, decomposing knowledge into modular components enables scalability and flexibility in handling high-dimensional and heterogeneous multimodal data. Previous multimodal methods [35, 36] simply decompose multimodal knowledge into common and specific components. However, synergy-new knowledge that can only be generated through multimodal collaborations-is valuable yet overlooked in these methods. To better utilize all knowledge components in multimodal data, we propose a multimodal knowledge decomposition module and cohort guidance modeling to eliminate redundancy and strengthen valuable information under the guidance of patient cohorts.

Multimodal Knowledge Decomposition. In our framework, we decompose the knowledge within genomic profiles and pathology images into four distinct components, including redundancy, synergy and uniqueness of two modalities. Concretely, the large heterogeneity gap between pathology and genomics indicates a wealth of modality-specific knowledge. Meanwhile, there still exist considerable common knowledge between them. For example, particular genetic mutations or alterations can be observed at both the molecular level and the histopathological level simultaneously. Furthermore, another essential component, synergy, can provide new insights on tumors to enrich the knowledge in multimodal data, especially for heterogeneous modalities. Emerging from multimodal interactions, it surpasses common knowledge by forming a unique realm of insights unattainable through any individual modality alone. As an example, the diagnosis of glioma typically necessitates genetic markers (IDH mutation and 1p/19q co-deletion) to categorize gliomas into subtypes and histopathological images to determine the presence of microvascular proliferation and necrosis simultaneously. Overall, a comprehensive and explicit decomposition enhances integration, representation, synergy capture, and flexibility. It enables more effective modeling and utilization of multimodal data, leading to improved performance, insights, and decision-making in cancer prognosis.

To achieve a comprehensive decomposition of knowledge, we develop four encoders within our framework. These encoders are specifically employed to capture and model different components—redundancy, synergy, and uniqueness—present within the genomic profiles and pathology images. Concretely, we employ MLP layers as modality encoders $\Phi_{p}$ and $\Phi_{g}$ to extract specific knowledge $P$ and $G$ , focusing on aspects distinct from common knowledge shared by both modalities. The formula can be written as:

P=\Phi_{p}(F_{p}),G=\Phi_{g}(F_{g}).

(1)

Diverging from $\Phi_{p}$ and $\Phi_{g}$ that receive only a single input, common and synergistic encoders $\Phi_{c}$ and $\Phi_{s}$ employed in this context necessitate two inputs to build multimodal interactions from different perspectives. In these encoders, a co-attention block is utilized to bridge the interactions across modalities and produce modality attentions to integrate knowledge from different modalities into a fused feature $C$ or $S$ , as shown in Fig 4. The detailed operations can be formulated as:

	$\displaystyle\centering C=\Phi_{c}(F_{p},F_{g})\@add@centering$	$\displaystyle=fc(A^{T})F_{p}+fc(A)F_{g},$		(2)
	$\displaystyle A$	$\displaystyle=fc(F_{p})^{T}fc(F_{g}),$		(3)

where $A$ is the co-attention matrix and $fc$ indicates a fully-connected layer. Synergy $S$ is computed the same as common knowledge $C$ , but with different parameters. Using the above encoders, we decompose modality representations $F_{g}$ and $F_{p}$ into genomic-specific $G$ , pathology-specific $P$ , common $C$ , and synergistic $S$ components, respectively.

Cohort Guidance Modeling. In the above section, we employ different encoders to capture different knowledge components from multimodal data. Beside that, an effective loss function is also significant in enhancing the discrimination and generalization abilities of our model. Due to the high dimensionality of input modalities, multimodal interactions learned from each patient, as done in existing methods [25, 28, 29], may overfit task-irrelevant information, leading to reduced generalization and discrimination abilities. Therefore, we propose to utilize the correlations between patient cohorts to acquire more general knowledge components with cohort consistency, capture the heterogeneity of multimodal data, and gain a better understanding of cancer.

In our framework, we harness the cohort guidance to learn knowledge components from both knowledge and patient levels, as shown in Fig. 5. At the knowledge level, the most prominent difference between the decomposed knowledge components is their correlation to the original modality representations. Specifically, synergy is unattainable through any individual modality alone, so it is beyond the knowledge of both modalities. Redundancy is the intersection of knowledge as it is shared by both modalities. In addition, uniqueness is located in the source modality and is distinct from another modality. Therefore, we employ a set of similarity constrains between them as:

\begin{split}l_{k}=&|cos(G,F_{p})|-cos(G,F_{g})-cos(P,F_{p})\\ +&|cos(P,F_{g})|-cos(C,F_{p})-cos(C,F_{g})\\ +&|cos(S,F_{p})|+|cos(S,F_{g})|,\end{split}

(4)

where $cos$ computes the cosine similarity between two inputs and $|\cdot|$ is the absolute operation used to constraints two inputs orthogonal to each other. We freeze the gradients of $F_{p}$ and $F_{g}$ from $l_{k}$ to prevent potential model collapse that the extracted pathology and genomic features follow similar distributions. At the patient level, task-relevant information can be obtained by distinguishing patients conditioned on their risk scores, which are valuable for extracting more discriminative representations. Therefore, we split all patients into $r$ equal groups conditioned on their ground truth survival time, while patients in the same group are considered as similar. For uncensored patient, its features are closer to the features extracted from similar patients, while distinctive to patients in other groups. For censored patients, it may be closer to the features of patients at similar or lower risk and distinctly different from those at higher risk. Such an assumption in line with the definition of contrastive learning, which has been proved to have strong feature learning capability in both supervised and unsupervised settings. Therefore, we develop a cohort contrastive learning to enhance the discrimination and generalization abilities of our model. In the training phase, we build a cohort bank to store features of historical patients in each group. Our cohort bank is actually $r$ queues with length $b$ and follows the first-in-first-out principle for dynamically feature updating. For the feature of uncensored patient, we reduce its distances to similar patients in the cohort bank, while enlarging its distances to other patients. As an example, the formula for synergy $S$ is:

l_{p}=-\log\frac{\sum_{S^{\prime}\in\mathcal{S}_{+}}d(S,S^{\prime})}{\sum_{S^{\prime}\in\mathcal{S}_{+}}d(S,S^{\prime})+\sum_{S^{\prime}\in\mathcal{S}_{-}}d(S,S^{\prime})},

(5)

where $\mathcal{S}_{+}$ is the patient group of similar risk storing in the cohort bank, while $\mathcal{S}_{-}$ is the collection of other patients. $d$ is the similarity function. Noted that $S$ can be replaced by other knowledge components $G$ , $P$ , or $C$ similarly. For censored patient, we can extend $\mathcal{S}_{+}$ as the collection of patients with similar or lower risk.

Combining knowledge and patient levels, the cohort guidance loss can be formulated as:

L_{cohort}=l_{k}+l_{p}.

(6)

Table 1: Survival analysis (C-index) on TCGA datasets. The best and second best scores are in bold and underline, respectively. All results are re-produced on our own environment with the same settings.

Methods	Modality	Datasets					Overall
Methods	Modality	BLCA	BRCA	GBMLGG	LUAD	UCEC	Overall
SNN [17]	Genomics	0.6339_±0.0509	0.6327_±0.0739	0.8370_±0.0276	0.6171_±0.0411	0.6900_±0.0389	0.6821
SNNTrans	Genomics	0.6456_±0.0428	0.6478_±0.0580	0.8284_±0.0158	0.6335_±0.0493	0.6324_±0.0324	0.6775
MaxMIL	Pathology	0.5509_±0.0315	0.5966_±0.0547	0.7136_±0.0574	0.5958_±0.0600	0.5626_±0.0547	0.6039
MeanMIL	Pathology	0.5847_±0.0324	0.6110_±0.0286	0.7896_±0.0367	0.5763_±0.0536	0.6653_±0.0457	0.6454
AttMIL [37]	Pathology	0.5673_±0.0498	0.5899_±0.0472	0.7974_±0.0336	0.5753_±0.0744	0.6507_±0.0330	0.6361
CLAM-SB [38]	Pathology	0.5487_±0.0286	0.6091_±0.0329	0.7969_±0.0346	0.5962_±0.0558	0.6780_±0.0342	0.6458
CLAM-MB [38]	Pathology	0.5620_±0.0313	0.6203_±0.0520	0.7986_±0.0320	0.5918_±0.0591	0.6821_±0.0646	0.6510
TransMIL [39]	Pathology	0.5466_±0.0334	0.6430_±0.0368	0.7916_±0.0272	0.5788_±0.0303	0.6799_±0.0304	0.6480
DTFD [40]	Pathology	0.5662_±0.0353	0.5975_±0.0406	0.7641_±0.0297	0.5580_±0.0404	0.6308_±0.0190	0.6233
DualTrans	Multimodal	0.6607_±0.0319	0.6637_±0.0621	0.8393_±0.0174	0.6706_±0.0343	0.6724_±0.0192	0.7013
MCAT [25]	Multimodal	0.6727_±0.0320	0.6590_±0.0418	0.8350_±0.0233	0.6597_±0.0279	0.6336_±0.0506	0.6920
M3IF [41]	Multimodal	0.6361_±0.0197	0.6197_±0.0707	0.8238_±0.0170	0.6299_±0.0312	0.6672_±0.0293	0.6753
GPDBN [23]	Multimodal	0.6354_±0.0252	0.6549_±0.0332	0.8510_±0.0243	0.6400_±0.0478	0.6839_±0.0529	0.6930
Porpoise [27]	Multimodal	0.6461_±0.0338	0.6207_±0.0544	0.8479_±0.0128	0.6403_±0.0412	0.6918_±0.0488	0.6894
HFBSurv [19]	Multimodal	0.6398_±0.0277	0.6473_±0.0346	0.8383_±0.0128	0.6501_±0.0495	0.6421_±0.0445	0.6835
SurvPath [26]	Multimodal	0.6581_±0.0357	0.6306_±0.0340	0.8422_±0.0161	0.6600_±0.0233	0.6636_±0.0354	0.6909
MOTCat [28]	Multimodal	0.6830_±0.0260	0.6730_±0.0060	0.8490_±0.0280	0.6700_±0.0380	0.6750_±0.0400	0.7100
CMTA [29]	Multimodal	0.6910_±0.0426	0.6679_±0.0434	0.8531_±0.0116	0.6864_±0.0359	0.6975_±0.0409	0.7192
Ours	Multimodal	0.6862_±0.0253	0.6840_±0.0339	0.8614_±0.0149	0.6957_±0.0231	0.7026_±0.0475	0.7260

3.3 Multimodal Fusion and Prediction

Recently, Transformer [42] has shown impressive performance on multimodal learning, and thus is employed to integrate the decomposed components for survival predictions. Specifically, we concatenate a class token $U$ with the decomposed components $G$ , $P$ , $C$ , and $S$ as the inputs of Transformer $\Phi_{t}$ . Moreover, a fully-connected layer with Sigmoid activation, denoted as $\sigma$ , is appended to the output of class token for predicting the hazard function $H$ :

H=\sigma(\Phi_{t}(U,G,P,S,C)).

(7)

For survival prediction, following previous works [25, 26, 28, 29], we construct an one-hot time-series vector $T$ based on $n$ equal parts of ground truth survival time. Under this setting, the original event time regression problem is simplified to a classification problem. We find out the time interval $t_{k}$ in which the ground truth event occurred as the classification label $k$ for each patient. Each patient sample is defined as a triplet $\{H,c,k\}$ , where $H=\{h_{1},...h_{n}\}$ is the predicted hazard vector measuring the probability of ground truth event time located in corresponding time interval $t_{k}$ . Additionally, we define the discrete survival function $f_{sur}(H,k)=\prod_{j=1}^{k}(1-h_{j})$ . Following previous works [28, 29, 25], we generalizes the negative log-likelihood (NLL) with censorship to supervise the survival prediction by:

\begin{split}L_{surv}=&-c\log(f_{sur}(H,k))\\ &-(1-c)\log(f_{sur}(H,{k-1}))\\ &-(1-c)\log(h_{k}).\end{split}

(8)

Finally, the overall loss function of our framework is:

L=L_{surv}+\alpha L_{cohort},

(9)

where $\alpha$ controls the effect of our cohort guidance.

4 Experiments

4.1 Experiment Setups

Dataset. To validate the effectiveness of the proposed methods, we evaluate our framework on five datasets from TCGA: Bladder Urothelial Carcinoma (BLCA, n=372), Breast Invasive Carcinoma (BRCA, n=956), Glioblastoma & Lower Grade Glioma (GBMLGG, n=569), Lung Adenocarcinoma (LUAD, n=453) and Uterine Corpus Endometrial Carcinoma (UCEC, n=480). These datasets contain hundreds of paired genomics, pathological images, and follow-up data collected from multiple centers, enabling center-agnostic analysis on precision oncology. We collected all diagnostic WSIs used for primary diagnosis, resulting in 2,830 WSIs with an average of 15k patches per WSI at 20x magnification (assuming 256 x 256 patches). Genomic profiles provide molecular information of individual, including RNA sequencing (RNA-seq), Copy Number Variation (CNV), Simple Nucleotide Variation (SNV), DNA methylation. Following previous works [25, 29], we use RNA-seq, CNV, and SNV sequences and further group them into six genomic sub-sequences: 1) Tumor Suppression, 2) Oncogenesis, 3) Protein Kinases, 4) Cellular Differentiation, 5) Transcription, and 6) Cytokines and Growth.

Evaluation Metrics. The concordance index (C-index) is the proportion of all comparable patient pairs for which the predicted outcome is consistent with the actual outcome. In survival analysis, the predicted outcome is considered consistent with the actual outcome if the predicted survival time is longer for individuals who indeed have a longer survival time, compared to the predicted event time for others. Moreover, Kaplan-Meier and T-test analyses are employed to assess the significance of differences in survival predictions between high- and low-risk groups.

Implementation. For each dataset, we adopted five-fold cross-validation to evaluate our model and other compared methods. Our framework was implemented using PyTorch running on a single NVIDIA GTX 3090 GPU. For optimization, we employed the SGD optimizer with a learning rate of 1e-3 to train our framework for 30 epochs.

4.2 Comparative Methods

We compare our framework with different types of survival analysis models, including genomics-based (SNN [17]), pathology-based (AttMIL [37], CLAM [38], TransMIL [39], and DTFD [40]), and multimodal (MCAT [25], M3IF [41], GPDBN [23], Porpoise [27], HFBSurv [19], SurvPath [26], MOTCat [28], and CMTA [29]) models. Here we provide brief overviews of several representative multimodal competitors:

•

Porpoise [27] is a multimodal fusion model that aggregates WSI patch embeddings with the generated attention scores and processes genomic profiles using a SNN network. The extracted multimodal features are simply concatenated to produce survival predictions.
•

SurvPath [26] learns biological pathway tokens from transcriptomics that can encode specific cellular functions and fuse two modalities using a memory-efficient multimodal Transformer to model interactions between pathway and histology patch tokens.
•

MOTCat [28] contains a global structure consistency, in which optimal transport (OT) is applied to match WSI patches and gene embeddings for selecting informative patches. More importantly, OT-based co-attention provides a global awareness to effectively capture structural interactions for survival prediction.
•

CMTA [29] explores the intrinsic cross-modal correlations and transfer potential complementary information by enhancing modality-specific representations by integrating with multimodal representations.

In addition, we build several models for a comprehensive comparison, including: 1) SNNTrans: two SNN layers followed with a transformer; 2) MaxMIL/MeanMIL: a maximum/average operation to aggregate multiple instances; and 3) DualTrans: combining SNN and TransMIL for genomic profile and pathological images, respectively, while a Transformer is employed for multimodal fusion. We unify the settings of compared models to make a fair comparison and report the re-produced results in our own environment.

Table 2: Comparison between p-values of Kaplan-Meier analysis.

Methods	BLCA	BRCA	GBMLGG	LUAD	UCEC
SNN	$1.7e^{-6}$	$1.5e^{-2}$	$1.1e^{-27}$	$1.1e^{-3}$	$9.8e^{-4}$
TransMIL	$5.6e^{-2}$	$2.4e^{-3}$	$2.5e^{-25}$	$3.8e^{-2}$	$9.8e^{-2}$
MCAT	$7.8e^{-6}$	$7.5e^{-3}$	$1.6e^{-19}$	$6.9e^{-5}$	$4.8e^{-3}$
SurvPath	$8.2e^{-6}$	$1.4e^{-3}$	$1.0e^{-29}$	$2.2e^{-4}$	$1.4e^{-5}$
MOTCat	$2.9e^{-7}$	$4.9e^{-4}$	$3.4e^{-30}$	$1.1e^{-5}$	$\mathbf{3.8e^{-7}}$
CMTA	$2.0e^{-8}$	$3.1e^{-3}$	$\mathbf{1.8e^{-33}}$	$\mathbf{9.6e^{-7}}$	$1.7e^{-3}$
Ours	$\mathbf{5.7e^{-11}}$	$\mathbf{2.2e^{-7}}$	$9.8e^{-32}$	$1.1e^{-4}$	$3.6e^{-4}$

4.3 Quantitative Evaluation

C-index Comparison. As shown in Tab. 1, the proposed framework achieves an average score of 72.60%, which surpasses existing methods by a significant margin. On each dataset, our model obtains 68.62%, 68.40%, 86.14%, 69.57%, and 70.26% C-index scores on BLCA, BRCA, GBMLGG, LUAD, and UCEC, which are state-of-the-art performance on most datasets, while the second best on BLCA dataset with a small margin to the best 69.10% reported by CMTA [29]. Compared to unimodal models, either genomics- or pathology-based, our framework achieves significantly improvements on all datasets. It indicates that our framework can effective integrate complementary information in genomic profiles and pathology images. Compared to multimodal models, our framework reports more robust and significant performance on survival analysis. For example, CMTA [29] performs well on the BLCA dataset but still inferior to our method on others.

Table 3: Ablation study on the proposed modules. CCA, MKD and CGM indicate the proposed cluster center alignment, multimodal knowledge decomposition and cohort guidance modeling, respectively.

Models			Datasets					Overall
CCA	MKD	CGM	BLCA	BRCA	GBMLGG	LUAD	UCEC	Overall
(Baseline)			0.6607_±0.0319	0.6637_±0.0621	0.8393_±0.0174	0.6706_±0.0343	0.6724_±0.0192	0.7013
✓			0.6701_±0.0287	0.6767_±0.0532	0.8433_±0.0241	0.6792_±0.0417	0.6844_±0.0331	0.7107
✓	✓		0.6783_±0.0296	0.6718_±0.0416	0.8511_±0.0233	0.6855_±0.0397	0.6872_±0.0388	0.7147
✓	✓	✓	0.6862_±0.0253	0.6840_±0.0339	0.8614_±0.0149	0.6957_±0.0231	0.7026_±0.0475	0.7260

Kaplan-Meier Analysis. The Kaplan-Meier (KM) analysis is a non-parametric statistic used to estimate the survival function from lifetime data. Specifically, we use the median survival time of the entire cohort to divide all patients into high- and low-risk groups, represented as red and blue curves, respectively. The p-value in KM analysis indicates the probability of observing the observed difference in survival rates between two groups, under the null hypothesis that there is no true difference between them. As shown in Fig. 6, the p-values achieved by our framework are significantly lower than 0.05 on all five datasets, which indicates a statistically significant discrimination between high- and low-risk groups. We also compare the p-values with several representative methods in Tab. 2. Our model achieves the lowest p-values on BLCA and BRCA, meanwhile is competitive to state-of-the-art methods on GBMLGG, LUAD, and UCEC. Our framework can generalize across different cancers to produce robust predictions, holding the potential to significantly advance both clinical practice and cancer research.

T-test Analysis. The T-test is a statistical analysis method used to determine if there is a significant difference between the means of two groups. In the T-test analysis, the t-value quantifies the size of the difference relative to the variability within two groups, while the p-value is the probability that the results from our results occurred by chance. As the results shown in 7, our framework can distinguish between different groups with good p-values and t-values, especially on BLCA, BRCA, and GBMLGG datasets (p-value < 0.05). However, LUAD and UCEC datasets are still challenging for our framework (p-value > 0.05).

4.4 Ablation study

Module Analysis. To demonstrate the effectiveness of our methods, we progressively append our modules to our baseline DualTrans, including Cluster Center Alignment (CCA), Multimodal Knowledge Discrimination (MKD), and Cohort Guidance Modeling (CGM). As shown in Tab. 3, the proposed methods effectively and consistently improve the performance of the baseline DualTrans model. CCA aligns the cluster centers after K-means, facilitating the extraction of crucial information from important patches. Moreover, MKD enables more effective multimodal fusion by comprehensively decomposing multimodal knowledge, reducing redundancy and reinforcing valuable information. Further, CGM leverages the cohort guidance to promote a general multimodal interaction learning. Collaborating these modules, our full framework achieves significant improvements over the baseline model.

Table 4: Results of hyper-parameter experiments.

Param.	Value	BLCA	BRCA	GBMLGG	LUAD	UCEC	Overall
	4	0.6758	0.6763	0.8433	0.6718	0.6775	0.7089
$k$	6	0.6862	0.6840	0.8614	0.6957	0.7026	0.7260
	9	0.6792	0.6851	0.8573	0.6960	0.6982	0.7232
	0.05	0.6834	0.6842	0.8598	0.6957	0.6997	0.7246
$\tau$	0.1	0.6862	0.6840	0.8614	0.6957	0.7026	0.7260
	0.3	0.6855	0.6847	0.8602	0.6932	0.7011	0.7249
	5	0.6764	0.6799	0.8631	0.6789	0.6911	0.7179
$b$	10	0.6862	0.6840	0.8614	0.6957	0.7026	0.7260
	20	0.6835	0.6844	0.8597	0.6827	0.6943	0.7209
	2	0.6778	0.6751	0.8501	0.6747	0.6818	0.7119
$r$	4	0.6862	0.6840	0.8614	0.6957	0.7026	0.7260
	6	0.6754	0.6912	0.8486	0.6913	0.6988	0.7211

Knowledge Component Visualization. In Fig. 8, we visualize the feature distributions after MKD using T-SNE [43], where the orange, green, blue and purple points indicate common, synergistic, genomic-specific, and pathology-specific knowledge, respectively. The distributions of these components differ from each other and are consistent with our intuition that common and synergistic features lie in the middle of two modality-specific features. Moreover, the features of each component are clustered as several groups, which demonstrates that patients of similar risk tend to be closer than other patients under our cohort guidance. For example, genomic-specific features are basically clustered as four groups, which is consistent with the group number used in our framework. It proves that our framework can extract discriminative representations to facilitate multimodal fusion.

Hyper-parameter Experiment. There are several hyper-parameters used in our framework, including $k$ in K-means, anchor updating ratio $\tau$ , cohort bank length $b$ , and the number of patient groups $r$ . We provide experiments results on these hyper-parameters in Tab. 4. From these results, we conclude that $k=6$ , $\tau=0.1$ , $b=10$ and $r=4$ work best in our framework, and thus are employed as the main settings of our framework. Moreover, smaller values of $k$ , $b$ , and $r$ imply fewer features and computational loads, but the performance usually drops significantly. When exceeding adequate values, enlarging them may not offer additional improvements. For $\tau$ , it controls the updating rate of anchor, which is used in cluster center alignment. After a period of training, the anchor becomes stable, so the effect of $\tau$ may not be significant.

Knowledge Attention Analysis. To further illustrate the differences between knowledge components, we visualize their attention on input data in Fig. 9. In sub-figures (a)-(c), we visualize the attention regions of redundancy, synergy, and pathology-specific knowledge on input pathology images, while the last sub-figure (d) shows attention scores on genomic profiles. For pathology images, the focusing regions vary from component to component, indicating the distinctiveness between them. For genomic profiles, the first sample focus more on synergy, whereas the second one has a stronger reliance on redundancy. It implies that two types of multimodal interactions are necessary for effective survival analysis.

5 Conclusion

In this paper, we propose a Cohort-individual Cooperative Learning (CCL) framework to effectively integrate genomics and pathology images for cancer survival analysis. Specifically, first, we propose a Multimodal Knowledge Decomposition (MKD) module to completely decompose multimodal knowledge into four distinct components. Second, we propose a Cohort Guidance Modeling (CGM) to enhance the generalization and discrimination abilities of the decomposed components. Collaborating knowledge decomposition and cohort guidance, our framework can learn general multimodal interactions and facilitate an effective fusion. Experiment results on five TCGA datasets prove that the proposed framework achieves state-of-the-art performance in survival analysis.

Despite the performance achieved by our framework, there are several potential directions for future works. First, soft similarity assessment among patients may improve the cohort guidance by providing a more effective learning strategy. Second, specialized encoders for different knowledge components may further facilitate the multimodal fusion process. Last but not least, more modalities describing other characteristics of tumors, such as diagnostic reports and clinical indicators, can also be utilized for survival analysis models.

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[3] K. L. Lee, D. B. Pryor, F. E. Harrell Jr, R. M. Califf, V. S. Behar, W. L. Floyd, J. J. Morris, R. A. Waugh, R. E. Whalen, and R. A. Rosati, “Predicting outcome in coronary disease statistical models versus expert clinicians,” The American Journal of Medicine, vol. 80, no. 4, pp. 553–560, 1986.
[4] E. R. Dickson, P. M. Grambsch, T. R. Fleming, L. D. Fisher, and A. Langworthy, “Prognosis in primary biliary cirrhosis: model for decision making,” Hepatology, vol. 10, no. 1, pp. 1–7, 1989.
[5] R. B. D’Agostino, M.-L. Lee, A. J. Belanger, L. A. Cupples, K. Anderson, and W. B. Kannel, “Relation of pooled logistic regression to time dependent cox regression analysis: the framingham heart study,” Statistics in Medicine, vol. 9, no. 12, pp. 1501–1515, 1990.
[6] H. Kappen and J. Neijt, “Neural network analysis to predict treatment outcome,” Annals of Oncology, vol. 4, pp. S31–S34, 1993.
[7] P. Lapuerta, S. P. Azen, and L. LaBree, “Use of neural networks in predicting the risk of coronary artery disease,” Computers and Biomedical Research, vol. 28, no. 1, pp. 38–52, 1995.
[8] L. Ohno-Machado, “A comparison of cox proportional hazards and artificial neural network models for medical prognosis,” Computers in Biology and Medicine, vol. 27, no. 1, pp. 55–65, 1997.
[9] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences, vol. 95, no. 25, pp. 14 863–14 868, 1998.
[10] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu et al., “Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000.
[11] Y. L. Qiu, H. Zheng, A. Devos, H. Selby, and O. Gevaert, “A meta-learning approach for genomic survival analysis,” Nature Communications, vol. 11, no. 1, p. 6350, 2020.
[12] S. Yousefi, F. Amrollahi, M. Amgad, C. Dong, J. E. Lewis, C. Song, D. A. Gutman, S. H. Halani, J. E. Velazquez Vega, D. J. Brat et al., “Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models,” Scientific Reports, vol. 7, no. 1, p. 11707, 2017.
[13] E. Wulczyn, D. F. Steiner, Z. Xu, A. Sadhwani, H. Wang, I. Flament-Auvigne, C. H. Mermel, P.-H. C. Chen, Y. Liu, and M. C. Stumpe, “Deep learning-based survival prediction for multiple cancer types using histopathology images,” PloS One, vol. 15, no. 6, p. e0233678, 2020.
[14] R. J. Chen, C. Chen, Y. Li, T. Y. Chen, A. D. Trister, R. G. Krishnan, and F. Mahmood, “Scaling vision transformers to gigapixel images via hierarchical self-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 144–16 155.
[15] J. Yao, X. Zhu, J. Jonnagaddala, N. Hawkins, and J. Huang, “Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks,” Medical Image Analysis, vol. 65, p. 101789, 2020.
[16] X. Zhu, J. Yao, F. Zhu, and J. Huang, “Wsisa: Making survival prediction from whole slide histopathological images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7234–7242.
[17] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[18] W. Shao, J. Liu, Y. Zuo, S. Qi, H. Hong, J. Sheng, Q. Zhu, and D. Zhang, “Fam3l: Feature-aware multi-modal metric learning for integrative survival analysis of human cancers,” IEEE Transactions on Medical Imaging, 2023.
[19] R. Li, X. Wu, A. Li, and M. Wang, “Hfbsurv: hierarchical multimodal fusion with factorized bilinear models for cancer survival prediction,” Bioinformatics, vol. 38, no. 9, pp. 2587–2594, 2022.
[20] N. Braman, J. W. Gordon, E. T. Goossens, C. Willis, M. C. Stumpe, and J. Venkataraman, “Deep orthogonal fusion: multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. Springer, 2021, pp. 667–677.
[21] R. J. Chen, M. Y. Lu, J. Wang, D. F. Williamson, S. J. Rodig, N. I. Lindeman, and F. Mahmood, “Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis,” IEEE Transactions on Medical Imaging, vol. 41, no. 4, pp. 757–770, 2020.
[22] R. Nakhli, P. A. Moghadam, H. Mi, H. Farahani, A. Baras, B. Gilks, and A. Bashashati, “Sparse multi-modal graph transformer with shared-context processing for representation learning of giga-pixel images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 547–11 557.
[23] Z. Wang, R. Li, M. Wang, and A. Li, “Gpdbn: deep bilinear network integrating both genomic data and pathological images for breast cancer prognosis prediction,” Bioinformatics, vol. 37, no. 18, pp. 2963–2970, 2021.
[24] C. Cui, H. Liu, Q. Liu, R. Deng, Z. Asad, Y. Wang, S. Zhao, H. Yang, B. A. Landman, and Y. Huo, “Survival prediction of brain cancer with incomplete radiology, pathology, genomic, and demographic data,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 626–635.
[25] R. J. Chen, M. Y. Lu, W.-H. Weng, T. Y. Chen, D. F. Williamson, T. Manz, M. Shady, and F. Mahmood, “Multimodal co-attention transformer for survival prediction in gigapixel whole slide images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4015–4025.
[26] G. Jaume, A. Vaidya, R. Chen, D. Williamson, P. Liang, and F. Mahmood, “Modeling dense multimodal interactions between biological pathways and histology for survival prediction,” arXiv preprint arXiv:2304.06819, 2023.
[27] R. J. Chen, M. Y. Lu, D. F. Williamson, T. Y. Chen, J. Lipkova, Z. Noor, M. Shaban, M. Shady, M. Williams, B. Joo et al., “Pan-cancer integrative histology-genomic analysis via multimodal deep learning,” Cancer Cell, vol. 40, no. 8, pp. 865–878, 2022.
[28] Y. Xu and H. Chen, “Multimodal optimal transport-based co-attention transformer with global structure consistency for survival prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2023, pp. 21 241–21 251.
[29] F. Zhou and H. Chen, “Cross-modal translation and alignment for survival analysis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 485–21 494.
[30] Y. Zhang, Y. Xu, J. Chen, F. Xie, and H. Chen, “Prototypical information bottlenecking and disentangling for multimodal cancer survival prediction,” in The International Conference on Learning Representations, 2024.
[31] D. R. Cox, “Regression models and life-tables,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 34, no. 2, pp. 187–202, 1972.
[32] Z. Lv, Y. Lin, R. Yan, Z. Yang, Y. Wang, and F. Zhang, “Pg-tfnet: transformer-based fusion network integrating pathological images and genomic data for cancer survival analysis,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2021, pp. 491–496.
[33] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.
[34] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
[35] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6631–6640.
[36] D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131.
[37] M. Ilse, J. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 2127–2136.
[38] M. Y. Lu, D. F. Williamson, T. Y. Chen, R. J. Chen, M. Barbieri, and F. Mahmood, “Data-efficient and weakly supervised computational pathology on whole-slide images,” Nature Biomedical Engineering, vol. 5, no. 6, pp. 555–570, 2021.
[39] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji et al., “Transmil: Transformer based correlated multiple instance learning for whole slide image classification,” Advances in Neural Information Processing Systems, vol. 34, pp. 2136–2147, 2021.
[40] H. Zhang, Y. Meng, Y. Zhao, Y. Qiao, X. Yang, S. E. Coupland, and Y. Zheng, “Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 802–18 812.
[41] H. Li, F. Yang, X. Xing, Y. Zhao, J. Zhang, Y. Liu, M. Han, J. Huang, L. Wang, and J. Yao, “Multi-modal multi-instance learning using weakly correlated histopathological images and tabular clinical information,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24. Springer, 2021, pp. 529–539.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[43] G. E. Hinton and S. Roweis, “Stochastic neighbor embedding,” Advances in Neural Information Processing Systems, vol. 15, 2002.