WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang¹, Xinheng Lyu²¹¹footnotemark: 1, Meidan Ding¹, Wenting Chen³, Jipeng Zhang⁵, Yuexiang Ren^6,7,
Xiangjian He², Song Wu^1,8, Sen Yang⁴, Xiyue Wang⁴, Xiaohan Xing⁴, Linlin Shen¹
¹Shenzhen University, ²University of Nottingham Ningbo China, ³City University of Hong Kong,
⁴Stanford University, ⁵Hong Kong University of Science and Technology, ⁶Nanchang University,
⁷The First Affiliated Hospital of Nanchang University,
⁸South China Hospital affiliated to Shenzhen University,
Equal contribution.

Abstract

Recent advancements in computational pathology have produced patch-level Multi-modal Large Language Models (MLLMs), but these models are limited by their inability to analyze whole slide images (WSIs) comprehensively and their tendency to bypass crucial morphological features that pathologists rely on for diagnosis. To address these challenges, we first introduce WSI-Bench, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, designed to evaluate MLLMs’ understanding of morphological characteristics crucial for accurate diagnosis. Building upon this benchmark, we present WSI-LLaVA, a novel framework for gigapixel WSI understanding that employs a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning. To better assess model performance in pathological contexts, we develop two specialized WSI metrics: WSI-Precision and WSI-Relevance. Experimental results demonstrate that WSI-LLaVA outperforms existing models across all capability dimensions, with a significant improvement in morphological analysis, establishing a clear correlation between morphological understanding and diagnostic accuracy.

1 Introduction

Recent advancements in computational pathology have introduced several advanced patch-level Multi-modal Large Language Models (MLLMs) [17, 22, 26, 13, 25]. While these models excel at executing a variety of pathology-related tasks, their scope is confined to localized regions within tissue samples. In contrast, histopathological diagnostics require a holistic analysis of whole slide images (WSIs), where pathologists integrate findings from extensive tissue areas to establish accurate diagnosis [10]. The limited capacity of current patch-level MLLMs to capture comprehensive contextual information from WSIs significantly restricts their utility in clinical practice, highlighting the urgent need to develop models that better incorporate this extensive contextual data.

Several pioneering works [6, 3, 12, 4, 23] on WSI-level multi-modal models have demonstrated the capability to utilize comprehensive contextual information from gigapixel WSIs for various tasks. Some WSI-level models are based on transformer architecture for pathology report generation or simple visual question answering (VQA) tasks, while others are based on LLM for complex VQA tasks. For instance, Chen et al. [6] proposes Slidechat, which combines a patch-level and slide-level encoder for WSI feature extraction with a large language model (LLM) for multi-modal understanding. However, the slide-level encoder [9] is pre-trained on general language tasks, making it lack compatibility with patch-level input. Additionally, the direct utilization of both encoders for multi-modal training with LLM creates a substantial cross-modal gap between WSI features and textual descriptions. These features would affect MLLMs’ ability to achieve the deep understanding of WSIs necessary for clinical applications. Therefore, it is crucial to address this cross-modal gap between the WSI and text during MLLM training.

Another significant challenge with current MLLMs in computational pathology is their tendency to map pathology images directly to diagnostic outcomes, ignoring crucial morphological features necessary for accurate diagnosis [29]. Pathologists depend on detailed observation of these morphological characteristics when diagnosing using pathology images. For instance, in the diagnosis of breast cancer, accurately distinguishing between invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) hinges on recognizing subtle differences in the arrangement of tumor cells. IDC is typically identified by its irregular cells forming duct-like structures [28]. When MLLMs fail to prioritize in-depth morphological analysis, they may overlook these critical features, thereby compromising the accuracy of the diagnosis. Therefore, it is vital to integrate detailed morphological observations into the MLLMs to ensure they align with traditional diagnostic methods employed by pathologists.

To address the challenges above, we introduce a gigapixel WSI-Bench benchmarking framework to consider morphological characteristics for WSI-level MLLMs evaluation, and propose a novel WSI-level Large Language and Vision Assistant framework (WSI-LLaVA) for gigapixel WSI understanding and analysis.

WSI-Bench: Aiming to include the morphological observations for comprehensive WSI understanding, we introduce a morphology-aware WSI-Bench benchmark for gigapixel WSI evaluation from 3 main capabilities and 11 tasks for WSI analysis, which encompasses over 180k Visual Question Answering (VQA) pairs from 9,850 WSIs across 30 cancer types, sourced from 8,368 patients. In Fig. 1, the main capabilities include morphological analysis, diagnosis, and treatment planning, which comprehensively cover 11 specific clinical tasks in pathology. To the best of our knowledge, it is the first largest morphological-aware WSI benchmark with the most comprehensive clinical tasks for WSI-level MLLMs training and evaluation.

WSI-LLaVA: To bridge the cross-modal gap between whole slide images and textual descriptions, we propose the WSI-LLaVA framework, which employs a three-stage training approach for gigapixel WSI analysis. In Fig. 3, the stages include WSI-text alignment, feature space alignment, and task-specific instruction tuning. In the WSI-text alignment stage, we align the WSI encoder with the text encoder through contrastive learning on 9.85k WSI-report pairs from the training set of WSI-Bench, enabling the WSI encoder to extract richer semantic features that correspond to textual descriptions. The WSI encoder consists of patch-level and slide-level components, designed to capture both local details and global contextual features within the WSIs. During the feature space alignment stage, we integrate the WSI encoder and LLM through a projection layer to form our WSI-LLaVA architecture. To align the feature spaces between the WSI encoder and LLM, we maintain their pre-trained parameters frozen while training the projection layer on WSI-report pairs. In the final task-specific instruction tuning stage, we fine-tune WSI-LLaVA on pathological tasks using the training set of WSI-Bench, which consists of 180k VQA pairs.

WSI Metrics: While Natural Language Understanding (NLU) metrics are commonly used to evaluate medical language tasks, they fall short in accurately assessing performance due to pathology’s complex and often similar terminology[16, 11]. To address this limitation, we introduce two specialized WSI metrics: WSI-Precision, which verifies the accuracy of each claim derived from the ground truth against the model’s answers, and WSI-Relevance, which assesses the alignment of each claim in the model’s responses with the ground truth to ensure their relevance. Our evaluation of WSI MLLMs on WSI-Bench employs both traditional NLU metrics and these new pathology-specific metrics. Results show that WSI-LLaVA outperforms existing models across all three capability dimensions, with a notable improvement in morphological analysis compared to the next-best model. Our contributions are summarized as follows:

•

We introduce WSI-Bench, the first large-scale morphology-aware benchmark for gigapixel WSI understanding and evaluation, encompassing 180k VQA pairs from 9,850 WSIs across 30 cancer types. This benchmark uniquely emphasizes morphological observations in evaluating WSI-level MLLMs.
•

We propose WSI-LLaVA, a novel framework for gigapixel WSI analysis that bridges the cross-modal gap between WSIs and textual descriptions. The framework introduces a three-stage training approach: WSI-text alignment, feature space alignment, and task-specific instruction tuning.
•

We develop WSI-specific evaluation metrics (WSI-Precision and WSI-Relevance) that provide a more accurate assessment of model performance in pathological contexts, which address the limitations of traditional NLU metrics by verifying claim accuracy and response relevance.
•

Through comprehensive experiments, we demonstrate WSI-LLaVA’s superior performance compared to existing models, establishing a clear correlation between morphological capabilities and diagnostic accuracy.

2 Related Work

Refer to caption — Figure 1: Overview of WSI-Bench, a morphology-aware benchmark for gigapixel WSI evaluation across 3 pathological capabilities and 11 tasks, which encompasses about 180k VQA pairs from 9,850 WSIs across 30 cancer types, sourced from 8,368 patients.

Patch-Level WSI MLLMs.

Prior works in patch-level MLLMs, such as PathChat [17] and Quilt-LLaVA [22], have demonstrated significant capabilities in localized histopathological analysis by focusing on analyzing small image patches extracted from WSIs. PathChat employs the UNI [5] visual encoder pre-trained on extensive histology image datasets and combines it with a large language model fine-tuned on visual language instructions, achieving notable diagnostic accuracy. Quilt-LLaVA introduces its visual encoder, QuiltNet [19], leveraging educational histopathology videos for spatial localization of narrations. Both models are limited to accepting patch-level input, which restricts their ability to capture the global tissue architecture and comprehensive morphological patterns essential for accurate histopathological diagnoses. Our proposed WSI-LLaVA framework addresses these limitations through an innovative dual-encoder architecture that simultaneously processes both patch-level details and slide-level contextual information, enabling comprehensive analysis of entire gigapixel WSIs and enhancing diagnostic accuracy.

WSI-Level WSI MLLMs.

WSI-level multi-modal models aim to interpret entire slides for tasks like pathology report generation and visual question answering. Transformer-based models such as WSI-Caption [3], HistGen [12], WSI-VQA [4], and PRISM [23] generate detailed pathology reports or answer specific questions but lack the flexibility to handle varied clinical inquiries or engage in complex dialogues. SlideChat [6] is the only known multi-modal large language model capable of handling WSI features effectively, combining a patch-level and slide-level encoder with an LLM for multi-modal understanding. However, it directly learns the mapping between the WSI and the textual answer using both encoders and LLM, facing the significant cross-modal gap between WSI and texts. Additionally, its dataset, SlideBench-VQA [6], mainly covers 10 cancer types that ignore the majority of cancers. To address these limitations, we propose the WSI-Bench framework, providing a large-scale, morphology-aware benchmark with about 180k carefully curated VQA pairs across 30 cancer types. Our WSI-LLaVA, equipped with a three-stage training scheme, addresses the cross-modal gap and ensures comprehensive morphological analysis capabilities, enhancing the model’s ability to process entire slides and engage in detailed, open-ended interactions.

3 WSI-Bench

In clinical practice, pathologists rely on morphological features for diagnosis, but current WSI MLLMs overlook these crucial details, compromising accuracy[2]. Thus, we introduce a morphology-aware WSI-Bench benchmark for gigapixel WSI evaluation to include the morphological observations for comprehensive WSI understanding. It can evaluate the WSI MLLMs across 3 pathological capabilities with 11 tasks for WSI analysis. As shown in Fig. 1, the pathological capabilities include morphological descriptions, diagnosis, and treatment planning.

3.1 Pathological Capabilities

Morphological Analysis. To assess the morphological analysis capability of WSI MLLMs, the four hierarchical pathological tasks are introduced for a comprehensive evaluation, ranging from global to local perspectives. These tasks include global morphology description, key diagnostic description, regional structure description, and specific feature description. Each category is designed to capture distinct pathological aspects, enhancing the dataset’s comprehensiveness while providing clinically relevant insights.

1) Global Morphology Description evaluates the ability to provide a comprehensive overview of WSIs, establishing a context for global analysis. We construct the open-ended VQA pairs by a question with fixed content and a global morphological description. The question asks for the overall description of the WSI. The global morphological description is extracted from our WSI-report pairs, which should capture the general tissue architecture and overall tumor presentation, ensuring a broad understanding of the slide’s features.

2) Key Diagnostic Description focuses on the evaluation of crucial morphological features vital for accurate diagnosis, linking specific observations to their clinical implications. Each open-ended VQA pair includes a question about key morphological features for diagnosis and its answer. We construct the answer by extracting content related to the cancer type’s diagnosis from the WSI report.

3) Regional Structure Description aims to evaluate the capability of broad observations across specific regions, such as tumor infiltration,tissue architecture, and cellular morphology. We construct both open-ended and close-ended (multiple-choice questions, MCQs) VQA pair, which includes a question about a specific region and its answer. The answer is obtained by extracting relevant descriptions for the specific region from the WSI report.

4) Specific Feature Description evaluates the ability to detailed morphological examinations of 6 specific aspects, including vascular invasion, gland formation patterns, cellular pleomorphism, etc. We construct close-ended (MCQs) VQA pair with a question to a specific aspect and its answer, where the answer is the presence of the specific feature mentioned in WSI report. This allows for precise and detailed insights into the pathological slides, enhancing the dataset’s utility for detailed morphological analysis.

Diagnosis. To evaluate the diagnostic capabilities of WSI MLLMs, we examine four fundamental diagnostic tasks: histological typing, grading, molecular subtyping, and staging. In order to provide an interpretable and accurate diagnosis, each diagnostic result includes not only the answer but also an explanation of how key morphological analysis leads to the final answer.

1) Histological Typing evaluates the ability to classification of histological types. It includes both open-ended and close-ended (MCQs) VQA pairs with a question about histological typing and its answer, where the answer is obtained from the WSI report.

2) Grading assesses the ability to classify the grade with both open-ended and close-ended (MCQs) VQA pairs, where each VQA pair includes a question about grading and its answer obtained from the WSI report.

3) Molecular Subtyping evaluates the capabilities of identifying specific subtyping based on histological typing and immunohistochemical data. We design open-ended VQA pairs with a question and its answer. To ensure that subtype assignments are medically precise and relevant, each question includes histological typing and immunohistochemical data extracted from its WSI report. The answer for subtyping is the label from the original data.

4) Staging assesses the ability to predict the Tumor-Node-Metastasis (TNM) [21] labels based on detailed clinical scenarios. We construct both open-ended and close-ended (MCQs) VQA pairs with a question and its answer. The questions are created by translating specific TNM classifications into descriptive clinical information. These scenarios form the basis of the questions posed to users, who are tasked with inferring the TNM classification from the provided clinical context. The answer is the corresponding TNM label from the original data.

Treatment Planing. In the context of treatment planning, WSI-Bench includes three pathological tasks, including prognosis, treatment recommendations, and report generation. To imitate real-world settings, we include both WSI and its comprehensive information like diagnostic and immunohistochemical results for treatment planning.

1) Treatment recommendations aim to evaluate the ability to recommend treatment suggestions. We design the open-ended VQA pairs with a question and its answer. The question includes the diagnostic results of WSI, such as histological typing, grading, and staging information, and immunohistochemical outcomes. We utilize the GPT4-o model to generate initial treatment recommendations based on WSI reports, which are subsequently refined by clinical experts.

2) Prognosis evaluates the ability to predict likely disease outcomes to be good or bad. We construct close-ended (Yes/No questions) VQA pairs with a question and its answer. The question includes the diagnostic results of WSI, such as histological typing, grading, and staging information, and immunohistochemical outcomes. The answer is acquired using the GPT4-o model to generate initial prognosis results, followed by clinical expert review and refinement of the initial ones.

3) Report Generation assesses the model’s ability to generate the reports from the input WSI. The ground-truth (GT) WSI reports are extracted from the original PDF version of the pathology report by using the GPT-4o model. We exclude the irrelevant content that cannot be observed in WSIs, and only include the comprehensive morphological descriptions and diagnostic results (i.e. histological typing, and grading) in our final WSI report.

3.2 Data Collection

To create a comprehensive benchmark with a multifaceted evaluation, we meticulously curate a dataset from The Cancer Genome Atlas (TCGA) [24], which includes 10,756 WSIs representing 30 different cancer types, along with their corresponding 9,045 pathological reports. To enhance the quality and consistency of our dataset, we exclude instances where multiple reports corresponded to a single WSI. The final WSI-Bench comprises 9,850 WSIs paired with 8,368 unique pathological reports. The specific distribution across different cancer types can be found in Appendix Section 2. As shown in Fig. 2, we include two main steps to clean the original WSI report and construct VQA pairs. We first use GPT-4o to extract textual content from the original PDF version of the pathology report, and remove gross descriptions and immunohistochemical results. The WSI reports only include the comprehensive morphological descriptions and diagnostic results, such as histological typing and grading. To enrich morphological descriptions, we introduce a reverse-engineer scheme to provide a morphological description extension. It uses diagnostic labels to identify the corresponding morphological characteristics according to tumor classification standards like the Nottingham grading system [8]. This extension is included in the WSI reports that are then refined by clinical experts. Based on the final WSI reports, we contruct the VQA pairs for each pathological task as required in Section 3.1. Please see details in Appendix Section 2.

Data Statistics and Split. The WSI-Bench comprises a total of 179,569 VQA pairs, which span across 3 pathological capabilities with 11 tasks, as listed in Table 1. The WSI-Bench is split into the training set of 9,642 WSIs paired with 175,450 VQA pairs (122,133 open-ended and 53,317 closed-ended), along with a testing set of 208 WSIs containing 4,119 VQA pairs (2838 open-ended and 1,281 closed-ended).

Table 1: Statistics Analysis on the WSI-Bench.

Capabilities	Tasks	VQA Pair Count
Morphological Analysis	Global Morphology Description	9,850
	Key Diagnostic Description	9,850
	Regional Structure Description	55,181
	Specific Feature Description	48,170
Diagnosis	Histological Typing	19,697
	Grading	15,335
	Molecular Subtyping	977
	Staging	6,743
Treatment Planning	Treatment Recommendations	1,958
	Prognosis	1958
	Report Generation	9,850
Total		179,569 (180k)

3.3 WSI Metrics

Although NLU metrics are widely used to evaluate open-ended medical language tasks, they cannot accurately assess performance due to pathology’s complex and often similar terminology[14, 20]. Thus, we introduce two specialized WSI metrics for open-ended questions: WSI-Precision and WSI-Relevance. WSI-Precision verifies the accuracy of each claim derived from the ground truth against the model’s answers. WSI-Relevance measures the alignment of each claim in the model’s responses with the ground truth to ensure their relevance. For the accuracy of each claim, the scoring system ranges from 1 (complete accuracy with perfect alignment to established facts) to 0 (completely incorrect or irrelevant information), with intermediate scores of 0.7 (mostly correct with minor discrepancies) and 0.3 (significant errors in critical content affecting core message). Besides, we also use traditional NLU metrics for comparison, including BLEU, ROUGE-L and METEOR. For the close-ended questions, we utilize accuracy for evaluation.

4 WSI-LLaVA

Since the current WSI MLLMs [6] directly use WSI encoder and LLM for multi-modal learning, there would be a substantial cross-modal gap between WSI and textual descriptions appears, affecting MLLMs’ ability of WSIs understanding. Hence, we introduce the WSI-LLaVA framework, which adopts a three-stage training approach for gigapixel WSI analysis. As depicted in Fig. 3, it consists of the WSI-text alignment, feature space alignment, and task-specific instruction tuning stages.

WSI-Text Alignment. In this stage, the WSI encoder includes a patch-level encoder that employs Prov-GigaPath[27]’s tile-level encoder, pre-trained using the DINOv2[18] framework, to process individual $256\times 256$ pixel patches extracted from WSIs. The slide-level encoder, based on the LongNet[9] architecture, generates global features from all the patch features. Then, a text encoder using Bio_ClinicalBERT[1] encodes the WSI reports into textual features. Finally, we align the global features with the textual features through contrastive learning, making the global features include more semantic information. We train WSI and text encoders on the WSI-report pairs in the training set of WSI-Bench.

Feature Space Alignment. Our WSI-LLaVA comprises a WSI encoder pre-trained in first stage, a projection layer and a pre-trained LLM. To align the feature space between the WSI encoder and LLM, we freeze these pre-trained models and train the projection layer on WSI-report pairs of WSI-Bench.

Task-Specific Instruction Tuning. To learn the pathological tasks, we perform instruction tuning on the WSI-LLaVA for all the tasks in WSI-Bench. In this stage, we freeze the WSI encoder, while the projection layer and the LLM are trained to enhance the model’s ability to handle diverse instructional queries.

Table 2: Quantitative evaluation of WSI and general-purpose MLLMs on our WSI-Bench across three pathological capabilities. WSI-P: WSI-Precision, WSI-R: WSI-Relevance, Acc: accuracy, open: open-ended question, and close: close-ended question.

		Morphological Analysis			Diagnosis			Treatment Planning			Average
Model	Input	Open		Close	Open		Close	Open		Close
		WSI-P	WSI-R	Acc	WSI-P	WSI-R	Acc	WSI-P	WSI-R	Acc
Quilt-LLaVA[22]	Thumbnail	0.448	0.447	0.947	0.586	0.604	0.849	0.788	0.816	1.000	0.721
GPT-4o[15]	Thumbnail	0.220	0.204	0.471	0.472	0.457	0.530	0.496	0.841	0.875	0.507
WSI-VQA[4]	WSI	0.395	0.462	0.758	0.436	0.525	0.577	0.791	0.595	0.771	0.590
WSI-LLaVA	WSI	0.488	0.610	0.951	0.610	0.612	0.863	0.810	0.845	1.000	0.754

Table 3: Quantitative evaluation of WSI and general-purpose MLLMs on report generation task of our WSI-Bench.

Models	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-L	METEOR	WSI-Precision	WSI-Relevance
GPT-4o[15]	0.2017	0.0687	0.0296	0.0161	0.1319	0.1670	0.0670	0.1380
Quilt-LLaVA[22]	0.3343	0.1698	0.0928	0.0566	0.2463	0.2910	0.3240	0.3330
MI-Gen[3]	0.4027	0.3061	0.2481	0.2085	0.4464	0.407	0.3100	0.3770
Hist-Gen[12]	0.4058	0.3070	0.2482	0.2081	0.4484	0.4162	0.3000	0.3970
WSI-LLaVA	0.3531	0.1859	0.1058	0.0665	0.2626	0.3072	0.3800	0.4290

5 Experiment

5.1 Baseline Models

We collect various WSI MLLMs to evaluate on our WSI-Bench dataset. These include specialized models for WSI report generation, such as MI-Gen [3] and Hist-Gen [12], as well as models designed for pathological VQA tasks, like Quilt-LLaVA [22] and WSI-VQA [4]. Additionally, we assess GPT-4o’s performance to evaluate a general-purpose MLLM. For models with input size constraints (e.g., Quilt-LLaVA and GPT-4o), we resize the WSIs to $1024\times 1024$ pixels to fit within their input processing capabilities. To ensure a fair comparison, all WSI MLLMs are trained on WSI-Bench’s training set and evaluated on its test set.

5.2 Quantitative Evaluation

We quantitatively compare the WSI MLLMs with the general-purpose MLLM on our WSI-Bench across three pathological capabilities, including morphological analysis, diagnosis and treatment planning. As listed in Table 2, our proposed WSI-LLaVA outperforms existing methods across all the capabilities, surpassing the Quilt-LLaVA by a significant margin of 3%. This indicates the superiority of our WSI-LLaVA over both the WSI and general-purpose MLLMs. Among the three capabilities, current MLLMs achieve the best performance in treatment planning tasks, implying that these MLLMs preserve stronger planning capabilities in pathology scenarios. Moreover, the performance of closed-ended questions for different models is better than that of closed-ended questions, suggesting the open-ended questions are more challenging for current models.

For the report generation task, we compare our WSI-LLaVA with WSI report generation models (MI-GEN and Hist-Gen), WSI MLLMs for VQA task (Quilt-LLaVA) and the general-purpose MLLMs (GPT-4o) in both NLU metrics and our proposed WSI-specialized metrics. As shown in Table 3, while specialized WSI report generation models (Hist-Gen and MI-Gen) excel in traditional NLU metrics (BLEU, ROUGE-L, METEOR), achieving scores up to 0.4058 in BLEU-1, our WSI-LLaVA shows superior performance in domain-specific metrics with the highest WSI-Precision (0.3800) and WSI-Relevance (0.4290). General-purpose model GPT-4o consistently underperforms across all metrics, highlighting the importance of domain-specific training. This pattern suggests a trade-off between general language generation capabilities and domain-specific expertise, with WSI-LLaVA striking a balance that favors accurate and relevant WSI analysis over pure linguistic similarity to reference reports.

Table 4: Ablation studies on WSI encoder across three pathological capabilities. WSI-P: WSI-Precision, WSI-R: WSI-Relevance, Acc: accuracy, Open: open-ended questions, Close: close-ended questions.

	Morphological Analysis			Diagnosis			Treatment Planning			Report Generation		Average
WSI Encoder	Open		Close	Open		Close	Open		Close	WSI-P	WSI-R
	WSI-P	WSI-R	Acc	WSI-P	WSI-R	Acc	WSI-P	WSI-R
✗	0.435	0.433	0.940	0.565	0.577	0.847	0.739	0.803	1.00	0.290	0.326	0.632
✓	0.488	0.487	0.951	0.610	0.612	0.863	0.810	0.845	1.00	0.380	0.429	0.680

[Uncaptioned image] — Table 5: Sample Comparison of VQA. Correct responses are highlighted in green, and incorrect responses are highlighted in red.

5.3 Qualitative Evaluation

As shown in Table 5, in the pathological VQA tasks, WSI-LLaVA demonstrated superior performance by correctly identifying the Nottingham Grade 1 classification and accurately describing all three scoring components (tubule formation, nuclear pleomorphism, and mitotic count). Correct responses are highlighted in green, and incorrect responses are highlighted in red. Quilt-LLaVA, while partially accurate in identifying tubule formation and mitotic count scores, incorrectly assessed the overall grade as Grade 2 and misinterpreted the nuclear pleomorphism score, highlighting the challenges in consistent grading interpretation.

As displayed in Table 6, for the report generation task, WSI-LLaVA showed higher accuracy by correctly identifying the poorly differentiated squamous cell carcinoma and detecting key features like necrosis. In contrast, MI-Gen made several critical errors, including misclassifying the tumor type as adenocarcinoma and incorrectly stating the absence of vascular invasion and necrosis. This comparison reveals that WSI-LLaVA maintains better consistency with ground truth observations and demonstrates superior capability in detailed pathological feature recognition.

Table 7: Ablation studies on three-stage training scheme of WSI-LLaVA. WSI-P: WSI-Precision, WSI-R: WSI-Relevance, Acc: accuracy, Open: open-ended questions, Close: close-ended questions.

Stage 1	Stage 2	Morphological Analysis			Diagnosis			Treatment Planning			Report Generation		Average
		Open		Close	Open		Close	Open		Close	WSI-P	WSI-R
		WSI-P	WSI-R	Acc	WSI-P	WSI-R	Acc	WSI-P	WSI-R	Acc	WSI-P	WSI-R
✗	✗	0.425	0.483	0.945	0.561	0.541	0.837	0.761	0.798	1.000	0.356	0.367	0.643
✗	✓	0.442	0.478	0.945	0.589	0.602	0.856	0.744	0.776	1.000	0.380	0.377	0.654
✓	✓	0.448	0.487	0.951	0.610	0.612	0.863	0.810	0.845	1.000	0.380	0.429	0.676

Table 8: Comparison between WSI metrics (WSI-Precision and WSI-Relevance) and NLU metrics. Green text indicates content matching the ground-truth response, while red text highlights deviations from the ground truth. WSI-P: WSI-Presion, WSI-R: WSI-Relevance.

Model	Model Responses	WSI Metrics		NLU Metrics
Model	Model Responses	WSI-P	WSI-R	NLU Metrics
Ground Truth	Microscopic observation reveals poorly differentiated squamous cell carcinoma characterized by sheets and nests of atypical squamous cells. The tumor cells exhibit marked nuclear pleomorphism, irregular nuclei, and prominent nucleoli. Necrosis and vascular invasion are evident, and mitotic figures are numerous, indicating aggressive behavior. The diagnosis is poorly differentiated squamous cell carcinoma.	1.0	1.0	N/A
WSI-LLaVA	Microscopic observation of the pathology slide reveals poorly differentiated squamous cell carcinoma characterized by sheets and nests of atypical squamous epithelial cells with minimal keratinization. The tumor cells exhibit marked nuclear pleomorphism, with enlarged, irregular nuclei and prominent nucleoli. There is a high rate of mitotic activity, indicative of aggressive cellular proliferation. Vascular invasion is evident, with tumor cells infiltrating lymphovascular spaces. Areas of necrosis are present within the tumor mass. Based on these observations, the diagnosis is poorly differentiated invasive squamous carcinoma.	0.8	0.83	BLEU-1: 0.5060 BLEU-2: 0.3848 BLEU-3: 0.3205 BLEU-4: 0.2803 ROUGE-L: 0.4727 METEOR: 0.6387
Hist-Gen	Microscopic observation of the pathology slide reveals a poorly differentiated squamous cell carcinoma characterized by irregular nests and sheets of atypical squamous cells. The tumor cells exhibit marked nuclear pleomorphism with prominent nucleoli and prominent nucleoli. There is no evidence of vascular invasion or necrosis within the examined tissue. Based on these observations, the diagnosis is squamous cell carcinoma poorly differentiated.	0.233	0.52	BLEU-1: 0.6029 BLEU-2: 0.4930 BLEU-3: 0.4156 BLEU-4: 0.3661 ROUGE-L: 0.5161 METEOR: 0.5909
MI-Gen	Microscopic observation of the pathology slide reveals a poorly differentiated adenocarcinoma characterized by a mixture of atypical squamous cell carcinoma. The tumor exhibits a high degree of differentiation with a high degree of nuclear pleomorphism and increased mitotic activity. There is no evidence of vascular invasion or necrosis or calcification. Based on these observations, the diagnosis is poorly differentiated.	0.1	0.05	BLEU-1: 0.5394 BLEU-2: 0.4066 BLEU-3: 0.3101 BLEU-4: 0.2512 ROUGE-L: 0.4783 METEOR: 0.4971

5.4 Ablation Studies

Effectiveness of WSI Encoder. To prove the effectiveness of WSI encoder in extracting WSI features, we ablate the WSI encoder in our proposed WSI-LLaVA framework, as listed in Table 4. When ablating WSI encoder, across all three pathological capabilities and both report generation tasks, the model with WSI encoder consistently outperforms the variant without it. Notable improvements are observed in the WSI-R metrics for morphological analysis (0.447 to 0.610) and treatment planning (0.816 to 0.845). The overall average performance increases from 0.632 to 0.680, representing a 4.8% improvement. These results clearly validate the effectiveness of the WSI encoder in extracting meaningful features from whole slide images, contributing to better performance across all evaluation metrics.

Effectiveness of the three-stage training scheme. To demonstrate the necessity of the three-stage training scheme for the WSI-LLaVA framework, we ablate the first or second stage to compare performance. As shown in Table 7, when ablating the first stage, the overall performance declines by 2.2% in average score. When ablating both the first and second stages, the performance decreases significantly by a large margin with an average score of 3.3%, indicating that each stage contributes to improving performance to different extents. This highlights the effectiveness of three-stage training in mitigating the cross-modal gap.

5.5 Comparison between NLU Metrics and WSI Metrics

Table 8 compares different models’ performance in generating pathology reports, evaluated using both our proposed WSI metrics and traditional nature language understanding (NLU) metrics like BLEU, ROUGE-L, and METEOR. In the model responses, clinically correct information is highlighted in green, while incorrect or irrelevant content is highlighted in red. The comparison reveals significant discrepancies in their ability to evaluate model performance. While Hist-Gen achieves the highest scores across most NLU metrics (BLEU-1: 0.6029, BLEU-2: 0.4930, BLEU-3: 0.4156, BLEU-4: 0.3661, ROUGE-L: 0.5161), its WSI-Precision (0.233) and WSI-Relevance (0.52) scores are substantially lower than WSI-LLaVA (WSI-Precision: 0.8, WSI-Relevance: 0.83). Similarly, MI-Gen demonstrates moderate NLU metric scores but performs poorly in WSI-specific metrics (WSI-Precision: 0.1, WSI-Relevance: 0.05), largely due to its misidentification of the tumor type and incorrect assessment of key pathological features.

This discrepancy between traditional NLU metrics and WSI metrics can be attributed to several factors. 1) Surface Similarity: High word overlap and similar sentence structures between the generated text and the reference lead to higher BLEU scores, even if the clinical content differs significantly. 2) n-gram Matching: Metrics like BLEU reward matching n-grams regardless of semantic accuracy. Shared phrases like “Microscopic observation reveals poorly differentiated squamous cell carcinoma” can inflate scores, even when used incorrectly. 3) Semantic Insensitivity: Traditional NLU metrics do not account for the correctness of medical facts; they may assign high scores to clinically incorrect statements if they share vocabulary and structure with the reference. These results suggest that traditional NLU metrics may not adequately capture the clinical accuracy and relevance of model-generated pathology reports, highlighting the importance of domain-specific evaluation metrics like WSI-Precision and WSI-Relevance.

6 Conclusion

We present WSI-LLaVA, a novel framework for gigapixel WSI analysis, along with WSI-Bench, the first large-scale morphology-aware benchmark containing 180k VQA pairs. Our three-stage training approach effectively bridges the cross-modal gap between WSIs and text, while our specialized WSI-Precision and WSI-Relevance metrics enable more accurate evaluation of model performance. Experimental results demonstrate WSI-LLaVA’s superior performance across all capabilities, establishing a clear link between morphological understanding and diagnostic accuracy. These advances represent significant progress toward developing more clinically relevant computational pathology tools.

References

Alsentzer et al. [2019] Emily Alsentzer, John R Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323, 2019.
Bradbury et al. [2022] Richard S Bradbury, Sarah GH Sapp, Idzi Potters, Blaine A Mathison, John Frean, Abhishek Mewara, Harsha Sheorey, Francesca Tamarozzi, Marc Roger Couturier, Peter Chiodini, et al. Where have all the diagnostic morphological parasitologists gone? Journal of clinical microbiology, 60(11):e00986–22, 2022.
Chen et al. [2024a] Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang. Wsicaption: Multiple instance generation of pathology reports for gigapixel whole-slide images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 546–556. Springer, 2024a.
Chen et al. [2025] Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, and Lin Yang. Wsi-vqa: Interpreting whole slide images by generative visual question answering. In European Conference on Computer Vision, pages 401–417. Springer, 2025.
Chen et al. [2024b] Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862, 2024b.
Chen et al. [2024c] Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Bin Zhang, Nana Pei, Rongshan Yu, Yu Qiao, et al. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761, 2024c.
Devlin [2018] Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Dilani et al. [2019] Lokuhetty Dilani, AW Valerie, W Beiko, et al. Who classification of tumours, breast tumours. Lyon: International Agency for Research on Cancer, 2019.
Ding et al. [2023] Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
Farahani et al. [2015] Navid Farahani, Anil V Parwani, and Liron Pantanowitz. Whole slide imaging in pathology: advantages, limitations, and emerging perspectives. Pathology and Laboratory Medicine International, pages 23–33, 2015.
García-Barragán et al. [2024] Álvaro García-Barragán, Alberto González Calatayud, Oswaldo Solarte-Pabón, Mariano Provencio, Ernestina Menasalvas, and Víctor Robles. Gpt for medical entity recognition in spanish. Multimedia Tools and Applications, pages 1–20, 2024.
Guo et al. [2024] Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 189–199. Springer, 2024.
He et al. [2024] Sunan He, Yuxiang Nie, Zhixuan Chen, Zhiyuan Cai, Hongmei Wang, Shu Yang, and Hao Chen. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv preprint arXiv:2404.15127, 2024.
He et al. [2023] Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y Chang, Amilcare Gentili, Julian McAuley, and Chun-Nan Hsu. Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. arXiv preprint arXiv:2310.14088, 2023.
Islam and Moushi [2024] Raisa Islam and Owana Marzia Moushi. Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints, 2024.
López-Úbeda et al. [2022] Pilar López-Úbeda, Teodoro Martín-Noguerol, José Aneiros-Fernández, and Antonio Luna. Natural language processing in pathology: current trends and future insights. The American Journal of Pathology, 192(11):1486–1495, 2022.
Lu et al. [2023] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Kenji Ikamura, Georg Gerber, Ivy Liang, Long Phi Le, Tong Ding, Anil V Parwani, et al. A foundational multimodal vision language ai assistant for human pathology. arXiv preprint arXiv:2312.07814, 2023.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
Park et al. [2022] Jongho Park, HyukJun Kwon, Seowoo Kim, Junyoung Lee, Minho Ha, Euicheol Lim, Mohsen Imani, and Yeseong Kim. Quiltnet: Efficient deep learning inference on multi-chip accelerators using model partitioning. In Proceedings of the 59th ACM/IEEE Design Automation Conference, pages 1159–1164, 2022.
Santos et al. [2022] Thiago Santos, Amara Tariq, Susmita Das, Kavyasree Vayalpati, Geoffrey H Smith, Hari Trivedi, and Imon Banerjee. Pathologybert-pre-trained vs. a new transformer language model for pathology domain. In AMIA annual symposium proceedings, page 962. American Medical Informatics Association, 2022.
Scorsetti et al. [2016] Marta Scorsetti, Francesco Leo, Annalisa Trama, Rolando D’Angelillo, Danila Serpico, Marianna Macerelli, Paolo Zucali, Gemma Gatta, and Marina Chiara Garassino. Thymoma and thymic carcinomas. Critical reviews in oncology/hematology, 99:332–350, 2016.
Seyfioglu et al. [2024] Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024.
Shaikovski et al. [2024] George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254, 2024.
Tomczak et al. [2015] Katarzyna Tomczak, Patrycja Czerwińska, and Maciej Wiznerowicz. Review the cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology/Współczesna Onkologia, 2015(1):68–77, 2015.
Wang et al. [2024] Xiyue Wang, Junhan Zhao, Eliana Marostica, Wei Yuan, Jietian Jin, Jiayu Zhang, Ruijiang Li, Hongping Tang, Kanran Wang, Yu Li, Fang Wang, Yulong Peng, Junyou Zhu, Jing Zhang, Christopher R Jackson, Jun Zhang, Deborah Dillon, Nancy U Lin, Lynette M. Sholl, Thomas Denize, David Meredith, Keith L. Ligon, Sabina Signoretti, Shuji Ogino, Jeffrey A Golden, MacLean P Nasrallah, Xiao Han, Sen Yang, and Kun-Hsing Yu. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature, 2024.
Wu et al. [2024] Xiaomin Wu, Rui Xu, Pengchen Wei, Wenkang Qin, Peixiang Huang, Ziheng Li, and Lin Luo. Pathinsight: Instruction tuning of multimodal datasets and models for intelligence assisted diagnosis in histopathology. arXiv preprint arXiv:2408.07037, 2024.
Xu et al. [2024] Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data. Nature, pages 1–8, 2024.
Yoder et al. [2007] Brian J Yoder, Edward J Wilkinson, and Nicole A Massoll. Molecular and morphologic distinctions between infiltrating ductal and lobular carcinoma of the breast. The breast journal, 13(2):172–179, 2007.
Zhao et al. [2023] Zihao Zhao, Yuxiao Liu, Han Wu, Mei Wang, Yonghao Li, Sheng Wang, Lin Teng, Disheng Liu, Zhiming Cui, Qian Wang, et al. Clip in medical imaging: A comprehensive survey. arXiv preprint arXiv:2312.07353, 2023.
Zheng et al. [2023] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.

\thetitle

Supplementary Material

Abstract. In this supplementary material, we provide additional information about the proposed WSI-Bench, WSI-LLaVA, and WSI Metrics. Appendix 1 illustrates additional experiments of the proposed WSI-LLaVA on the proposed WSI-Bench. Appendix 2 demonstrates the details of WSI-Bench. Appendix 3 provides the details of the WSI metrics. Appendix 4 provides the implementation details of the proposed method.

1 Additional Experiments

1.1 Quantitative Evaluation across Each Task

To quantitatively evaluate each pathological task across three capabilities, we demonstrate the performance of existing MLLMs and our proposed WSI-LLaVA in Table 1. As shown in the results, WSI-LLaVA achieves superior overall performance with the highest average scores in both WSI-Precision (0.601) and WSI-Relevance (0.608). Specifically, in morphological analysis capability, WSI-LLaVA demonstrates leading WSI-Precision scores across all sub-tasks, with particular strength in specific feature description (0.760). For diagnostic capability, it exhibits robust performance in histological typing (0.410), grading (0.570), and staging (0.830), although Quilt-LLaVA shows slightly better results in molecular subtyping. Regarding treatment planning capability, WSI-LLaVA excels in treatment recommendations with the highest scores in both metrics (0.790 and 0.881), while maintaining competitive performance in prognosis. These comprehensive results validate WSI-LLaVA’s balanced and reliable performance across diverse pathological tasks compared to existing MLLMs.

Table 1: Quantitative comparison among MLLMs across three capabilities by using WSI-Precision and WSI-Relevance metrics.

Methods	Morphological Analysis				Diagnosis				Treatment Planning		Average
	Global Morph. Desc.	Key Diag. Desc.	Reg. Struct. Desc.	Spec. Feat. Desc.	Hist. Typing	Grading	Mol. Subtyping	Staging	Treat. Recom.	Prognosis
	WSI-Precision
Quilt-LLaVA	0.338	0.314	0.389	0.752	0.339	0.505	0.675	0.824	0.764	0.812	0.571
GPT-4o	0.221	0.243	0.153	0.263	0.282	0.363	0.578	0.655	0.498	0.493	0.375
WSI-VQA	0.322	0.313	0.389	0.554	0.377	0.430	0.388	0.550	0.708	0.874	0.490
WSI-LLaVA	0.390	0.350	0.450	0.760	0.410	0.570	0.630	0.830	0.790	0.830	0.601
	WSI-Relevance
Quilt-LLaVA	0.289	0.363	0.404	0.732	0.333	0.453	0.708	0.922	0.858	0.773	0.584
GPT-4o	0.171	0.326	0.216	0.103	0.177	0.312	0.169	0.745	0.792	0.889	0.390
WSI-VQA	0.383	0.370	0.511	0.586	0.475	0.514	0.476	0.634	0.444	0.746	0.514
WSI-LLaVA	0.333	0.382	0.465	0.766	0.380	0.511	0.675	0.882	0.881	0.808	0.608

Abbreviation: Global Morph. Desc. (Global Morphological Description), Key Diag. Desc. (Key Diagnostic Description),
Reg. Struct. Desc. (Regional Structure Description), Spec. Feat. Desc. (Specific Feature Description), Hist. Typing (Histological Typing),
Grading (Grading), Mol. Subtyping (Molecular Subtyping), Staging (Staging), Treat. Recom. (Treatment Recommendations), Prognosis (Prognosis).

1.2 Qualitative Evaluation across Each Task

To qualitatively evaluate each task, we demonstrate the visual results of existing methods and our proposed WSI-LLaVA for 11 pathological tasks, as shown in Table 16 to 26. We use color green to highlight content that matches the ground truth and red to indicate content that contradicts it. In these Tables, we can observe that WSI-LLaVA’s answers are more consistent with the ground truth and exhibit higher accuracy across all tasks. While Quilt-LLaVA’s responses are similar in form to the ground truth, they contain more errors in specific content, especially in morphological description tasks. This may be due to its inability to extract the rich morphological features in WSI images. WSI-VQA’s answers often include repeated invalid information, and some sentences have grammatical errors, indicating the fundamental disadvantages of Transformer-based models compared to LLMs. GPT-4o provides a large amount of irrelevant information, likely because it cannot be trained to read WSI images.

In the radar chart displaying WSI-Precision metrics (Figure 4), WSI-LLaVA (our model) dominates with broader coverage and higher peaks in most diagnostic categories, particularly excelling in “Specific Feature Description,” “Staging,” and “Prognosis.” This suggests an excellent ability to accurately identify and describe critical pathological features and outcomes. Meanwhile, models like GPT-4o show considerably lower precision, particularly in detailed descriptions, which may limit their utility in nuanced diagnostic scenarios.

The WSI-Relevance radar chart reveals that WSI-LLaVA also leads in relevance (Figure 4), with outstanding performance in “Staging” and “Treatment Recommendations,” underscoring its capability to deliver clinically pertinent information that aids in treatment planning and prognosis estimation. In contrast, while GPT-4o and WSI-VQA provide valuable insights in specific areas such as “Prognosis” and “Regional Structure Description,” they exhibit a balanced but generally lower relevance compared to our model, indicating a need for targeted improvements to enhance their practical application in clinical settings.

2 WSI-Bench

To establish a comprehensive and large-scale WSI-Bench, we include an automated data construction process using GPT-4o and a subsequent validation process by clinical experts.

2.1 Data Distribution

In Fig. 5, we illustrate the distribution of WSI across various cancer types as categorized in the Cancer Genome Atlas (TCGA) database.

Cancer types are commonly identified by standardized abbreviations. Specifically, TCGA-ACC refers to Adrenocortical Carcinoma; TCGA-BLCA is Bladder Urothelial Carcinoma; TCGA-BRCA denotes Breast Invasive Carcinoma; TCGA-CESC stands for Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma; TCGA-CHOL represents Cholangiocarcinoma; TCGA-COAD is Colon Adenocarcinoma; TCGA-DLBC refers to Lymphoid Neoplasm Diffuse Large B-cell Lymphoma; TCGA-ESCA denotes Esophageal Carcinoma; TCGA-HNSC stands for Head and Neck Squamous Cell Carcinoma; TCGA-KICH represents Kidney Chromophobe; TCGA-KIRC is Kidney Renal Clear Cell Carcinoma; TCGA-KIRP denotes Kidney Renal Papillary Cell Carcinoma; TCGA-LGG refers to Brain Lower Grade Glioma; TCGA-LIHC stands for Liver Hepatocellular Carcinoma; TCGA-LUAD represents Lung Adenocarcinoma; TCGA-LUSC is Lung Squamous Cell Carcinoma; TCGA-MESO denotes Mesothelioma; TCGA-OV refers to Ovarian Serous Cystadenocarcinoma; TCGA-PCPG stands for Pheochromocytoma and Paraganglioma; TCGA-PRAD represents Prostate Adenocarcinoma; TCGA-READ is Rectum Adenocarcinoma; TCGA-SARC denotes Sarcoma; TCGA-SKCM refers to Skin Cutaneous Melanoma; TCGA-STAD stands for Stomach Adenocarcinoma; TCGA-TGCT represents Testicular Germ Cell Tumors; TCGA-THCA is Thyroid Carcinoma; TCGA-THYM denotes Thymoma; TCGA-UCEC stands for Uterine Corpus Endometrial Carcinoma; TCGA-UCS represents Uterine Carcinosarcoma; and TCGA-UVM refers to Uveal Melanoma.

2.2 Reverse-Engineer Scheme

To enrich morphological descriptions, we introduce a reverse-engineer scheme to provide a morphological description extension. It uses diagnostic labels to identify the corresponding morphological characteristics according to tumor classification standards like the Nottingham grading system. This extension is included in the WSI reports that are then refined by clinical experts. This reverse-engineering process is supported by the breadth of medical literature included in GPT-4o’s training data, ensuring a high level of precision that mirrors the expertise of professional pathologists.

Specifically, we take the instance of breast cancer where reports utilize the Nottingham grading system [8], a method frequently used in TCGA datasets. The grading system assesses three critical aspects: gland formation, nuclear pleomorphism, and mitotic count, assigning each a score from 1 to 3. Consider a report that assigns a score of 3 for both gland formation and nuclear pleomorphism. This high score indicates minimal gland formation and substantial variation in nuclear size and shape, suggestive of aggressive tumor characteristics. Furthermore, if the pathology report confirms a diagnosis of invasive ductal carcinoma, it characterizes the cancer cells as having breached duct walls and forming invasive clusters. Such diagnostic details are crucial as they provide clear visual markers that can be directly observed in WSIs. We incorporate these specific findings into our dataset to enrich the morphological descriptions, ensuring that they reflect real-world pathological assessments and are tightly aligned with established diagnostic criteria.

Table 4 presents an example of the Nottingham grading system applied to breast cancer, a method derived from the WHO Classification of Tumors [8]. This system is recognized for its rigor and authority in classifying tumor severity. It systematically scores three critical histological features: gland formation, nuclear pleomorphism, and mitotic count. Each feature is evaluated based on specific morphological characteristics observable in tissue samples. The scores correlate directly with these features, ensuring that the derived descriptions are not only standardized but also clinically meaningful. By adhering to this methodology, the descriptions generated are both accurate reflections of the histopathological findings and relevant for clinical assessment and treatment planning.

To address hallucinations, we generate three versions of each report and use another LLM to verify if the content aligns with the original reports. The model performs five iterations of selection, and if one version is selected three or more times, it is deemed accurate. If none achieve this threshold, a pathologist reviews and selects the appropriate content. Furthermore, to avoid the pitfalls of incomplete filtering, especially concerning immunohistochemical data and macroscopic descriptions, we use regular expressions to identify and exclude 30 keywords, with any flagged text revised by a pathologist.

2.3 Data Construction

Report Generation Task. Fig. 9 illustrate how prompts are utilized to guide GPT-4o in generating textual reports that encompass critical histopathological assessments. Table 9 shows the example of the WSI and the corresponding report. These examples demonstrate the process of crafting prompts that effectively elicit comprehensive and clinically relevant reports from the model, ensuring essential diagnostic information is accurately captured.

Other Pathological Tasks. We present detailed examples for other pathological tasks showcasing how each prompt is crafted to elicit precise responses from the model, aligning with clinical expectations and diagnostic standards. The prompts, referenced in Fig. 10 to 19, are pivotal in defining the scope and focus of each pathological task. Tables 10 to 19 visualize the examples for each task.

2.4 Expert Review and Quality Control

To ensure the accuracy, clinical relevance, and educational value of the generated VQA pairs, we engaged a panel of clinical experts to meticulously review and validate each pair. This team comprises 2 senior pathologists and 5 junior doctors, providing a comprehensive blend of experience and perspectives.

Senior pathologists, leveraging their extensive clinical expertise, critically assess the medical accuracy and relevance of the model’s answers, verifying that the diagnostic conclusions are consistent with current medical standards and practices. Junior doctors focus on evaluating the clarity and practical usefulness of the VQA pairs for medical practitioners at earlier stages of their careers.

The review process for each VQA pair includes:

•

Accuracy Verification: Ensuring that the medical facts and conclusions provided align precisely with the visual evidence presented in the Whole Slide Images (WSIs).
•

Relevance Assessment: Confirming that each VQA pair addresses typical diagnostic questions pertinent to clinical practice.
•

Clarity Evaluation: Checking that the information is communicated clearly and is understandable for educational purposes.

Any discrepancies or inaccuracies identified during the review are addressed by modifying or regenerating the VQA pairs to meet the required standards. This rigorous examination ensures that the final VQA pairs are not only medically accurate but also serve as effective tools for clinical application and training, ultimately enhancing the quality of diagnostic practice.

3 WSI Metrics

We introduce two key metrics developed to evaluate WSI MLLMs: WSI-Precision and WSI-Relevance. These metrics are essential for assessing the accuracy and relevance of model-generated responses in a clinical context.

3.1 WSI-Precision

WSI-Precision measures the accuracy of model responses by comparing them directly to claims derived from the ground-truth response. As shown in Table 2, it includes three main steps to compute the evaluation result for the model response and ground-truth response, including the clams extraction, individual score calculation, and final score calculation.

Claims Extraction. To break down complex medical information into assessable units, we extract distinct claims from the ground-truth response by using GPT-4o. As shown in Fig. 6, we design the prompt to guide GPT-4o for claims extraction. The second row in Table 2 shows the extracted claims for the ground-truth response.

Individual Score Calculation. With the extracted claims, we compare them with the generated response using our scoring criterion, which is defined as follows:

•

Score 1: The information in the model’s response is completely accurate with respect to the claim. The response perfectly aligns with established facts, correctly addressing all elements of the claim.
•

Score 0.7: The information is mostly correct and closely aligns with the claim, with minor discrepancies that do not fundamentally alter the truth or clinical significance of the response.
•

Score 0.3: The claim is acknowledged, but the response contains significant errors in critical content areas, such as the degree of differentiation or malignancy status, affecting the core message.
•

Score 0: The information in the model’s response is completely incorrect or unrelated to the claim, indicating a fundamental misunderstanding or misrepresentation of the clinical facts.

Following this scoring criterion, we design the prompt to guide GPT-4o to compute the individual score for each claim, as demonstrated in Fig. 7. In Table 2, the third row shows the score of each claim.

Final Score Calculation. With the individual score for each claim, we calculate the overall WSI-Precision score by averaging all the individual claim scores.

3.2 WSI-Relevance

WSI-Relevance assesses the relevance of the claims extracted from model responses in relation to the ground-truth response. Similarly, it also includes three main steps to evaluate the model response and ground-truth response, including the clams extraction, individual score calculation, and final score calculation, as shown in Table 3.

Claims Extraction. To extract the claims from the model response, we design the prompt to guide GPT-4o for claims extraction, as illustrated in Fig. 6. Table 3 visualizes the extracted claims of model response in second row.

Individual Score Calculation. With the extracted claims, we compare them with the ground-truth response using our scoring criterion which is the same as that for WSI-Precision. Based on the scoring criterion, we devise the prompt to guide GPT-4o to compute the individual score for each claim, as shown in Fig. 8.

Final Score Calculation. With the individual scores for each claim, we average them and obtain the final WSI-Relevance score.

By utilizing both WSI-Precision and WSI-Relevance, we aim to provide a comprehensive assessment of the diagnostic accuracy and clinical relevance of model outputs. This structured evaluation ensures that models are not only producing factually correct responses but are also providing information that is pertinent and useful in real-world medical settings.

4 Implementation Details

4.1 Network Architecture

Our model’s network architecture is divided into three components: a WSI encoder, a text encoder, a projection layer, and a large language model (LLM). This design enables efficient processing of gigapixel pathology WSI data.

WSI Encoder. WSI Encoder, is responsible for extracting meaningful features from gigapixel pathology images, consisting of a patch-level encoder and a slide-level encoder. The patch-level encoder employs Prov-GigaPath[27]’s tile-level encoder, which is specifically designed for pathology data. The tile-level encoder is pre-trained using the DINOv2[18] framework on 1.3 billion $256\times 256$ pathology image tiles derived from over 170,000 WSIs spanning 31 major tissue types. Its task is to encode each $256\times 256$ image tile, extracting efficient feature representations at the patch level. The slide-level encoder is based on the LongNet [9] architecture. LongNet is an improved Transformer that utilizes dilated self-attention mechanisms to reduce the computational complexity of traditional Transformers from O(n²) to a more manageable range. This enables the model to efficiently aggregate features from tens of thousands or even hundreds of thousands of tiles, capturing global spatial contexts while minimizing computational costs.

Text Encoder. We use Bio_ClinicalBERT[1] as our text encoder which is a BERT[7] model pre-trained on biomedical and clinical text to optimize its performance in these domains.

Projection Layer. To achieve cross-modal alignment, a projection layer is designed between the WSI encoder and the LLM. The projection layer consists of a two-layer multi-layer perceptron (MLP) module, which maps visual features into a feature space compatible with the language model.

Large Language Model. We use Vicuna-7b-v1.5 [30] as the pre-trained model for the large language model, which is a highly efficient and powerful open-source language model tailored for dialogue generation tasks. Through fine-tuning, Vicuna demonstrates exceptional adaptability to various medical scenarios, generating accurate and natural diagnostic reports.

4.2 Training Details

The training process is divided into three stages, i.e. WSI-text alignment, feature space alignment, and task-specific instruction tuning.

Stage 1: WSI-Text Alignment. The learning rate is set to 0.001, with a batch size of 64. The WSI encoder and text encoder are trained for 50 epochs, with a temperature parameter set to 0.02 and hard negative sampling using $k=10$ .

Stage 2: Feature Space Alignment. In this stage, we freeze both the WSI encoder and LLM, and only train the projection layer. The learning rate is set to 1e-3, with 1 training epoch. The maximum sequence length is configured as 2048, weight decay is set to 0, and the warmup ratio is set to 0.03.

Stage 3: Task-Specific Instruction Tuning. In this stage, we freeze the WSI encoder and projection layer, and only train LLM. The learning rate is adjusted to 2e-5, with 1 training epoch. The maximum sequence length is 2,048, weight decay is 0, and the warmup ratio is maintained at 0.03. We utilize LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, with a rank of 128 and an alpha value of 256, enabling efficient adaptation to the target dataset with minimal additional parameters. To enhance computational efficiency, we utilize DeepSpeed with Zero-3 optimization for distributed training. Batch-level modality grouping ensured balanced processing across multi-modal inputs. Additionally, BF16 precision is adopted, and TensorFloat32 is enabled to accelerate matrix operations while maintaining numerical stability.

Table 2: Evaluation process for WSI-Precision. Step 1 involves extracting claims from the ground-truth response, step 2 evaluates the correctness of the model responses against these claims using a defined scoring system, and step 3 calculates the average score from Step 2 as the final WSI-Precision score.

	Model Response	Ground-Truth Response
	Microscopic observation of the pathology slide reveals moderately differentiated squamous cell carcinoma characterized by nests and sheets of atypical squamous epithelial cells with intercellular bridges and keratinization. The tumor cells exhibit moderate nuclear pleomorphism with enlarged, hyperchromatic nuclei and prominent nucleoli. Mitotic figures are present, indicating active cell division. There is no evidence of vascular invasion, necrosis, or calcification within the examined tissue. Based on these observations, the diagnosis is moderately differentiated keratinizing squamous carcinoma.	Microscopic observation of the pathology slide reveals sheets and nests of atypical squamous cells with moderate differentiation. There is evidence of focal keratinization, characterized by the presence of keratin pearls within the tumor tissue. Intercellular bridges are noted among the squamous cells, indicating differentiation towards squamous epithelium. The cellular morphology shows moderate pleomorphism, with variability in nuclear size and shape, and occasional prominent nucleoli. Mitotic figures are present but not excessively numerous. Vascular invasion is not specified in the observations, and necrosis is not prominently noted. Based on these observations, the diagnosis is squamous cell carcinoma, moderately differentiated.
Steps	Descriptions	Outputs
Step 1	Carefully extract claims from the Ground Truth answers. This involves identifying distinct pieces of information that are directly relevant to the context, preserving logical connections within related ideas. The claims serve as the foundation for evaluating the model’s responses in subsequent steps.	• Claim₁: Microscopic observation reveals sheets and nests of atypical squamous cells with moderate differentiation. • Claim₂: There is evidence of focal keratinization characterized by keratin pearls within the tumor tissue. • Claim₃: Intercellular bridges indicate differentiation towards squamous epithelium. • Claim₄: The cellular morphology shows moderate pleomorphism with variability in nuclear size and shape and occasional prominent nucleoli. • Claim₅: Mitotic figures are present but not excessively numerous. • Claim₆: Vascular invasion is not specified. • Claim₇: Necrosis is not prominently noted. • Claim₈: The diagnosis is squamous cell carcinoma, moderately differentiated.
Step 2	Evaluate the correctness of the Model Responses for each extracted claim using the WSI-Precision scoring criteria. Assign a score to each claim (1, 0.7, 0.3, or 0) based on how accurately the model’s response aligns with the clinical information in the Ground Truth. Provide an explanation for each score and present the evaluations as a list of dictionaries.	• Claim₁: Microscopic observation reveals sheets and nests of atypical squamous cells with moderate differentiation. – Explanation: The dialogue correctly describes the presence of nests and sheets of atypical squamous epithelial cells as part of a moderately differentiated squamous cell carcinoma. – Score₁: 1 • Claim₂: There is evidence of focal keratinization characterized by keratin pearls within the tumor tissue. – Explanation: The dialogue mentions keratinization but does not specifically mention keratin pearls. However, keratinization typically involves keratin pearls, aligning closely with the claim. – Score₂: 0.7 • … • Claim₆: Vascular invasion is not specified. – Explanation: The dialogue explicitly states there is no evidence of vascular invasion, which contradicts the claim that vascular invasion is not specified. – Score₆: 0 • … (Additional evaluations omitted for brevity) …
Step 3	Compute the average score from Step 2 to obtain the final precision score. This score represents the overall correctness of the model’s responses in relation to the extracted claims.	$\text{Final Score}=\frac{\sum_{i=1}^{n}\text{Score}_{i}}{n}$ Where: • $\text{Score}_{i}$ : The score assigned to the $i$ -th claim. • $n$ : The total number of evaluated claims.

Table 3: Evaluation process for WSI-Relevance. Step 1 involves extracting claims from the model response, step 2 assesses the relevance of these claims to the Ground-Truth Response, and step 3 calculates the average relevance score from Step 2 as the final WSI-Relevance score.

	Model Response	Ground-Truth Response
	Microscopic observation of the pathology slide reveals moderately differentiated squamous cell carcinoma characterized by nests and sheets of atypical squamous epithelial cells with intercellular bridges and keratinization. The tumor cells exhibit moderate nuclear pleomorphism with enlarged, hyperchromatic nuclei and prominent nucleoli. Mitotic figures are present, indicating active cell division. There is no evidence of vascular invasion, necrosis, or calcification within the examined tissue. Based on these observations, the diagnosis is moderately differentiated keratinizing squamous carcinoma.	Microscopic observation of the pathology slide reveals sheets and nests of atypical squamous cells with moderate differentiation. There is evidence of focal keratinization, characterized by the presence of keratin pearls within the tumor tissue. Intercellular bridges are noted among the squamous cells, indicating differentiation towards squamous epithelium. The cellular morphology shows moderate pleomorphism, with variability in nuclear size and shape, and occasional prominent nucleoli. Mitotic figures are present but not excessively numerous. Vascular invasion is not specified in the observations, and necrosis is not prominently noted. Based on these observations, the diagnosis is squamous cell carcinoma, moderately differentiated.
Steps	Descriptions	Outputs
Step 1	Extract claims from the Model Response, focusing on identifying distinct and relevant pieces of information that form the basis of the pathological assessment. These claims must preserve the logical and clinical coherence of the model’s response, serving as the foundation for relevance evaluation in the following steps.	• Claim₁: The pathology slide reveals moderately differentiated squamous cell carcinoma. • Claim₂: The carcinoma is characterized by nests and sheets of atypical squamous epithelial cells with intercellular bridges and keratinization. • Claim₃: Tumor cells exhibit moderate nuclear pleomorphism with enlarged, hyperchromatic nuclei and prominent nucleoli. • Claim₄: Mitotic figures are present, indicating active cell division. • Claim₅: There is no evidence of vascular invasion, necrosis, or calcification within the examined tissue. • Claim₆: The diagnosis is moderately differentiated keratinizing squamous carcinoma.
Step 2	Assess the relevance of each extracted claim from the Model Response to the Ground-Truth Response. Assign a relevance score (1, 0.7, 0.3, or 0) based on how well each claim reflects the clinical details and context provided in the Ground Truth. Provide a justification for each relevance score, ensuring that the evaluation clearly outlines how each claim aligns or deviates from the ground-truth observations.	• Claim₁: The pathology slide reveals moderately differentiated squamous cell carcinoma. – Explanation: The ground truth confirms the presence of moderately differentiated squamous cell carcinoma, matching the claim accurately. – Score₁: 1 • Claim₂: The carcinoma is characterized by nests and sheets of atypical squamous epithelial cells with intercellular bridges and keratinization. – Explanation: The ground truth describes similar cellular structures, but the focus on keratinization is less pronounced, leading to a slightly lower score. – Score₂: 0.7 • … • Claim₄: Mitotic figures are present, indicating active cell division. – Explanation: The presence of mitotic figures is confirmed, aligning with the claim of active cell division, and is fully supported by the ground truth. – Score₄: 1 • … (Additional evaluations omitted for brevity) …
Step 3	Compute the average score from Step 2 to obtain the final relevance score.This score quantifies how closely the model’s responses align with the key contextual elements of the ground truth.	$\text{Final Score}=\frac{\sum_{i=1}^{n}\text{Score}_{i}}{n}$ Where: • $\text{Score}_{i}$ : The score assigned to the $i$ -th claim. • $n$ : The total number of evaluated claims.

Table 4: Detailed diagnostic steps of the Nottingham grading system for breast cancer, illustrating the scoring and corresponding morphological descriptions based on gland formation, nuclear pleomorphism, and mitotic activity.

Diagnostic Step

Score

Morphological Description

Gland Formation

1 - Well-formed

2 - Moderately formed

3 - Poorly formed

Glands are well-differentiated

Glands show moderate differentiation

Minimal to no gland formation

Nuclear Pleomorphism

1 - Small, uniform

2 - Moderate variability

3 - Marked variability

Nuclei are small and uniform

Nuclei show moderate variability in size and shape

Nuclei show significant variation in size and shape

Mitotic Count

Low - Few mitoses

Medium - Moderate mitoses

High - Numerous mitoses

Few mitotic figures are present

Moderate number of mitotic figures

High number of mitotic figures

Image
Question	How is this invasive lobular carcinoma graded according to the Nottingham system?
WSI-LLaVA	This invasive lobular carcinoma is graded as Nottingham Histologic Grade 1, with a total score of 5 out of 9. This grading is based on a tubule formation score of 3, a nuclear pleomorphism score of 1, and a mitotic count score of 1. A Grade 1 designation indicates a well-differentiated tumor, suggesting a more favorable prognosis compared to higher-grade tumors.
Quilt-LLaVA	This invasive lobular carcinoma is graded as Nottingham Histologic Grade 2, with a total score of 6 out of 9. This grading reflects a tubular formation score of 3, indicating poor differentiation, a nuclear pleomorphism score of 2, and a mitotic count score of 1, which corresponds to a moderately differentiated tumor.
Ground Truth	This invasive lobular carcinoma is graded as Nottingham Histologic Grade 1. This grading is determined by a tubular formation score of 3, indicating poor glandular differentiation, a nuclear pleomorphism score of 1, reflecting minimal variability in nuclear features, and a mitotic rate score of 1, indicating low level of cellular proliferation. These components collectively classify the tumor as low-grade, corresponding to a well-differentiated carcinoma.

Image
Question	What are the main observations in the whole slide histopathology image, and what is the likely diagnosis?
WSI-LLaVA	Microscopic observation of the pathology slide reveals poorly differentiated squamous cell carcinoma characterized by sheets and nests of atypical squamous epithelial cells with minimal keratinization. The tumor cells exhibit marked nuclear pleomorphism, with enlarged, irregular nuclei and prominent nucleoli. There is a high rate of mitotic activity, indicative of aggressive cellular proliferation. Vascular invasion is evident, with tumor cells infiltrating lymphovascular spaces. Areas of necrosis are present within the tumor mass. Based on these observations, the diagnosis is poorly differentiated invasive squamous carcinoma.
MI-Gen	Microscopic observation of the pathology slide reveals a poorly differentiated adenocarcinoma characterized by a mixture of atypical squamous cell carcinoma. The tumor exhibits a high degree of differentiation with a high degree of nuclear pleomorphism and increased mitotic activity. There is no evidence of vascular invasion or necrosis or calcification or calcification. Based on these observations the diagnosis is poorly differentiated.
Ground Truth	Microscopic observation of the pathology slide reveals poorly differentiated squamous cell carcinoma characterized by extensive invasion with minimal keratinization and significant cellular atypia. The tumor displays extensive necrosis alongside regions of calcification within the tissue. Additionally, there is evidence of tumor infiltration into surrounding structures, with observable lymphovascular invasion. Based on these observations, the diagnosis is squamous cell carcinoma, Grade 4 (of 4).

Example of Report Generation
WSI
Question	What are the main observations in the whole slide histopathology image, and what is the likely diagnosis?
Answer	Microscopic observation of the pathology slide reveals tumor cells infiltrating the surrounding stroma with minimal formation of glandular structures, significantly deviating from normal tissue architecture, and showing a poorly differentiated state with a tubule formation score of 3. The nuclei exhibit considerable size and shape variability, corresponding to a nuclear pleomorphism score of 3. There is a moderate rate of cell division observed, reflected by a mitotic count score of 2. Vascular invasion is not unequivocally present, and no necrosis or microcalcifications are observed. The intraductal component is noted to be less than 25% with prominent cancerization. Based on these observations, the diagnosis is infiltrating ductal carcinoma, no special type, Nottingham Grade III, poorly differentiated, with a total Nottingham score of 8/9.

Example of Regional Structure Description
WSI

Question	What observations can you make about the tissue architecture on this slide?
Answer	The tissue architecture is significantly disrupted, marked by minimal formation of glandular structures. The poorly differentiated state results in a loss of the typical organized glandular architecture, and instead, cells are haphazardly arranged. There is less than 25% of the slide showing an intraductal component, and this area exhibits prominent cancerization of lobules.

Example of Specific Feature Description
WSI

Question	Do the nuclei of the tumor cells show nuclear pleomorphism?
Answer	Indeed, the nuclei of the tumor cells exhibit considerable nuclear pleomorphism, featuring significant variability in size and shape. They present irregular contours, are substantially larger than normal, and possess dense chromatin, highlighting marked atypical nuclear features.

VQA example of global morphology description
WSI

Question	Can you provide an overall description of the whole slide image (WSI)?
Answer	The whole slide image reveals malignant epithelial cells infiltrating the surrounding stroma with minimal glandular structures. The normal tissue architecture is significantly disrupted. The tumor cells are poorly differentiated, displaying large variability in nuclear size and shape, indicative of substantial nuclear pleomorphism. A moderate rate of cell division is observed among the tumor cells. There is no evident vascular invasion, necrosis, or microcalcifications. The intraductal component is less than 25% with notable cancerization present.

VQA example for key diagnostic description task
WSI

Question	Among the features observed, which are particularly critical for making a diagnosis?
Answer	Among the features observed, which are particularly critical for making a diagnosis?”, ”text”: ”Key features essential for diagnosis include the extensive nuclear pleomorphism and minimal glandular structure formation, both of which point to poor differentiation. The moderate mitotic rate is also significant in assessing cellular proliferation. Another important factor is the presence of a prominent cancerization of ducts, albeit the intraductal component is less than 25%. These aspects collectively depict a highly invasive and aggressive nature of the tumor.

Example of Histological Typing
WSI

Question	How would you classify the histological type of cancer observed on the slide?
Answer	The observed slide is indicative of invasive lobular carcinoma. This classification is due to the presence of epithelial cells with minimal glandular differentiation, infiltrating the stroma in a characteristic pattern. The moderate nuclear pleomorphism and the absence of vascular or lymphatic invasion are consistent with this histological type.

Example of Grading
WSI

Question	What is the Nottingham Histologic Grade of this tumor based on the observed features?
Answer	The tumor is classified as Nottingham Histologic Grade 3, with a total score of 8 out of 9. This grading is based on the poor tubule formation (score of 3), marked nuclear pleomorphism (score of 3), and moderate mitotic activity (score of 2). This indicates a high-grade tumor with poor differentiation and suggests a potentially more aggressive clinical course.

Example of Molecular Subtyping
WSI

Question	Based on the IHC results showing P16 negativity and HR, HPV ISH status negativity, what is the molecular subtype in this head and neck squamous cell carcinoma?
Answer	The molecular subtype is HNSC HPV-

VQA example of treatment recommendations task
WSI

Question	What is the recommended treatment plan for a patient with invasive ductal carcinoma, Nottingham grade 2, TNM staging T2N1M0, ER-positive, PR-positive, and HER2-negative?
Answer	1. Surgery: Perform breast-conserving surgery or mastectomy, depending on tumor size and patient preference, followed by sentinel lymph node biopsy or axillary lymph node dissection. 2. Chemotherapy: Consider adjuvant chemotherapy due to nodal involvement and to reduce recurrence risk. 3. Endocrine Therapy: Initiate tamoxifen or an aromatase inhibitor for at least 5 years, given the ER/PR positivity. 4. Radiation Therapy: Advise adjuvant radiation therapy, especially if breast-conserving surgery was performed, to minimize local recurrence risk.

Sample Comparison of Treatment Recommendations
WSI
Question	What is the recommended treatment plan for a patient with a diagnosis of invasive ductal carcinoma, Nottingham Histologic Grade 3, TNM staging T3N1, ER-positive, PR-positive, and HER2-negative?
WSI-LLaVA	1. Surgery: Consider mastectomy or breast-conserving surgery with axillary lymph node dissection due to T3 tumor size and N1 nodal involvement. 2. Chemotherapy: Adjuvant chemotherapy is recommended given the high histologic grade and nodal involvement to reduce recurrence risk. 3. Endocrine Therapy: Initiate tamoxifen or aromatase inhibitors for at least 5 years due to ER/PR positivity. 4. Radiation Therapy: Post-surgical radiation therapy is advised, especially if breast conservation is performed or in the presence of significant nodal disease.
Quilt-LLaVA	1. Surgery: Consider mastectomy or breast-conserving surgery with axillary lymph node dissection due to T3N1 staging. 2. Chemotherapy: Adjuvant chemotherapy is recommended given the high grade and nodal involvement. 3. Endocrine Therapy: Initiate tamoxifen or aromatase inhibitors for at least 5 years due to ER/PR positivity. 4. Radiation Therapy: Adjuvant radiation therapy is advised, especially if breast-conserving surgery was performed or due to significant nodal involvement.
WSI-VQA	1 . surgery mastectomy or breast-conserving surgery lumpectomy or mastectomy based on patient preference followed by sentinel lymph node biopsy or axillary lymph node dissection due to nodal involvement . 2 . endocrine therapy initiate tamoxifen or aromatase inhibitors for at least 5 years due to er/pr positivity . 3 . radiation therapy consider adjuvant radiation therapy especially if breast-conserving surgery was performed to reduce local recurrence risk .
GPT-4o	For a patient with invasive ductal carcinoma, Nottingham Histologic Grade 3, TNM staging T3N1, ER-positive, PR-positive, and HER2-negative, the recommended treatment plan typically includes: 1. Surgery:Lumpectomy or mastectomy depending on the tumor size and patient preference.Axillary lymph node dissection or sentinel lymph node biopsy to address nodal involvement. 2.Adjuvant Chemotherapy:Given the tumor size (T3) and histologic grade (Grade 3), chemotherapy is often recommended.A regimen such as AC-T (doxorubicin and cyclophosphamide followed by paclitaxel) might be used. 3.Radiation Therapy:Typically recommended after surgery, especially if breast-conserving surgery (lumpectomy) is performed.It may also be considered after mastectomy if the tumor is large or if lymph nodes are positive. 4.Endocrine Therapy:Since the tumor is ER-positive and PR-positive, endocrine therapy is crucial.Options include tamoxifen or aromatase inhibitors, depending on menopausal status. 5.Follow-up and Monitoring:Regular follow-up visits for physical exams, mammograms (if breast tissue remains), and monitoring for any recurrence or metastasis.
Ground Truth	1. Surgery: Perform a mastectomy or breast-conserving surgery with axillary lymph node dissection based on tumor size and nodal involvement. 2. Chemotherapy: Adjuvant chemotherapy is recommended given the high tumor grade and nodal involvement. 3. Endocrine Therapy: Initiate tamoxifen or aromatase inhibitors for at least 5 years due to ER and PR positivity. 4. Radiation Therapy: Consider post-surgical radiation therapy, especially if breast-conserving surgery was chosen or if significant nodal involvement is present.

Example of Staging
WSI

Question	A patient has a 3 cm tumor, no regional lymph node involvement, and no distant metastasis. What is the TNM staging for this patient?
Answer	The TNM staging for this patient is T2 N0 M0.

Example of Prognosis
WSI

Question	Given the diagnosis of invasive ductal carcinoma (Nottingham Grade 3), TNM staging of T4BN1M0, and IHC results showing 10-75% positivity for hormone receptors with negative HER2 protein, is the prognosis good or poor?
Answer	The prognosis is poor. The high histologic grade, advanced tumor size, and lymph node involvement indicate aggressive disease, despite partial hormone receptor positivity. Negative HER2 limits targeted therapy options.

Sample Comparison of Report Generation
WSI
Question	What are the main observations in the whole slide histopathology image, and what is the likely diagnosis?
WSI-LLaVA	Microscopic observation of the pathology slide reveals moderately differentiated squamous cell carcinoma characterized by nests and sheets of atypical squamous epithelial cells with intercellular bridges and keratinization. The tumor cells exhibit moderate nuclear pleomorphism with enlarged, hyperchromatic nuclei and prominent nucleoli. Mitotic figures are present, indicating active cell division. There is no evidence of vascular invasion, necrosis, or calcification within the examined tissue. Based on these observations, the diagnosis is moderately differentiated keratinizing squamous carcinoma.
MI-Gen	microscopic observation of the pathology slide reveals a poorly differentiated squamous cell carcinoma characterized by a lack of squamous cell carcinoma . the tumor exhibits a lack of squamous differentiation with a lack of squamous differentiation . the tumor cells exhibit moderate nuclear pleomorphism with variability in nuclear size and shape and there is a moderate rate of cell division .vascular invasion is not identified . necrosis is not observed within the tumor . based on these observations the diagnosis is squamous cell carcinoma .
Hist-Gen	microscopic observation of the pathology slide reveals a poorly differentiated squamous cell carcinoma characterized by irregular nests and sheets of atypical squamous cells with intercellular bridges and keratinization . the tumor cells exhibit a high degree of nuclear pleomorphism . there is no evidence of vascular invasion or necrosis within the tumor tissue . the surrounding lung parenchyma shows areas of chronic inflammation and chronic inflammation . based on these observations the diagnosis is squamous cell carcinoma poorly differentiated .
Quilt-LLaVA	Microscopic observation of the pathology slide reveals sheets and nests of atypical squamous cells with moderate differentiation. There is evidence of focal keratinization, characterized by the presence of keratin pearls within the tumor tissue. Intercellular bridges are noted among the squamous cells, indicating differentiation towards squamous epithelium. The cellular morphology shows moderate pleomorphism, with variability in nuclear size and shape, and occasional prominent nucleoli. Mitotic figures are present but not excessively numerous. Vascular invasion is not specified in the observations, and necrosis is not prominently noted. Based on these observations, the diagnosis is squamous cell carcinoma, moderately differentiated.
GPT-4o	This histopathology image shows a well-circumscribed, nodular lesion with a dense, cellular stroma. There are areas of necrosis and hemorrhage, and the tissue architecture appears disrupted. The lesion has a heterogeneous appearance with varying cellular densities and some fibrotic regions.Diagnosis: The features are consistent with a uterine fibroid (leiomyoma), which is characterized by a smooth muscle cell growth forming a circumscribed mass with areas of degeneration.
Ground Truth	Microscopic observation of the pathology slide reveals sheets and nests of atypical squamous cells with moderate differentiation. There is evidence of focal keratinization, characterized by the presence of keratin pearls within the tumor tissue. Intercellular bridges are noted among the squamous cells, indicating differentiation towards squamous epithelium. The cellular morphology shows moderate pleomorphism, with variability in nuclear size and shape, and occasional prominent nucleoli. Mitotic figures are present but not excessively numerous. Vascular invasion is not specified in the observations, and necrosis is not prominently noted. Based on these observations, the diagnosis is squamous cell carcinoma, moderately differentiated.

Sample Comparison of Global Morphology Description
WSI
Question	Can you provide an overall description of the whole slide image (WSI)?
WSI-LLaVA	The whole slide image reveals exuberant granulation tissue characterized by notable fibroblastic proliferation and myxoid changes. The cellular component is primarily composed of spindle-shaped cells within a myxoid matrix. There is an absence of normal tissue architecture as the granulation tissue dominates the observed section. Importantly, no residual sarcoma tissue is identifiable within this examined section.
Quilt-LLaVA	The whole slide image reveals a tumor with a diffuse architectural pattern, characterized by a mixture of alveolar formations and glandular cell differentiation. There is a notable presence of large cells with prominent nucleoli and moderate nuclear atypia. The mitotic activity is moderate, with 8 mitoses per 10 high power fields. The tumor shows significant vascular invasion, with tumor cells infiltrating the renal vein. Additionally, there is focal necrosis within the tumor tissue.
WSI-VQA	the whole slide image reveals a biphasic tumor composed of spindle-shaped cells arranged in a storiform pattern . these spindle cells are arranged in a storiform pattern . the spindle cells are arranged in a storiform pattern . the spindle cells are arranged in a storiform pattern . there is a notable absence of necrosis or calcification within the tumor.
GPT-4o	The whole slide image appears to be a tissue section stained with hematoxylin and eosin (H&E), commonly used in histopathology. The sample is divided into several sections, likely representing different orientations or parts of the tissue. - Coloration: The pink and purple hues suggest typical H&E staining, where eosin stains cytoplasmic components pink and hematoxylin stains nuclear material blue or purple. - Tissue Architecture: The sections show varying thickness and include both smooth and irregular contours, indicating possible different tissue types or regions. - Other Observations: There are areas that seem to contain more densely packed cells and some regions with more open or less cellular areas, suggesting different tissue densities or types. A detailed examination at higher magnification would be necessary for further analysis, such as identifying specific pathological features or structures.
Ground Truth	The whole slide image demonstrates exuberant granulation tissue characterized by notable fibroblastic proliferation and myxoid changes. The cellular component is primarily composed of spindle-shaped cells within a myxoid matrix. There is an absence of normal tissue architecture as the granulation tissue dominates the observed section. Importantly, no residual sarcoma tissue is identifiable within this examined section.

Sample Comparison of Key Diagnostic Description
WSI
Question	Which observed features are most significant for making a diagnosis?
WSI-LLaVA	The most significant features include the papillary architecture with fibrovascular cores, the presence of psammoma bodies, and the nuclear characteristics such as overlapping, grooves, and intranuclear cytoplasmic inclusions. The absence of vascular invasion, necrosis, and calcification also plays a crucial role in the diagnostic process.
Quilt-LLaVA	The most significant features include the irregular clusters of epithelial cells with poor glandular differentiation, marked nuclear pleomorphism, and the presence of perineural invasion. The high mitotic activity and evidence of lymphovascular invasion are also critical, as they indicate aggressive tumor behavior.
WSI-VQA	the most significant features include the papillary architecture with fibrovascular cores and the presence of fibrovascular cores . the absence of vascular invasion and necrosis further supports the diagnosis . the absence of necrosis and calcification further supports the diagnosis .
GPT-4o	To make a diagnosis from this histopathological image, the following observed features are significant: 1. Cellular Architecture and Arrangement: Observe the overall structure and any abnormal formations or patterns within the tissue. 2. Presence of Lesions or Masses: Look for areas indicating hyperplasia, neoplasia, or other abnormal growths. 3. Inflammation Indicators: Note any signs of infiltration by immune cells, such as lymphocytes or macrophages. 4. Vascular Changes: Identify any alterations in blood vessels, such as congestion, hemorrhage, or new vessel formation. 5. Stromal Changes: Assess changes in the supporting tissue, like fibrosis or stromal desmoplasia. 6. Glandular Structures or Cyst Formation: If applicable, evaluate the formation of glands or cystic spaces. A detailed examination considering these features, along with clinical correlation, aids in reaching an accurate diagnosis.
Ground Truth	The key features of high diagnostic importance include the branching papillary formations and the unique nuclear features of the neoplastic epithelial cells. The presence of crowding, overlapping nuclei with prominent grooves, and occasional intranuclear inclusions are particularly crucial. The absence of vascular invasion, necrosis, or calcification also provides essential context in the diagnostic evaluation.

Sample Comparison of Regional Structure Description
WSI
Question	What observations can you make about the tissue architecture on this slide?
WSI-LLaVA	The tissue architecture is characterized by a sheet-like growth pattern, where the tumor cells are densely packed, exhibiting minimal stromal separation. This creates an overall cohesive mass with sparse connective tissue presence.
Quilt-LLaVA	The tissue architecture is characterized by nests and sheets of atypical melanocytes. These cells are arranged in a disorganized pattern, with a lack of the normal architectural structure typically seen in healthy tissue. The presence of necrosis within the tumor further disrupts the tissue architecture.
WSI-VQA	the tissue architecture is characterized by papillary structures lined by neoplastic epithelial cells . these papillae are lined by layers of cuboidal to columnar epithelial cells . the papillae are lined by a single layer of cuboidal to columnar epithelial cells
GPT-4o	The tissue slide shows several features: 1. Layered Structure: The tissue exhibits a layered architecture indicative of an organ with distinct layers, such as the gastrointestinal tract. 2. Heterogeneity: There are regions with varied staining intensity, suggesting variability in cellular composition or density. 3. Crypt Structures: There are glandular or crypt-like structures present, typical of mucosal tissues like the colon. 4. Villi or Projections: Some areas show protrusions or folds, which could be part of the mucosal surface architecture. 5. Potential Inflammation: If there is any increased cellularity in certain areas, it could indicate inflammation or infiltration. 6. Connective Tissue: Lighter stained areas may suggest the presence of connective or submucosal tissue. These observations can help identify the type of tissue and any pathological changes if present.
Ground Truth	The tissue architecture is characterized by a sheet-like growth pattern, where the tumor cells are densely packed, exhibiting minimal stromal separation. This creates an overall cohesive mass with sparse connective tissue presence.

Sample Comparison of Specific Feature Description
WSI
Question	Do the epithelial cells infiltrate the surrounding stromal tissue or exhibit invasiveness?
WSI-LLaVA	Yes, the epithelial cells are infiltrating the surrounding stromal tissue, demonstrating invasive characteristics. They have breached their normal boundaries and are penetrating into the stroma, disrupting the typical tissue architecture. This invasion is accompanied by a desmoplastic stromal response, indicating the body’s reaction to the invading tumor cells.
Quilt-LLaVA	Yes, the epithelial cells are infiltrating the surrounding stromal tissue, demonstrating invasive characteristics. They have breached their normal boundaries and are penetrating into the stroma, disrupting the typical tissue architecture. This invasion is accompanied by minimal glandular formation, indicating a significant deviation from normal tissue organization.
WSI-VQA	yes the epithelial cells are infiltrating the surrounding stromal tissue demonstrating invasive characteristics . they have breached their usual boundaries and are penetrating into the stroma disrupting the normal tissue architecture . this infiltration is a key feature of the epithelial cells .
GPT-4o	The image shows a histological section. However, at this magnification, it is difficult to determine if epithelial cells are infiltrating the surrounding stromal tissue or exhibiting invasiveness. To make an accurate assessment, a higher magnification would be necessary to examine the interactions between epithelial cells and stroma more closely. If invasiveness is suspected, it is best evaluated through additional histological analysis with higher resolution images.
Ground Truth	Yes, the epithelial cells are infiltrating the surrounding stromal tissue. They extend beyond their usual confines, permeating into the stroma and demonstrating distinct invasive characteristics. This infiltration disrupts the normal architecture of the surrounding tissue.

Sample Comparison of Histological Typing
WSI
Question	What is the histological classification based on your examination of the slide?
WSI-LLaVA	The histological classification is invasive ductal carcinoma. This is supported by the presence of infiltrating epithelial cells with minimal glandular formation, moderate nuclear pleomorphism, and a moderate mitotic rate. The absence of vascular invasion and necrosis also aligns with this classification.
Quilt-LLaVA	The histological classification is invasive ductal carcinoma. This is supported by the presence of infiltrating epithelial cells forming moderately differentiated glandular structures, moderate nuclear pleomorphism, and the presence of ductal carcinoma in situ (DCIS) with solid and cribriform patterns.
WSI-VQA	the histological classification based on the observed features is invasive ductal carcinoma . this classification is supported by the presence of infiltrating epithelial cells with minimal glandular formation moderate nuclear pleomorphism and the presence of ductal carcinoma in situ dcis with cribriform and solid patterns .
GPT-4o	I’m unable to make a diagnosis from the image. However, the slide appears to show fibrous and necrotic areas, characteristic of particular tumor types or other pathological conditions. For an accurate classification, a detailed analysis by a pathologist is required, considering factors such as cellular morphology, staining patterns, and clinical context.
Ground Truth	The histological findings are consistent with invasive ductal carcinoma. This classification is supported by the infiltrative growth pattern of the epithelial cells, minimal glandular differentiation, moderate nuclear pleomorphism, and low mitotic activity. The absence of vascular and lymphatic invasion, along with the lack of necrosis and calcifications, helps to further characterize this subtype based on its histopathological features.

Sample Comparison of Grading
WSI
Question	What is the Nottingham Histologic Grade for this carcinoma, and what does this grading reflect?
WSI-LLaVA	The carcinoma is classified as Nottingham Histologic Grade 2, with a total score of 6 out of 9. This grading reflects a tubular formation score of 3, indicating poor differentiation, a nuclear pleomorphism score of 2, representing moderate variability in nuclear size and shape, and a mitotic count score of 1, denoting a low rate of cell division. This intermediate grade suggests a moderately differentiated tumor, which has implications for prognosis and treatment planning.
Quilt-LLaVA	The carcinoma is classified as Nottingham Histologic Grade 2, with a total score of 6 out of 9. This grading reflects a tubular formation score of 3, indicating poor differentiation, a nuclear pleomorphism score of 2, showing moderate variability, and a mitotic count score of 1, indicating low mitotic activity. This intermediate grade suggests a moderately differentiated tumor, which has implications for prognosis and treatment planning.
WSI-VQA	this carcinoma is classified as nottingham histologic grade 2 with a total score of 7 out of 9 . this grading reflects a tubular differentiation score of 3 a nuclear pleomorphism score of 2 and a mitotic count score of 1 . this grade reflects a moderately differentiated tumor with a total score of 1 .
GPT-4o	I can’t determine the Nottingham Histologic Grade from the image alone, as it requires detailed examination of specific histologic features. The Nottingham grading system evaluates three components: 1. Tubule formation: The percentage of tumor forming tubular structures. 2. Nuclear pleomorphism: Variation in the size and shape of the tumor cell nuclei. 3. Mitotic count: The number of cells dividing. Each component is scored from 1 to 3, and the scores are summed to give a grade: - Grade 1 (3-5 points): Well-differentiated (low grade) - Grade 2 (6-7 points): Moderately differentiated (intermediate grade) - Grade 3 (8-9 points): Poorly differentiated (high grade). This grading reflects the aggressiveness of the cancer, with higher grades indicating more aggressive tumors.
Ground Truth	The carcinoma is classified as Nottingham Histologic Grade 2 (NHG2) with a total score of 6 out of 9. This grading reflects poor glandular (tubular) formation, moderateness in nuclear pleomorphism, and a low rate of mitotic activity. An NHG2 designation indicates a moderately differentiated tumor, suggesting an intermediate level of aggressiveness and guiding treatment choices accordingly.

Sample Comparison of Molecular Subtyping
WSI
Question	Based on the IHC results showing P16 negativity and HR, HPV ISH status negativity, what is the molecular subtype in this head and neck squamous cell carcinoma?
WSI-LLaVA	The molecular subtype is HNSC_HPV.
Quilt-LLaVA	The molecular subtype is HNSC_HPV.
WSI-VQA	the molecular subtype is hnsc unk .
GPT-4o	The molecular subtype of head and neck squamous cell carcinoma with P16 negativity and HR HPV ISH negativity is typically HPV-negative. This subtype is generally associated with non-viral risk factors such as smoking and alcohol use.
Ground Truth	The molecular subtype is HNSC_HPV

Sample Comparison of Staging
WSI
Question	What is the TNM staging for a patient with a 7 cm tumor, several affected lymph nodes, but no distant metastasis?
WSI-LLaVA	The TNM staging for this patient is T3 N2 M0
Quilt-LLaVA	The TNM staging for this patient is T3 N2 M0.
WSI-VQA	the tnm staging for this patient is t2 n2 m0 .
GPT-4o	To determine the TNM staging, we need to consider the following components: - T (Tumor size): The tumor is 7 cm. This generally corresponds to T2 or T3, depending on the specific cancer type, but often T3 for many cancers. - N (Nodes): S̈everal affected lymph nodesẗypically indicates N1 or N2, depending on the exact number and extent of nodal involvement. - M (Metastasis): No distant metastasis is indicated as M0. Without specific details about the type of cancer, a general staging based on the information given could be T3N1M0 or T3N2M0. However, the exact stage can vary based on cancer type and guidelines, so consultation with specific cancer staging resources is advised for precise staging.
Ground Truth	TNM staging for this patient is T3 N2 M0.

Sample Comparison of Prognosis
WSI
Question	Given the patient’s diagnosis of hepatocellular carcinoma, TNM staging of T1N0M0, and immunohistochemistry results showing positive Hepatocyte and negative CK19, is the prognosis good or poor?
WSI-LLaVA	The prognosis is generally good. The early-stage (T1N0M0) hepatocellular carcinoma with positive Hepatocyte and negative CK19 indicates a well-differentiated tumor, which typically has a better prognosis compared to more advanced stages.
Quilt-LLaVA	The prognosis is good. The early stage (T1N0M0) and absence of lymph node involvement (N0) suggest a less aggressive disease, and the positive Hepatocyte marker indicates the tumor is likely of hepatocellular origin, which is typically associated with a better outcome
WSI-VQA	B
GPT-4o	In the case of a T1N0M0 stage hepatocellular carcinoma (HCC), the prognosis is generally considered relatively good. The TNM staging indicates: - T1: Single tumor without vascular invasion. - N0: No regional lymph node involvement. - M0: No distant metastasis. This early-stage diagnosis suggests a higher likelihood of successful treatment options, potentially including surgery, and a better overall prognosis compared to more advanced stages. The immunohistochemistry results showing positive Hepatocyte and negative CK19 support the diagnosis of a well-differentiated HCC, which is generally associated with a better prognosis. CK19 negativity implies the tumor is less likely to be aggressive compared to CK19-positive HCCs. Overall, the combination of early stage and favorable immunohistochemistry results suggests a relatively good prognosis for this patient. However, individual prognosis can vary, and continuous follow-up and management are essential.
Ground Truth	The prognosis is good. The early stage of liver cancer (T1N0M0) with no lymph node involvement or metastasis suggests a limited disease burden, while the IHC profile indicates typical hepatocellular carcinoma without bile duct differentiation.