\useunder

\ul

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

[1]Shaoting Zhang

1]\orgaddressShanghai Artificial Intelligence Laboratory 2]\orgaddressNational Biomedical Imaging Center, Peking University 3]\orgaddressRuijin Hospital, Shanghai Jiaotong University School of Medicine 4]\orgaddressDepartment of Pathology, State Key Laboratory of Cancer Biology, Xijing Hospital 5]\orgaddressDepartment of Computer Science, Rutgers University

Cost-effective Instruction Learning for Pathology Vision and Language Analysis

Kaitao Chen Mianxin Liu Fang Yan Lei Ma Xiaoming Shi \fnmLilong \surWang Xiaosong Wang Lifeng Zhu Zhe Wang \fnmMu \surZhou [email protected] [ [ [ [ [

Abstract

The advent of vision-language models fosters the interactive conversations between AI-enabled models and humans. Yet applying these models into clinics must deal with daunting challenges around large-scale training data, financial, and computational resources. Here we propose a cost-effective instruction learning framework for conversational pathology named as CLOVER. CLOVER only trains a lightweight module and uses instruction tuning while freezing the parameters of the large language model. Instead of using costly GPT-4, we propose well-designed prompts on GPT-3.5 for building generation-based instructions, emphasizing the utility of pathological knowledge derived from the Internet source. To augment the use of instructions, we construct a high-quality set of template-based instructions in the context of digital pathology. From two benchmark datasets, our findings reveal the strength of hybrid-form instructions in the visual question-answer in pathology. Extensive results show the cost-effectiveness of CLOVER in answering both open-ended and closed-ended questions, where CLOVER outperforms strong baselines that possess 37 times more training parameters and use instruction data generated from GPT-4. Through the instruction tuning, CLOVER exhibits robustness of few-shot learning in the external clinical dataset. These findings demonstrate that cost-effective modeling of CLOVER could accelerate the adoption of rapid conversational applications in the landscape of digital pathology.

keywords:

Vision-language model, Large language model (LLM), Generative AI, Digital pathology, Prompt engineering

1 Main

The rise of vision-language model (VLM) opens remarkable opportunities to analyze pathological images in a visual question-answering manner [1, 2, 3]. This profound progress of multi-modal data integration leverages the power of large language model (LLM) on cognition, reasoning, and content generation [4, 5, 6]. In essence, LLMs are large-scale parametric networks trained on vast amounts of data, enabling them to generate human-like responses and achieve remarkable accuracy. ChatGPT [7] and GPT-4 [8] are examples to simulate the conversational interaction between AI-enabled models and humans. The generated conversational records can be re-used to guide the visual-language model refinement [9, 10, 11, 12], opening avenues for downstream tasks across domains [1]. In the landscape of digital pathology, providing in-depth language descriptions of cell morphology, tissue status, and treatment suggestions, equipped with human-like interactions, could enhance tissue understanding, characterization, and decision making in various clinical scenarios [13, 14, 15, 16, 17].

Emerging pathological vision-language models (PVLMs) (e.g., LLaVA-Med [18] and Quilt-LLaVA [19]) have demonstrated their utility in analyzing medical imaging. However, it is widely known that building a capable PVLM demands an excessive training data, human labour, financial, and computational resources [20, 21]. Extending general-purpose models into pathology-oriented model starts with using pathological vision-language datasets. These datasets consist of (i) image-text pairs (image-caption) dataset and (ii) instruction (image-question-answering) dataset [22, 23, 24, 25]. The image-text dataset aligns visual and language features [26], providing rich semantic information for pathological image contents [27, 28, 29]. Meanwhile, instruction dataset is crucial for activating LLMs to complete the visual-language question answering. Yet the process of instruction generation incurs a considerable financial cost by using GPT-4 (nearly $9,000) [19]. In addition to the instruction data, model optimization also requires a substantial compute support. Directly tuning a billion-parameter LLM demands high computational resources and is impractical to achieve on consumer-grade GPUs. To overcome these bottlenecks, fundamental questions are urged to be addressed towards building an effective but low-cost PVLM: (i) How can we come up with a lightweight tuning approach to complement the LLaVA-like tuning on large training parameters? (ii) How can we generate a useful instruction dataset in a cost-effective way? (iii) How can we achieve few-shot generalized learning ability of PVLM for clinical applications without using a large-scale instruction dataset?

2 Building Efficient PVLM via CLOVER

2.1 Vision-language Foundation Model

The family of LLMs [1, 30, 31, 8] and visual language pre-training [26, 32] are two indispensable building blocks for vision-language foundation models. Visual language pre-training aligns the encoding space in vision and language. The visual features from the visual encoder can then be perceived by the LLMs. Examples of multi-modal foundation models include Flamingo [10], BLIP-2 [33], FROMAGe [11], mPLUG-Owl [12], and LLaVA [9]. In particular, Flamingo, BLIP-2, and FROMAGe all freeze LLMs and allow vision to adapt to language information through different modules, while mPLUG-Owl and LLaVA all train LLM to fuse visual and language information. Their success strongly benefits from leveraging large-scale natural image-text datasets. For instance, LLaVA leverages a GPT-based data generation to construct the instruction dataset. Despite the promise shown in general domains, these approaches face inherent challenges in the healthcare system. This is due to the medical complexity that significantly differ from generic domains, requiring models to capture specialized disease concepts [28, 29]. Therefore, extending these general models to the medical domain requires fine-tuning and the careful use of specific medical image-text pairs and question-answer data [1].

2.2 Medical Vision-language Foundation Model

Building medical vision-language foundation models requires systematic efforts in both data curation and model inference in the healthcare system [1]. The large-scale resources from the literature (PMC-15M [22]), social media (PLIP [24] and Quilt-1M [23]), accessible textbooks [21] and private hospital data [25, 34] provide critical data support for domain-specific multi-modal models. Examples include PMC-LLaMA [20], LLaVA-Med [18], and Med-Flamingo [21]. LLaVA-Med is a representative work that aligns vision and language using 600K image-text pairs on the basis of LLaVA and fine-tunes the model using 60K dialogue-based instructions. Yet these instruction sets are generated by GPT-4 at a high cost. Similarly, in the pathological field, Quilt-LLaVA [19] uses Quilt-1M and GPT-4 to generate instruction data at a cost of up to $8,858. Towards a better utility of foundation models, instruction preparation and generation in the various forms of structure and scale require substantial research efforts. Meanwhile, during the fine-tuning stage, LLaVA-like methods [9] update the parameters of the LLM, resulting in a high training cost that can not be accommodated on consumer-grade GPUs. This pressing demand for high-performance training equipment becomes a daunting hurdle for researchers to extend general-purpose models into domain-specific vision-language models.

2.3 CLOVER

Refer to caption — Figure 1: The workflow of CLOVER. CLOVER employs the training framework of BLIP-2 to achieve a fast domain tuning with lightweight parameters. The entire training process of CLOVER includes two major stages: (i) alignment of vision and language and (ii) supervised fine-tuning with instructions. The alignment compels the model to acquire valuable representations between vision and language. Instruction fine-tuning is vital here for activating LLMs to excel in visual language question answering. Stage 1 requires inputs of image-text pairs, where we use the large-scale Quilt-1M [23] dataset. Stage 2 demands domain-specific instruction data. As we have seen a significant lack of the required instruction data in the literature, we propose a low-cost solution of instruction data generation carefully designed for analyzing pathological data.

In this study, we propose a cost-effective learning framework for accurate pathology vision and language inference named as CLOVER (Fig. 1). To achieve fast domain tuning with low training resources, we adopt the BLIP-2 architecture [33] as a visual language pre-training using a lightweight trainable query transformer (Q-former), a frozen visual encoder, and a frozen LLM. We leverage the paired pathological image and text captures from the Quilt-1M [23] dataset to align the vision and language. For the instruction fine-tuning, we meticulously design prompts tailored for pathological question answering and use GPT-3.5 [7] for generating effective instruction dataset at low cost, referred as the “generation-based instructions”. The instructions generated by our PVLM-orient prompt show strong capabilities in the model tuning. Additionally, we construct instructions by matching template questions with the original text captions to enhance the model’s vision understanding ability, referred as the “template-based instructions”. Integrating these two types of instructions gives rise to the hybrid-form instructions, which remarkably enhance CLOVER’s conversational abilities in pathology. Using two benchmark datasets and one independent clinical dataset, we comprehensively validate the effectiveness of CLOVER and demonstrate its potential for assisting clinical pathological tasks by “talking to your pathology data” in resource-constrained settings.

Our study has made multifaceted contributions: First, we propose an effective PVLM training framework particularly at low computational resources and financial spending. Different from the LLaVA-like tuning (e.g. LLaVA-Med), we emphasize the use of BLIP-2 [33] to serve as an alternative choice for the cost-effective lightweight inference. We find that tuning the wide spectrum of LLM’s parameters is unnecessary for building a sufficiently usable PVLM. Second, the value of our designed PVLM-oriented prompt is pronounced on GPT-3.5 instead of the advanced GPT-4. Low-cost CLOVER model outperforms those models trained with LLM tuning on the instruction data generated by GPT-4 in multiple settings. We also recognize template-based instructions without relying on GPT can significantly supplement generation-based instructions. A combined use of these two type instructions can further enhance CLOVER’s understanding capabilities. Third, we confirm that CLOVER can be effectively developed with a limited scale of instructions towards a frugal and real-world application. This is evidenced by the appealing performance in fine-tuning and zero-shot experiments on two visual question answering datasets, as well as in the few-shot experiment on a real-world clinical dataset.

3 Results

3.1 Quantitative Comparison

We systematically measure CLOVER’s visual question answering (VQA) performance using two public benchmark datasets, namely PathVQA [35] and QUILT-VQA [19], and one independent clinical data set (see Sec. Datasets). We compare CLOVER’s performance with state-of-the-art (SOTA) PVLMs trained with notable costs to observe the cost-effectiveness of CLOVER.

Comparison with SOTA PVLMs on PathVQA. From Table 1, CLOVER outperforms major competing methods, showing a remarkable improvement of 26.13% in accuracy at maximum for closed-ended questions. Our model performance is even approaching to LLaVA-Med (37 times more model parameters) with a minor accuracy difference of 1.83%. Note that LLaVA and LLaVA-Med achieve their results through extensive parameter tuning of LLM and training on a vast instruction dataset generated by GPT-4. To illustrate, our method involves training with a frozen LLM, adjusting only a small portion of parameters (about 1/37 of LLaVA-Med). Meanwhile, CLOVER only involves the use an entry-level GPT-3.5 with a smaller scale of instruction dataset (2/3 of LLaVA-Med). As seen in Table 1, our model also shows superior performance improvement in open-ended question scenarios. CLOVER’s performance is twice that of BLIP-2 and approaches five times that of LLaVA in open-ended question scenarios. Compared with BLIP-2, our model improves in both closed-ended and open-ended question-answering. This key finding suggests that vision-language models could evidently benefit from the high-quality pathology-sensitive instruction data towards a low-cost model development.

Method

Number of

instruction

Trainable

parameter (pt)

Closed-end

/ log(pt)

Open-end

/ log(pt)

VL Encoder–Decoder

N/A

400M

84.63

32.52

Q2ATransformer

N/A

88.85

M2I2

N/A

236M

88.00

\ul37.09

LLaVA

158K

63.20

16.41

7.74

2.01

LLaVA-Med

60K

91.21

23.68

37.95

\ul9.85

BLIP-2

N/A

187M

83.90

36.93

18.69

8.23

CLOVER (ours)

45K

187M

\ul89.38

39.34

\ul36.95

16.26

Table 1: Comparison with SOTA methods on PathVQA dataset.

Comparison with SOTA PVLMs on QUILT-VQA. We validate CLOVER’s zero-shot generalization capability on the QUILT-VQA dataset. From Table 2, CLOVER achieves the leading performance in precision and F1-score. Our results are even close to LLaVA and LLaVA-Med in terms of recall. Due to the longer answer length of LLaVA and LLaVA-Med, these models are advantageous in the recall evaluation metric. The average length (word counts) of true answers is 18 and that number for LLaVA and LLaVA-Med are 38 and 36 respectively, both of which are higher than BLIP-2 and our model (4 and 22 respectively). In addition, our precision exceeds LLaVA’s by 9.77% and LLaVA-Med’s by 10.25%, while our F1 score also surpasses LLaVA’s by 6.73% and LLaVA-Med’s by 6.53%. In terms of a performance-cost ratio, CLOVER achieves the leading values for all metrics. Overall, these findings indicate the strength of CLOVER to retain the high-level performance while minimizing costs.

Method

Recall

/ log(pt)

Precision

/ log(pt)

F1-score

/ log(pt)

LLaVA

57.36

14.89

30.97

8.04

36.83

9.56

LLaVA-Med

57.03

14.81

30.49

7.92

37.03

9.62

BLIP-2

10.03

4.41

30.51

13.43

13.95

6.14

CLOVER (ours)

54.33

23.91

40.74

17.93

43.56

19.17

Table 2: Comparison with prior SOTA methods on QUILT-VQA dataset.

[Uncaptioned image] — Table 3: Qualitative comparisons of visual question answering on QUILT-VQA. (Image source: QUILT-VQA [19])

3.2 Qualitative Comparison

To gain insight into model outputs, we present the representative cases from VQA experiments involving comparisons with LLaVA, LLaVA-Med, BLIP-2, and our CLOVER on QUILT-VQA and LLaVA-Med-17K datasets. In Table 3 for results from QUILT-VQA, we observe that the output of LLaVA is unrelated to the actual content, reflecting general image contents such as “people”, “water”, and “lake”. While LLaVA-Med identifies a tissue sample and provides definitions for pathology and immunohistochemistry, it does not reveal the specific type contexts of tissues. BLIP-2 could only provide a simple answer without specific information. In contrast, CLOVER identifies that the image shows a compositional section of bone tissue and describes the composition of bone tissue and the structural shape, which serves as the basis for model reasoning. In addition, we perform evaluation on the LLaVA-Med-17K instruction. It is noted that results of LLaVA-Med tend to be overly optimistic due to the known overlap with its own training set [18]. A case is shown in Table 4. LLaVA’s answers still remain unrelated to pathology, while BLIP-2 could only answer generally without image details. Our model describes these inflammatory cells and provides a more detailed description of specific types of inflammatory cells. This is facilitated by our proposed PVLM-oriented prompts and generation-based instructions enriching the pathology knowledge distilled from GPT-3.5. Overall, CLOVER demonstrates its descriptive ability in reasonably responding to professional queries related to tissue characteristics and pathological outcomes.

3.3 External Clinical Data Validation

To assess the efficiency of CLOVER in real-world applications with minimal human labour, we validate CLOVER’s few-shot learning capability under a challenging task of cancer detection (i.e., cancer/non-cancer tissue classification). We implement a K-shot learning testing scheme, meaning that our model can only use K WSIs (K = 1, 2) from each class for a model fine-tuning. We evaluate the model performance on the external validation dataset on two cancer tissues. Fig. 2 details the performance of CLOVER in the intestinal cancer detection. We report comparative results from BLIP-2 model as it is not feasible for a RTX 3090 to fine-tune resource-demanding LLaVA and LLaVA-Med. In addition, Fig. 3 presents results in the gastric cancer detection. Overall, both Fig. 2 and 3 show that CLOVER demonstrates a performance advantage over BLIP-2. Even under the extreme condition of 1-shot learning, CLOVER achieves a commendable accuracy of 75.49% in the intestinal classification task in Fig. 2. In the gastric cancer detection, CLOVER’s performance exhibits a rapid improvement with the increase of training samples. Starting from an accuracy of 75.04% in the 1-shot scenario, the accuracy swiftly increased to 87.91% with 2-shot learning in Fig. 3. This marked improvement underscores CLOVER’s potential on few-shot learning capabilities in cancer detection on the both stomach and intestine cancer tissues, especially under extreme sample conditions.

3.4 Ablation Studies

Comparison with SOTA instruction data. We conduct ablation studies on assessing the value of instruction data. We compare our instruction data with SOTA instruction datasets using the same model setting using BLIP-2 and use PathVQA as the final testing set. In detail, we use the generated 15k generation-based and 30k template-based instructions as our instruction data. In Table 5, the gains from using our instructions generated by GPT-3.5 based on noisy internet data are higher compared to using instructions generated by GPT-4 based on high-quality data. This finding can be attributed to the developed form of the instructions (generation-based and template-based instructions) and the quality of the instructions (generated through PVLM-oriented prompts). In particular, our costs are only $8, whereas the estimated API costs of LLaVA-Med are hundreds of times higher than our low-cost approach. When using the FlanT5XL model, our performance on closed-ended question-answering closely approaches that of the LLaVA-Med-Pathology-IM (LM-IM) [18], evidently outperforming LLaVA-Med-Pathology (LM) [18] and LM-IM on open-ended question-answering by 2.22% and 3.13% respectively. With the Vicuna 7B model, our results surpass the LM on both closed-ended and open-ended question-answering settings.

LLM	Instruction data	Estimated cost	Closed-end	Open-end
FlanT5XL	LM (from GPT-4)	more than $1,000	\ul86.49	\ul24.55
FlanT5XL	LM-IM (from GPT-4)	more than $1,000	86.56	23.64
FlanT5XL	Ours (from GPT-3.5)	$8	85.84	26.77
Vicuna 7B	LM (from GPT-4)	more than $1,000	88.73	35.28
Vicuna 7B	LM-IM (from GPT-4)	more than $1,000	89.88	\ul35.48
Vicuna 7B	Ours (from GPT-3.5)	$8	\ul89.38	36.95

Table 5: The VQA performances of different models tuned with SOTA instruction datasets and our instructions.

LLM	Template-based instruction	Closed-end	Open-end
FlanT5XL	✗	84.51	24.10
FlanT5XL	✔	85.84	26.77
Vicuna 7B	✗	88.76	35.08
Vicuna 7B	✔	89.38	36.95

Table 6: The VQA performances of different models tuned with and without template-based instruction.

LLM

PVLM-oriented

prompts

Closed-end

Open-end

FlanT5XL

✗

85.67

12.64

FlanT5XL

✔

86.80

26.48

Vicuna 7B

✗

88.59

31.79

Vicuna 7B

✔

88.76

35.08

Table 8: VQA performances with and without PVLM-oriented prompts.

The effectiveness of template-based instruction data. We investigate the effectiveness of template-based instructions, aiming to provide the LLM with more direct and comprehensive language guidance. As shown in Table 6, results indicate that the addition of template-based instructions to the original instruction data is effective. Especially in FlanT5XL, there is a remarkable improvement in the open-question answering. We reason that template-based instructions can provide a complete answer about the image caption. Therefore it is possible to compensate for the information loss during the creation of generation-based instructions. Meanwhile, template-based instruction design enables the language model to capture visual contents and related contextual descriptions comprehensively.

The effectiveness of PVLM-oriented prompts. To explore whether using simplified question-answering prompts can achieve results comparable to our designed PVLM-oriented prompts, we again utilize GPT-3.5 to generate instruction data but modify the prompt to be a single sentence. The defined prompt describes that “Now you are a pathologist, and your task is to generate question-answer pairs based on the captions of the images.” (examples are offered in Table 7). The observable difference is that the answers of GPT-3.5 become shorter and lack additional knowledge. Results in Table 8 show that for the relatively weaker-performing FlanT5XL, low-quality prompts have a serious impact, yielding a recall of only 12.64% in open-question scenarios, even lower than BLIP-2’s 18.69%. This finding indicates that our PVLM-oriented prompts play a crucial role in enriching the instruction data with medical knowledge, thereby contributing to the performance improvement of the pathological vision-language model.

Impact of scales of Generation-based instructions. Due to the expensive cost in the use of GPT API and high training expense, our goal is to generate less instruction data while retaining a high performance of PVLM. We are particularly focused on measuring the model performance in terms of a low demand of instruction data. We conduct experiments using 5K, 10K, 15K, and 30K generation-based instruction data for the FlanT5XL and Vicuna models. As seen in Fig. 4, our method achieves high-quality results on small-scale instruction data. When comparing the cases of 15K and 30K instructions, the result difference is not statistically significant, indicating that large-scale instruction datasets may not always be instrumental. More critically, we recognize that the form of the instructions (whether they possess template-based instructions) and the quality of the instructions (whether they are generated with high quality through PVLM-oriented prompts) are more differential factors compared to the quantity of instruction data.

Homogeneity of instruction data. The quality of the generated instruction data could vary across different subsets. To validate the homogeneity and robustness of the instruction dataset, we randomly divide the 15K generation-based instruction data into three subsets of 5K each and use them for training separately. From results in Table 9, we observe a stable performance across subsets, which indicates a potential generalization ability and adaptability to variations within the dataset. This homogeneous ability is critical for measuring the robustness of instruction data since variations and noises can be introduced during data collection.

LLM	Subset	Closed-end	Open-end
Vicuna 7B	1	89.65	32.78
Vicuna 7B	2	88.59	32.65
Vicuna 7B	3	88.50	32.25

Table 9: The VQA performances of model trained with different subsets (5K) of the generation-based instruction data.

Impact of epochs at each training stages. We conduct ablation experiments using Vicuna as LLM in Stage 1 (alignment), Stage 2 (instruction tuning), and the fine-tuning stage. We use the same model parameter initialization of BLIP-2. The experimental results are shown in Fig. 5, indicating that sufficient training on pathological images in Stage 1 (Fig. 5(a) and (b)) and Stage 2 (Fig. 5(c) and (d)) is necessary for BLIP-2’s adaptation to the pathological domain. In the fine-tuning stage (Fig. 5(e) and (f)), increasing the number of epochs from 10 to 20 leads to a noticeable performance improvement in open-question tasks, as the model needs to adapt to the evaluation dataset according training dataset.

4 Discussion

Development of cost-effective pathological vision-language models (PVLMs) is crucial to enable timely information retrieval and accelerate clinical decision making. In this study, we have proposed a framework to train a precise PVLM with a low computational resource and financial expenditure. A key contribution is that we have designed PVLM-oriented prompts and leveraged the GPT-3.5 to create question-and-answer instruction datasets. These generated instructions, combined with template-based instructions requiring no economical expands, exhibit superior capabilities in aiding instruction-based model tuning and downstream use. Our established CLOVER performs strongly comparing to advanced models with significantly larger parameters and expansive costs using GPT-4. In addition, CLOVER presents an appealing conversational capacity in the real-world data from stomach and intestine cancer tissues. These findings demonstrate CLOVER’s potential as a cost-effective multimodal framework to support downstream clinical tasks without excessive model training and human labeling efforts.

Computational pathology is widely known for its high demands on data and computational resource, especially for building a useful PVLM application. Prior efforts were primarily focused on the use of human-examined clinical data [1, 36]. A key differentiation of our study is to build efficient PVLMs based on the public Internet data and align with the power of instruction data without extensive human examinations. To address the challenge of image-to-text alignment, we show that using the generation-based instructions and prompting with GPT-3.5 can generate high-quality instruction data. In addition, we offer template-based instructions to improve the scale and diversity of the instructions, leading to the improved generalizability of models in pathology. It is noteworthy that combining GPT-3.5 and proper prompt could result in a high performance that even outperforms the results from GPT-4. Since we reiterate the performance gain without advanced GPTs, our study can serve as a methodological baseline for guiding follow-up studies to measure cost-effectiveness.

Data generation has become increasingly crucial to diversify training data and enhance model robustness [37, 38, 39]. In our study, instruction data generation is a core strategy for CLOVER to achieve its strong efficiency. Especially in the era of large model, instruction data generation has proven to be useful in model fine-tuning as it can unlock the specialized use of LLMs [40, 41, 42]. From our results, we have gained key insights into an efficient use of instruction in model tuning and inference. First, instruction fine-tuning markedly improves performance compared to models without fine-tuning. Even under low-resource conditions with frozen LLM parameters, instruction fine-tuning can still enhance the performance. Second, small-scale, well-defined instructions are powerful on guiding PVLM inference, suggesting that large-scale instruction datasets are not always necessary. This finding is strongly aligned with the prior research that less but higher quality instruction data can yield superior results [43]. Our well-crafted prompts are more advantageous for a LLM to generate high-quality instructions when comparing the distinction of PVLM-oriented prompts versus non-PVLM-oriented prompts on model performance. Finally, our study confirms that proper use of instructions is impactful in complex pathological visual question answering tasks. This key insight supports the observation that instruction tuning is more beneficial for complex and unseen tasks compared to simpler ones [44].

It has come to our attention that instructions generated using PVLM-oriented prompts naturally incorporate the characteristic of chain-of-thought (CoT) [6]. From our results, these instructions typically combine explicit answers with explanations and contextual expansions on the medical knowledge. In addition, previous investigations have suggested that combining non-CoT instructions with CoT instruction fine-tuning achieves positive results, outperforming the solely CoT instruction fine-tuning [40]. Likewise, our finding affirms that a hybrid-form of instructions, encompassing both generation-based and template-based instructions, can greatly contribute to the development of cost-efficient PVLMs.

Our primary focus has placed on building a low-cost PVLM, we thus have not address the automated noise removal and fine-scale image-to-text alignment that can potentially improve the model robustness performance. CLOVER studies patch-level pathology image processing due to its efficiency, thus the extensive analysis on spatially-aware image contents can enhance clinical diagnosis and report generation [45]. While our study simultaneously aims to sweep the barrier in data, computational source, and financial costs for building PVLM, other paralleled cost-effective endeavours [46, 47] can be considered to enhance performance efficiency. As our instruction approach can potentially extend to other demanding human-AI interactions, the continued exploration of external multi-modal clinical data could help validate the low-cost utility of CLOVER. Finally, our study demonstrates the power of instruction data for building a vision-language model, enabling the rapid language-and-image information interaction and conversational decision making in the space of digital pathology.

5 Methods

In this section, we offer methodological details of the development of CLOVER. We next introduce the involved datasets and the evaluation schemes of experiment.

5.1 Efficient Instructions Construction at Low-cost in CLOVER

Generating large-scale instructions via GPT-4 incurs significant financial costs. In this study, we specialize in developing effective domain-specific instruction data generation at a low cost. The proposed framework includes (i) generation-based instructions with a specialized prompt for employing GPT-3.5, and (ii) template-based instructions without any additional financial cost. The construction of generation-based instructions involves a generation of question-answer (QA) from the captions using GPT and a PVLM-oriented prompt, while that of template-based instructions involves matching the captions as answer to a set of pre-designed template questions. Notably, the template-based QA dataset permits a comprehensive understanding of image contextual information, while the generation-based QAs emphasize the pathological knowledge distilled from GPT-3.5.

Generation-based Instruction. We meticulously design a prompt tailored for pathological question answering. We use GPT-3.5 [7] for instruction dataset generation enabling a low cost operation with prompt examples seen in Table 10. Given the high variance of prompt design, we have designed four desired principles of the prompt construction. First, we use GPT-3.5 to simulate a scenario where users (patient or doctor) and AI assistants (CLOVER model) conduct question and answering (QA). Since GPT-3.5 does not have access to images, QAs generated by GPT-3.5 are based on the textual description of the image. To reduce the over-reliance on textual descriptions, we emphasize avoiding minor information that can not be obtained from the image (e.g., reference dates or magnification ratios). Second, we focus on adding visual detailed information such as tissue structure, cell morphology, potential lesions, and locations. Third, the noise in the original textual description is avoided such as vocabularies related to context or narrator. Finally, we seek to generate answers of GPT-3.5 to exhibit the cautiousness, aligning with the medical field’s expectation for producing prudent answers. To enhance the quality of generated data, we additionally introduce few-shot examples in the prompts to inject more relevant information for in-context learning [48].

Messages = [ {“role” : “system”,“content”: “As a specialized AI assistant focusing on pathological images, you will receive textual descriptions (caption) of figures. Please note that you do not have access to the actual images. Your task is to generate a set of question-and-answer (QA) pairs between the person inquiring about the images (user) and you as the assistant responding. The QA should be conducted as if both the user and the assistant are examining the images, without referring to textual information.

The following are the requirements for generating question-and-answer pairs: • Avoid referencing dates or magnification ratios. • Focus on visual descriptions, including organizational structure, cellular morphology, potential pathological changes, location, etc. • Avoid using phrases such as “mention”, “title”, “context”, or “narrator”. Instead, refer to information as being “in the image.” • When responding to questions, adopt an objective and responsible attitude, avoiding overconfidence, and refrain from providing medical advice or diagnostic information. Encourage users to consult healthcare professionals for more accurate advice. The content should include 4-5 question-and-answer pairs related to visual aspects of the images. ” },

{few-shot examples},

{ “role”: “user”, “content”: description} ]

Table 10: The proposed PVLM-oriented prompt for generating high-quality pathological QA instruction data.

Template-based Instruction. The above process of generation-based instruction with GPT-3.5 is often based on partial captions derived from images. Generation-based instructions only capture a portion of the original information, preventing LLMs from fully capturing the visual and descriptive content. To address this challenge, we construct useful template-based instructions, providing the LLM with more structured and comprehensive language guidance, overcoming the shortcoming where visual information are often overlooked or incomplete in pathology. We use the same 17 descriptive statements as LLaVA [9], which instructs the model to intricately describe the content of images using various expressions. These statements are the questions and the corresponding answers from the original captions of the images. In detail, we merge multiple captions for the same image and filter out those with a word count less than 25. Subsequently, we randomly choose 30K image-text pairs for generating the template-based instruction. For each image-text pair, we randomly select only one description from 17 statements to form our template-based instructions. Note that this generation process requires no additional fee as the use of GPT is not involved.

5.2 Training Details of CLOVER

Model training involves two main stages: (i) alignment of vision and language and (ii) supervised fine-tuning with instructions. In the first training stage, to align pathological images and text, we train BLIP-2 on the original image-text pairs directly obtained from Quilt-1M dataset [23]. BLIP-2 takes inputs in the form of a pair of image and text (caption or question). The visual encoder is utilized for extracting features from image and generating visual tokens $\textbf{V}=\{v_{1},v_{2},...,v_{n_{v}}\}$ , where $n_{v}$ is the number of visual tokens. Next, the lightweight Q-former handles text and visual learnable queries, incorporating a self-attention [49] mechanism to share the transformer for both visual and textual components. A cross-attention mechanism is used to facilitate the interaction between visual tokens and visual queries. We simultaneously optimize the Q-former using image-text contrastive loss, image-grounded text generation loss, and image-text matching loss [32]. The image-text contrastive loss aims to maximize the similarity between the same image-text pair by comparing visual queries and text representations. The image-grounded text generation loss trains a text transformer to generate corresponding text given an image. Meanwhile, the image-text matching loss is a binary classification loss used to determine whether an image and text belong to the same pair.

In the second stage training, we introduce the customized instruction dataset for activating LLM and completing visual language question answering. In order to stimulate the domain speciality of LLM, we add a task-driven prompt before the input question as “Now that you are a pathologist, please answer the following questions based on the images”. We utilize a standard tokenization to obtain a sequence of text tokens, including prompt tokens $\textbf{P}=\{p_{1},p_{2},...,p_{n_{p}}\}$ , question tokens $\textbf{Q}=\{q_{1},q_{2},...,q_{n_{q}}\}$ , and answer tokens $\textbf{A}=\{a_{1},a_{2},...,a_{n_{a}}\}$ , where $n_{q}$ , $n_{p}$ and $n_{a}$ represent the token lengths of prompt, questions and answer. We feed the Q-former both the visual tokens $\textbf{V}=\{v_{1},v_{2},...,v_{n_{v}}\}$ and the text tokens, and then adjust a linear layer to map the dimension of image token into the the dimension of text. Our optimization objective is to fine-tune the learnable parameters $\theta$ of Q-former by maximizing the following likelihood:

\displaystyle p(A|P,Q,V)=\prod_{i=1}^{n_{a}}p_{\theta}(a_{i}|P,Q,V,a_{1},a_{2},...,a_{i-1}),

(1)

where $p_{\theta}(a_{i}|P,Q,V,a_{1},a_{2},...,a_{i-1})$ is the probability to generate $a_{i}$ given $P$ , $Q$ , $V$ , $a_{1}$ , $a_{2}$ , …, and $a_{i-1}$ under the model parameters $\theta$ .

5.3 Implementation Details of CLOVER

We choose EVA-ViT-G/14 [50] as the frozen visual encoder, utilizing the output from its second-to-last layer as the visual feature representation. This choice has been validated as a superior solution in BLIP-2 [33]. Regarding the LLM, we select the decoder-only Vicuna 7B [31] and encoder-decoder based FlanT5XL [30], which could best utilize the pre-trained BLIP-2 parameters. To enhance the training efficiency, we convert the parameters of visual encoder and LLM to FP16. In the first stage of model training, we conduct training for 20 epochs with a batch size of 36. In the second stage, training continues for 30 epochs, with a batch size of 2 for Vicuna and 8 for FlanT5XL. The remaining hyperparameter settings remain consistent with BLIP-2. We complete the two-stage training using Vicuna in 4 days with 4 RTX-3090 GPUs.

5.4 Datasets

5.4.1 CLOVER Training Dataset

Quilt-1M [23] is a multi-modal pathology dataset involving both pathological vision and language information. It includes 768,826 pathological images and their corresponding textual annotations obtained from YouTube. Based on Quilt, additional established Internet-based, image-text paired data are integrated to form a total dataset of one million image-text pairs, called Quilt-1M. In our study, we use Quilt-1M in two tasks including the first-stage vision-language alignment and instruction generation for supervised fine-tuning.

CLOVER instruction is generated with our proposed method as introduced above. It consists of the generated 15k (a default size in major experiments) generation-based using GPT-3.5 and 30k template-based instructions based on manual construction. We develop this original instruction dataset for activating LLMs to complete visual-language question answering in pathology domain. These instructions are used for instruction tuning (Stage 2). An example of generating question and answer instructions based on the prompt is shown in Table 11.

5.4.2 Evaluation Datasets

PathVQA [35] is a pathological visual question-answering dataset, comprising 4,998 pathological images and 32,799 question-answer pairs. The questions in this dataset are categorized into two types: open-ended questions and closed-ended questions. Open-ended questions cover a wide range of topics, including why, what, how, etc., while closed-ended answers are limited to responses like “yes” or “no”. The training and testing splits are specified by the dataset. For experiments using PathVQA, the models are fined tuned by training dataset. Testing dataset is used for evaluation. This setting is consistent with previous studies [18] and provides the basis to compare our CLOVER with SOTA methods. Note that for all ablation studies, we use PathVQA by default.

QUILT-VQA [19] is another pathological visual question-answering dataset. QUILT-VQA is uniquely sourced from educational video contents covering diverse topics. Researchers extract valuable texts from these videos. Then, GPT-4 is utilized to extract question-answering pairs from these texts with human intervention ensuring the pairs alignment on the medical themes. QUILT-VQA comprises 985 visual question-answer pairs, with an average word count of 17 in the answers. Again, to be consistent with previous work [19], we use a zero-shot validation scheme in QUILT-VQA. In our study, we regard all questions as open-ended questions to fully utilize the dataset.

Clinical dataset is used to test CLOVER under a real-world clinical setting. We design a cancer detection task, where CLOVER performs in a visual question-answering (VQA) manner for the given image patch. We collect 38 cancer tissues WIS in Pathology Department of Xinhua Hospital affiliated to Shanghai Jiao Tong University School of Medicine, during April 2024 to May 2024, including 13 WSIs from stomach and 25 WSIs from intestines. To identify specific pathological regions within WSIs, we collaborate with two experienced pathologists to perform the fine-grained annotation with cross-checking. These annotations, provided in standard XML files, delineate the negative and positive regions within the WSIs. For testing CLOVER working on patch-level VQA, we further extract non-overlapping image patches (of size 512 $\times$ 512 pixels) containing the annotated sub-regions. This process yields 7,112 image patches (1,136 tumor and 2,079 non-tumor patches from stomach, 1,846 tumor and 2,051 non-tumor patches from intestines). For model training and evaluation, we formulate the cancer detection task into a visual question answering (VQA) format. The uniform question is posed as: “Is this pathological image showing a negative or positive result?” The answers are set to “this is a negative pathological image” or “this is a positive pathological image” based on the ground-truth labels. For this experiment, we implement a few-shot learning setting to test whether our model can be fast transferred to the real-world data with a few annotated samples. We divide the data into training (15 WSIs) and testing sets (23 WSIs) on the WSI level, and the number of training WSIs depends on the setting of our five independent experiments. For both each 1-shot (patches of one random WSI from each class for training) and 2-shot (two WSIs from each class) experiment, we reconstruct the training samples using different combinations of samples from training samples. We ensure that there is no WSI data from overlapped patients. The test set remains constant throughout all experiments to facilitate the evaluation across different experiments.

5.5 Compared Methods and Instructions

We conduct a comprehensive comparison with the standard BLIP-2 and other strong baseline approaches, including VL Encoder-Decoder [51], Q2ATransformer [52], M2I2 [3], LLaVA [9], and LLaVA-Med [18]. Since VL Encoder-Decoder, Q2ATransformer and M2I2 do not have zero-shot generalization ability, these models are not considered when evaluating the performance on QUILT-VQA.

LLaVA-Med-Pathology and LLaVA-Med-Pathology-IM [18] are the public instruction datasets that we use to compare with our proposed instruction dataset. LLaVA-Med-Pathology is a high-quality conversational instruction dataset focused on the pathological domain and is a subset of LLaVA-Med [18]. LLaVA-Med-Pathology-IM is another version of LLaVA-Med-Pathology, which adds inline content to the original image descriptions as supplementary textual information. These datasets are created by transforming parts of the PMC-15M dataset [22] into instruction datasets using GPT-4 based on specific prompts.

5.6 Evaluation Metrics

For PathVQA, we use the recall of the true answer that appears in the predicted answer in open-ended questions, and we report the accuracy in closed-ended questions. For QUILT-VQA, we find that under zero-shot learning, the answer lengths of different models vary greatly, and thus we report the recall, precision and F1-score. F1-score as an integration of recall and precision is used as a more comprehensive metric among the three. All above metrics are reported in percentage. To evaluate the cost-effectiveness of different models, we further compute the ratio of performance to the log of the number of parameters as a key metric.

Declarations

Funding

This study is supported in part by Shanghai Artificial Intelligence Laboratory.

Conflict of Competing Interests

The authors declare no competing interest.

Ethics approval and consent to participate

The Ethics Committee of Xinhua Hospital approved this study after data anonymization.

Data Availability

The QUILT-1M and QUILT-VQA can be accessed in https://quilt1m.github.io/. And PathVQA can be downloaded from https://paperswithcode.com/dataset/pathvqa. The CLOVER instruction data will be released upon acceptance. The clinical dataset from Xinhua Hospital is available upon request from the corresponding author due to the restriction of hospital.

Code Availability

The code and model will be released upon acceptance at https://github.com/JLINEkai/CLOVER.

Author Contribution

KC, ML, MZ and SZ are major contributors to drafting and revising the manuscript for content and analyzing the data. FY, LM, XS, LW, XW, LZ, and ZW played major roles in the acquisition of data. KC, ML, MZ, FY, XS, LW, LM and XW substantially revised the manuscript. KC, ML, MZ and SZ conceptualize and design the study. ML, LZ and ZW interpret the data. All authors read and approved the final manuscript.

Appendix A Supplementary Case Study

To gain more insight into the model output, we present cases from visual question answering (VQA) experiments involving comparisons with LLaVA [9], LLaVA-Med [18], BLIP-2 [33], and our CLOVER on QUILT-VQA [19]. Cases are shown in Tabel 12, 13, and 14. In Table 12, LLaVA describes a scenario that is irrelevant to the given question and is unrelated to the reference answer. LLaVA-Med provides an answer related to cystic spaces, which does not match the reference answer. BLIP-2’s response is incorrect, overly brief, and lacked diagnostic value. In contrast, CLOVER provides an answer related to angiomyolipoma, which closely matches the reference answer and includes detailed descriptions related to angiomyolipoma. In Table 13, the question involves medical knowledge, testing the model’s stored knowledge about stains. LLaVA-med provides an incorrect response, addressing H&E staining while the question is about specific stains 3 and 20. The answers of BLIP-2 still are very brief and irrelevant. LLaVA and CLOVER perform well, providing relevant and professional answers. In Table 14, CLOVER answers correctly, staying focused on the core of the question, mentioning the absence of cellular atypia, mitotic activity, or necrosis. This finding closely aligns with the reference answer and clearly outperforms other models.

References

\bibcommenthead
Zhang et al. [2024] Zhang, Y., Gao, J., Tan, Z., Zhou, L., Ding, K., Zhou, M., Zhang, S., Wang, D.: Data-centric foundation models in computational healthcare: A survey. arXiv preprint arXiv:2401.02458 (2024)
van Sonsbeek et al. [2023] Sonsbeek, T., Derakhshani, M.M., Najdenkoska, I., Snoek, C.G., Worring, M.: Open-ended medical visual question answering through prefix tuning of language models. arXiv preprint arXiv:2303.05977 (2023)
Li et al. [2023] Li, P., Liu, G., Tan, L., Liao, J., Zhong, S.: Self-supervised vision-language pretraining for medial visual question answering. In: IEEE International Symposium on Biomedical Imaging, pp. 1–5 (2023)
Thirunavukarasu et al. [2023] Thirunavukarasu, A.J., Ting, D.S.J., Elangovan, K., Gutierrez, L., Tan, T.F., Ting, D.S.W.: Large language models in medicine. Nature medicine 29(8), 1930–1940 (2023)
Singhal et al. [2023] Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Wei, J., Chung, H.W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al.: Large language models encode clinical knowledge. Nature 620(7972), 172–180 (2023)
Wei et al. [2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022)
Schulman et al. [2022] Schulman, J., Zoph, B., Kim, C., Hilton, J., Menick, J., Weng, J., Uribe, J.F.C., Fedus, L., Metz, L., Pokorny, M., et al.: Chatgpt: Optimizing language models for dialogue. OpenAI Blog (2022)
Achiam et al. [2023] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Liu et al. [2023] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Alayrac et al. [2022] Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
Koh et al. [2023] Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823 (2023)
Ye et al. [2023] Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Lu et al. [2024] Lu, M.Y., Chen, B., Williamson, D.F., Chen, R.J., Zhao, M., Chow, A.K., Ikemura, K., Kim, A., Pouli, D., Patel, A., et al.: A multimodal generative ai copilot for human pathology. Nature, 1–3 (2024)
Chen et al. [2024] Chen, R.J., Ding, T., Lu, M.Y., Williamson, D.F., Jaume, G., Song, A.H., Chen, B., Zhang, A., Shao, D., Shaban, M., et al.: Towards a general-purpose foundation model for computational pathology. Nature Medicine, 1–13 (2024)
Song et al. [2023] Song, A.H., Jaume, G., Williamson, D.F., Lu, M.Y., Vaidya, A., Miller, T.R., Mahmood, F.: Artificial intelligence for digital and computational pathology. Nature Reviews Bioengineering 1(12), 930–949 (2023)
Schwalbe and Wahl [2020] Schwalbe, N., Wahl, B.: Artificial intelligence and the future of global health. The Lancet 395(10236), 1579–1586 (2020)
Baxi et al. [2022] Baxi, V., Edwards, R., Montalto, M., Saha, S.: Digital pathology and artificial intelligence in translational medicine and clinical practice. Modern Pathology 35(1), 23–32 (2022)
Li et al. [2023] Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023)
Seyfioglu et al. [2023] Seyfioglu, M.S., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv preprint arXiv:2312.04746 (2023)
Wu et al. [2023] Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454 (2023)
Moor et al. [2023] Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. In: Machine Learning for Health, pp. 353–367 (2023)
Zhang et al. [2023] Zhang, S., Xu, Y., Usuyama, N., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., et al.: Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915 (2023)
Ikezogwo et al. [2023] Ikezogwo, W.O., Seyfioglu, M.S., Ghezloo, F., Geva, D.S.C., Mohammed, F.S., Anand, P.K., Krishna, R., Shapiro, L.: Quilt-1m: One million image-text pairs for histopathology. arXiv preprint arXiv:2306.11207 (2023)
Huang et al. [2023] Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T.J., Zou, J.: A visual–language foundation model for pathology image analysis using medical twitter. Nature Medicine 29(9), 2307–2316 (2023)
Lu et al. [2024] Lu, M.Y., Chen, B., Williamson, D.F., Chen, R.J., Liang, I., Ding, T., Jaume, G., Odintsov, I., Le, L.P., Gerber, G., et al.: A visual-language foundation model for computational pathology. Nature Medicine 30(3), 863–874 (2024)
Radford et al. [2021] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Gao et al. [2024] Gao, Y., Gu, D., Zhou, M., Metaxas, D.: Aligning human knowledge with visual concepts towards explainable medical image classification. arXiv preprint arXiv:2406.05596 (2024)
Zhang et al. [2023] Zhang, Y., Gao, J., Zhou, M., Wang, X., Qiao, Y., Zhang, S., Wang, D.: Text-guided foundation model adaptation for pathological image classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 272–282 (2023)
Ding et al. [2023] Ding, K., Zhou, M., Metaxas, D.N., Zhang, S.: Pathology-and-genomics multimodal transformer for survival outcome prediction. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 622–631 (2023)
Chung et al. [2022] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Chiang et al. [2023] Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org (2023)
Li et al. [2022] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022)
Li et al. [2023] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Lu et al. [2024] Lu, M.Y., Chen, B., Williamson, D.F., Chen, R.J., Zhao, M., Chow, A.K., Ikemura, K., Kim, A., Pouli, D., Patel, A., et al.: A multimodal generative ai copilot for human pathology. Nature, 1–3 (2024)
He et al. [2020] He, X., Zhang, Y., Mou, L., Xing, E.P., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
Chen et al. [2022] Chen, X., Wang, X., Zhang, K., Fung, K.-M., Thai, T.C., Moore, K., Mannel, R.S., Liu, H., Zheng, B., Qiu, Y.: Recent advances and clinical applications of deep learning in medical image analysis. Medical Image Analysis 79, 102444 (2022)
Chang et al. [2023] Chang, Q., Yan, Z., Zhou, M., Qu, H., He, X., Zhang, H., Baskaran, L., Al’Aref, S., Li, H., Zhang, S., et al.: Mining multi-center heterogeneous medical data with distributed synthetic learning. Nature communications 14(1), 5510 (2023)
Graikos et al. [2024] Graikos, A., Yellapragada, S., Le, M.-Q., Kapse, S., Prasanna, P., Saltz, J., Samaras, D.: Learned representation-guided diffusion models for large-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8532–8542 (2024)
Ding et al. [2023] Ding, K., Zhou, M., Wang, H., Gevaert, O., Metaxas, D., Zhang, S.: A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Scientific Data 10(1), 231 (2023)
Chung et al. [2024] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. Journal of Machine Learning Research 25(70), 1–53 (2024)
Liu et al. [2024] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024)
Peng et al. [2023] Peng, B., Li, C., He, P., Galley, M., Gao, J.: Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 (2023)
Zhou et al. [2024] Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al.: Lima: Less is more for alignment. Advances in Neural Information Processing Systems 36 (2024)
Wei et al. [2021] Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 (2021)
Ding et al. [2022] Ding, K., Zhou, M., Wang, H., Zhang, S., Metaxas, D.N.: Spatially aware graph neural networks and cross-level molecular profile prediction in colon cancer histopathology: a retrospective multi-cohort study. The Lancet Digital Health 4(11), 787–795 (2022)
Hu et al. [2021] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Jin et al. [2024] Jin, Y., Li, J., Liu, Y., Gu, T., Wu, K., Jiang, Z., He, M., Zhao, B., Tan, X., Gan, Z., et al.: Efficient multimodal large language models: A survey. arXiv preprint arXiv:2405.10739 (2024)
Liu et al. [2022] Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., Raffel, C.A.: Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35, 1950–1965 (2022)
Vaswani et al. [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Fang et al. [2023] Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369 (2023)
Bazi et al. [2023] Bazi, Y., Rahhal, M.M.A., Bashmal, L., Zuair, M.: Vision–language model for visual question answering in medical imagery. Bioengineering 10(3), 380 (2023)
Liu et al. [2023] Liu, Y., Wang, Z., Xu, D., Zhou, L.: Q2atransformer: Improving medical vqa via an answer querying decoder. In: International Conference on Information Processing in Medical Imaging, pp. 445–456 (2023)

Question:	What can be observed in the image?
Answer:	\ulIn the image, a cross-section of a bone can be seen upon magnification, osteomas in the compact bone are visible the tissue in question shows a very eosinophilic cytoplasm and nuclei at the edge, indicating a cross-sectional profile of skeletal muscle cells. (Reference)
LLaVA:	In the image, there is a person standing in front of a large body of water, which appears to be a lake. The person is wearing a backpack, and there is a boat visible in the water. The scene suggests that the person might be taking a break from boating or preparing to go out on the water.
LLaVA-Med:	In the image, you can observe the histopathology and immunohistochemistry of a tissue sample. Histopathology refers to the microscopic examination of tissue to study the manifestations of disease, while immunohistochemistry is a technique used to visualize the presence and distribution of specific proteins or antigens in tissue samples using antibodies.
BLIP-2:	A cell nucleus in a tissue.
CLOVER:	\ulBased on the provided image, it appears that the observer is looking at a section of bone tissue, the bone tissue consists of a honeycomb-like structure made up of osteocytes, which are specialized bone cells embedded within the bone matrix.

Question:	What are the arrows indicating?
Answer:	\ulThe arrows in the image are pointing to inflammatory infiltrates within the myocardial fibers inflammatory infiltrates are collections of immune cells that have migrated to the site of inflammation or injury in the tissue. (Reference)
LLaVA:	The arrows indicate the direction of the flow of energy in the system. In this case, the arrows point from the heat source to the heat sink, indicating that heat is being transferred from the heat source to the heat sink. This is a common way to represent the flow of heat in a system and it helps visualize how the system is operating and how it can be optimized for better performance.
LLaVA-Med:	The arrows in the image are pointing to areas of inflammatory infiltrate. Inflammatory infiltrate refers to the accumulation of immune cells, such as white blood cells, in the tissue as a response to injury, infection, or other factors that cause inflammation.
BLIP-2:	The direction in which a tissue is moving.
CLOVER:	\ulThe arrows in the image indicate the presence of inflammatory cells, such as neutrophils and lymphocytes, within the myocardium these inflammatory cells are a hallmark of myocarditis, which is an inflammation of the heart muscle.

Caption:	Apoptotic keratinocytes are present within the epidermis, consistent with a clinical impression of pityriasis lichenoides at variola formis acuta (PLEVA). PLEVA is characterized by parakeratosis and lymphocytic exocytosis, and may also show lichenoid interface dermatitis.
	Question and answer generated with PVLM-oriented prompt :
Question:	What can you tell me about the presence of apoptotic keratinocytes in the image?
Answer:	The image shows the presence of apoptotic keratinocytes within the epidermis. This finding is consistent with a clinical impression of pityriasis lichenoides at variola formis acuta (PLEVA). Apoptotic keratinocytes are cells undergoing programmed cell death, and their presence in the epidermis suggests an abnormality in the cellular turnover process.
Question:	What are the characteristic features of PLEVA?
Answer:	PLEVA is characterized by parakeratosis, which is an abnormal retention of nuclei in the outermost layer of the epidermis, lymphocytic exocytosis, which is the migration of immune cells from blood vessels into the epidermis, and lichenoid interface dermatitis, which is the infiltration of immune cells at the junction between the epidermis and dermis.
	Question and answer generated with non-PVLM-oriented prompt (Now you are a pathologist, and your task is to generate question-answer pairs based on the captions of the images.):
Question:	What type of cells are present within the epidermis that are consistent with PLEVA?
Answer:	Apoptotic keratinocytes are present within the epidermis, consistent with PLEVA.
Question:	What does PLEVA stand for?
Answer:	PLEVA stands for pityriasis lichenoides et variolaformis acuta.

Caption:	Active inflammation in the stomach can cause epigastric pain and is often associated with H. pylori infection. Successful treatment of H. pylori can lead to regression of the inflammation and healing of the stomach. The described condition is H. pylori gastritis or chronic active gastritis with Helicobacter pylori organisms seen on H and E stain.
Question:	What is the described condition?
Answer:	The described condition is H. pylori gastritis or chronic active gastritis. This is an inflammation of the stomach that is often caused by Helicobacter pylori infection.
Question:	What are the symptoms associated with active inflammation in the stomach?
Answer:	Active inflammation in the stomach can cause epigastric pain, which is pain in the upper abdomen. Other symptoms that may be associated with this condition include nausea, vomiting, bloating, and loss of appetite.
Question:	What can lead to regression of the inflammation and healing of the stomach in H. pylori gastritis?
Answer:	Successful treatment of H. pylori infection can lead to regression of the inflammation and healing of the stomach. By eradication of the H. pylori bacteria, the inflammatory response can be reduced, allowing the stomach tissue to heal.
Question:	What staining technique was used to visualize the H. pylori organisms?
Answer:	The H. pylori organisms were visualized using Hematoxylin and Eosin (H&E) staining technique. This staining method allows for the detection of bacteria and other microscopic structures within the tissue.

Question:	What is the diagnostic feature in the image?
Answer:	The diagnostic feature in the image is the presence of fibrin thrombi within the little vessels these fibrin microthrombi in the lumens of the vessels are so characteristic, and they’re present almost 100% of the time in angiolipoma. (Reference)
LLaVA:	The diagnostic feature in the image is a large, round, raised area on the back of the dog’s neck. This is likely a lump or a mass, which could be a cause for concern. It is important to have the dog examined by a veterinarian to determine the cause of the lump and to ensure that it is not a sign of a more serious condition.
LLaVA-Med:	The diagnostic feature in the image is the presence of cystic spaces separated by thin fibrous septa. These cystic spaces are filled with mucin and lined by columnar epithelium. This histopathological appearance is characteristic of a mucoepidermoid cyst.
BLIP-2:	Eosinophilic cytoplasm.
CLOVER:	The diagnostic feature in the image is the presence of angiomyolipoma, which is a benign tumor made up of blood vessels, muscle cells, and fat cells.

Question:	How often are stains like 3 and 20 done without a clinical suspicion?
Answer:	Stains like 3 and 20 are not done very often without a clinical suspicion. (Reference)
LLaVA:	Stains like 3 and 20 are typically performed in the context of a clinical suspicion, such as when a patient presents with symptoms that suggest a possible infection or inflammation. These stains are often used to help diagnose certain types of infections, such as those caused by bacteria or fungi. They are not typically done as routine screening tests.
LLaVA-Med:	According to the information provided, staining with hematoxylin and eosin (H&E) is done without clinical suspicion in 95% of the cases. This means that H&E staining is performed in the vast majority of cases, even if there is no specific clinical indication for it.
BLIP-2:	Not often enough to be a problem.
CLOVER:	Stains like 3 and 20 are not typically done without a clinical suspicion they are usually ordered when there is a concern or suspicion of liver disease based on symptoms or other clinical findings.

Question:	What should not be seen in the image if it is a typical example of nodular fasciitis?
Answer:	In a typical example of nodular fasciitis, you should not see big hyperchromatic ugly pleomorphic cells like those seen in sarcoma, or pleomorphism and really marked atypia or atypical mites. (Reference)
LLaVA:	If the image is a typical example of nodular fasciitis, it should not show any significant inflammation or redness around the affected area. Nodular fasciitis is a benign condition characterized by the formation of small, round, firm lumps under the skin. These lumps are usually painless and mobile, and they do not cause any significant inflammation or redness.
LLaVA-Med:	In a typical example of nodular fasciitis, you would not expect to see inflammatory cells, vascularity, or fat within the mass. The image should primarily display densely packed, variably sized, fusiform mesenchymal cells with a bland nucleus, along with delicate, dense collagen fiber bundles that separate the cells.
BLIP-2:	Nodular granuloma.
CLOVER:	In a typical example of nodular fasciitis, there should not be any cellular atypia, mitotic activity, or necrosis these features are not typically present in nodular fasciitis.