Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch

Qing Zhang¹, Hang Guo¹, Siyuan Yang², Qingli Li¹, Yan Wang¹
¹ Shanghai Key Laboratory of Multidimensional Information Processing,
East China Normal University, Shanghai, China.
² The School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
e-mail: [email protected] Corresponding author.

Abstract

Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi-organ, multi-class cell segmentation is particularly challenging. Most existing methods employ multi-branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi-class analysis due to intricate textural details. To address these challenges, we propose a Multi-OrgaN multi-Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision-language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse-to-fine-grained features for segmenting cells of various shapes, including high-frequency, convolutional, and topological features. Inspired by the synergy of textual and multi-grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state-of-the-art cell segmentation methods and vision-language models. Codes and implementations will be made publicly available.

1 Introduction

Multi-organ, multi-class, multi-cell segmentation and classification involves segmenting cell contours for different cell types in digitized patient specimens, such as Whole Slide Images (WSIs) [32, 39, 10, 30]. It is a foundational technology in computational pathology, enabling both quantification and visualization of various cell types to provide a reliable basis for diagnosis in clinical applications such as prognosis evaluation, cancer grading, and treatment planning [25, 35, 34]. Accurate multi-organ, multi-cell segmentation models are crucial in determining the nature, grade, and stage of diseases, making them highly valuable for clinical practice and worthy of further exploration.

Deep learning has rapidly advanced, achieving impressive performance in nuclei instance segmentation [45, 18] and multi-class cell semantic segmentation [3, 2]. Despite recent progress, comprehensively analyzing multi-class cells from various organs remains challenging due to the difficulty in capturing accurate features of different cell types across organs. Additionally, the high imbalance in these datasets presents another challenge for the model’s feature extraction capabilities. Learning from less-diverse data makes it difficult for these models to transfer to other organs, as training datasets do not match the data distribution found in ‘the clinical wild’ [11, 12]. One solution to better extract pathological features across organs is to split the cell segmentation task into multiple branches, such as nuclei instance segmentation, position prediction, and classification [14, 9]. While these multi-branch methods have shown good performance in multi-organ multi-class cell segmentation, their model complexity and efficiency are significantly lower than those of single-branch methods. In physiological environments, various cell types coexist, each exhibiting distinct textures and sizes. How to fully extract latent information and strong semantic features from feature maps is crucial for the downstream tasks. However, few of these methods consider multi-grained features. Existing feature fusion methods simply concatenate multi-level features, which often leads to redundant information. To address this, feature pyramid and attention-based methods have been proposed for effective multi-scale feature fusion [15, 5, 48]. Feature Pyramid Networks (FPNs), however, may lose important information during pooling or downsampling operations, and their performance can degrade when applied to datasets with significantly different distributions. Attention-based methods enable information transfer between multi-grained feature maps, enhancing feature representation by capturing more comprehensive contextual information [28, 43].

As mentioned above, traditional single-branch methods also struggle to capture diverse features, as they rely purely on semantic information, limiting performance when data exhibits significant diversity. Vision-Language Models (VLMs) integrate computer vision and natural language processing, enhancing semantic features by incorporating textural information [47, 46, 13]. VLMs offer strong feature extraction capabilities due to pre-training on large-scale datasets. Their ability to align image and text features enables the use of textual information to supplement image segmentation. Despite their successes, the unified potential of VLMs for pathological cell segmentation remains largely underexplored.

To address these issues, we propose a network for Multi-OrgaN multi-class multi-Cell segmentation with a single-brancH, named MONCH, which combines progressive prompts with textual attributes and multi-grained visual features. First, textual attributes of different cell types are generated using GPT-4, followed by visual-textual feature fusion. Given the varying shapes and sizes of different cell types, a multi-grained visual feature extraction block (MGFE) is designed to extract comprehensive visual features, including high-frequency details, semantic information, and topological features capturing mutual information among multiple cells. To fully integrate textual and multi-grained visual information, we introduce a progressive prompt decoder (PPD) that gradually merging features from fine-grained visual feature to coarse-grained textual feature, applying the finer feature as the query for the coarse feature.

The main contributions of this work are as follows:

•

We propose MONCH, a single-branch network leveraging textual and visual information, which effectively segments multiple cell types across various organs through comprehensive feature extraction and a progressive prompt decoder.
•

We design the MGFE block to extract multi-grained features from image features enhanced by a pre-trained VLM, enabling MONCH to capture detailed visual information.
•

We introduce the PPD block to integrate features from fine to coarse granularity, aiding in capturing global context while preserving visual details.
•

Extensive experiments conducted on the public PanNuke dataset demonstrate that MONCH achieves state-of-the-art performance in multi-organ, multi-class, multi-cell segmentation.

2 Related Work

2.1 Multi-Class Cell Detection and Segmentation

The distribution of multi-class pathological cells provides crucial auxiliary diagnostic information for pathologists. A common approach for analyzing multi-class cell distribution is to divide the cell segmentation framework into multiple branches, such as performing object segmentation followed by classification [1, 44]. This approach helps eliminate background interference from original pathological images, leading to improved cell classification outcomes. Some research further adds branches to enhance contour or texture information of different cell types [19, 14]. For instance, Meta-MTL proposes a multi-task nuclei segmentation network with both contour detection and segmentation tasks, with a feature attention module to amplify shape information [16]. Similarly, AL-Net [52] introduces an attention-based learning network using multi-task learning strategy to enhance segmentation feature extraction by predicting nuclei boundaries. Beyond boundary detection, several studies focus on enhancing textual feature learning through semantic information analysis. For example, SMILE [33] leverages cell semantic segmentation and instantiation to generate a distance transformation map, improving the accuracy of multi-class cell instance segmentation. GSN-HVNET [53] employs an encoder-decoder framework to extract precise cell features for segmentation and classification tasks.

While multi-branch cell segmentation networks have shown strong performance in multi-class cell segmentation and classification, their increased computational complexity is a significant disadvantage. Therefore, designing a single-branch network that efficiently merges diverse features to achieve accurate multi-class cell segmentation remains an important research question.

Refer to caption — Figure 1: Overview of the proposed method. *Texture feature extraction*: Textual features are extracted via a frozen text encoder based on GPT-generated cell attributes. *Multi-grained visual feature extraction*: Multi-grained visual features are obtained from a pre-trained image encoder and enhanced via specific feature extraction modules. $HF^{2}EM$ is a high-frequency extraction module, $Conv3*3$ is a convolutional block, and $TSEM$ is a topological structure extraction module. *Feature fusion*: Multi-scale visual features are integrated using feature pyramid fusion block. *Progressive prompt decoder*: Multimodal features are progressively input into the cross-attention module as prompts to lower-level features, harmonizing the discrepancy between multi-grained visual and linguistic features.

2.2 Multi-Scale Feature Learning

Given the diverse structural and scalar details present across multiple cell types, multi-scale feature fusion is crucial for capturing information ranging from low-level textures and edges to high-level semantics and contextual details. Deep learning networks extensively used for image fusion are generally categorized into CNN-based and attention-based architectures. CNN-based methods leverage their architectural design to integrate features across various scales, e.g., FPN [15, 24] and skip connections [40, 49], etc. However, CNN-based feature fusion methods typically use static input features, which increases computational demands. Attention-based methods, on the other hand, provide nonlinear strategies for multi-scale feature fusion, effectively handling features of varying semantics and scales. These methods can adjust input features dynamically, leading to the proliferation of attention-based fusion networks [38, 43]. For example, MS-CAM [8] proposes a multi-scale channel attention module for feature fusion. HiFuse [22] introduces a three-branch hierarchical medical image classification network with a self-attention module to integrate local multi-scale features. In the field of medical image analysis, several attention-based multi-scale feature fusion networks have been proposed, including AF-Net[20], MAXFormer [27] and MSA-Net [41], etc. While these attention-based methods generally outperform CNN-based approaches in multi-scale feature integration, current cross-attention mechanisms still struggle to fully capture the complex relationships between features at different scales.

2.3 Vision-Language Based Fine-Tuning

Most visual analysis research focuses on training a single-branch deep neural network using extensive annotated datasets, resulting in a laborious and time-consuming process [31, 50]. Recently, Vision-Language Models (VLMs) have gained significant attention for their ability to learn intricate correlations between visual and linguistic features using web-scale image-text pairs [51, 54]. Leveraging VLMs has had a substantial impact on computational pathology. To be more specific, VLMs have become particularly popular in computational pathology since the introduction of Contrastive Language-Image Pre-training (CLIP) [36]. For example, PLIP [21] introduces a novel approach to pathology by developing a language-image pre-training model capable of analyzing multimodal data. UNI [6] effectively adapts VLMs originally trained on natural images to various pathological tasks by leveraging large-scale pathological image-text pairs. CONCH [29] proposes a pioneering vision-language foundation model that employs contrastive learning on image-caption pairs, addressing several downstream tasks such as image analysis, text-to-image, and image-to-text retrieval.

While vision-language foundation models have shown promising results in various tasks, they primarily focus on organ-level analysis. However, subtle changes and interactions within cells can provide pathologists with more detailed structural and functional insights, allowing for the identification and examination of heterogeneity among cellular populations [17]. Therefore, further exploration of cell-level VLMs is a promising area of research.

3 Method

3.1 Problem Setting and Network Architecture

Mathematically, given a WSI $\mathbf{X}_{w}\in\mathbb{R}^{W_{w}\times H_{w}\times 3}$ , whose size is $W_{w}\times H_{w}$ , a set of patches $\mathbf{X}\in\mathbb{R}^{W_{p}\times H_{p}\times 3}$ are cropped from the WSI. These images contain $C$ different cell types and are collected from multiple organs. Our goal is to predict the pixel-level label $\hat{\mathbf{Y}}=\{0,1,...,N\}\in\mathbb{R}^{W\times H}$ , where $N$ is the number of cell types, based on both image and textual prompts.

To address this problem, we propose a novel framework for Multi-OrgaN multi-class multi-Cell segmentation with a single-brancH, named MONCH. As shown in Fig. 1, the proposed MONCH incorporates a coarse-to-fine visual feature extraction mechanism with a progressive vision language prompt decoder to efficiently fuse textural and multi-grained visual features. Specifically, we first generate cell attributes $\mathcal{T}$ utilizing GPT-4, which encompasses the background description and $N$ cell type descriptions. Among $N+1$ types, we describe each of them using three sentences. We then encode cell descriptions $\mathbf{S}$ and pathological cell image $\mathbf{X}_{p}$ using the image and text encoders inherited from a pre-trained VLM, thereby obtaining the textual feature $\mathbf{F}^{\mathcal{T}}$ and multi-grained image features $\mathbf{F_{X}}$ . $\mathbf{F^{X}}$ is calculated as:

\small\mathbf{F^{X}}=\{\mathbf{F}^{c},\mathbf{F}^{m},\mathbf{F}^{f}\},

(1)

where $\mathbf{F}^{m}=\mathcal{G}_{VLM}(\mathbf{X},\mathbf{S})$ , $\mathbf{F}^{c}=\mathcal{G}_{ds}(\mathbf{F}^{m})$ , and $\mathbf{F}^{f}=\mathcal{G}_{us}(\mathbf{F}^{m})$ . Here, $\mathcal{G}_{VLM}(\cdot)$ is the image encoder of the pre-trained VLM, $\mathcal{G}_{ds}(\cdot)$ is the downscale block, and $\mathcal{G}_{us}(\cdot)$ is the upscale block. The generated multi-grained features are middle-grained $\mathbf{F}^{m}\in\mathbb{R}^{B\times C\times W\times H}$ , fine-grained $\mathbf{F}^{f}\in\mathbb{R}^{B\times C\times 2W\times 2H}$ , and coarse-grained $\mathbf{F}^{c}\in\mathbb{R}^{B\times C\times W/2\times H/2}$ , where $B$ is the batch size and $C$ is the channel size. Given the varying textures and scale information across different cell types, we introduce a coarse-to-fine feature extraction module to enhance the multi-grained features generated from the pre-trained image encoder.

Fine-grained features capture intricate details of pathological cells, which motivates the integration of a high-frequency information extraction module to enhance cell textual features. Coarse-grained features, while lacking in detailed information, retain essential cell distribution insights. To further enrich the structural and semantic content, we propose learning topological features. To preserve the original characteristics of the image, we incorporate a convolutional block to capture local and textual nuances.

The aforementioned comprehensive visual features, raging from coarse- to fine-grained, are then integrated with the embedded features derived from attribute prompts, resulting in enhanced image features that simultaneously capture visual and textural attributes of cells. To fully integrate these multimodal features, we design a progressive vision-language prompt decoder that iteratively adopts one feature set as the query for the subsequent feature set at a finer level. The coarse features are refined through fusing them with the multi-head self-attention mechanism, utilizing fine-grained features as the guiding query. Based on this progressive prompt learning approach, all features are effectively integrated, thereby facilitating superior performance in multi-class cell semantic segmentation.

3.2 Coarse-to-Fine Visual Feature Enhancement

To enhance the representation of pathological cells, we introduce a coarse-to-fine visual feature extraction module for multi-grained visual feature enhancement. Our approach utilizes text and image encoders from a pre-trained VLM to align textual and visual features. The initial medium-grained image feature, $\mathbf{F}^{m}$ , is obtained from the pre-trained image encoder, capturing fundamental local details such as cell edges, textures, and essential structures We further process $\mathbf{F}^{m}$ by upscaling it to obtain a fine-grained feature $\mathbf{F}^{f}$ and downscaling it to derive a coarse-grained feature $\mathbf{F}^{c}$ . These transformations provide multi-grained perspectives of the pathological cells. A convolutional module is then applied to adapt $\mathbf{F}^{m}$ for improved suitability to the new task, resulting in a convolutional feature $\mathbf{F}^{v}=\text{conv}(\mathbf{F}^{m})$ . For fine-grained feature enhancement, we design a high-pass filter module that produces a refined high-frequency image feature $\mathbf{F}^{h}$ . This process begins by transforming the fine-grained feature $\mathbf{F}^{f}$ at position $(x,y)$ into the frequency domain using a Fourier Transform $\mathcal{F}(\cdot)$ :

\small\mathbf{F}^{f^{\prime}}(x,y)=\mathcal{F}(\mathbf{F}^{f}(x,y)).

(2)

We then extract the high-frequency feature by applying a high-pass filter $\mathcal{H}(\cdot)$ , based on the resolution of $\mathbf{F}^{f}$ :

\small\mathbf{F}^{f^{\prime\prime}}(x,y)=\mathcal{H}(\mathbf{F}^{f^{\prime}}(x,y)),

(3)

where $\mathbf{F}^{f^{\prime\prime}}$ represents the high-frequency component. The refined high-frequency feature $\mathbf{F}^{h}$ is obtained through an inverse Fourier transform $\mathcal{F}^{-1}(\cdot)$ and a residual operation:

\small\mathbf{F}^{h}(x,y)=\mathcal{F}^{-1}(\mathbf{F}^{f^{\prime\prime}}(x,y))+\mathbf{F}^{f}(x,y).

(4)

For the coarse-grained visual feature, we introduce a topological feature extraction module to better capture the intrinsic structural and shape information. Given the coarse-grained feature $\mathbf{F}^{c}$ , we apply dilated KNN to calculate pixel-level pairwise distances, obtaining $k=9$ nearest neighbors for each point with dimensions $B\times C/2\times k\times k$ . We first normalize $\mathbf{F}^{c}$ to reduce its channel dimension to get a blended feature with dimension of $B\times C/2\times W/2\times H/2$ . For each image feature $\mathbf{F}^{c^{\prime}}$ in its spatial dimension, the pixel-level pairwise distance is calculated as:

\small\mathbf{D}=||{\mathbf{F}^{c^{\prime}}}||^{2},\hat{\mathbf{D}}=\mathbf{D}-2\times(\mathbf{F}^{c^{\prime}}\cdot{\mathbf{F}^{c^{\prime}}}^{\top})+\mathbf{D}^{\top},

(5)

where $\mathbf{D}$ is the pairwise distance of $\mathbf{F}^{c^{\prime}}$ . The $k$ nearest neighbors for each spatial-level feature are then extracted based on the distance $\hat{\mathbf{D}}$ , resulting in the topological structure $\mathbf{F}^{k}\in\mathbf{R}^{B\times C/2\times k\times\times k}$ . The enhanced topological feature $\mathbf{F}^{t}$ is obtained through a residual connection:

\small\mathbf{F}^{t}=\mathbf{F}^{k}+\mathbf{F}^{c}.

(6)

By combining these enhanced multi-grained visual features, the proposed MONCH comprehensively captures both local fine details and global structural information, leading to a more complete representation of pathological cells.

3.3 Progressive Prompt Decoder

To effectively harmonize the linguistic features with the multi-grained visual features, we introduce a progressive vision-language prompt decoder that generates a comprehensive representation for enhanced analysis, as illustrated in Fig. 2. In this framework, the higher-level features are sequentially used as queries for lower-level features to get a robust representation.

To reconcile knowledge differences between these modalities and extract additional textual information, we propose an attention mechanism that leverages the complementarity of visual and language features. Finer $query$ implies higher distinctiveness in attention mechanism, resulting in better differentiating between various parts of the input feature and capturing meaningful contextual information. Therefore, this attention based hierarchical approach ensures that the $query$ feature effectively captures the core requirements of the task, providing key details for refinement at each level. MONCH progressively sets the fine-grained feature as the $query$ for the coarser-grained feature from $\mathbf{F}^{h}$ to $\mathbf{F}^{\mathcal{T}}$ , i.e. $\mathbf{F}^{h}\to\mathbf{F}^{v}\to\mathbf{F}^{t}\to\mathbf{F}^{\mathcal{T}}$ . Specifically, using $\mathbf{F}^{h}$ as the $query$ for $\mathbf{F}^{v}$ , the multi-head self-attention module is calculated as follows:

\displaystyle\mathcal{G}_{ms}(\mathbf{F}^{h},\mathbf{F}^{v})=\mathtt{softmax}(\frac{g_{q}(\mathbf{F}^{h})\cdot{g_{k}(\mathbf{F}^{v})}^{\top}}{\sqrt{d_{k}}})\cdot g_{v}(\mathbf{F}^{v}),

(7)

where $\mathcal{G}_{ms}(\cdot)$ represents the multi-head self-attention layer, $g_{q}(\cdot)$ , $g_{k}(\cdot)$ and $g_{v}(\cdot)$ are projection functions, and $d_{k}$ is the dimension of the $key$ . The subsequent multi-head self-attention modules are calculated in the same manner.

As shown in Fig. 2, after three-iteration attention architecture progressing from fine-grained to coarse-grained attention, The interactively learned features are ultimately merged with a blended feature, which is generated by fusing the three-scale features $\{\mathbf{F}^{c},\mathbf{F}^{m},\mathbf{F}^{f}\}$ obtained from the pre-trained image encoder using an FPN-like feature fusion block in Fig. 1. To enhance the detailed cell feature, we set the iteratively generated visual feature as the $query$ to this blended feature. Through this progressive prompt decoder strategy, we effectively harmonize multimodal and multi-granular features of pathological cells, ensuring accurate integration and representation of diverse characteristics. A final merge step is designed to enhance feature richness, resulting in a more robust representation.

3.4 Loss Function

The proposed MONCH builds upon a pre-trained VLM architecture, enhanced with several specifically designed adapters. Our network takes as input a combination of limited pathological cell descriptions and their associated images. To optimize learning, we freeze the text encoder while fine-tuning the image encoder, allowing better adaptation to pathological cell feature analysis. Following the progressive vision-language prompt decoder, we obtain a merged feature $\mathbf{F}^{m}$ that effectively integrates both textual and visual features. To further refine this representation, we introduce a forward propagation module that takes the textual feature $\mathbf{F}_{\mathcal{T}}$ and the merged feature $\mathbf{F}^{m}$ as inputs:

\small\mathbf{F}^{m}=\mathcal{G}_{conv}(\mathcal{G}_{reshape}(\mathcal{G}_{v}(\mathbf{F}^{m}),\mathcal{G}_{weight}(\mathcal{G}_{linear}(\mathbf{F}_{\mathcal{T}})))),

(8)

where $\mathcal{G}_{reshape}$ is operation that transforms a feature with a resolution of $B\times C\times W\times H$ into a matrix with dimensions $(1,B\times C,W,H)$ . $\mathcal{G}_{v}$ represents a series of sequential convolution layers for processing visual features. $\mathcal{G}_{linear}$ is a linear transformation applied to the textual feature, and $\mathcal{G}_{weight}$ reshapes the weights for the subsequent convolution operation $\mathcal{G}_{conv}$ .

For accurate multi-class cell semantic segmentation, we introduce a segmentation loss function $\mathcal{L}_{seg}$ . We employ binary cross-entropy loss to facilitate precise segmentation of each cell class:

\small\mathcal{L}_{seg}(y,p)=-\frac{1}{N}\frac{1}{B}\sum_{n=1}^{N}\sum_{i=1}^{B}[y_{in}\log(p_{in})+(1-y_{in})\log(1-p_{in})],

(9)

where $y_{in}$ is the $i^{th}$ ground truth for cell class $n$ , and $p_{in}$ is the $i^{th}$ predicted segmentation map for $n^{th}$ cell class.

4 Results

Table 1: Evaluation against SOTA cell segmentation methods in cell types from PanNuke. The best results are highlighted in bold.

Model	Neoplastic			Epithelial			Inflammatory			Connective			Dead			Inference (ms)
Model	$IoU$	$Precision$	$F1Score$	$IoU$	$Precision$	$F1Score$	$IoU$	$Precision$	$F1Score$	$IoU$	$Precision$	$F1Score$	$IoU$	$Precision$	$F1Score$	Inference (ms)
HoVer-Net [14]	0.6343	0.7372	0.7762	0.5171	0.6999	0.6817	0.4515	0.7188	0.6221	0.4316	0.6496	0.6030	0.1479	0.2130	0.2577	98.218
TSFD-Net [23]	0.6573	0.7418	0.7932	0.5893	0.7195	0.7416	0.5153	0.6983	0.6801	0.4600	0.6649	0.6301	0.0118	0.4215	0.0234	742.81
CPP-Net [7]	0.6497	0.7898	0.7876	0.5749	0.7410	0.7300	0.4717	0.6276	0.6410	0.4402	0.5977	0.6113	0.1543	0.2856	0.2673	352.353
Med-SA [42]	0.5970	0.7405	0.7477	0.3954	0.7773	0.5667	0.3007	0.6247	0.4624	0.3413	0.5424	0.5089	0.0009	0.6513	0.0018	182.901
MA-SAM [4]	0.3235	0.3848	0.4889	0.2127	0.2164	0.3507	0.0508	0.0895	0.0967	0.1576	0.1858	0.2724	0.0401	0.0414	0.0772	33.129
MONCH	0.6679	0.7898	0.8009	0.6572	0.7841	0.7931	0.4984	0.6624	0.6652	0.4677	0.6709	0.6373	0.1767	0.5194	0.3003	214.456

4.1 Experimental Setup

Datasets. We adopt two well-known pathological cell segmentation datasets. PanNuke [11] consists of H&E-stained organ samples from 19 organs, with annotations for various cell types, including neoplastic, epithelial, inflammatory, connective, and dead cells. The dataset contains 189,744 annotated cells, each labeled with both class and shape information. It comprises 7,094 patches captured at 40x magnification, each with a resolution of 256 $\times$ 256 pixels. As shown in Fig. 3, PanNuke is highly imbalanced in terms of both cell types and organ types, making it one of the most challenging datasets for cell semantic segmentation.

Implementation details. The proposed MONCH framework leverages pre-trained text and image encoders from CLIP, with the image encoder further fine-tuned for our specific task. For PanNuke, we use the three-fold splits provided by the dataset organizers to divide the data into training, validation, and testing sets. We train MONCH for 40 epochs using the Adam optimizer, with an initial learning rate of 5e-5 and a learning rate decay factor of 0.1. The text input length for MONCH is set to 77, with an embedding dimension of 1024. We implement our method using PyTorch, and all experiments were conducted on NVIDIA GeForce RTX 3090 GPUs.

Metrics. For evaluation, we report Pixel Accuracy (PA), Intersection over Union (IoU), Frequency Weighted Intersection over Union (FWIoU), Precision, and F1 Score for the multi-class cell semantic segmentation task.

Table 2: Evaluation against SOTA cell segmentation methods in organ types from PanNuke. The best results are highlighted in bold.

	Hover-Net [14]			TSFD-Net [23]			CPP-Net [7]			MONCH
	$IoU$	$FWIOU$	$F1Score$	$IoU$	$FWIOU$	$F1Score$	$IoU$	$FWIOU$	$F1Score$	$IoU$	$FWIOU$	$F1Score$
Adrenal	0.4404	0.9112	0.5573	0.4625	0.9127	0.6309	0.4981	0.9202	0.6543	0.5104	0.9159	0.6145
Bile Duct	0.4445	0.8942	0.6004	0.5738	0.8929	0.7092	0.5622	0.8945	0.6973	0.6164	0.8892	0.7484
Bladder	0.4377	0.9242	0.5618	0.5166	0.9190	0.6156	0.5154	0.9206	0.6103	0.5362	0.9169	0.6290
Breast	0.4898	0.8747	0.6046	0.5254	0.8808	0.6907	0.5154	0.9206	0.6103	0.5305	0.8793	0.6364
Cervix	0.4161	0.8873	0.5137	0.4329	0.8896	0.5959	0.4333	0.8812	0.5904	0.5382	0.8810	0.6421
Colon	0.5305	0.8630	0.6715	0.5211	0.8670	0.6912	0.5017	0.8661	0.6714	0.5961	0.8679	0.7434
Esophagus	0.5090	0.8830	0.6349	0.5069	0.8816	0.6725	0.4733	0.8791	0.6366	0.5313	0.8713	0.6942
Head & Neck	0.4630	0.8961	0.5915	0.4844	0.8989	0.6516	0.4780	0.9038	0.6378	0.5472	0.8913	0.6776
Kidney	0.3494	0.9035	0.4921	0.4792	0.9128	0.6495	0.5746	0.9187	0.7116	0.6170	0.9091	0.7480
Liver	0.4807	0.9028	0.6013	0.5151	0.9069	0.6821	0.4923	0.9071	0.6646	0.5330	0.9057	0.6381
Lung	0.4807	0.9028	0.6013	0.3539	0.8307	0.5253	0.3890	0.8367	0.5568	0.4591	0.8244	0.5941
Ovarian	0.4941	0.8412	0.6754	0.6208	0.8440	0.7563	0.6507	0.8516	0.7714	0.6413	0.8352	0.7680
Pancreatic	0.3910	0.8641	0.5907	0.5191	0.8681	0.6806	0.5315	0.8735	0.6847	0.6390	0.8804	0.7686
Prostate	0.4126	0.8786	0.5536	0.5242	0.8704	0.6467	0.4471	0.8657	0.5903	0.5811	0.8736	0.7204
Skin	0.3386	0.8018	0.4748	0.4662	0.8010	0.6347	0.4317	0.8147	0.6174	0.5895	0.8172	0.7202
Stomach	0.5049	0.8724	0.6624	0.5697	0.8677	0.7361	0.5639	0.8739	0.7292	0.5461	0.8466	0.6836
Testis	0.5121	0.8823	0.6867	0.6515	0.8838	0.7805	0.6507	0.8900	0.7792	0.6622	0.8820	0.7897
Thyroid	0.4310	0.8670	0.5643	0.4314	0.8673	0.6025	0.4555	0.8778	0.6245	0.5088	0.8861	0.6754
Uterus	0.3837	0.8225	0.4753	0.3921	0.8222	0.5189	0.3895	0.8245	0.5183	0.4610	0.8044	0.6088
Average	0.5203	0.8738	0.6563	0.5282	0.8765	0.6735	0.5383	0.8772	0.6680	0.5662	0.8743	0.7035
STD	0.0032	0.0009	0.0042	0.0050	0.0010	0.0044	0.0053	0.0009	0.0044	0.0032	0.0010	0.0035

4.2 Comparison with State-of-the-Art

We evaluate MONCH against the state-of-the-art (SOTA) pathological cell segmentation methods that utilize visual inputs, including HoVer-Net [14], TSFD-Net [23], and CPP-Net [7]. As shown in Table 1, MONCH achieves superior performance across almost all evaluation metrics while maintaining a simpler single-branch architecture. Notably, our method demonstrates robust performance on PanNuke, as shown in Table 1, which exhibits significant class imbalance, particularly for epithelial and dead cell categories. The vision-language approach of MONCH proves especially effective in handling limited data scenarios compared to purely vision-based alternatives. A key advantage of MONCH lies in its architectural efficiency. While competing methods often employ multi-branch networks, our approach achieves comparable or superior results with a single branch. Its inference time is also comparable with those methods, as shown in Table 1. Given that MONCH leverages vision-language model fine-tuning, we also benchmark against SOTA fine-tuned medical image segmentation approaches, specifically Med-SA [42] and MA-SAM [4] in Table 1. This comparison provides insights into the effectiveness of our fine-tuning strategy within the medical imaging domain. MA-SAM is specifically designed for medical imaging through fine-tuning SAM, failing to represent pathological data with higher complexity. Med-SA performs much better due to its ability to represent multimodal medical images. Our method combines both textual and multi-grained visual information, resulting in its good characterization capabilities in diverse pathological cell samples. Fig. 4 illustrates that MONCH can segment the diverse cell dataset much better.

The multi-organ composition of PanNuke makes assessing organ-specific performance crucial for validating model robustness. Table 2 presents the organ-wise comparison between MONCH and SOTA methods. The results show that MONCH maintains consistent performance across diverse organ types while achieving comparable or superior metrics to existing approaches, indicating its strong generalization capability. Fig. 5 intuitively shows the F1 Score of the proposed MONCH against SOTA cell segmentation methods, demonstrating that MONCH can get the best-balanced evaluation results in all organ types.

Table 3: F1 Score value comparison of the proposed MONCH with different strategies. MGFE is the multi-grained visual feature extraction block, and PPD is the progressive prompt decoder.

Dataset	MGFE	PPD	Neoplastic	Epithelial	Inflammatory	Connective	Dead
PanNkue	$\times$	✓	0.7720	0.7566	0.6125	0.5959	0.1333
	✓	$\times$	0.7289	0.7192	0.5638	0.5343	0.0815
	✓	✓	0.8009	0.7931	0.6652	0.6373	0.3003

4.3 Ablation Studies

Proposed strategies. As listed in Table 3, the proposed MONCH shows the optimal performance when both strategies are implemented concurrently. Without the multi-grained visual feature extraction block, the multi-grained features directly generated from the pre-trained VLM are input to the following progressive prompt decoder. From Table 3, we can tell that F1 Score values decrease by over 6% in almost all cell types except epithelial cell because it has adequate data in PanNuke. To evaluate the effectiveness of the progressive prompt decoder block, we remove this block out of the proposed method and simply fuse the former multi-grained visual features with the fusion module in Fig. 1. F1 Score values drop by over 9% in all cell types of PanNuke, proving that iteratively learning from the fine feature to coarse feature, finally merged with a blended feature, can effectively integrate multimodal features. It is obvious in Table 3 that training without each one of the modules will lead to quite large performance degradation in dead cell, proving that both modules can enhance the model robustness in imbalanced dataset.

Table 4: F1 Score value comparison of the proposed MONCH with different visual feature extraction strategies. HF represents a high-frequency feature enhancement module. Conv represents the convolution module, and Topo represents the topological structure enhancement module.

\times

symbol means that the feature extraction block is replaced with the convolution block.

Dataset	HF	Conv	Topo	Neoplastic	Epithelial	Inflammatory	Connective	Dead
PanNuke	$\times$	✓	✓	0.8004	0.7888	0.6630	0.6380	0.2449
	✓	✓	$\times$	0.7915	0.7806	0.6498	0.6322	0.3041
	✓	✓	✓	0.8009	0.7931	0.6652	0.6373	0.3003

Visual feature extraction setting. To evaluate the necessity of multi-grained visual feature extraction modules, we conduct ablation experiments and F1 Score values are listed in Table 4. Compared with simple convolution feature, adding any complementary visual feature can achieve performance improvements. From this table, we can find that both high-frequency feature extraction block and topological structure enhancement block are important for dead cell image feature representation with very limited data.

Table 5: F1 Score value comparison of the proposed MONCH with different prompt decoders.

\{q_{h},q_{v},q_{t},q_{T}\}

represent the queries for the next-level feature to conduct multi-head self-attention in Fig. 1.

\times

symbol in this table means that the feature from the previous level are carried over to the multi-head-self-attention mechanism at the next level.

$q_{h}$	$q_{v}$	$q_{t}$	$q_{T}$	Neoplastic	Epithelial	Inflammatory	Connective	Dead
$\times$	✓	✓	✓	0.7890	0.7806	0.6460	0.6213	0.2136
✓	$\times$	✓	✓	0.7912	0.7800	0.6477	0.6221	0.3001
✓	✓	$\times$	✓	0.7895	0.7744	0.6489	0.6062	0.1804
✓	✓	✓	$\times$	0.7750	0.7643	0.6273	0.5781	0.0620
✓	✓	✓	✓	0.8009	0.7931	0.6652	0.6373	0.3003

Progressive prompt decoder setting. We evaluate our proposed method with different progressive prompt decoders, i.e., separately removing each one of the stages in $\{q_{h},q_{v},q_{t},q_{T}\}$ . Table 5 demonstrates that removing any one of these stages will significantly weaken the model’s performance. The deeper the stage is, the greater the degradation, which illustrates that the finer feature can indeed provide reliable information for the coarser feature.

Table 6: F1 Score value comparison of the proposed MONCH with different pre-trained backbones on PanNuke.

Backbone	Neoplastic	Epithelial	Inflammatory	Connective	Dead
PLIP	0.7346	0.7212	0.4864	0.5161	0.1364
CONCH	0.7892	0.7779	0.6382	0.6102	0.3237
CLIP	0.8009	0.7931	0.6652	0.6373	0.3003

Different backbones. We evaluate MONCH using three different vision-language models (VLMs) as backbones: CLIP, PLIP, and CONCH, on PanNuke. Table 6 shows the F1 scores for each cell type, averaged across all organ types. CLIP-based MONCH achieves the best overall performance, with CONCH-based implementation showing competitive results. However, PLIP-based MONCH shows notably lower performance, particularly for cell types with limited training samples (Inflammatory, Connective, and Dead cells). We attribute this performance gap to CONCH’s architectural design, which was originally optimized for whole slide image (WSI) classification. Its pre-trained image encoder produces feature maps with approximately half the resolution compared to CLIP’s encoder, impacting the model’s ability to capture fine-grained cellular details.

We will provide feature maps of different enhancement modules and more visualized results in the supplementary.

5 Conclusion

In this paper, we propose a novel progressive multi-modal prompt learning method with a single-branch architecture for multi-organ, multi-class cell semantic segmentation, achieving coarse-to-fine-grained feature extraction using text-image pairs. Specifically, our single-branch network effectively analyzes multimodal pathological cell features. The multi-grained visual feature extraction module enhances visual features from coarse to fine level. Subsequently, the progressive prompt decoder fully integrates these multimodal features through a sequence of fine-coarse-fine queries, enabling the multi-head self-attention modules to refine and improve feature representations at the next level. Evaluations conducted on a complex cell segmentation dataset demonstrate that our proposed method outperforms state-of-the-art cell segmentation techniques and vision-language models in semantic segmentation tasks, highlighting its effectiveness in capturing intricate cellular structures.

\thetitle

Supplementary Material
[Uncaptioned image]

Figure 6: Visualization of multi-grained visual features and multimodal features in progressive prompt decoder block. Multimodal features can emphasize cells of specific types as guided by linguistic prompts.

6 Additionally Results

6.1 Different backbones

We evaluate MONCH with three vision-language models (VLMs) as backbones: PLIP [21], CONCH [29], and CLIP [36]. Additionally, we replace the image encoder with three vision-based models, SAM [26] and UNI [6], as well as SAM2 [37], paired with a text encoder from pre-trained CLIP. Table 7 reports the F1 scores for each cell type, averaged across all organ types.

Among these, CLIP-based MONCH achieves the highest overall performance, followed by competitive results from SAM-based, CONCH-based, and SAM2-based implementations. PLIP-based MONCH shows a significantly lower performance, particularly for cell types with limited training samples. This gap can be attributed to PLIP’s architectural design, which was optimized for whole-slide image (WSI) classification. PLIP’s pre-trained image encoder generates feature maps with approximately half the resolution compared to the CLIP encoder, reducing the model’s ability to capture fine-grained cellular details. Furthermore, MONCH with UNI-based backbones demonstrates lower performance, likely due to the misalignment between text and image encoders.

Table 7: F1 Score value comparison of the proposed MONCH with different pre-trained backbones on PanNuke.

Backbone	Pretraining	Neoplastic	Epithelial	Inflammatory	Connective	Dead
SAM [26]	Vision	0.7952	0.7857	0.6542	0.6271	0.2038
UNI [6]	Vision	0.7451	0.7452	0.5837	0.5606	0.2309
SAM2 [37]	Vision	0.7960	0.7856	0.6553	0.6311	0.2918
PLIP [21]	Vision-language	0.7346	0.7212	0.4864	0.5161	0.1364
CONCH [29]	Vision-language	0.7892	0.7779	0.6382	0.6102	0.3237
CLIP [36]	Vision-language	0.8009	0.7931	0.6652	0.6373	0.3003

6.2 Feature visualization

The progressive prompt decoder block plays a crucial role in harmonizing linguistic and visual features. To demonstrate its effectiveness, we visualize the iteratively generated visual and multimodal features in Fig. 6. The high-frequency visual features, $\mathbf{F}^{h}$ , effectively capture detailed information about pathological cells. By utilizing these fine-grained features as the $query$ for subsequent coarse features, the visual representations progressively refine, as shown by $\mathbf{F}^{h}\to\mathbf{F}^{v}\to\mathbf{F}^{t}$ . These multi-grained visual features excel at preserving the semantic information of pathological cell textual features. Subsequently, the visual features are used as $query$ inputs to the text features, facilitating the capture of multimodal features that integrate linguistic and visual data. Finally, by merging visual and textual features, the multimodal features, $\mathbf{F}^{t}\to\mathbf{F}^{\mathcal{T}}$ , effectively highlight features specific to different cell types.

6.3 Visualization of cell segmentation

Fig. 7 presents the cell segmentation results of MONCH compared to state-of-the-art cell segmentation methods and large-scale models. MONCH clearly outperforms competing approaches, leveraging both linguistic and visual features to effectively segment diverse cell data. Notably, MONCH delivers superior semantic segmentation performance compared to CPP-Net, the second-best performing method. By integrating linguistic information with visual learning, MONCH accurately identifies cell types, overcoming challenges faced by other methods. As shown in Fig. 7, cells marked with yellow arrows are misclassified by CPP-Net but are correctly segmented by MONCH. Furthermore, MONCH effectively segments rare instances, such as dead cells in PanNuke, despite limited data available. This underscores MONCH’s robustness in handling imbalanced datasets, making it a powerful solution for challenging segmentation tasks.

References

Abousamra et al. [2021] Shahira Abousamra, David Belinsky, John Van Arnam, Felicia Allard, Eric Yee, Rajarsi Gupta, Tahsin Kurc, Dimitris Samaras, Joel Saltz, and Chao Chen. Multi-class cell detection using spatial context representation. In ICCV, pages 4005–4014, 2021.
Berger et al. [2024] Alexander H Berger, Laurin Lux, Nico Stucki, Vincent Bürgin, Suprosanna Shit, Anna Banaszak, Daniel Rueckert, Ulrich Bauer, and Johannes C Paetzold. Topologically faithful multi-class segmentation in medical images. In MICCAI, pages 721–731. Springer, 2024.
Bokhorst et al. [2023] John-Melle Bokhorst, Iris D Nagtegaal, Filippo Fraggetta, Simona Vatrano, Wilma Mesker, Michael Vieth, Jeroen van der Laak, and Francesco Ciompi. Deep learning for multi-class semantic segmentation enables colorectal cancer detection and classification in digital pathology images. Scientific Reports, 13(1):8398, 2023.
Chen et al. [2024a] Cheng Chen, Juzheng Miao, Dufan Wu, Aoxiao Zhong, Zhiling Yan, Sekeun Kim, Jiang Hu, Zhengliang Liu, Lichao Sun, Xiang Li, et al. Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. Medical Image Analysis, 98:103310, 2024a.
Chen et al. [2021] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In CVPR, pages 357–366, 2021.
Chen et al. [2024b] Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862, 2024b.
Chen et al. [2023] Shengcong Chen, Changxing Ding, Minfeng Liu, Jun Cheng, and Dacheng Tao. Cpp-net: Context-aware polygon proposal network for nucleus segmentation. IEEE Transactions on Image Processing, 32:980–994, 2023.
Dai et al. [2021] Yimian Dai, Fabian Gieseke, Stefan Oehmcke, Yiquan Wu, and Kobus Barnard. Attentional feature fusion. In WACV, pages 3560–3569, 2021.
Dogar et al. [2023] G Murtaza Dogar, Muhammad Shahzad, and Muhammad Moazam Fraz. Attention augmented distance regression and classification network for nuclei instance segmentation and type classification in histology images. Biomedical Signal Processing and Control, 79:104199, 2023.
Fehri et al. [2019] Hamid Fehri, Ali Gooya, Yuanjun Lu, Erik Meijering, Simon A Johnston, and Alejandro F Frangi. Bayesian polytrees with learned deep features for multi-class cell segmentation. IEEE Transactions on Image Processing, 28(7):3246–3260, 2019.
Gamper et al. [2019] Jevgenij Gamper, Navid Alemi Koohbanani, Ksenija Benet, Ali Khuram, and Nasir Rajpoot. Pannuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. In ECDP 2019, pages 11–19. Springer, 2019.
Gamper et al. [2020] Jevgenij Gamper, Navid Alemi Koohbanani, Ksenija Benes, Simon Graham, Mostafa Jahanifar, Syed Ali Khurram, Ayesha Azam, Katherine Hewitt, and Nasir Rajpoot. Pannuke dataset extension, insights and baselines. arXiv preprint arXiv:2003.10778, 2020.
Gao et al. [2024] Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Zhuowen Tu, Vijay Mahadevan, and Stefano Soatto. Enhancing vision-language pre-training with rich supervisions. In CVPR, pages 13480–13491, 2024.
Graham et al. [2019] Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Medical Image Analysis, 58:101563, 2019.
Guo et al. [2020] Chaoxu Guo, Bin Fan, Qian Zhang, Shiming Xiang, and Chunhong Pan. Augfpn: Improving multi-scale feature learning for object detection. In CVPR, pages 12595–12604, 2020.
Han et al. [2022] Chu Han, Huasheng Yao, Bingchao Zhao, Zhenhui Li, Zhenwei Shi, Lei Wu, Xin Chen, Jinrong Qu, Ke Zhao, Rushi Lan, et al. Meta multi-task nuclei segmentation with fewer training samples. Medical Image Analysis, 80:102481, 2022.
Hao et al. [2024] Minsheng Hao, Jing Gong, Xin Zeng, Chiming Liu, Yucheng Guo, Xingyi Cheng, Taifeng Wang, Jianzhu Ma, Xuegong Zhang, and Le Song. Large-scale foundation model on single-cell transcriptomics. Nature Methods, pages 1–11, 2024.
He et al. [2023a] Hongliang He, Jun Wang, Pengxu Wei, Fan Xu, Xiangyang Ji, Chang Liu, and Jie Chen. Toposeg: Topology-aware nuclear instance segmentation. In CVPR, pages 21307–21316, 2023a.
He et al. [2023b] Zhenqi He, Mathias Unberath, Jing Ke, and Yiqing Shen. Transnuseg: A lightweight multi-task transformer for nuclei segmentation. In MICCAI, pages 206–215. Springer, 2023b.
Hou et al. [2021] Guimin Hou, Jiaohua Qin, Xuyu Xiang, Yun Tan, and Neal N Xiong. Af-net: A medical image segmentation network based on attention mechanism and feature fusion. Computers, Materials & Continua, 69(2):1877–1891, 2021.
Huang et al. [2023] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature Medicine, 29(9):2307–2316, 2023.
Huo et al. [2024] Xiangzuo Huo, Gang Sun, Shengwei Tian, Yan Wang, Long Yu, Jun Long, Wendong Zhang, and Aolun Li. Hifuse: Hierarchical multi-scale feature fusion network for medical image classification. Biomedical Signal Processing and Control, 87:105534, 2024.
Ilyas et al. [2022] Talha Ilyas, Zubaer Ibna Mannan, Abbas Khan, Sami Azam, Hyongsuk Kim, and Friso De Boer. Tsfd-net: Tissue specific feature distillation network for nuclei segmentation and classification. Neural Networks, 151:1–15, 2022.
Ju et al. [2021] Moran Ju, Jiangning Luo, Zhongbo Wang, and Haibo Luo. Adaptive feature fusion with attention mechanism for multi-scale target detection. Neural Computing and Applications, 33:2769–2781, 2021.
Kassim et al. [2020] Yasmin M Kassim, Kannappan Palaniappan, Feng Yang, Mahdieh Poostchi, Nila Palaniappan, Richard J Maude, Sameer Antani, and Stefan Jaeger. Clustering-based dual deep learning architecture for detecting red blood cells in malaria diagnostic smears. IEEE Journal of Biomedical and Health Informatics, 25(5):1735–1746, 2020.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, pages 4015–4026, 2023.
Liang et al. [2023] Zhiwei Liang, Kui Zhao, Gang Liang, Siyu Li, Yifei Wu, and Yiping Zhou. Maxformer: Enhanced transformer for medical image segmentation with multi-attention and multi-scale features fusion. Knowledge-Based Systems, 280:110987, 2023.
Liu et al. [2021] Jinyuan Liu, Xin Fan, Ji Jiang, Risheng Liu, and Zhongxuan Luo. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Transactions on Circuits and Systems for Video Technology, 32(1):105–119, 2021.
Lu et al. [2024] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863–874, 2024.
Meng et al. [2018] Nan Meng, Edmund Y Lam, Kevin K Tsia, and Hayden Kwok-Hay So. Large-scale multi-class image-based cell classification with deep learning. IEEE Journal of Biomedical and Health Informatics, 23(5):2091–2098, 2018.
Minaee et al. [2021] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3523–3542, 2021.
Oh and Jeong [2024] Hyun-Jic Oh and Won-Ki Jeong. Controllable and efficient multi-class pathology nuclei data augmentation uuing text-conditioned diffusion models. In MICCAI, pages 36–46. Springer, 2024.
Pan et al. [2023] Xipeng Pan, Jijun Cheng, Feihu Hou, Rushi Lan, Cheng Lu, Lingqiao Li, Zhengyun Feng, Huadeng Wang, Changhong Liang, Zhenbing Liu, et al. Smile: Cost-sensitive multi-task learning for nuclear segmentation and classification with imbalanced annotations. Medical Image Analysis, 88:102867, 2023.
Petukhov et al. [2022] Viktor Petukhov, Rosalind J Xu, Ruslan A Soldatov, Paolo Cadinu, Konstantin Khodosevich, Jeffrey R Moffitt, and Peter V Kharchenko. Cell segmentation in imaging-based spatial transcriptomics. Nature Biotechnology, 40(3):345–354, 2022.
Punn and Agarwal [2020] Narinder Singh Punn and Sonali Agarwal. Inception u-net architecture for semantic segmentation to identify nuclei in microscopy cell images. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(1):1–15, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
Ravi et al. [2024] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024.
Tong et al. [2023] Xiaozhong Tong, Shaojing Su, Peng Wu, Runze Guo, Junyu Wei, Zhen Zuo, and Bei Sun. Msaffnet: A multiscale label-supervised attention feature fusion network for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing, 61:1–16, 2023.
Verma et al. [2021] Ruchika Verma, Neeraj Kumar, Abhijeet Patil, Nikhil Cherian Kurian, Swapnil Rane, Simon Graham, Quoc Dang Vu, Mieke Zwager, Shan E Ahmed Raza, Nasir Rajpoot, et al. Monusac2020: A multi-organ nuclei segmentation and classification challenge. IEEE Transactions on Medical Imaging, 40(12):3413–3423, 2021.
Wang et al. [2022] Haonan Wang, Peng Cao, Jiaqi Wang, and Osmar R Zaiane. Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In AAAI, pages 2441–2449, 2022.
Wang et al. [2024] Shuo Wang, Yuanhong Wang, Yanjun Peng, and Xue Chen. Msa-net: Multi-scale feature fusion network with enhanced attention module for 3d medical image segmentation. Computers and Electrical Engineering, 120:109654, 2024.
Wu et al. [2023a] Junde Wu, Wei Ji, Yuanpei Liu, Huazhu Fu, Min Xu, Yanwu Xu, and Yueming Jin. Medical sam adapter: Adapting segment anything model for medical image segmentation, 2023a.
Wu et al. [2023b] Peishu Wu, Zidong Wang, Baixun Zheng, Han Li, Fuad E Alsaadi, and Nianyin Zeng. Aggn: Attention-based glioma grading network with multi-scale feature extraction and multi-modal information fusion. Computers in Biology and Medicine, 152:106457, 2023b.
Xie et al. [2024] Feng Xie, Fengxiang Zhang, and Shuoyu Xu. Db-fcn: An end-to-end dual-branch fully convolutional nucleus detection model. Expert Systems with Applications, 257:125139, 2024.
Xie et al. [2020] Xinpeng Xie, Jiawei Chen, Yuexiang Li, Linlin Shen, Kai Ma, and Yefeng Zheng. Instance-aware self-supervised learning for nuclei segmentation. In MICCAI, pages 341–350. Springer, 2020.
Xu et al. [2023a] Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, and Weidi Xie. Learning open-vocabulary semantic segmentation models from natural language supervision. In CVPR, pages 2935–2944, 2023a.
Xu et al. [2022] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In ECCV, pages 736–753. Springer, 2022.
Xu et al. [2023b] Zhenghua Xu, Biao Tian, Shijie Liu, Xiangtao Wang, Di Yuan, Junhua Gu, Junyang Chen, Thomas Lukasiewicz, and Victor CM Leung. Collaborative attention guided multi-scale feature fusion network for medical image segmentation. IEEE Transactions on Network Science and Engineering, 11(2):1857 – 1871, 2023b.
Yan et al. [2020] Qingsen Yan, Bo Wang, Wei Zhang, Chuan Luo, Wei Xu, Zhengqing Xu, Yanning Zhang, Qinfeng Shi, Liang Zhang, and Zheng You. Attention-guided deep neural network with multi-scale feature fusion for liver vessel segmentation. IEEE Journal of Biomedical and Health Informatics, 25(7):2629–2642, 2020.
Yu et al. [2023] Ying Yu, Chunping Wang, Qiang Fu, Renke Kou, Fuyu Huang, Boxiong Yang, Tingting Yang, and Mingliang Gao. Techniques and challenges of image segmentation: A review. Electronics, 12(5):1199, 2023.
Zhang et al. [2024] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
Zhao et al. [2021] Jing Zhao, Yong-Jun He, Si-Qi Zhao, Jin-Jie Huang, and Wang-Meng Zuo. Al-net: Attention learning network based on multi-task learning for cervical nucleus segmentation. IEEE Journal of Biomedical and Health Informatics, 26(6):2693–2702, 2021.
Zhao et al. [2023] Tengfei Zhao, Chong Fu, Yunjia Tian, Wei Song, and Chiu-Wing Sham. Gsn-hvnet: A lightweight, multi-task deep learning framework for nuclei segmentation and classification. Bioengineering, 10(3):393, 2023.
Zhou et al. [2022] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.