SAM-FNet: SAM-Guided Fusion Network for Laryngo-Pharyngeal Tumor Detection

1^st Jia Wei1 This work is partially supported by the Basic and Applied Basic Research Project of Guangdong Province (2022B1515130009), the Special subject on Agriculture and Social Development, Key Research and Development Plan in Guangzhou (2023B03J0172), and the Natural Science Foundation of Top Talent of SZTU (GDRC202318). College of Big Data and Internet
Shenzhen Technology University
Shenzhen, China
[email protected] 2^nd Yun Li1 Jia Wei and Yun Li are the co-first authors. Otorhinolaryngology Hospital
The First Affiliated Hospital, Sun Yat-Sen University
Guangzhou, China
[email protected] 3^th Meiyu Qiu School of Applied Technology
Shenzhen University
Shenzhen, China
[email protected] 4^rd Hongyu Chen College of Big Data and Internet
Shenzhen Technology University
Shenzhen, China
[email protected] 5^th Xiaomao Fan2 College of Big Data and Internet
Shenzhen Technology University
Shenzhen, China
[email protected] 6^th Wenbin Lei2 Xiaomao Fan and Wenbin Lei are the corresponding authors. Otorhinolaryngology Hospital
The First Affiliated Hospital, Sun Yat-Sen University
Guangzhou, China
[email protected]

Abstract

Laryngo-pharyngeal cancer (LPC) is a highly fatal malignant disease affecting the head and neck region. Previous studies on endoscopic tumor detection, particularly those leveraging dual-branch network architectures, have shown significant advancements in tumor detection. These studies highlight the potential of dual-branch networks in improving diagnostic accuracy by effectively integrating global and local (lesion) feature extraction. However, they are still limited in their capabilities to accurately locate the lesion region and capture the discriminative feature information between the global and local branches. To address these issues, we propose a novel SAM-guided fusion network (SAM-FNet), a dual-branch network for laryngo-pharyngeal tumor detection. By leveraging the powerful object segmentation capabilities of the Segment Anything Model (SAM), we introduce the SAM into the SAM-FNet to accurately segment the lesion region. Furthermore, we propose a GAN-like feature optimization (GFO) module to capture the discriminative features between the global and local branches, enhancing the fusion feature complementarity. Additionally, we collect two LPC datasets from the First Affiliated Hospital (FAHSYSU) and the Sixth Affiliated Hospital (SAHSYSU) of Sun Yat-sen University. The FAHSYSU dataset is used as the internal dataset for training the model, while the SAHSYSU dataset is used as the external dataset for evaluating the model’s performance. Extensive experiments on both datasets of FAHSYSU and SAHSYSU demonstrate that the SAM-FNet can achieve competitive results, outperforming the state-of-the-art counterparts. The source code of SAM-FNet is available at the URL of https://github.com/VVJia/SAM-FNet.

Index Terms:

Laryngo-Pharyngeal Tumor Detection, Dual-Branch Network, Endoscopic Images, SAM, LoRA

I Introduction

Laryngo-pharyngeal cancer (LPC) is a malignant disease with a high mortality in head and neck tumor. Reports [1] on 2020 showed that LPC have caused more than 130,000 deaths globally. Early-stage LPC can often be effectively treated using minimally invasive procedures, boasting a 5-year survival rate of up to 90% with perserving patient’s voice [2]. Clinically, laryngologists typically rely on biospy under laryngoscope as the gold standard [3] for LPC diagnosis. However, the visual inspection process is time-consuming and subjects to laryngologist’s skill and experience[4, 5]. Limitations of the laryngologist’s expertise may lead to missed diagnoses and unnecessary repeated biopsies [6, 7]. Therefore, developing an automatic approach to assist the laryngologists to detect LPC is of great significance.

In recent studies, many researches attempted to utilize deep learning techniques to build models for tumor detection. These methods are mainly categorized into two groups: single-branch based networks [8, 9, 10, 11, 12, 13, 14] and dual-branch based networks [15, 16, 17, 18, 19]. As for the single-branch based network, the existing methods focus solely on the global features of the endoscopic images as input. For instance, Luo et al. [12] proposed a UC-DenseNet combining Convolution Neural Network (CNN) and Recurrent Neural Network (RNN) to recognize ulcerative colitis, and also utilized attention mechanism to localize the relevant feature information. Ling et al. [13] developed a multi-task attention network termed MTANet based on Transformer and CNN architecture, which achieved excellent performance in the detection of multiple types of tumors. Though these methods have achieved promising results, they overlook the valuable local information within lesion regions, which is informative for accurate tumor detection. Therefore, the dual-branch methods are proposed by fusing the global and local (lesion) features to enhance the model capability for tumor detection. Wang et al. [15] proposed DLGNet, a dual-branch lesion-aware neural network for colorectal lesion classification, which explicitly extracted the global and local features from the colorectal endoscopic images. Similarly, Basu et al. [16] proposed a Transformer-based network termed RadFormer to integrate global and local attention branches of the gallbladder lesion, outperforming competitive results in gallbladder tumor detection tasks. Although the existing methods have integrated the global and local (lesion) information and made notable improvements in cancer diagnosis, there are still some significant challenges that need to be addressed. One of the key challenges is their inability to accurately locate the lesion region in endoscopic images. This can be attributed to the fact that the lesion (foreground) region and the background often appear quite similar, making it difficult to differentiate them. Moreover, the existing dual-branch networks typically concatenate the global and local features directly, and then feed them into a module like a transformer to generate a fusion feature [16]. These approaches are still challenging to fully capture the complementary features from the global and local branches, leading to inadequately utilizing the advantages of feature fusion mechanism which could further improve the laryngo-pharyngeal tumor detection performance.

To address the aforementioned challenges, we propose a Segment Anything Model-Guided Fusion Network (SAM-FNet), a novel dual-branch architecture to capture the global and local features for accurate laryngo-pharyngeal tumor detection. Specifically, it consists of five key components, a SAM-guided lesion location (SLL) module, a global feature extractor (GFE) module, a local feature extractor (LFE) module, a GAN-like feature optimization (GFO) module, and a classifier. Particularly, to accurately locate the lesion region, we leverage the SAM of powerful segmentation capabilities to segment tumors from laryngo-pharyngeal endoscopic images. To fully capture the discriminative information between the global and local branches, we propose a GAN-like optimization module to better learn the complementary feature representations. Extensive experiments on our collected two datasets from the First Affiliated Hospital of Sun Yat-sen University (FAHSYSU) and the Sixth Affiliated Hospital of Sun Yat-sen University (SAHSYSU) demonstrate that our proposed network SAM-FNet can achieve competitive results, surpassing the state-of-the-art counterparts on the LPC detection task. Overall, our main contributions can be summarized as follows:

$\bullet$

We propose a novel SAM-guided fusion network for laryngo-pharyngeal tumor detection, which is the first to utilize a dual-branch network architecture for this specific task.
$\bullet$

We introduce the SAM, with the advantage of powerful object segmentation capabilities, to segment the laryngo-pharyngeal endoscopic images for accurately locating the lesion region.
$\bullet$

We utilize a GAN-like feature optimization module to further capture the complementary features between the global and local branches.
$\bullet$

Extensive experiments on the two datasets of FAHSYSU and SAHSYSU demonstrate that SAM-FNet can achieve competitive results, outperforming the state-of-the-art counterparts.

II Methodology

II-A Overview

Motivated by DLGNet [15], a dual-branch network can capture both global and local (lesion) information, enabling accurate detection of lesions. In this study, we propose a SAM-guided fusion network for laryngo-pharyngeal tumor detection, following dual-branch architecture [15], which is shown in Fig. 1. Specifically, it consists of five parts: a SLL (see Section II-B), a GFE (see Section II-C), a LFE (see Section II-D), a GFO (see Section II-E), and a classifier (see Section II-F). Unlike DLGNet, we introduce the SAM to accurately locate the lesion region. Additionally, we design a GAN-like feature optimisation module (GFO) to enhance global and local feature representations, thereby improving the model’s ability to extract more discriminative features.

Formally, the dataset is denoted as $\mathcal{D}=\{(x_{g}^{(i)},y^{(i)})\}_{i=1}^{N}$ , where $N$ is the number of total laryngo-pharyngeal endoscopic images, $x_{g}^{(i)}$ represents a holistic laryngoscopic image, and $y^{(i)}\in\{0,1,2\}$ represents the labels for the three categories: normal laryngo-pharyngeal tissues (normal for short), benign tumors (benign for short), and malignant tumors (malignant for short). The lesion area image $x_{l}^{(i)}$ corresponding to $x_{g}^{(i)}$ , is obtained through the SLL module and share the same label $y^{(i)}$ as $x_{g}^{(i)}$ . Following the paradigm of DLGNet, we adopt a multi-task learning framework with several loss functions tailored for each branch, which can leverage domain-specific information from complementary tasks, enhancing prediction accuracy and generalization for each task [20]. Specifically, global, local, and fused feature representations are fed into their respective classifiers for final prediction. The discrepancy between these predictions and the ground truth labels is measured by cross-entropy losses denoted as $\mathcal{L}_{g}$ , $\mathcal{L}_{l}$ , and $\mathcal{L}_{f}$ , respectively. Additionally, a GAN-like loss $(\mathcal{L}_{s}+\mathcal{L}_{d})$ , comprising a similarity loss $\mathcal{L}_{s}$ (to align global and local features) and a binary cross-entropy loss $\mathcal{L}_{d}$ (to differentiate between global and local feature distributions), is introduced to further refine the feature representations, ensuring the capture of complementary and distinctive characteristics. In this study, we employ a joint learning scheme, where the total objective loss is defined as:

\mathcal{L}_{t}=\mathcal{L}_{f}+\alpha\mathcal{L}_{g}+\beta\mathcal{L}_{l}+\gamma(\mathcal{L}_{s}+\mathcal{L}_{d})

(1)

where the weights $\alpha$ , $\beta$ , and $\gamma$ are trade-off hyperparameters of each component loss.

Refer to caption — Figure 1: The architecture of the proposed SAM-FNet includes several key components: a SAM-guided lesion location (SLL) module to generate the lesion area image from the entire image; a global feature extractor (GFE) to extract features from the whole image; a local feature extractor (LFE) to derive features from the lesion region; a GAN-like feature optimization (GFO) module to align global and local features while differentiating their distributions; and a classifier that predicts based on global, local, and fused features.

II-B SAM-guided Lesion Location

To improve the model’s capability in extracting lesion features, recent studies have concentrated on dual-branch networks [15, 16, 21]. Specifically, these networks typically use segmentation methods, such as Mask R-CNN [22], to identify and crop the lesion region as the local input [15]. However, these approaches often face challenges in accurately locating lesions because of the high similarity between the lesions and the surrounding tissues. To address this issue, we employ the SAM, which has demonstrated excellent performance in detecting pixel-level objects. However, the SAM is primarily trained on annotations for natural images, which differ significantly from medical images in domain characteristics. This domain discrepancy can lead to a substantial drop in performance when directly applied to medical images [23, 24]. Therefore, we utilize the low-rank-based (LoRA) fine-tuning strategy to make the SAM adapted to the laryngo-pharyngeal endoscopic images. Specifically, we freeze the SAM’s image encoder and apply the LoRA layers to the query and value projection layers of the multi-head attention mechanism within each transformer block of the encoder, while keeping all parameters in the mask decoder and prompt encoder trainable. Furthermore, the rank of the LoRA layers is empirically set to 4 to balance efficiency and performance.

After the fine-tuning process, we employ the LoRA-based SAM model to segment the entire image $x_{g}^{(i)}$ , obtaining the corresponding lesion mask $x_{s}^{(i)}$ , which can be defined to be

x_{s}^{(i)}=\mathcal{F}_{SAM}(x_{g}^{(i)}),

(2)

where $\mathcal{F}_{SAM}(\cdot)$ represents the function of the LoRA-based SAM that generates the lesion mask from the input image. Next, we apply pixel-wise multiplication between the entire image $x_{g}^{(i)}$ and the corresponding lesion mask $x_{s}^{(i)}$ . This process isolates the lesions, effectively removing background information. The resulting image, containing only the lesions, is then used for lesion-based cropping to extract the lesion area image $x_{l}^{(i)}$ . The entire process can be defined as follows:

x_{l}^{(i)}=Crop(x_{g}^{(i)}\odot x_{s}^{(i)}),

(3)

where $Crop(\cdot)$ denotes lesion-based cropping function, and $\odot$ indicates the pixel-wise multiplication.

II-C Global Feature Extractor

To capture comprehensive context information, we propose the GFE, designed to extract global features from the whole LPC image. The GFE includes three components of an encoder ( $E_{g}$ ), a Feature Pyramid Network ( $FPN_{g}$ ), and a Convolutional Block Attention Module ( $CBAM_{g}$ ). Specifically, we utilize the ResNet-50 [25] with the initial weights transferred from the model zoo trained on ImageNet as the backbone of the encoder $E_{g}$ . Here, we remove the final fully connected layer. Then, we utilize the Feature Pyramid Network [26] to further learn the fine-to-coarse granular information from feature maps output by $E_{g}$ . The FPN architecture, as proposed in the work by Lin et al. [26], requires four-scale feature maps as inputs. In this study, we leverage the feature representations extracted from the last four blocks of the ResNet-50 model as the four-scale inputs to the FPN. As the FPN is initially proposed for object detection task, it derives five projection heads from the top-down pathway for further processing. However, this study is a classification task which merely requires one projection head for subsequent process. Thereby, we retains the bottom projection head which preserves the rich semantic information of images. To reduce the model’s focus on irrelevant information among feature maps, we employ the CBAM [27] with a channel attention module and a spatial attention module to highlight the inter-channel and inter-spatial relationships of features pertinent to laryngo-pharyngeal tumors. Finally, a global average pooling is utilized to map the feature maps into a global feature vector $F_{g}^{(i)}$ , which can be defined to be

F_{g}^{(i)}=\mathcal{F}_{GFE}(x_{g}^{(i)}),

(4)

where $\mathcal{F}_{GFE}(\cdot)$ represents the global feature extractor.

II-D Local Feature Extractor

As shown in the middle bottom part of Fig. 1, the Local Feature Extractor (LFE) employs the identical network architecture as the GFE described in Section II-C to learn the local feature information, that is, the feature information of the lesion region. This allows us to effectively capture the relevant characteristics of the localized lesion area, which can be important for accurately identifying and analyzing the medical condition. Specifically, the LFE consists of an encoder ( $E_{l}$ ), a FPN ( $FPN_{l}$ ), and a CBAM ( $CBAM_{l}$ ). Importantly, the LFE does not share the parameters with the GFE, thereby enhancing its ability to learn the lesion-specific information. Formally, given a lesion area image $x_{l}^{(i)}$ , passing it through the LFE processing, a local feature vector $F_{l}^{(i)}$ is obtained, which is defined to be

F_{l}^{(i)}=\mathcal{F}_{LFE}(x_{l}^{(i)}),

(5)

where $\mathcal{F}_{LFE}(\cdot)$ represents the local feature extractor.

II-E GAN-Like Feature Optimization

The GAN-Like Feature Optimization (GFO) module is proposed to improve the learning of complementary features between the global and local branches of the SAM-FNet architecture. Inspired by the Generative Adversarial Networks (GAN) framework [28], the GFO module introduces an adversarial training process to better align the features learned by the global and local branches. Concretely, the GFO utilizes two loss functions of the similarity loss $\mathcal{L}_{s}$ and the discrimination loss $\mathcal{L}_{d}$ , which is shown in the middle part of Fig. 1. The similarity loss aims to align the global feature vector $F_{g}^{(i)}$ (see Eq. 4) and the local feature vector $F_{l}^{(i)}$ (see Eq. 5), promoting the GFE and the LFE to better capture the coherent feature information. On the contrary, the discrimination loss encourages the model to learn the complementary feature information between the GFE and LFE. In this context, the similarity loss is calculated specifically for laryngo-pharyngeal endoscopic images that contain tumors. The labels for these images are either benign (with a label value of 1) or malignant (with a label value of 2). The similarity loss is calculated by cosine function, which is defined to be

\mathcal{L}_{s}=\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\left(1-\frac{F_{g}^{(i)}\cdot F_{l}^{(i)}}{\|F_{g}^{(i)}\|\|F_{l}^{(i)}\|}\right),

(6)

where $N_{t}$ is the number of laryngo-pharyngeal endoscopic images with tumor labels, and $\|F_{g}^{(i)}\|$ and $\|F_{l}^{(i)}\|$ are the Euclidean norms of the global and local feature vectors, respectively. However, simply reducing the distance between $F_{g}^{(i)}$ and $F_{l}^{(i)}$ in the feature space may lead to the loss of their specific advantages. Therefore, inspired by the adversarial approach in GAN, we introduce a discriminator network. The objective of this discriminator is to ensure that $F_{g}^{(i)}$ and $F_{g}^{(i)}$ maintain their unique characteristics: the global features should retain their broad contextual understanding, and the local features should focus on detailed information of tumors. Specifically, the discriminator, denoted as $FC_{d}$ , is a fully connected layer that processes either the global feature vector $F_{g}^{(i)}$ or the local feature vector $F_{l}^{(i)}$ . The output of the discriminator is given by:

\hat{y}^{(i)}_{d}=\mathcal{F}^{d}_{FC}(F_{g}^{(i)}||F_{l}^{(i)}),

(7)

where $||$ denotes ”or”. The binary cross-entropy loss $\mathcal{L}_{d}$ is then computed by comparing this predicted probability $\hat{y}^{(i)}_{d}$ with the ground truth label $y^{(i)}_{d}$ , which is defined as:

\mathcal{L}_{d}=\text{BinaryCrossEntropyLoss}(\hat{y}^{(i)}_{d},y^{(i)}_{d}),

(8)

where $y^{(i)}_{d}=0$ indicates a global feature, and $y^{(i)}_{d}=1$ indicates a local feature. This loss function optimizes the discriminator’s ability to correctly classify the features as either global or local, ensuring that each feature retains its distinctive characteristics. With this adversarial optimization strategy, SAM-FNet can effectively extracts discriminative features between the global and local branches, enhancing the richness and complementarity of the fused feature representation.

II-F Classifier

In the training phase, following the DLGNet [15], we adopt the multi-task learning framework that incorporates three cross-entropy losses from the global, local, and fused branches. This framework has the potential to enhance prediction accuracy and generalization across each task [20]. As shown in the right part of Fig. 1, the fused feature vector $F_{f}^{(i)}$ is obtained by concatenating the global feature vector $F_{g}^{(i)}$ (see Eq. (4)) and the local feature vector $F_{l}^{(i)}$ (see Eq. (5)). The vectors $F_{g}^{(i)}$ , $F_{l}^{(i)}$ , and $F_{f}^{(i)}$ are then passed through their respective classifiers, denoted as fully connected layers $FC_{g}$ , $FC_{l}$ , and $FC_{f}$ . This produces the probability distributions $\hat{y}_{g}^{(i)}$ , $\hat{y}_{l}^{(i)}$ , and $\hat{y}_{f}^{(i)}$ , respectively, which can be defined as follows:

$\displaystyle\hat{y}_{g}^{(i)}$	$\displaystyle=\mathcal{F}_{FC}^{g}(f_{g}^{(i)}),$	(9)
$\displaystyle\hat{y}_{l}^{(i)}$	$\displaystyle=\mathcal{F}_{FC}^{l}(f_{l}^{(i)}),$	(10)
$\displaystyle\hat{y}_{f}^{(i)}$	$\displaystyle=\mathcal{F}_{FC}^{f}(f_{f}^{(i)}),$	(11)

To optimize the SAM-FNet model, this study employs three separate cross-entropy loss functions, denoted as $\mathcal{L}_{g}$ , $\mathcal{L}_{l}$ , and $\mathcal{L}_{f}$ , corresponding to the respective fully connected layers of $FC_{g}$ , $FC_{l}$ , and $FC_{f}$ . The formulation of the loss functions can be expressed as follows:

$\displaystyle\mathcal{L}_{g}$	$\displaystyle=\text{CrossEntropyLoss}(\hat{y}_{g}^{(i)},y^{(i)}),$	(12)
$\displaystyle\mathcal{L}_{l}$	$\displaystyle=\text{CrossEntropyLoss}(\hat{y}_{l}^{(i)},y^{(i)}),$	(13)
$\displaystyle\mathcal{L}_{f}$	$\displaystyle=\text{CrossEntropyLoss}(\hat{y}_{f}^{(i)},y^{(i)}).$	(14)

In the inference phase, inspired by ensemble learning, we generate the final output by averaging the values of $\hat{y}_{g}^{(i)}$ , $\hat{y}_{l}^{(i)}$ , and $\hat{y}_{f}^{(i)}$ . This approach leverages the strengths of different branches, helping to mitigate individual prediction errors and enhance overall model robustness and accuracy.

III Experiment

III-A Experiment Settings

III-A1 Datasets

TABLE I: Statistic description of the FAHSYSU and SAHSYSU datasets.

	NBI	WLI	NBI	WLI
	FAHSYSU (Total=25,256)		SAHSYSU (Total=2,788)
Normal	695	7,310	9	2,202
Benign	1,488	3,332	27	218
Malignant	5,954	6,477	99	233
Total	8,137	17,119	135	2,653

All LPC endoscopic images are provided by two hospitals, including the FAHSYSU and SAHSYSU. All the laryngo-pharyngeal endoscopic images were collected during routine clinical practice using standard laryngoscopes (ENF-VT2, ENF-VT3, or ENF-V3; Olympus Medical Systems, Tokyo, Japan) and imaging systems (VISERA ELITE OTV-S190, EVIS EXERA III CV-190, Olympus Medical Systems) at an original resolution of $512\times 512$ pixels. Notably, our dataset includes laryngoscopic images captured in both narrow-banding imaging (NBI) mode and white light imaging (WLI) mode. Furthermore, all malignant tumors were pathologically confirmed. Tumors were annotated by experienced laryngologists, followed by cross-checking and expert review for quality control. The statistic description of laryngo-pharyngeal endoscopic images in each dataset is presented in Table I.

•

FAHSYSU: The FAHSYSU as the internal dataset contains 25,256 images, with 8,137 in NBI mode and 17,119 in WLI mode. It was used for model training, validation, and internal testing.
•

SAHSYSU: The SAHSYSU as the external dataset contains 2,788 images, with 135 in NBI mode and 2,653 in WLI mode. It was only used for external testing without any data leakage on model training.

III-A2 Evaluation Metrics

Accuracy, precision, recall, and $F_{1}$ score were utilized as evaluation metrics for the classification of laryngo-pharyngeal tumors. Moreover, we employed Dice coefficient (Dice) to analyze the segmentation performance of the LoRA-based SAM.

III-A3 Implementation Details

All experiments were conducted on a computing server equipped with NVIDIA RTX A6000 GPU and the CUDA version used was 11.8. We writted all code in Python 3.10.14 and Pytorch 2.2.0 environment. The first stage involved fine-tuning based on the SAM, closely adhering to the settings described in SAMed [29]. However, due to differences in input data (CT images) and the number of prediction categories, several adjustments were made to adapt the model to our specific task: (1) We resized the input images to a size of $224\times 224$ . (2) The model predicted a single category, distinguishing between foreground and background. (3) We empirically assigned a weight of 0.9 to the Dice loss and 0.1 to the cross-entropy loss.

In the second stage, the parameters of the LoRA-based SAM were frozen, and only the subsequent dual-branch network was trained. The masks generated by the LoRA-based SAM were used to crop the lesion region, serving as the input to the LFE. If no masks were detected (e.g., in the cases where the image did not contain tumors), a central region of a size of $256\times 256$ was cropped from the entire image. Both the holistic images and the lesion region images were resized to the same size. We then trained all networks using the Stochastic Gradient Descent (SGD) as the optimizer, with a learning rate of 0.003, a momentum of 0.9, a weight decay of 5e-4, and a mini-batch size of 256 for 60 epochs. We also utilized exponential learning rate decay with an exponent of 0.965. In order to improve the model’s generalization and robustness, we have also employed the online data augmentation, such as random affine transformations, horizontal flipping and color jittering. Empirically, we set the hyperparameters $\alpha$ , $\beta$ , and $\gamma$ to 1.0, 0.3, and 0.01 in this study, respectively.

TABLE II: Experiment performance comparison with state-of-the-art counterparts on the FAHSYSU and SAHSYSU datasets.

¹The best performance is in bold.
Dataset	Method	Overall results				Recall for different classes
Dataset	Method	Accuracy	Precision	Recall	$F_{1}$ score	Normal	Benign	Malignant
FAHSYSU	ResNet [25]	89.45%	86.14%	85.13%	85.52%	93.82%	67.89%	93.68%
	EfficientNet [10]	89.31%	85.76%	85.71%	85.69%	93.16%	71.23%	92.75%
	ViT [11]	87.74%	82.93%	84.57%	83.65%	90.96%	71.89%	90.85%
	RadFormer [16]	87.01%	82.87%	82.42%	82.63%	89.08%	65.55%	92.61%
	DLGNet [15]	88.95%	85.04%	84.60%	84.80%	91.18%	68.47%	94.13%
	SAM-FNet (Ours)	92.14%	89.57%	88.68%	89.08%	93.95%	75.81%	96.27%
SAHSYSU	ResNet [25]	91.07%	80.18%	82.37%	81.22%	94.81%	67.05%	85.24%
	EfficientNet [10]	87.88%	74.69%	81.58%	77.50%	90.52%	68.97%	85.24%
	ViT [11]	89.67%	78.14%	79.95%	78.68%	94.03%	67.82%	78.01%
	RadFormer [16]	86.80%	71.57%	78.64%	74.61%	90.30%	63.98%	81.63%
	DLGNet [15]	86.76%	72.89%	80.80%	75.96%	89.20%	67.05%	86.14%
	SAM-FNet (Ours)	92.29%	82.71%	84.52%	83.59%	95.58%	69.73%	88.25%

III-B Experiment Results

III-B1 Baselines

To demonstrate the effectiveness of our proposed SAM-FNet, we compared our method with three types of state-of-the-art classification methods in both internal dataset and external dataset. Including CNN-based methods (ResNet and EfficientNet), Transformer-based method (ViT), and dual-branch methods (RadFormer and DLGNet).

•

ResNet [25]: ResNet is a deep convolutional neural network that employs a residual connections to enable the training of very deep networks, effectively improving image recognition accuracy.
•

EfficientNet [10]: EfficientNet is a convolutional neural network that scales depth, width, and resolution using a compound scaling method, optimizing accuracy and efficiency for image classification tasks.
•

ViT [11]: ViT is a vision transformer model that applies the transformer architecture, originally designed for natural language processing, to image classification by treating images as sequences of patches.
•

RadFormer [16]: RadFormer is a dual-branch network that uses a transformer architecture to combine global and local feature maps for accurate Gallbladder Cancer detection from Ultrasound images.
•

DLGNet [15]: DLGNet is also a dual-branch network that integrates contextual lesion information by learning global and local features for colon lesions classification.

III-B2 Internal Dataset Results

Test data from the FAHSYSU dataset were initially randomly selected by laryngologists, with the data partitioned according to individual patients. For hyperparameter tuning, the remaining data were divided into training and validation sets in a 90% to 10% ratio. The detailed distribution is shown in Table III, with a total of 16,222 images in the training set, 1,806 images in the validation set, and 7,229 images in the test set. It should be noted that for a fair comparison, we downloaded the code for all baseline methods from their open-source repositories and retrained them on the FAHSYSU dataset using the same training settings as our proposed SAM-FNet.

TABLE III: Data distribution for training, validation, and test sets of the FAHSYSU dataset.

	Training	Validation	Test
Normal	5,134	571	2,301
Benign	3,276	366	1,178
Malignant	7,812	869	3,750
Total	16,222	1,806	7,229

Table II shows the evaluation results of our method compared with other state-of-art counterparts on the FAHSYSU dataset. It can be observed that the SAM-FNet achieves promising results, which can be up to 92.14%, 89.57%, 88.68%, and 89.08% in terms of accuracy, precision, recall, and $F_{1}$ score, respectively. Notably, in terms of single-class recall, SAM-FNet achieves 75.81% for benign tumors and 96.27% for malignant tumors.

It is noteworthy that our proposed SAM-FNet outperforms other state-of-the-art counterparts by a significant margin. Specifically, SAM-FNet surpasses the second-best counterpart by 2.69% in accuracy and 2.97% in recall. Additionally, SAM-FNet achieves a 3.92% higher recall for benign tumors compared to the second-best counterpart.

Moreover, as shown in Fig. 2, we plotted the Receiver Operating Characteristic (ROC) curve for each category (normal, benign, malignant) compared with state-of-the-art counterparts. It can be observed that our method, represented by the red line, achieves the largest area under the curve (AUC) across all classes. This demonstrates superior classification performance of our proposed SAM-FNet. Notably, SAM-FNet significantly surpasses other counterparts in accurately classifying benign tumors. This may be due to the small size of benign tumors and their blurred borders with surrounding normal tissue, which make it difficult for other compared counterparts to extract lesion-related features. However, the SLL module in our method can accurately segment the contours of the lesion and crop the lesion region image, allowing the subsequent LFE to capture more discriminative tumor features.

III-B3 External Dataset Results

The external dataset may suffer from inconsistent data distribution with the internal dataset, posing challenges for the model’s accuracy in tumor detection. To assess the generalization performance of our method compared to state-of-the-art counterparts, we conducted comparative experiments on the SAHSYSU dataset. The detailed data distribution for this dataset is presented in Table I.

Table II also presents the performance of our proposed SAM-FNet in comparison with state-of-the-art counterparts on the external SAHSYSU dataset. The results clearly show that SAM-FNet achieves superior performance across all metrics. Specifically, SAM-FNet achieves an accuracy of 92.29%, a precision of 82.71%, a recall of 84.52%, and an $F_{1}$ score of 83.59%. In terms of sing-class recall, SAM-FNet reaches 69.73% for benign tumors and 88.25% for malignant tumors.

In comparison with other cutting-edge methods, SAM-FNet shows remarkable improvements in recall. Specifically, it surpasses the second-best counterpart by 2.15% in overall recall. Even though DLGNet achieves a recall of 86.14% for malignant tumors, it is still 2.11% lower than our method.

III-B4 Ablation Experiments

To verify the effectiveness of each component in the SAM-FNet, including GFE, LFE, and GFO module, we conducted ablation experiments on the FAHSYSU dataset.

TABLE IV: The experiment results of ablation experiments.

Variant	GFE	LFE	GFO	Accuracy	Precision	Recall	$F_{1}$ score
V1	✓			89.45%	86.14%	85.13%	85.52%
V2		✓		84.52%	79.38%	80.07%	79.64%
V3	✓	✓		91.38%	88.41%	87.99%	88.19%
V4 (SAM-FNet)	✓	✓	✓	92.14%	89.57%	88.68%	89.08%
¹The best performance is in bold.

Table IV shows the ablation experiments results. It is observed that the variants V3 and V4, which utilize both global and local feature extractors, outperform the variants V1 and V2, which use only a single feature extractor. Specifically, variants V3 and V4 show improvements of at least 1.93% in each evaluation metric. It means that the fusion of global and local features can effectively improve the laryngo-pharyngeal tumors detection performance. As for the variants V1 and V2, the variant of V1 achieves significantly better performance than the variant V2 across all evaluation metrics, with improvements of at least 4.75%. This demonstrates the importance of global semantic understanding of holistic laryngo-pharyngeal images for their classification. From the last row of the table, it can be observed that the introduction of GFO module helps the variant to achieve promising performance. The variant V4 improves by 0.78%, 1.16%, 0.69%, and 0.89% in terms of accuracy, precision, recall, and $F_{1}$ score, respectively, compared with the variant V3. This results indicate that the variant with the GFO module outperforms other variants without it detecting tumors.

III-C Visualization Analysis

In order to demonstrate the effectiveness of our proposed architecture intuitively, Gradient-weighted Class Activation Mapping (Grad-CAM) [30] was applied to show which parts the model pay attention to. Fig. 3 shows examples of Grad-CAM visualizations on laryngoscopic images. It is observed that SAM-FNet correctly identifies tumors, and is able to focus on and highlight effective tumor characteristics more precisely than other state-of-the-art counterparts.

Furthermore, we conducted experiments and visualizations using the LoRA-based SAM on the FAHSYSU dataset to validate its effectiveness in lesion localization. Specifically, the LoRA-based SAM achieved a mean Dice coefficient of 0.5918 for benign tumors and 0.7966 for malignant tumors. In particular, the LoRA-based SAM demonstrates promising results in the segmentation of malignant tumors, potentially enhancing the lesion feature extraction capabilities of the LFE. To visually demonstrate the effectiveness of the LoRA-based SAM, we also present several segmentation results in Fig. 4. The results indicates that the LoRA-based SAM exhibits strong segmentation performance on tumors, even when tumors are very small.

IV Conclusion

In this study, we propose a novel SAM-guided fusion network (SAM-FNet), a dual-branch architecture specifically designed for laryngo-pharyngeal tumor detection. The SAM-FNet consists of five key components: a SAM-guided Lesion Location (i.e., SLL) , a Global Feature Extractor (i.e., GFE), a Local Feature Extractor (i.e., LFE) module, a GAN-Like Feature Optimization (i.e., GFO) , and a classifier. To capture the critical lesion information in laryngo-pharyngeal endoscopic images, we introduce the SLL module that leverages powerful object segmentation capabilities to accurately identify and segment the lesion regions for subsequent feature extraction. This ensures that the network focuses on the most relevant areas of the tumor, improving the overall detection performance. Furthermore, to better capture the comprehensive characteristics of laryngo-pharyngeal tumors, we propose the GFO module, which utilizes a GAN-like mechanism to learn the complementary features between the global and local branches of the network. By fusing the global and local representations, the model can gain a more thorough understanding of the tumor’s morphological and textural features, leading to enhanced classification accuracy. We evaluate the proposed SAM-FNet on two datasets, FAHSYSU as the internal dataset and SAHSYSU as the external dataset, and the results demonstrate its effectiveness. The SAM-FNet achieves an overall accuracy of 92.14 % and 92.29 % on the FAHSYSU and SAHSYSU datasets, respectively, surpassing state-of-the-art approaches and showcasing its competitive performance on the laryngo-pharyngeal tumor detection task.

References

[1] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021.
[2] E. Rudolph, G. Dyckhoff, H. Becher, A. Dietz, and H. Ramroth, “Effects of tumour stage, comorbidity and therapy on survival of laryngeal cancer patients: a systematic review and a meta-analysis,” European Archives of Oto-Rhino-Laryngology, vol. 268, pp. 165–179, 2011.
[3] C. Sampieri, M. A. Azam, A. Ioppi, C. Baldini, S. Moccia, D. Kim, A. Tirrito, A. Paderno, C. Piazza, L. S. Mattos et al., “Real-time laryngeal cancer boundaries delineation on white light and narrow-band imaging laryngoscopy with deep learning,” The Laryngoscope, vol. 134, no. 6, pp. 2826–2834, 2024.
[4] H. Irjala, N. Matar, M. Remacle, and L. Georges, “Pharyngo-laryngeal examination with the narrow band imaging technology: early experience,” European archives of oto-rhino-laryngology, vol. 268, pp. 801–806, 2011.
[5] M. A. Azam, C. Sampieri, A. Ioppi, S. Africano, A. Vallin, D. Mocellin, M. Fragale, L. Guastini, S. Moccia, C. Piazza et al., “Deep learning applied to white light and narrow band imaging videolaryngoscopy: toward real-time laryngeal cancer detection,” The Laryngoscope, vol. 132, no. 9, pp. 1798–1806, 2022.
[6] X.-G. Ni, G.-Q. Wang, F.-Y. Hu, X.-M. Xu, L. Xu, X.-Q. Liu, X.-S. Chen, L. Liu, X.-L. Ren, Y. Yang et al., “Clinical utility and effectiveness of a training programme in the application of a new classification of narrow-band imaging for vocal cord leukoplakia: A multicentre study,” Clinical Otolaryngology, vol. 44, no. 5, pp. 729–735, 2019.
[7] J. Chen, Z. Li, T. Wu, and X. Chen, “Accuracy of narrow-band imaging for diagnosing malignant transformation of vocal cord leukoplakia: A systematic review and meta-analysis,” Laryngoscope Investigative Otolaryngology, vol. 8, no. 2, pp. 508–517, 2023.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[10] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[12] X. Luo, J. Zhang, Z. Li, and R. Yang, “Diagnosis of ulcerative colitis from endoscopic images based on deep learning,” Biomedical Signal Processing and Control, vol. 73, p. 103443, 2022, doi: 10.1016/j.bspc.2021.103443.
[13] Y. Ling, Y. Wang, W. Dai, J. Yu, P. Liang, and D. Kong, “Mtanet: Multi-task attention network for automatic medical image segmentation and classification,” IEEE Transactions on Medical Imaging, vol. 43, no. 2, pp. 674–685, 2024, doi: 10.1109/TMI.2023.3317088.
[14] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,” Nature Communications, vol. 15, no. 1, p. 654, 2024, doi: 10.1038/s41467-024-44824-z.
[15] K.-N. Wang, S. Zhuang, Q.-Y. Ran, P. Zhou, J. Hua, G.-Q. Zhou, and X. He, “Dlgnet: A dual-branch lesion-aware network with the supervised gaussian mixture model for colon lesions classification in colonoscopy images,” Medical Image Analysis, vol. 87, p. 102832, 2023, doi: 10.1016/j.media.2023.102832.
[16] S. Basu, M. Gupta, P. Rana, P. Gupta, and C. Arora, “Radformer: Transformers with global–local attention for interpretable and accurate gallbladder cancer detection,” Medical Image Analysis, vol. 83, p. 102676, 2023, doi: 10.1016/j.media.2022.102676.
[17] M. Zhu, Z. Chen, and Y. Yuan, “Dsi-net: Deep synergistic interaction network for joint classification and segmentation with endoscope images,” IEEE Transactions on Medical Imaging, vol. 40, no. 12, pp. 3315–3325, 2021, doi: 10.1109/TMI.2021.3083586.
[18] J. Li, P. Zhang, T. Wang, L. Zhu, R. Liu, X. Yang, K. Wang, D. Shen, and B. Sheng, “Dsmt-net: Dual self-supervised multi-operator transformation for multi-source endoscopic ultrasound diagnosis,” IEEE Transactions on Medical Imaging, vol. 43, no. 1, pp. 64–75, 2024, doi: 10.1109/TMI.2023.3289859.
[19] Z. Yang, M. Qiu, X. Fan, G. Dai, W. Ma, X. Peng, X. Fu, and Y. Li, “cvan: A novel sleep staging method via cross-view alignment network,” IEEE Journal of Biomedical and Health Informatics, 2024, doi: 10.1109/JBHI.2024.3413081.
[20] Y. Komeda, H. Handa, T. Watanabe, T. Nomura, M. Kitahashi, T. Sakurai, A. Okamoto, T. Minami, M. Kono, T. Arizumi et al., “Computer-aided diagnosis based on convolutional neural network system for colorectal polyp classification: preliminary experience,” Oncology, vol. 93, no. Suppl. 1, pp. 30–34, 2017.
[21] L. Cai, L. Chen, J. Huang, Y. Wang, and Y. Zhang, “Know your orientation: A viewpoint-aware framework for polyp segmentation,” Medical Image Analysis, p. 103288, 2024, doi: 10.1016/j.media.2024.103288.
[22] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[23] M. A. Mazurowski, H. Dong, H. Gu, J. Yang, N. Konz, and Y. Zhang, “Segment anything model for medical image analysis: an experimental study,” Medical Image Analysis, vol. 89, p. 102918, 2023, doi: 10.1016/j.media.2023.102918.
[24] Y. Huang, X. Yang, L. Liu, H. Zhou, A. Chang, X. Zhou, R. Chen, J. Yu, J. Chen, C. Chen et al., “Segment anything model for medical images?” Medical Image Analysis, vol. 92, p. 103061, 2024, doi: 10.1016/j.media.2023.103061.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[26] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
[27] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
[28] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
[29] K. Zhang and D. Liu, “Customized segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.13785, 2023.
[30] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.