^†^†footnotetext:

\dagger

Equal contribution.^†^†footnotetext:

*

Corresponding author.¹¹institutetext: School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China. ²²institutetext: Huawei Technologies Co., Ltd., China ³³institutetext: Huawei Noah’s Ark Lab, Beijing 100196, China
³³email: [email protected]

Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Suqi Song^† 11 0009-0007-6132-9212 Chenxu Zhang^† 11 0000-0002-7079-7284 Peng Zhang 11 0009-0008-1123-0115 Pengkun Li 22 Fenglong Song 33 Lei Zhang^∗ 11 0000-0002-5305-8543

Abstract

Urban waterlogging poses a major risk to public safety and infrastructure. Conventional methods using water-level sensors need high-maintenance to hardly achieve full coverage. Recent advances employ surveillance camera imagery and deep learning for detection, yet these struggle amidst scarce data and adverse environmental conditions. In this paper, we establish a challenging Urban Waterlogging Benchmark (UW-Bench) under diverse adverse conditions to advance real-world applications. We propose a Large-Small Model co-adapter paradigm (LSM-adapter), which harnesses the substantial generic segmentation potential of large model and the specific task-directed guidance of small model. Specifically, a Triple-S Prompt Adapter module alongside a Dynamic Prompt Combiner are proposed to generate then merge multiple prompts for mask decoder adaptation. Meanwhile, a Histogram Equalization Adap-ter module is designed to infuse the image specific information for image encoder adaptation. Results and analysis show the challenge and superiority of our developed benchmark and algorithm. Project page: https://github.com/zhang-chenxu/LSM-Adapter

Keywords:

Urban waterlogging detection Segment anything model Benchmark Adaptation Large-Small Model

1 Introduction

Road water accumulation is a hidden danger, which not only causes structural damage to the pavement such as cracks and depressions, but also obstruct traffic flow, posing risks of accidents and threats to public safety. Therefore, early identification of waterlogged areas on urban roads is critical and essential.

Traditional urban waterlogging detection methods involve the installation of sensors on roadways to measure water levels, but challenges to maintain and hardly achieve full coverage [1, 24]. Recently, deep learning approaches have been explored for flood detection [17, 22, 29, 30], leveraging surveillance cameras. However, due to the lighting variability of water image reflection and the complexity of urban backgrounds, urban waterlogging detection faces several challenges: 1) Waterlogged areas vary in shape, size and depth, making it difficult to learn a uniform set of features; 2) The reflection on water surface along with shallow and clear standing water renders water texture information indistinct; 3) Under low-light conditions, the waterlogging features are not prominent, further intensifying the difficulty of detection. Owing to these challenges, existing methods struggle to detect waterlogging or provide accurate segmentation in real-world urban scenarios. Particularly, very limited scale and insufficient diversity of labeled data also diminish the generalizability of current methods, making urban waterlogging detection a hard nut to crack.

Refer to caption — Figure 1: Waterlogging detection under general and hard conditions, such as strong-light reflection, low-light condition and clear water. The first 4 rows show general samples and hard samples for the last 4 rows. The practical difficulty of this task is witnessed.

Recently, Meta AI has released an innovative visual foundation model known as the Segment Anything model (SAM) [14]. By prompt engineering and training on a corpus of over 1 billion images, SAM exhibits formidable zero-shot capabilities and impressive segmentation performance in numerous application fields [39]. However, due to lacking task-specific knowledge and reliance on manual prompts, SAM shows sub-optimal outcomes in downstream tasks [38]. Thereby, parameter fine-tuning [20, 18], integrating learnable adapters [5, 40, 25] or devising automated prompting [31, 2, 8, 23] appear to improve SAM in downstream tasks. But these techniques still have gaps with real-world waterlogging under adverse conditions.

To advance urban waterlogging detection, the first challenge is data scarcity. Existing datasets are of limited scale or lack diversity and comprise only samples that are easy to recognize [30]. Models trained on such data tend to exhibit poor generalizability, struggling to deploy in real-world application. To solve this practical issue, we firstly construct a challenging benchmark tailored for real-world urban waterlogging detection, including adverse conditions, such as low-light conditions, strong-light reflections and clear water, etc. A total of 7,677 waterlogging images are collected with manual labels, containing frames from surveillance cameras and handheld mobile devices. Fig. 1 shows the vision challenge of waterlogging images and the effectiveness of our approach.

To compromise the generalization of large model in diverse conditions and the specificity of small model in downstream task, we propose a SAM guided Large-Small Model co-adapter paradigm (LSM-adapter), exploring combined prompts tuning and adapting for efficient yet robust urban waterlogging detection. We design a Triple-S Prompt adapter (TSP-Adapt) comprising a small model based spatial prompter, a prototype-based semantic prompter and a spectrum-based style prompter, to generate prompts from small model, large model and raw input, respectively. The origin and function of these prompts are distinct, to give complementary and counterbalancing benefits, thereby furnishing the large model with a more comprehensive and diverse information. Meanwhile, we propose a Dynamic Prompt Combiner (DPC) composed of a set of learnable weights and an adaptive embedding to dynamically weigh and blend the above prompts for mask decoder. Given the features of waterlogging images are often not prominent, we design a Histogram Equalization adapter (HE-Adpat) to infuse the enhanced task-relevant information (e.g., texture and contrast) into the image encoder. The proposed LSM-adapter paradigm is overall elaborated in Fig. 2.

In summary, our main contributions are as follows:

•

We first construct a challenging real-world urban waterlogging benchmark (UW-Bench) under adverse conditions, advancing the field towards application deployment with large model.
•

We propose an innovative large-small model co-adapter paradigm (LSM-adapter), aiming at achieving win-win regime. In order to learn a robust prompter, a Triple-S prompt adapter (TSP-Adapt) with a dynamic prompt combiner is formulated, enabling a success on adaptation.
•

We pioneer the use of vision foundation model i.e., SAM for urban waterlogging detection, providing new insights for future research.

2 Related Work

2.1 Urban Waterlogging Detection

Urban waterlogging detection is crucial for traffic management, urban planning, and disaster early warning systems. The early methods are based on water-level sensors[1, 24], which detect water accumulation within a certain area through sensor devices placed at specific locations in a city. However, this approach is cost-ineffective to maintain and very limited in detection range. The remote sensing satellite imagery with a wide monitoring range are thus involved [15, 27]. Due to the lack of local detail information in remote sensing-based methods, some studies have explored to utilize images or video data from surveillance cameras to detect waterlogging [21, 7, 37, 13, 34]. [21] combines local spatial-temporal features and brightness signals to detect water in videos by using decision forests. [7] estimates flood extent from crowdsourced images using brown color segmentation to identify flood water. More efforts were made to explore CNN based deep learning approaches [17, 22, 29, 30], such as Mask R-CNN [10] and DeepLabv3+[3], and improved the waterlogging detection performance. In this paper, we pioneer the use of vision foundation model (SAM) with innovative designs on a newly developed urban waterlogging benchmark to advance this field fundamentally.

2.2 SAM Adaptation

SAM [14] is composed of a vision transformer-based image encoder, a lightweight mask decoder and a flexible prompt encoder that processes diverse input such as points, bounding boxes, masks and text. Numerous SAM variants have emerged, aiming to explore its potential in various tasks such as medical image analysis [12, 18, 20, 40], camouflaged object detection [33, 5] and mirror and transparent objects detection [9]. Adapting SAM to downstream tasks becomes a challenge. Early attempts involved directly fine-tuning a part of SAM (e.g., decoder) on downstream datasets [12, 18, 20]. As full fine-tuning of image encoder is computationally intensive, some methods are inspired by adapters in natural language processing (NLP) and insert adapter in SAM, achieving efficient fine-tuning by training the adapter only [5, 25, 40]. For example, SAM-Adapter [5] adds adapters between transformer blocks of the image encoder. SAMed [40] employs the LoRA [11] approach to approximate low-rank updates of the parameters in image encoder. Several studies accomplish adaptation from the perspective of generating automatic task-specific prompts [31, 2, 8, 23]. For example, RSPrompter [2] generates appropriate prompts based on semantic information to yield semantically clear segmentation results for remote sensing images. In this paper, we consider dual adaptation in image- and prompt-level, and collaboration of large and small models.

3 Method

3.1 SAM-Based Task-Generalized Large Model Branch

The SAM based large model is the main part in the entire framework for predicting the final segmentation mask. We retain three core components of SAM: the image encoder frozen with pretrained parameters, the lightweight mask decoder and the prompt encoder. As previously mentioned, directly deploying SAM to downstream task produces unsatisfactory results due to the frozen image encoder [5]. To facilitate the image encoder adaptation, we design an histogram equalization adapter laterally connected with the image encoder.

3.1.1 Histogram Equalization Adapter Module (HE-Adapt).

The internal structure of the enhanced-image adapter module is presented in Fig. 3 (a), which mainly consists of a histogram equalization, a high-frequency filter and MLP blocks. Given that the features of water are not pronounced in most challenging scenarios, we first conduct histogram equalization operation to highlight the contrast and texture of input image. The enhanced image is then passed through a high-frequency filter to extract high-frequency information beneficial for segmentation, and converted into frequency patch embedding. The patch embedding of original input image is reduced in dimension by fully-connected layer (FC) and added to the frequency patch embedding. This fused feature is mapped by $N$ individual MLP blocks and one parameter-shared MLP, and then merged with the original features of each transformer block in the SAM image encoder.

3.2 CNN-Based Task-Specific Small Model Branch

Waterlogging possesses the reflectivity and transparency, making it easily camouflage itself due to lighting and complex environmental backgrounds. To this end, we adopt SiNet [6], that succeeds in camouflaged object detection, as our task-specific small model. To accommodate diverse requirements, the choice of the small model is flexible and can be substituted with any other network without necessitating alterations to the overarching framework.

Acting as a domain expert, small model interacts with the large model through spatial prompt, furnishing it with prior knowledge and directional task guidance. Given an input image, spatial prompt is generated by a spatial prompter built on the small model, which could be a predicted mask or a further processed version such as a bounding box or some points, encapsulating the spatial location information of the object to be detected (see Sec. 3.3.1 for details).

3.3 Triple-S Prompt Adapter Module

The Triple-S prompt adapter module (TSP-Adapt) consists of a spatial prompter, a semantic prompter and a style prompter.

3.3.1 Spatial Prompter.

SAM originally considered two sets of prompts, sparse (boxes or points) and dense (masks), both of which can provide spatial location information for the object to be segmented. We propose to generate such prompts via a spatial prompter utilizing the outputs of the small model. The masks $M_{\mathrm{small}}$ predicted by small model can be directly used as the dense prompts. Further processing of the masks $M_{\mathrm{small}}$ yields either boxes or points as sparse prompts. For box prompts, we take the bounding boxes of the regions composed of all pixels predicted as the foreground in the masks, represented by the coordinates of the top-left and bottom-right corners of the boxes. For point prompts, we divide the masks into multiple grid regions. In each grid area $G_{g\times{g}}$ , all pixels are divided into a positive point set $I_{P}=\{(i,j)\mid{M_{\mathrm{small}}(i,j)}\geq{\tau}\}$ and a negative point set $I_{N}=\{(i,j)\mid{M_{\mathrm{small}}(i,j)}<{\tau}\}$ , where $\tau$ is the preset threshold. If the set $I_{P}$ is not empty, we select one point $p\in{I_{P}}$ with the highest prediction confidence as the positive prompt and the prompt label is set to 1. Otherwise, we take one point $p\in{I_{N}}$ with the lowest prediction confidence as the negative prompt and the prompt label is set to 0. All points from the grids together form the grid point prompts, represented by their coordinates and labels. Despite providing three different types of prompts, considering the redundancy of information, we only select one prompt as the final spatial prompt fed into the prompt encoder and obtain spatial prompt $\boldsymbol{e}_{\mathrm{Spa}}$ . It is noteworthy that the prompt encoder processes dense and sparse prompts differently. Dense prompts are embedded using convolution before adding to the image embedding and sparse prompts are encoded with positional encoding to generate the corresponding sparse embedding.

3.3.2 Semantic Prompter.

The image embedding of large model contains rich semantic information. Therefore, we propose a prototype learning-based semantic prompter, which leverages useful foreground features from large model to generate semantic prompts. The above process is detailed in Fig. 3 (b). A projector is first adopted to map the image embedding $\boldsymbol{Z}\in\mathbb{R}^{H\times{W}\times{D}}$ into the projected embedding $\boldsymbol{\bar{Z}}\in\mathbb{R}^{H\times{W}\times{D}}$ . Inspired by [42], we randomly initialize a group of $C\cdot K$ prototypes, i.e., $\{\boldsymbol{p}_{c,k}\in\mathbb{R}^{D}\}^{C,K}_{c,k=1}$ in the embedding space, where $C$ is the number of categories and each class is represented by $K$ prototypes. For each pixel sample $\boldsymbol{\bar{z}}_{i,j}\in\mathbb{R}^{D}$ , $i\in\{1\cdots W\}$ , $j\in\{1\cdots H\}$ in the projected image embedding $\boldsymbol{\bar{Z}}$ , we respectively calculate its cosine similarity with each prototype $\boldsymbol{p}_{c,k}$ to obtain a similarity vector $\boldsymbol{s}_{i,j}\in\mathbb{R}^{J}$ located at the ( $i$ , $j$ ) position in the similarity matrix $\boldsymbol{S}\in\mathbb{R}^{H\times{W}\times{J}}$ , where $J=C\cdot K$ . The category of the prototype corresponding to the maximum value in the similarity vector $\boldsymbol{s}_{i,j}$ is assigned to the pixel sample $\boldsymbol{\bar{z}}_{i,j}$ as the pseudo label $c_{i,j}^{*}$ . The pseudo mask generation (PMG) process can be represented as follows:

\displaystyle\boldsymbol{M}=\{c^{*}_{i,j}\}^{H,W}_{i,j=1},\text{ with }\left(c^{*}_{i,j},k^{*}_{i,j}\right)=\underset{(c,k)}{\arg\max}\left\{\langle\bar{z}_{i,j},p_{c,k}\rangle\right\}_{c,k=1}^{C,K},

(1)

where $\langle\cdot,\cdot\rangle$ denotes the cosine similarity operator. The pseudo mask $\boldsymbol{M}$ is then one-hot encoded and employed in conjunction with the original image embedding $\boldsymbol{Z}$ to compute the masked average pooling (MAP). This filters out irrelevant background features and preserves significant foreground features, and derives the semantic embedding as follows:

\displaystyle\boldsymbol{e}_{\mathrm{Sem}}=Concat(\boldsymbol{e}_{\mathrm{Sem}}^{1},\boldsymbol{e}_{\mathrm{Sem}}^{2},\cdots,\boldsymbol{e}_{\mathrm{Sem}}^{C}),

(2)

where $Concat(\cdot)$ denotes the concatenation operator and $\boldsymbol{e}_{\mathrm{Sem}}^{c}$ represents the semantic embedding of class $c$ ( $c\in\{1\cdots{C}\}$ ), computed as follows:

\displaystyle\boldsymbol{e}_{\mathrm{Sem}}^{c}=\frac{\sum_{i,j}\boldsymbol{Z}(i,j)\odot\boldsymbol{M}^{c}(i,j)}{\sum_{i,j}\boldsymbol{M}^{c}(i,j)},

(3)

where $\odot$ denotes the Hadamard product. The prototype $\boldsymbol{p}_{c,k}$ is momentum-updated after each training iteration according to the center of the $k$ -th sub-cluster of the training samples assigned to the $c$ -th class via online clustering. Meanwhile, a prototype loss $\mathcal{L}_{\mathrm{proto}}$ in [42] is utilized to optimize the large model.

3.3.3 Style Prompter.

We introduce a spectrum-based style prompter that extracts image-specific style embedding from the input image as the third type of prompt. The style of an image refers to features such as color and texture. In the context of urban waterlogging detection, these features to some extent can reflect information about illumination and the scene, where illumination is a critical factor causing difficulty. Specifically, We first perform a 2D Fast Fourier Transform (FFT) on the input image $f(x,y)$ to acquire its frequency spectrum $F(u,v)$ , which can be represented as:

\displaystyle F(u,v)=\text{FFT}\{f(x,y)\}=A(u,v)e^{j\Phi(u,v)}

(4)

where $A(u,v)$ is the amplitude spectrum and $\Phi(u,v)$ is the phase spectrum. $u$ and $v$ are the frequency coordinates. Since the amplitude spectrum reflects the image style while the phase spectrum interprets the image content, we thus reconstruct the image solely by the amplitude spectrum using the 2D Inverse Fast Fourier Transform(iFFT):

\displaystyle\bar{f}(x,y)=\text{iFFT}\{A(u,v)\}

(5)

The reconstructed amplitude-only image contains style information, which is then encoded as style embedding prompt $\boldsymbol{e}_{\mathrm{Sty}}$ by a convolutional block.

3.4 Dynamic Prompt Combiner

The dynamic prompt combiner (DPC) is designed to find the optimal combination of the above three types of prompts. DPC comprises three sets of dynamic weights $\{w_{1},w_{2},w_{3}\}$ assigned to spatial, semantic and style prompt, respectively, and an adaptive embedding $\boldsymbol{e}_{Ada}$ learnable to improve potential bias. The dynamically weighted prompts and the adaptive prompt are then concatenated to generate the final prompt as described in Fig. 2, computed as follows:

\displaystyle\boldsymbol{e}_{\mathrm{P}}=Concat\{w_{1}\odot\boldsymbol{e}_{\mathrm{Spa}},w_{2}\odot\boldsymbol{e}_{\mathrm{Sem}},w_{3}\odot\boldsymbol{e}_{\mathrm{Sty}},\boldsymbol{e}_{\mathrm{Ada}}\}.

(6)

where $\odot$ denotes element-wise product. During training, the weights are dynamically updated to encourage well-performing prompts while diminishing less-effective prompts. The motivation of the learnable embedding $\boldsymbol{e}_{Ada}$ arises from two aspects. 1) The learnable embedding enables the attention blocks within the decoder to comprehend nonlinear combination among these embeddings, and improve the bias that a linear combination may neglect. 2) It has a flexible capability to capture some useful implicit prompt information.

3.5 Optimization

Two training strategies are proposed to explore suitable joint training of models with diverse architectures, as illustrated in Fig. 4.

3.5.1 One-stage Training.

We introduce a straightforward one-stage training strategy, as depicted in Fig. 4 (a). The image encoder of the large model is frozen and the remaining parts are optimized together. We employ a combination of the focal loss $\mathcal{L}_{\mathrm{focal}}$ [19], cross-entropy loss $\mathcal{L}_{\mathrm{ce}}$ , IoU loss $\mathcal{L}_{\mathrm{iou}}$ and the prototype loss $\mathcal{L}_{\mathrm{proto}}$ [42] for the large model:

\displaystyle\mathcal{L}_{\mathrm{large}}=\mathcal{L}_{\mathrm{focal}}+\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{proto}}.

(7)

The total loss is given as:

\displaystyle\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{large}}+\lambda\mathcal{L}_{\mathrm{small}},

(8)

where $\mathcal{L}_{\mathrm{small}}$ is the original loss of the small model (the loss function depends on specific model selection, and Mask-RCNN [10], U2Net [26] and SiNet [6] are tested in experiments) and $\lambda$ is a hyper-parameter.

3.5.2 Two-stage Training.

A two-stage training strategy is provided to mitigate issues related to synchronization difficulties and gradient conflicts that may arise during the joint optimization of large-small models, as shown in Fig. 4 (b).

In the first stage, the triple-s prompt adapter module, the dynamic prompt combiner and the prompt encoder are not involved. The image encoder remains frozen, while the remaining modules of the large model and small model are independently optimized by their own loss functions, i.e., $\mathcal{L}_{\mathrm{large}}^{\mathrm{s1}}$ and $\mathcal{L}_{\mathrm{small}}^{\mathrm{s1}}$ , respectively. The loss function of small model is the same as Eq. 8. The training loss of the large model for the first stage is defined as:

\displaystyle\mathcal{L}_{\mathrm{large}}^{\mathrm{s1}}=\mathcal{L}_{\mathrm{focal}}+\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{iou}}.

(9)

In the second stage, we load the parameters of the modules (small model, HE-adapt and mask decoder) trained in the first stage, while integrating all the modules that were not considered previously for training (TSP-adapt, DPC and prompt encoder). With the parameters of image encoder, HE-adapt and the small model fixed, the optimization objective for the second stage is as follows:

\displaystyle\mathcal{L}_{\mathrm{total}}^{\mathrm{s2}}=\mathcal{L}_{\mathrm{focal}}+\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{proto}}.

(10)

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

For advancing urban waterlogging detection challenge, we develop a UW-Bench dataset containing a total of 7,677 images from various scenarios, such as waterlogging scenes, dry roads and hard cases, such as slippery roads and nighttime roads. Using keywords such as urban waterlogging, waterlogged roads, and monitoring viewpoints, we crawled and filtered relevant images from surveillance videos and handheld cameras. The training set includes 5,584 images, while the test set is provided by Huawei inc. and contains 2,093 images of urban scenes captured by surveillance cameras only. For the test set, we consider general-sample and hard-sample cases. Some examples from the training set and test set in our UW-Bench are described in Fig. 5, which indicates the difficulty of detecting waterloggings. In the labeling phase, we use EasyData to annotate the dataset with masks. The pixel-level annotation process can be divided into several stages: training, annotation, validation, and correction. We first create some annotation samples and train the annotators to understand the annotation standard. We also assign an inspector to verify the mask annotations. For failed annotations, the inspector gives an explanation and feedback to each annotator to further improve the annotation quality. The overall annotation process ensures the accuracy and reliability of masks in waterlogging regions.

4.1.2 Evaluation Metrics.

The waterlogging detection can be viewed as a pixel-level binary classification task for segmentation, and the waterlogged region is our prospect of interest. Based on the ground truth masks as well as the predicted waterlogging masks, we exploit the commonly used segmentation metrics, such as Precision, Recall, F1-score, and Intersection over Union (IoU) to evaluate the detection performance.

4.1.3 Implementation Details.

For the large model, we choose the ViT-B version of the pre-trained SAM as the backbone, and the input image is resized to 1024*1024. In the training phase, the input images are randomly flipped horizontally, and the batch size is set as 2. AdamW optimizer is used with an initial learning rate of 0.0005, and cosine annealing decay is applied. In the testing phase, the final binary mask is obtained by a simple thresholding operation with the threshold set as 0.5. To evaluate our approach more comprehensively, we choose the classic Mask-RCNN [10] for semantic segmentation, U2Net [26] for salient object detection and SINet [6] for camouflaged object detection, respectively, as the our task-specific small models, but not limited to these ones. All experiments are implemented in PyTorch on an NVIDIA Tesla V100S GPU (32G memory). See more implementation details in appendix.

Table 1: Comparison with existing methods on proposed UW-dataset. SAM-Adapter^† denotes training without prompts. The subscript of M, U, S denotes Mask R-CNN, U2Net and SINet, respectively, as the small model, to provide spatial prompts. The numbers in bold mean the best results.

Test Set	UW-all				UW-hard
Method	Precision	Recall	F1-Score	IoU	Precision	Recall	F1-Score	IoU
UNet [28]	54.77	45.58	49.75	33.12	61.64	30.40	40.72	25.56
DeeplabV3 [4]	74.10	47.17	57.64	40.50	69.56	36.04	47.48	31.13
SETR[41]	85.20	54.01	66.11	49.37	81.71	44.41	57.54	40.39
Segformer[36]	86.63	60.11	70.98	55.01	81.22	49.81	61.75	44.67
Mask R-CNN [10]	58.34	51.86	54.91	37.84	69.06	42.15	52.35	35.46
U2Net [26]	78.56	49.86	61.00	43.89	77.28	39.89	52.62	35.70
SINet[6]	80.09	59.02	67.96	51.47	77.69	52.00	62.30	45.24
SAM-Adapter^† [5]	72.13	63.77	67.69	51.16	69.70	58.36	63.53	46.55
SAM-Adapter_M	79.34	60.94	68.93	52.60	85.04	58.49	69.31	53.03
SAM-Adapter_U	80.63	60.87	69.37	53.11	77.63	35.69	48.90	32.36
SAM-Adapter_S	84.52	61.25	71.03	55.07	81.43	54.57	65.35	48.53
LSM-Adapter_M	71.20	75.30	73.19	57.73	73.39	74.16	73.77	58.45
LSM-Adapter_U	74.99	72.56	73.75	58.42	75.02	70.85	72.88	57.32
LSM-Adapter_S	79.47	70.57	74.76	59.69	79.19	67.29	72.76	57.18

4.2 Experimental Results

We evaluate the performance of the proposed LSM-Adapter on our developed UW-Bench under two types of test set: UW-all and UW-hard (a challenging subset of hard samples). We compare with some representative semantic segmentation models, including UNet [28], DeeplabV3 [4], SETR [41], Segformer [36], Mask R-CNN [10], U2Net [26], SINet [6] as well as SAM-Adapter [5], a large model based on SAM. In the experiments, we adopt a two-stage training strategy in our LSM-Adapter and select the mask as the output type of the spatial prompter (notably, experiments on different training strategies and spatial prompt types are discussed in Section 4.3). Additionally, SAM-Adapter utilizes a default prompt embedding as one of the dual inputs of the mask decoder and the prompt encoder was omitted. For a fair comparison, we also feed the prompts into SAM-Adapter based on the three small models, respectively, following the same setting.

The quantitative comparisons are tabulated in Tab. 1. From the results, we witness our proposed method achieves state-of-the-art performance in both test sets, and significantly outperforms extant methodologies, particularly in Recall, F1 score and IoU under different small models. Specifically, LSM-Adapter_M, LSM-Adapter_U, and LSM-Adapter_S demonstrate increment by 6.8 $\%$ to 18.28 $\%$ in F1 score and 8.22 $\%$ to 19.89 $\%$ in IoU, compared with their respective small models. Particularly, LSM-Adapter_M exhibits an increment of 19.89 $\%$ over Mask R-CNN on IoU, indicating that small models with inferior standalone performance are capable of realizing more pronounced performance improvements if co-trained with large model. Compared to the large model i.e. SAM-Adapter, our approach is improved by 7.45 $\%$ to 9.02 $\%$ in F1 score and 6.57 $\%$ to 8.53 $\%$ in IoU. Moreover, the competitive small model, i.e., SINet, exhibits an even greater gain in overall performance when integrated with the large model.

For the purpose of qualitative analysis, we illustrate the waterlogging segmentation results on several general test samples and hard test samples in Fig. 1. Evidently, the predicted masks of LSM-Adapter better approach the ground truth, further demonstrating its superiority to other methods. We further exploit the precision-recall curves (PR) to compare different methods. Fig. 6 illustrates the PR curves of our methods and other existing methods. Each subplot corresponds to the use of different small models. In each subplot, the PR-cureve of our method is closer to the top-right corner and exhibits better performance than existing CNN based segmentation models and Transformer based SAM-Adapter.

Table 2: Effects of different training strategies and spatial prompts. 1-S and 2-S denotes the one-stage and two-stage training strategy, respectively, as discussed in Section 3.5. The spatial prompts include mask, box and point. For each model, the numbers in bold mean the best results across the same training strategy.

Method	Train	Prompt	Precision	Recall	F1-Score	IoU
LSM-Adapter_M	1-S	Mask	73.06	53.71	61.91	44.86
		Box	77.80	51.01	61.62	44.53
		Point	68.11	47.04	55.65	38.55
	2-S	Mask	71.20	75.30	73.19	57.73
		Box	73.60	69.46	71.47	55.61
		Point	74.73	68.85	71.67	55.85
LSM-Adapter_U	1-S	Mask	66.84	66.34	66.59	49.94
		Box	72.21	63.75	69.01	52.68
		Point	71.65	54.96	62.20	45.14
	2-S	Mask	74.99	72.56	73.75	58.42
		Box	78.98	69.22	73.78	58.45
		Point	77.51	67.36	72.08	56.35
LSM-Adapter_S	1-S	Mask	63.88	62.58	63.22	46.22
		Box	61.84	55.27	58.37	41.22
		Point	63.60	60.58	62.05	44.98
	2-S	Mask	79.47	70.57	74.76	59.69
		Box	75.75	72.21	73.94	58.65
		Point	78.00	69.85	73.70	58.35

4.3 Discussion on Training Strategies and Spatial Prompts

We explore the impact of employing different training strategies and spatial prompts on model performance. The results are presented in Tab. 2.

4.3.1 Results based on different training strateges.

The proposed LSM-Adapter that employs the one-stage training strategy is significantly inferior to the two-stage training strategy by comparing their best performances. This demontrates the proposed two-stage training strategy is more stable in adaptation. We posit that the following factors may impede the effective implementation of the one-stage training strategy. During the early stages of training, the predicted output of the small model is characterized by low accuracy, and add complexity to the training of the large model results in subsequent slow convergence. Concurrently, the joint training of two networks with distinct architectures is highly contingent upon the selection of suitable hyper-parameters to achieve a synchronized optimization process. Otherwise, the optimization objectives may conflict each other, thereby hindering the joint model from attaining the optimal performance.

4.3.2 Results based on different spatial prompts.

Under identical conditions concerning the small model and training strategy, we compare three types of spatial prompts. Except LSM-Adapter_U, the performance by using mask as spatial prompt consistently surpasses that of other two types of prompts (box and point) in all scenarios. Although the box prompt in LSM-Adapter_U is the best, the performance gap from the mask prompt is very small, with a mere 0.03 $\%$ gap for both F1-score and IoU when employing two-stage training strategy. Moreover, the performance of models employing mask prompts predominantly exceeds that of both large models and their respective small models. A possible explanation for this observation may be that, in comparison to the sparse prompts of boxes and points, mask prompts furnish a more abundant referential information.

Table 3: Ablation studies for innovative components. SAM-Adapter is underlined.

HE-Adapt	SpaP	SemP	StyP	DPC	LSM-Adapter_U		LSM-Adapter_S
HE-Adapt	SpaP	SemP	StyP	DPC	F1-Score	IoU	F1-Score	IoU
					65.74	51.16	65.74	51.16
✔					71.65	55.83	71.65	55.83
✔	✔				71.97	56.22	73.95	58.67
✔	✔	✔			72.16	56.45	73.98	58.71
✔	✔	✔	✔		72.72	57.14	74.63	59.52
✔	✔	✔	✔	✔	73.75	58.42	74.76	59.69

4.4 Ablation Study

To verify the effectiveness of each module in our proposed framework, we adopt the two-stage training strategy, select mask as the spatial prompt and exploit $\mathrm{LSM}_{\mathrm{U2Net}}$ and $\mathrm{LSM}_{\mathrm{SINet}}$ to conduct the ablation experiments. SAM-Adapter is the baseline model, i.e., without adding any of our proposed modules. Experimental results are shown in Tab. 3. We can see that adding adaptive histogram equalization adapter (HE-adapt) achieves performance improvement in F1-score and IoU, which proves the effect of HE-adapt for image encoder adaptation. Additionally, with spatial prompt (SpaP), semantic prompt (SemP), and style prompt (StyP) are added gradually, the performance is also improved gradually, which proves the effectiveness of the Triple-S prompt adapter (TSP-adapt) for mask decoder adaptation. On this basis, by exploiting the dynamic prompt combiner (DPC), the optimal performance is achieved, which proves our proposed DPC can effectively combine different prompts via a dynamic optimization strategy to improve potential bias implied in an individual prompt.

5 Conclusion and Outlook

In this paper, we first pioneer the use of large vision model (i.e., SAM) for a challenging downstream task i.e. urban waterlogging detection and advancing its real-world applications. Considering the generic segmentation capability of large model and the task-specificity of small model, we propose a large-small model co-adapter paradigm following the win-win mechanism. To address the data scarcity of real-world waterlogging detection, we further contribute a large benchmark to advance this field fundamentally, on which more powerful algorithms can be developed. In experiments, we provide new perspectives on the training strategy of large-small model collaboration, due to their architectural differences. This paper sheds light on the possibility of large model in adapting to challenging downstream task with congenital data scarcity and adverse conditions.

In the future work, it is expected to further enrich the benchmark to facilitate the pre-train and fine-tune of large model. We hope our proposed large-small model paradigm and perspectives can inspire future work, particularly for downstream tasks with limited resources.

Acknowledgements

This work was partially supported by National Key R&D Program of China (2021YFB3100800), National Natural Science Fund of China (62271090, 61771079), Chongqing Natural Science Fund (cstc2021jcyj-jqX0023) and National Youth Talent Project. This work is also supported by Huawei computational power of Chongqing Artificial Intelligence Innovation Center.

References

[1] Basha, E.A., Ravela, S., Rus, D.: Model-based monitoring for early warning flood detection. In: Proceedings of the 6th ACM conference on Embedded network sensor systems. pp. 295–308 (2008)
[2] Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing (2024)
[3] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2017)
[4] Chen, L.C., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv (2017)
[5] Chen, T., Zhu, L., Deng, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: Sam-adapter: Adapting segment anything in underperformed scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3367–3375 (2023)
[6] Fan, D.P., Ji, G.P., Cheng, M.M., Shao, L.: Concealed object detection. IEEE transactions on pattern analysis and machine intelligence 44(10), 6024–6042 (2021)
[7] Geetha, M., Manoj, M., Sarika, A., Mohan, M., Rao, S.N.: Detection and estimation of the extent of flood from crowd sourced images. In: 2017 international conference on communication and signal processing (ICCSP). pp. 0603–0608. IEEE (2017)
[8] Guo, A., Fei, G., Pasupuletic, H., Wang, J.: Clicksam: Fine-tuning segment anything model using click prompts for ultrasound image segmentation. arXiv preprint arXiv:2402.05902 (2024)
[9] Han, D., Zhang, C., Qiao, Y., Qamar, M., Jung, Y., Lee, S., Bae, S.H., Hong, C.S.: Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278 (2023)
[10] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
[11] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
[12] Hu, M., Li, Y., Yang, X.: Skinsam: Empowering skin cancer segmentation with segment anything model. arXiv preprint arXiv:2304.13973 (2023)
[13] Jiang, J., Qin, C.Z., Yu, J., Cheng, C., Liu, J., Huang, J.: Obtaining urban waterlogging depths from video images using synthetic image data. Remote Sensing 12(6), 1014 (2020)
[14] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
[15] Klemas, V.: Remote sensing of floods and flood-prone areas: An overview. Journal of Coastal Research 31(4), 1005–1013 (2015)
[16] Le, T.N., Nguyen, T.V., Nie, Z., Tran, M.T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. Computer vision and image understanding 184, 45–56 (2019)
[17] Li, W., Zhu, H., Feng, X., Li, F.: Semantic segmentation-based algorithm for urban road waterlogging disaster detection. In: Proceedings of the 2021 5th International Conference on Video and Image Processing. pp. 104–110 (2021)
[18] Li, Y., Hu, M., Yang, X.: Polyp-sam: Transfer sam for polyp segmentation. arXiv preprint arXiv:2305.00293 (2023)
[19] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
[20] Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications 15(1), 654 (2024)
[21] Mettes, P., Tan, R.T., Veltkamp, R.: On the segmentation and classification of water in videos. In: 2014 International Conference on Computer Vision Theory and Applications (VISAPP). vol. 1, pp. 283–292. IEEE (2014)
[22] Muhadi, N.A., Abdullah, A.F., Bejo, S.K., Mahadi, M.R., Mijic, A.: Deep learning semantic segmentation for water level estimation using surveillance camera. Applied Sciences 11(20), 9691 (2021)
[23] Na, S., Guo, Y., Jiang, F., Ma, H., Huang, J.: Segment any cell: A sam-based auto-prompting fine-tuning framework for nuclei segmentation. arXiv preprint arXiv:2401.13220 (2024)
[24] Pasi, A.A., Bhave, U.: Flood detection system using wireless sensor network. International Journal of Advanced Research in Computer Science and Software Engineering 5(2) (2015)
[25] Pu, X., Jia, H., Zheng, L., Wang, F., Xu, F.: Classwise-sam-adapter: Parameter efficient fine-tuning adapts segment anything to sar domain for semantic segmentation. arXiv preprint arXiv:2401.02326 (2024)
[26] Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-net: Going deeper with nested u-structure for salient object detection. Pattern recognition 106, 107404 (2020)
[27] Robertson, N.M., Chan, T.: Aerial image segmentation for flood risk analysis. In: 2009 16th IEEE International Conference on Image Processing (ICIP). pp. 597–600. IEEE (2009)
[28] Ronneberger, O., et al.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
[29] Sarp, S., Kuzlu, M., Cetin, M., Sazara, C., Guler, O.: Detecting floodwater on roadways from image data using mask-r-cnn. In: 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). pp. 1–6. IEEE (2020)
[30] Sazara, C., Cetin, M., Iftekharuddin, K.M.: Detecting floodwater on roadways from image data with handcrafted features and deep transfer learning. In: 2019 IEEE intelligent transportation systems conference (ITSC). pp. 804–809. IEEE (2019)
[31] Shaharabany, T., Dahan, A., Giryes, R., Wolf, L.: Autosam: Adapting sam to medical images by overloading the prompt encoder. arXiv preprint arXiv:2306.06370 (2023)
[32] Skurowski, P., Abdulameer, H., Błaszczyk, J., Depta, T., Kornacki, A., Kozieł, P.: Animal camouflage analysis: Chameleon database. Unpublished manuscript 2(6), 7 (2018)
[33] Tang, L., Xiao, H., Li, B.: Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709 (2023)
[34] Tang, X., Wu, Z., Liu, W., Tian, J., Liu, L.: Exploring effective ways to increase reliable positive samples for machine learning-based urban waterlogging susceptibility assessments. Journal of Environmental Management 344, 118682 (2023)
[35] Wang, J., Li, X., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1788–1797 (2018)
[36] Xie, E., et al.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
[37] Xue, F., Tian, J., Song, X., Yan, Y.: Urban waterlogging monitoring and early warning based on video images. International Journal of Embedded Systems 13(4), 380–386 (2020)
[38] Zhang, C., Puspitasari, F.D., Zheng, S., Li, C., Qiao, Y., Kang, T., Shan, X., Zhang, C., Qin, C., Rameau, F., et al.: A survey on segment anything model (sam): Vision foundation model meets prompt engineering. arXiv preprint arXiv:2306.06211 (2023)
[39] Zhang, C., Liu, L., Cui, Y., Huang, G., Lin, W., Yang, Y., Hu, Y.: A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196 (2023)
[40] Zhang, K., Liu, D.: Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785 (2023)
[41] Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
[42] Zhou, T., Wang, W., Konukoglu, E., Van Gool, L.: Rethinking semantic segmentation: A prototype view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2582–2593 (2022)

A. Additional details on algorithm.

In Algorithm 1, we summarize the detailed training process of our proposed LSM-Adapter using a two-stage training strategy.

Algorithm 1 Training Procedure of LSM-Adapter with two-stage training strategy.

input: Training set $\mathcal{D}_{tr}$ .
output: LSM-Adapter $\mathcal{M}$ .

1: /* Stage-one */

2: Freeze the image encoder of the large model

\mathcal{M}_{\mathrm{large}}

3: while the maximal iterations are not reached do

\{x_{i},y_{i}\}_{i=1}^{n}\in{\mathcal{D}_{tr}}

// sample mini-batch

5: Obtain predictions of small model

M_{\mathrm{small}}=\mathcal{M}_{\mathrm{small}}(x_{i})

6: Obtain predictions of large model

M_{\mathrm{large}}=\mathcal{M}_{\mathrm{large}}(x_{i})

7: Update

\mathcal{M}_{\mathrm{small}}

and

\mathcal{M}_{\mathrm{large}}

by minimizing the loss function of small model and Eq. 9, respectively.

8: end while

9: /* Stage-two */

10: Load well-trained

\mathcal{M}_{\mathrm{small}}

and

\mathcal{M}_{\mathrm{large}}

11: Freeze

\mathcal{M}_{\mathrm{small}}

, and freeze the image encoder and HE-Adapt of

\mathcal{M}_{\mathrm{large}}

12: while the maximal iterations are not reached do

13:

\{x_{i},y_{i}\}_{i=1}^{n}\in{\mathcal{D}_{tr}}

// sample mini-batch

14:

Z=\mathrm{LSM\_Encoder}(x_{i})

15: Obtain

e_{\mathrm{spa}}

by small model

\mathcal{M}_{\mathrm{small}}

and spatial prompter // spatial prompt

16: Obtain

e_{\mathrm{sem}}

by Eq. 1, Eq. 2 and Eq. 3 // semantic prompt

17: Obtain

e_{\mathrm{sty}}

by Eq. 4, Eq. 5, and a convolutional block // style prompt

18: Obtain

e_{\mathrm{p}}

by Eq. 6 // final prompt

19: Obtain predictions

M=\mathrm{LSM\_Decoder}(Z,e_{\mathrm{p}})

20: Update

\mathcal{M}

by minimizing Eq. 10

21: end while

B. Additional experimental results.

B.1 Additional implementation details.

For the one-stage training, batch size is set to 2, and the large model and small model are optimized jointly using the AdamW optimizer. The learning rate of the large model is set to 0.0005. For three different small models, i.e., Mask R-CNN [10], U2Net [26], and SINet [6], the learning rate is set to 0.0005, 0.001, and 0.0005, respectively. Cosine annealing decay is applied, and the epoch is set to 40.

For the two-stage training, the large and small models are first optimized individually. The training settings for the large model are the same as in the one-stage training. Mask R-CNN [10] is trained using the SGD optimizer with a learning rate of 0.001, a batch size of 8, and an epoch of 40. U2Net [26] is trained using the Adam optimizer with a learning rate of 0.001, a batch size of 8, and an epoch of 40. And SINet [6] is trained using the Adam optimizer with a learning rate of 0.0001, a batch size of 16, and an epoch of 100. In the second stage, the epoch is set to 20, and the rest of the settings for the large model are the same as in the first stage.

B.2 Model efficiency.

We evaluate the efficiency of LSM-Adapter_S, including parameter amount and inference time of a single image in Tab. 4. Due to the introduction of the small model, LSM-Adapter_S has an increased overall parameter amount compared to SAM Adapter [5]. However, the trainable parameters in the second training stage have only increased by 1M, and the inference time per image is comparable to that of the SAM Adapter_S.

Table 4: Model efficiency.

Method	Params (train / total)	Inference time
SAM-Adapter	4.1M / 93.8M	0.2088s
SAM-Adapter_S	4.1M / 120.8M	0.3034s
LSM-Adapter_S	5.1M / 121.8M	0.3273s

Table 5: Ablation of

r

$r$	Precision	Recall	F1-Score	IoU
0.25	71.59	55.00	62.21	45.15
0.50	79.46	65.29	71.68	55.86
1.00	79.47	70.57	74.76	59.69

B.3 Additional ablation study.

5.0.1 Effect of the scale of dataset.

To investigate the impact of the scale of the annotated dataset, we randomly select training data with and without waterlogging according to the ratio $r$ and then merge them to form a subset for training. The ratio $r$ is set to 1, 0.5, and 0.25, respectively, where a value of 1 indicates training with the original fully annotated data. Tab. 5 shows that models trained with more annotated data perform better on downstream tasks, indicating that although foundation model possesses powerful generalization capabilities, additional annotated data is still needed to help the model adapt to downstream tasks.

Table 6: Ablation of

\tau

$\tau$	Precision	Recall	F1-Score	IoU
0.3	81.50	66.77	73.40	57.98
0.4	80.55	68.66	74.13	58.89
0.5	78.00	69.85	73.70	58.35
0.6	78.73	69.38	73.76	58.42
0.7	80.18	67.92	73.54	58.15

Table 7: Ablation of

\lambda

$\lambda$	Precision	Recall	F1-Score	IoU
10	52.00	68.15	58.99	41.83
1	63.88	62.58	63.22	46.22
0.1	63.22	59.76	61.44	44.34
0.01	70.50	51.42	59.47	42.31

5.0.2 Effect of Hyper-parameter selection.

We analyze the impact of two hyper-parameters on performance, the threshold $\tau$ for spatial prompt of point type and the coefficient $\lambda$ in the total loss (Eq. 8) of the one-stage training. As the threshold $\tau$ varies from 0.3 to 0.7, the overall performance of the model changes minimally, reaching optimal performance at a threshold of 0.4. The choice of coefficient $\lambda$ significantly affects the model’s performance. When SINet [6] is used as the small model and $\lambda$ is set to 1, the overall performance of LSM-Adapter_S is optimal. For other values, both the F1-score and IoU decline noticeably. Therefore, we recommend careful selection of the coefficient value when choosing other models as task-specific small models to achieve better performance. Generally, the coefficient should be chosen to keep the loss values of the large and small models within the same order of magnitude.

B.4 Experiments on additional datasets.

We use four additional datasets, CHAMELEON [32], CAMO [16], COD10K [6] for camouflaged object detection, and ISTD [35] for shadow detection to evaluate the proposed method on more downstream tasks. Tab. 8 shows that LSM-Adapter_S outperforms other methods on all datasets.

Table 8: Experimental results on additional datasets.

Method	CHAMELEON [32]				CAMO [16]				COD10K [6]				ISTD [35]
Method	$S_{\alpha}\uparrow$	$E_{\phi}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathrm{MAE}\downarrow$	$S_{\alpha}\uparrow$	$E_{\phi}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathrm{MAE}\downarrow$	$S_{\alpha}\uparrow$	$E_{\phi}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathrm{MAE}\downarrow$	BER $\downarrow$
Mask RCNN [10]	0.771	0.798	0.659	0.050	0.680	0.676	0.515	0.107	0.714	0.698	0.521	0.044	4.19
U2Net [26]	0.830	0.877	0.699	0.059	0.642	0.690	0.488	0.140	0.688	0.747	0.457	0.076	3.65
SINet [6]	0.888	0.942	0.816	0.030	0.820	0.882	0.743	0.070	0.815	0.887	0.680	0.037	2.35
SAM-Adapter [5]	0.834	0.858	0.680	0.055	0.800	0.816	0.657	0.094	0.820	0.856	0.657	0.044	1.65
LSM-Adapter_M	0.843	0.893	0.765	0.040	0.784	0.849	0.723	0.081	0.817	0.883	0.710	0.035	2.02
LSM-Adapter_U	0.868	0.906	0.740	0.044	0.723	0.762	0.596	0.115	0.778	0.821	0.580	0.053	2.22
LSM-Adapter_S	0.903	0.955	0.836	0.024	0.825	0.889	0.756	0.066	0.839	0.903	0.727	0.031	1.55

B.5 Additional qualitative results.

We provide more qualitative results to further demonstrate the effectiveness and superiority of the proposed LSM-Adapter. Fig. 7 illustrates the visual comparison results between LSM-Adapter and other existing methods, including Mask R-CNN [10], U2Net [26], SINet [6], and SAM-Adapter [5]. LSM-Adapter selects masks from three different small models as spatial prompts and utilizes the two-stage training strategy. It is evident that LSM-Adapter has the prediction results closest to the ground truth, whether compared with the small or large models, even in the case of challenging samples, such as the first, second, eighth, and ninth rows. Meanwhile, we observe that spatial prompts derived from the small model can compensate for the knowledge gap of the large model when the prediction output of the small model is superior, resulting in LSM-Adapter’s predictions that more closely align with the ground truth. Concurrently, even when the small model’s inferior predictive masks are employed as prompts, LSM-Adapter remains relatively unaffected, maintaining a prediction quality comparable to, if not slightly superior to, that of the standalone large model.