This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

footnotetext: \dagger Equal contribution.footnotetext: * Corresponding author.11institutetext: School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China. 22institutetext: Huawei Technologies Co., Ltd., China 33institutetext: Huawei Noah’s Ark Lab, Beijing 100196, China
33email: [email protected]

Urban Waterlogging Detection: A Challenging Benchmark and Large-Small Model Co-Adapter

Suqi Song 11 0009-0007-6132-9212 Chenxu Zhang 11 0000-0002-7079-7284 Peng Zhang 11 0009-0008-1123-0115 Pengkun Li 22 Fenglong Song 33 Lei Zhang 11 0000-0002-5305-8543
Abstract

Urban waterlogging poses a major risk to public safety and infrastructure. Conventional methods using water-level sensors need high-maintenance to hardly achieve full coverage. Recent advances employ surveillance camera imagery and deep learning for detection, yet these struggle amidst scarce data and adverse environmental conditions. In this paper, we establish a challenging Urban Waterlogging Benchmark (UW-Bench) under diverse adverse conditions to advance real-world applications. We propose a Large-Small Model co-adapter paradigm (LSM-adapter), which harnesses the substantial generic segmentation potential of large model and the specific task-directed guidance of small model. Specifically, a Triple-S Prompt Adapter module alongside a Dynamic Prompt Combiner are proposed to generate then merge multiple prompts for mask decoder adaptation. Meanwhile, a Histogram Equalization Adap-ter module is designed to infuse the image specific information for image encoder adaptation. Results and analysis show the challenge and superiority of our developed benchmark and algorithm. Project page: https://github.com/zhang-chenxu/LSM-Adapter

Keywords:
Urban waterlogging detection Segment anything model Benchmark Adaptation Large-Small Model

1 Introduction

Road water accumulation is a hidden danger, which not only causes structural damage to the pavement such as cracks and depressions, but also obstruct traffic flow, posing risks of accidents and threats to public safety. Therefore, early identification of waterlogged areas on urban roads is critical and essential.

Traditional urban waterlogging detection methods involve the installation of sensors on roadways to measure water levels, but challenges to maintain and hardly achieve full coverage [1, 24]. Recently, deep learning approaches have been explored for flood detection [17, 22, 29, 30], leveraging surveillance cameras. However, due to the lighting variability of water image reflection and the complexity of urban backgrounds, urban waterlogging detection faces several challenges: 1) Waterlogged areas vary in shape, size and depth, making it difficult to learn a uniform set of features; 2) The reflection on water surface along with shallow and clear standing water renders water texture information indistinct; 3) Under low-light conditions, the waterlogging features are not prominent, further intensifying the difficulty of detection. Owing to these challenges, existing methods struggle to detect waterlogging or provide accurate segmentation in real-world urban scenarios. Particularly, very limited scale and insufficient diversity of labeled data also diminish the generalizability of current methods, making urban waterlogging detection a hard nut to crack.

Refer to caption
Figure 1: Waterlogging detection under general and hard conditions, such as strong-light reflection, low-light condition and clear water. The first 4 rows show general samples and hard samples for the last 4 rows. The practical difficulty of this task is witnessed.
Refer to caption
Figure 2: The proposed Large-Small Model Co-adapter Paradigm, which include a histogram equalization adapter, a triple-S prompt adapter and a dynamic prompt combiner. All components except the image encoder of SAM are trained for prompt generation, learning and adaptation, toward adverse waterlogging detection.

Recently, Meta AI has released an innovative visual foundation model known as the Segment Anything model (SAM) [14]. By prompt engineering and training on a corpus of over 1 billion images, SAM exhibits formidable zero-shot capabilities and impressive segmentation performance in numerous application fields [39]. However, due to lacking task-specific knowledge and reliance on manual prompts, SAM shows sub-optimal outcomes in downstream tasks [38]. Thereby, parameter fine-tuning [20, 18], integrating learnable adapters [5, 40, 25] or devising automated prompting [31, 2, 8, 23] appear to improve SAM in downstream tasks. But these techniques still have gaps with real-world waterlogging under adverse conditions.

To advance urban waterlogging detection, the first challenge is data scarcity. Existing datasets are of limited scale or lack diversity and comprise only samples that are easy to recognize [30]. Models trained on such data tend to exhibit poor generalizability, struggling to deploy in real-world application. To solve this practical issue, we firstly construct a challenging benchmark tailored for real-world urban waterlogging detection, including adverse conditions, such as low-light conditions, strong-light reflections and clear water, etc. A total of 7,677 waterlogging images are collected with manual labels, containing frames from surveillance cameras and handheld mobile devices. Fig. 1 shows the vision challenge of waterlogging images and the effectiveness of our approach.

To compromise the generalization of large model in diverse conditions and the specificity of small model in downstream task, we propose a SAM guided Large-Small Model co-adapter paradigm (LSM-adapter), exploring combined prompts tuning and adapting for efficient yet robust urban waterlogging detection. We design a Triple-S Prompt adapter (TSP-Adapt) comprising a small model based spatial prompter, a prototype-based semantic prompter and a spectrum-based style prompter, to generate prompts from small model, large model and raw input, respectively. The origin and function of these prompts are distinct, to give complementary and counterbalancing benefits, thereby furnishing the large model with a more comprehensive and diverse information. Meanwhile, we propose a Dynamic Prompt Combiner (DPC) composed of a set of learnable weights and an adaptive embedding to dynamically weigh and blend the above prompts for mask decoder. Given the features of waterlogging images are often not prominent, we design a Histogram Equalization adapter (HE-Adpat) to infuse the enhanced task-relevant information (e.g., texture and contrast) into the image encoder. The proposed LSM-adapter paradigm is overall elaborated in Fig. 2.

In summary, our main contributions are as follows:

  • We first construct a challenging real-world urban waterlogging benchmark (UW-Bench) under adverse conditions, advancing the field towards application deployment with large model.

  • We propose an innovative large-small model co-adapter paradigm (LSM-adapter), aiming at achieving win-win regime. In order to learn a robust prompter, a Triple-S prompt adapter (TSP-Adapt) with a dynamic prompt combiner is formulated, enabling a success on adaptation.

  • We pioneer the use of vision foundation model i.e., SAM for urban waterlogging detection, providing new insights for future research.

2 Related Work

2.1 Urban Waterlogging Detection

Urban waterlogging detection is crucial for traffic management, urban planning, and disaster early warning systems. The early methods are based on water-level sensors[1, 24], which detect water accumulation within a certain area through sensor devices placed at specific locations in a city. However, this approach is cost-ineffective to maintain and very limited in detection range. The remote sensing satellite imagery with a wide monitoring range are thus involved [15, 27]. Due to the lack of local detail information in remote sensing-based methods, some studies have explored to utilize images or video data from surveillance cameras to detect waterlogging [21, 7, 37, 13, 34]. [21] combines local spatial-temporal features and brightness signals to detect water in videos by using decision forests. [7] estimates flood extent from crowdsourced images using brown color segmentation to identify flood water. More efforts were made to explore CNN based deep learning approaches [17, 22, 29, 30], such as Mask R-CNN [10] and DeepLabv3+[3], and improved the waterlogging detection performance. In this paper, we pioneer the use of vision foundation model (SAM) with innovative designs on a newly developed urban waterlogging benchmark to advance this field fundamentally.

2.2 SAM Adaptation

SAM [14] is composed of a vision transformer-based image encoder, a lightweight mask decoder and a flexible prompt encoder that processes diverse input such as points, bounding boxes, masks and text. Numerous SAM variants have emerged, aiming to explore its potential in various tasks such as medical image analysis [12, 18, 20, 40], camouflaged object detection [33, 5] and mirror and transparent objects detection [9]. Adapting SAM to downstream tasks becomes a challenge. Early attempts involved directly fine-tuning a part of SAM (e.g., decoder) on downstream datasets [12, 18, 20]. As full fine-tuning of image encoder is computationally intensive, some methods are inspired by adapters in natural language processing (NLP) and insert adapter in SAM, achieving efficient fine-tuning by training the adapter only [5, 25, 40]. For example, SAM-Adapter [5] adds adapters between transformer blocks of the image encoder. SAMed [40] employs the LoRA [11] approach to approximate low-rank updates of the parameters in image encoder. Several studies accomplish adaptation from the perspective of generating automatic task-specific prompts [31, 2, 8, 23]. For example, RSPrompter [2] generates appropriate prompts based on semantic information to yield semantically clear segmentation results for remote sensing images. In this paper, we consider dual adaptation in image- and prompt-level, and collaboration of large and small models.

3 Method

3.1 SAM-Based Task-Generalized Large Model Branch

The SAM based large model is the main part in the entire framework for predicting the final segmentation mask. We retain three core components of SAM: the image encoder frozen with pretrained parameters, the lightweight mask decoder and the prompt encoder. As previously mentioned, directly deploying SAM to downstream task produces unsatisfactory results due to the frozen image encoder [5]. To facilitate the image encoder adaptation, we design an histogram equalization adapter laterally connected with the image encoder.

3.1.1 Histogram Equalization Adapter Module (HE-Adapt).

The internal structure of the enhanced-image adapter module is presented in Fig. 3 (a), which mainly consists of a histogram equalization, a high-frequency filter and MLP blocks. Given that the features of water are not pronounced in most challenging scenarios, we first conduct histogram equalization operation to highlight the contrast and texture of input image. The enhanced image is then passed through a high-frequency filter to extract high-frequency information beneficial for segmentation, and converted into frequency patch embedding. The patch embedding of original input image is reduced in dimension by fully-connected layer (FC) and added to the frequency patch embedding. This fused feature is mapped by NN individual MLP blocks and one parameter-shared MLP, and then merged with the original features of each transformer block in the SAM image encoder.

Refer to caption
Figure 3: Details of the proposed histogram equalization adapter and the prototype learning based semantic prompter.

3.2 CNN-Based Task-Specific Small Model Branch

Waterlogging possesses the reflectivity and transparency, making it easily camouflage itself due to lighting and complex environmental backgrounds. To this end, we adopt SiNet [6], that succeeds in camouflaged object detection, as our task-specific small model. To accommodate diverse requirements, the choice of the small model is flexible and can be substituted with any other network without necessitating alterations to the overarching framework.

Acting as a domain expert, small model interacts with the large model through spatial prompt, furnishing it with prior knowledge and directional task guidance. Given an input image, spatial prompt is generated by a spatial prompter built on the small model, which could be a predicted mask or a further processed version such as a bounding box or some points, encapsulating the spatial location information of the object to be detected (see Sec. 3.3.1 for details).

3.3 Triple-S Prompt Adapter Module

The Triple-S prompt adapter module (TSP-Adapt) consists of a spatial prompter, a semantic prompter and a style prompter.

3.3.1 Spatial Prompter.

SAM originally considered two sets of prompts, sparse (boxes or points) and dense (masks), both of which can provide spatial location information for the object to be segmented. We propose to generate such prompts via a spatial prompter utilizing the outputs of the small model. The masks MsmallM_{\mathrm{small}} predicted by small model can be directly used as the dense prompts. Further processing of the masks MsmallM_{\mathrm{small}} yields either boxes or points as sparse prompts. For box prompts, we take the bounding boxes of the regions composed of all pixels predicted as the foreground in the masks, represented by the coordinates of the top-left and bottom-right corners of the boxes. For point prompts, we divide the masks into multiple grid regions. In each grid area Gg×gG_{g\times{g}}, all pixels are divided into a positive point set IP={(i,j)Msmall(i,j)τ}I_{P}=\{(i,j)\mid{M_{\mathrm{small}}(i,j)}\geq{\tau}\} and a negative point set IN={(i,j)Msmall(i,j)<τ}I_{N}=\{(i,j)\mid{M_{\mathrm{small}}(i,j)}<{\tau}\}, where τ\tau is the preset threshold. If the set IPI_{P} is not empty, we select one point pIPp\in{I_{P}} with the highest prediction confidence as the positive prompt and the prompt label is set to 1. Otherwise, we take one point pINp\in{I_{N}} with the lowest prediction confidence as the negative prompt and the prompt label is set to 0. All points from the grids together form the grid point prompts, represented by their coordinates and labels. Despite providing three different types of prompts, considering the redundancy of information, we only select one prompt as the final spatial prompt fed into the prompt encoder and obtain spatial prompt 𝒆Spa\boldsymbol{e}_{\mathrm{Spa}}. It is noteworthy that the prompt encoder processes dense and sparse prompts differently. Dense prompts are embedded using convolution before adding to the image embedding and sparse prompts are encoded with positional encoding to generate the corresponding sparse embedding.

3.3.2 Semantic Prompter.

The image embedding of large model contains rich semantic information. Therefore, we propose a prototype learning-based semantic prompter, which leverages useful foreground features from large model to generate semantic prompts. The above process is detailed in Fig. 3 (b). A projector is first adopted to map the image embedding 𝒁H×W×D\boldsymbol{Z}\in\mathbb{R}^{H\times{W}\times{D}} into the projected embedding 𝒁¯H×W×D\boldsymbol{\bar{Z}}\in\mathbb{R}^{H\times{W}\times{D}}. Inspired by [42], we randomly initialize a group of CKC\cdot K prototypes, i.e., {𝒑c,kD}c,k=1C,K\{\boldsymbol{p}_{c,k}\in\mathbb{R}^{D}\}^{C,K}_{c,k=1} in the embedding space, where CC is the number of categories and each class is represented by KK prototypes. For each pixel sample 𝒛¯i,jD\boldsymbol{\bar{z}}_{i,j}\in\mathbb{R}^{D}, i{1W}i\in\{1\cdots W\}, j{1H}j\in\{1\cdots H\} in the projected image embedding 𝒁¯\boldsymbol{\bar{Z}}, we respectively calculate its cosine similarity with each prototype 𝒑c,k\boldsymbol{p}_{c,k} to obtain a similarity vector 𝒔i,jJ\boldsymbol{s}_{i,j}\in\mathbb{R}^{J} located at the (ii, jj) position in the similarity matrix 𝑺H×W×J\boldsymbol{S}\in\mathbb{R}^{H\times{W}\times{J}}, where J=CKJ=C\cdot K. The category of the prototype corresponding to the maximum value in the similarity vector 𝒔i,j\boldsymbol{s}_{i,j} is assigned to the pixel sample 𝒛¯i,j\boldsymbol{\bar{z}}_{i,j} as the pseudo label ci,jc_{i,j}^{*}. The pseudo mask generation (PMG) process can be represented as follows:

𝑴={ci,j}i,j=1H,W, with (ci,j,ki,j)=argmax(c,k){z¯i,j,pc,k}c,k=1C,K,\displaystyle\boldsymbol{M}=\{c^{*}_{i,j}\}^{H,W}_{i,j=1},\text{ with }\left(c^{*}_{i,j},k^{*}_{i,j}\right)=\underset{(c,k)}{\arg\max}\left\{\langle\bar{z}_{i,j},p_{c,k}\rangle\right\}_{c,k=1}^{C,K}, (1)

where ,\langle\cdot,\cdot\rangle denotes the cosine similarity operator. The pseudo mask 𝑴\boldsymbol{M} is then one-hot encoded and employed in conjunction with the original image embedding 𝒁\boldsymbol{Z} to compute the masked average pooling (MAP). This filters out irrelevant background features and preserves significant foreground features, and derives the semantic embedding as follows:

𝒆Sem=Concat(𝒆Sem1,𝒆Sem2,,𝒆SemC),\displaystyle\boldsymbol{e}_{\mathrm{Sem}}=Concat(\boldsymbol{e}_{\mathrm{Sem}}^{1},\boldsymbol{e}_{\mathrm{Sem}}^{2},\cdots,\boldsymbol{e}_{\mathrm{Sem}}^{C}), (2)

where Concat()Concat(\cdot) denotes the concatenation operator and 𝒆Semc\boldsymbol{e}_{\mathrm{Sem}}^{c} represents the semantic embedding of class cc (c{1C}c\in\{1\cdots{C}\}), computed as follows:

𝒆Semc=i,j𝒁(i,j)𝑴c(i,j)i,j𝑴c(i,j),\displaystyle\boldsymbol{e}_{\mathrm{Sem}}^{c}=\frac{\sum_{i,j}\boldsymbol{Z}(i,j)\odot\boldsymbol{M}^{c}(i,j)}{\sum_{i,j}\boldsymbol{M}^{c}(i,j)}, (3)

where \odot denotes the Hadamard product. The prototype 𝒑c,k\boldsymbol{p}_{c,k} is momentum-updated after each training iteration according to the center of the kk-th sub-cluster of the training samples assigned to the cc-th class via online clustering. Meanwhile, a prototype loss proto\mathcal{L}_{\mathrm{proto}} in [42] is utilized to optimize the large model.

3.3.3 Style Prompter.

We introduce a spectrum-based style prompter that extracts image-specific style embedding from the input image as the third type of prompt. The style of an image refers to features such as color and texture. In the context of urban waterlogging detection, these features to some extent can reflect information about illumination and the scene, where illumination is a critical factor causing difficulty. Specifically, We first perform a 2D Fast Fourier Transform (FFT) on the input image f(x,y)f(x,y) to acquire its frequency spectrum F(u,v)F(u,v), which can be represented as:

F(u,v)=FFT{f(x,y)}=A(u,v)ejΦ(u,v)\displaystyle F(u,v)=\text{FFT}\{f(x,y)\}=A(u,v)e^{j\Phi(u,v)} (4)

where A(u,v)A(u,v) is the amplitude spectrum and Φ(u,v)\Phi(u,v) is the phase spectrum. uu and vv are the frequency coordinates. Since the amplitude spectrum reflects the image style while the phase spectrum interprets the image content, we thus reconstruct the image solely by the amplitude spectrum using the 2D Inverse Fast Fourier Transform(iFFT):

f¯(x,y)=iFFT{A(u,v)}\displaystyle\bar{f}(x,y)=\text{iFFT}\{A(u,v)\} (5)

The reconstructed amplitude-only image contains style information, which is then encoded as style embedding prompt 𝒆Sty\boldsymbol{e}_{\mathrm{Sty}} by a convolutional block.

3.4 Dynamic Prompt Combiner

The dynamic prompt combiner (DPC) is designed to find the optimal combination of the above three types of prompts. DPC comprises three sets of dynamic weights {w1,w2,w3}\{w_{1},w_{2},w_{3}\} assigned to spatial, semantic and style prompt, respectively, and an adaptive embedding 𝒆Ada\boldsymbol{e}_{Ada} learnable to improve potential bias. The dynamically weighted prompts and the adaptive prompt are then concatenated to generate the final prompt as described in Fig. 2, computed as follows:

𝒆P=Concat{w1𝒆Spa,w2𝒆Sem,w3𝒆Sty,𝒆Ada}.\displaystyle\boldsymbol{e}_{\mathrm{P}}=Concat\{w_{1}\odot\boldsymbol{e}_{\mathrm{Spa}},w_{2}\odot\boldsymbol{e}_{\mathrm{Sem}},w_{3}\odot\boldsymbol{e}_{\mathrm{Sty}},\boldsymbol{e}_{\mathrm{Ada}}\}. (6)

where \odot denotes element-wise product. During training, the weights are dynamically updated to encourage well-performing prompts while diminishing less-effective prompts. The motivation of the learnable embedding 𝒆Ada\boldsymbol{e}_{Ada} arises from two aspects. 1) The learnable embedding enables the attention blocks within the decoder to comprehend nonlinear combination among these embeddings, and improve the bias that a linear combination may neglect. 2) It has a flexible capability to capture some useful implicit prompt information.

Refer to caption
Figure 4: One-stage and Two-stage training strategies of the proposed large-small model paradigm for collaborative optimization.

3.5 Optimization

Two training strategies are proposed to explore suitable joint training of models with diverse architectures, as illustrated in Fig. 4.

3.5.1 One-stage Training.

We introduce a straightforward one-stage training strategy, as depicted in Fig. 4 (a). The image encoder of the large model is frozen and the remaining parts are optimized together. We employ a combination of the focal loss focal\mathcal{L}_{\mathrm{focal}} [19], cross-entropy loss ce\mathcal{L}_{\mathrm{ce}}, IoU loss iou\mathcal{L}_{\mathrm{iou}} and the prototype loss proto\mathcal{L}_{\mathrm{proto}} [42] for the large model:

large=focal+ce+iou+proto.\displaystyle\mathcal{L}_{\mathrm{large}}=\mathcal{L}_{\mathrm{focal}}+\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{proto}}. (7)

The total loss is given as:

total=large+λsmall,\displaystyle\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{large}}+\lambda\mathcal{L}_{\mathrm{small}}, (8)

where small\mathcal{L}_{\mathrm{small}} is the original loss of the small model (the loss function depends on specific model selection, and Mask-RCNN [10], U2Net [26] and SiNet [6] are tested in experiments) and λ\lambda is a hyper-parameter.

3.5.2 Two-stage Training.

A two-stage training strategy is provided to mitigate issues related to synchronization difficulties and gradient conflicts that may arise during the joint optimization of large-small models, as shown in Fig. 4 (b).

In the first stage, the triple-s prompt adapter module, the dynamic prompt combiner and the prompt encoder are not involved. The image encoder remains frozen, while the remaining modules of the large model and small model are independently optimized by their own loss functions, i.e., larges1\mathcal{L}_{\mathrm{large}}^{\mathrm{s1}} and smalls1\mathcal{L}_{\mathrm{small}}^{\mathrm{s1}}, respectively. The loss function of small model is the same as Eq. 8. The training loss of the large model for the first stage is defined as:

larges1=focal+ce+iou.\displaystyle\mathcal{L}_{\mathrm{large}}^{\mathrm{s1}}=\mathcal{L}_{\mathrm{focal}}+\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{iou}}. (9)

In the second stage, we load the parameters of the modules (small model, HE-adapt and mask decoder) trained in the first stage, while integrating all the modules that were not considered previously for training (TSP-adapt, DPC and prompt encoder). With the parameters of image encoder, HE-adapt and the small model fixed, the optimization objective for the second stage is as follows:

totals2=focal+ce+iou+proto.\displaystyle\mathcal{L}_{\mathrm{total}}^{\mathrm{s2}}=\mathcal{L}_{\mathrm{focal}}+\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{iou}}+\mathcal{L}_{\mathrm{proto}}. (10)

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets.

For advancing urban waterlogging detection challenge, we develop a UW-Bench dataset containing a total of 7,677 images from various scenarios, such as waterlogging scenes, dry roads and hard cases, such as slippery roads and nighttime roads. Using keywords such as urban waterlogging, waterlogged roads, and monitoring viewpoints, we crawled and filtered relevant images from surveillance videos and handheld cameras. The training set includes 5,584 images, while the test set is provided by Huawei inc. and contains 2,093 images of urban scenes captured by surveillance cameras only. For the test set, we consider general-sample and hard-sample cases. Some examples from the training set and test set in our UW-Bench are described in Fig. 5, which indicates the difficulty of detecting waterloggings. In the labeling phase, we use EasyData to annotate the dataset with masks. The pixel-level annotation process can be divided into several stages: training, annotation, validation, and correction. We first create some annotation samples and train the annotators to understand the annotation standard. We also assign an inspector to verify the mask annotations. For failed annotations, the inspector gives an explanation and feedback to each annotator to further improve the annotation quality. The overall annotation process ensures the accuracy and reliability of masks in waterlogging regions.

Refer to caption
Figure 5: Training and testing examples in the developed UW-Bench. For objectively evaluating the capability of the model in real-world applications, we consider both general-sample and hard-sample cases in test set.

4.1.2 Evaluation Metrics.

The waterlogging detection can be viewed as a pixel-level binary classification task for segmentation, and the waterlogged region is our prospect of interest. Based on the ground truth masks as well as the predicted waterlogging masks, we exploit the commonly used segmentation metrics, such as Precision, Recall, F1-score, and Intersection over Union (IoU) to evaluate the detection performance.

4.1.3 Implementation Details.

For the large model, we choose the ViT-B version of the pre-trained SAM as the backbone, and the input image is resized to 1024*1024. In the training phase, the input images are randomly flipped horizontally, and the batch size is set as 2. AdamW optimizer is used with an initial learning rate of 0.0005, and cosine annealing decay is applied. In the testing phase, the final binary mask is obtained by a simple thresholding operation with the threshold set as 0.5. To evaluate our approach more comprehensively, we choose the classic Mask-RCNN [10] for semantic segmentation, U2Net [26] for salient object detection and SINet [6] for camouflaged object detection, respectively, as the our task-specific small models, but not limited to these ones. All experiments are implemented in PyTorch on an NVIDIA Tesla V100S GPU (32G memory). See more implementation details in appendix.

Table 1: Comparison with existing methods on proposed UW-dataset. SAM-Adapter denotes training without prompts. The subscript of M, U, S denotes Mask R-CNN, U2Net and SINet, respectively, as the small model, to provide spatial prompts. The numbers in bold mean the best results.
Test Set UW-all UW-hard
Method Precision  Recall  F1-Score   IoU Precision  Recall  F1-Score   IoU
UNet [28] 54.77 45.58 49.75 33.12 61.64 30.40 40.72 25.56
DeeplabV3 [4] 74.10 47.17 57.64 40.50 69.56 36.04 47.48 31.13
SETR[41] 85.20 54.01 66.11 49.37 81.71 44.41 57.54 40.39
Segformer[36] 86.63 60.11 70.98 55.01 81.22 49.81 61.75 44.67
Mask R-CNN [10] 58.34 51.86 54.91 37.84 69.06 42.15 52.35 35.46
U2Net [26] 78.56 49.86 61.00 43.89 77.28 39.89 52.62 35.70
SINet[6] 80.09 59.02 67.96 51.47 77.69 52.00 62.30 45.24
SAM-Adapter [5] 72.13 63.77 67.69 51.16 69.70 58.36 63.53 46.55
SAM-AdapterM 79.34 60.94 68.93 52.60 85.04 58.49 69.31 53.03
SAM-AdapterU 80.63 60.87 69.37 53.11 77.63 35.69 48.90 32.36
SAM-AdapterS 84.52 61.25 71.03 55.07 81.43 54.57 65.35 48.53
LSM-AdapterM 71.20 75.30 73.19 57.73 73.39 74.16 73.77 58.45
LSM-AdapterU 74.99 72.56 73.75 58.42 75.02 70.85 72.88 57.32
LSM-AdapterS 79.47 70.57 74.76 59.69 79.19 67.29 72.76 57.18

4.2 Experimental Results

We evaluate the performance of the proposed LSM-Adapter on our developed UW-Bench under two types of test set: UW-all and UW-hard (a challenging subset of hard samples). We compare with some representative semantic segmentation models, including UNet [28], DeeplabV3 [4], SETR [41], Segformer [36], Mask R-CNN [10], U2Net [26], SINet [6] as well as SAM-Adapter [5], a large model based on SAM. In the experiments, we adopt a two-stage training strategy in our LSM-Adapter and select the mask as the output type of the spatial prompter (notably, experiments on different training strategies and spatial prompt types are discussed in Section 4.3). Additionally, SAM-Adapter utilizes a default prompt embedding as one of the dual inputs of the mask decoder and the prompt encoder was omitted. For a fair comparison, we also feed the prompts into SAM-Adapter based on the three small models, respectively, following the same setting.

The quantitative comparisons are tabulated in Tab. 1. From the results, we witness our proposed method achieves state-of-the-art performance in both test sets, and significantly outperforms extant methodologies, particularly in Recall, F1 score and IoU under different small models. Specifically, LSM-AdapterM, LSM-AdapterU, and LSM-AdapterS demonstrate increment by 6.8%\% to 18.28%\% in F1 score and 8.22%\% to 19.89%\% in IoU, compared with their respective small models. Particularly, LSM-AdapterM exhibits an increment of 19.89%\% over Mask R-CNN on IoU, indicating that small models with inferior standalone performance are capable of realizing more pronounced performance improvements if co-trained with large model. Compared to the large model i.e. SAM-Adapter, our approach is improved by 7.45%\% to 9.02%\% in F1 score and 6.57%\% to 8.53%\% in IoU. Moreover, the competitive small model, i.e., SINet, exhibits an even greater gain in overall performance when integrated with the large model.

For the purpose of qualitative analysis, we illustrate the waterlogging segmentation results on several general test samples and hard test samples in Fig. 1. Evidently, the predicted masks of LSM-Adapter better approach the ground truth, further demonstrating its superiority to other methods. We further exploit the precision-recall curves (PR) to compare different methods. Fig. 6 illustrates the PR curves of our methods and other existing methods. Each subplot corresponds to the use of different small models. In each subplot, the PR-cureve of our method is closer to the top-right corner and exhibits better performance than existing CNN based segmentation models and Transformer based SAM-Adapter.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: Precision-Recall curves of our models and other existing methods. M, B, P denotes mask, box and point is used as spatial prompt, respectively. M, U, S denotes Mask-RCNN (a), U2Net (b) and SINet (c) is used as small model, respectively.
Table 2: Effects of different training strategies and spatial prompts. 1-S and 2-S denotes the one-stage and two-stage training strategy, respectively, as discussed in Section 3.5. The spatial prompts include mask, box and point. For each model, the numbers in bold mean the best results across the same training strategy.
Method Train Prompt Precision Recall F1-Score IoU
LSM-AdapterM 1-S Mask 73.06 53.71 61.91 44.86
Box 77.80 51.01 61.62 44.53
Point 68.11 47.04 55.65 38.55
2-S Mask 71.20 75.30 73.19 57.73
Box 73.60 69.46 71.47 55.61
Point 74.73 68.85 71.67 55.85
LSM-AdapterU 1-S Mask 66.84 66.34 66.59 49.94
Box 72.21 63.75 69.01 52.68
Point 71.65 54.96 62.20 45.14
2-S Mask 74.99 72.56 73.75 58.42
Box 78.98 69.22 73.78 58.45
Point 77.51 67.36 72.08 56.35
LSM-AdapterS 1-S Mask 63.88 62.58 63.22 46.22
Box 61.84 55.27 58.37 41.22
Point 63.60 60.58 62.05 44.98
2-S Mask 79.47 70.57 74.76 59.69
Box 75.75 72.21 73.94 58.65
Point 78.00 69.85 73.70 58.35

4.3 Discussion on Training Strategies and Spatial Prompts

We explore the impact of employing different training strategies and spatial prompts on model performance. The results are presented in Tab. 2.

4.3.1 Results based on different training strateges.

The proposed LSM-Adapter that employs the one-stage training strategy is significantly inferior to the two-stage training strategy by comparing their best performances. This demontrates the proposed two-stage training strategy is more stable in adaptation. We posit that the following factors may impede the effective implementation of the one-stage training strategy. During the early stages of training, the predicted output of the small model is characterized by low accuracy, and add complexity to the training of the large model results in subsequent slow convergence. Concurrently, the joint training of two networks with distinct architectures is highly contingent upon the selection of suitable hyper-parameters to achieve a synchronized optimization process. Otherwise, the optimization objectives may conflict each other, thereby hindering the joint model from attaining the optimal performance.

4.3.2 Results based on different spatial prompts.

Under identical conditions concerning the small model and training strategy, we compare three types of spatial prompts. Except LSM-AdapterU, the performance by using mask as spatial prompt consistently surpasses that of other two types of prompts (box and point) in all scenarios. Although the box prompt in LSM-AdapterU is the best, the performance gap from the mask prompt is very small, with a mere 0.03%\% gap for both F1-score and IoU when employing two-stage training strategy. Moreover, the performance of models employing mask prompts predominantly exceeds that of both large models and their respective small models. A possible explanation for this observation may be that, in comparison to the sparse prompts of boxes and points, mask prompts furnish a more abundant referential information.

Table 3: Ablation studies for innovative components. SAM-Adapter is underlined.
HE-Adapt SpaP SemP StyP DPC    LSM-AdapterU    LSM-AdapterS
F1-Score IoU F1-Score IoU
65.74 51.16 65.74 51.16
71.65 55.83 71.65 55.83
71.97 56.22 73.95 58.67
72.16 56.45 73.98 58.71
72.72 57.14 74.63 59.52
73.75 58.42 74.76 59.69

4.4 Ablation Study

To verify the effectiveness of each module in our proposed framework, we adopt the two-stage training strategy, select mask as the spatial prompt and exploit LSMU2Net\mathrm{LSM}_{\mathrm{U2Net}} and LSMSINet\mathrm{LSM}_{\mathrm{SINet}} to conduct the ablation experiments. SAM-Adapter is the baseline model, i.e., without adding any of our proposed modules. Experimental results are shown in Tab. 3. We can see that adding adaptive histogram equalization adapter (HE-adapt) achieves performance improvement in F1-score and IoU, which proves the effect of HE-adapt for image encoder adaptation. Additionally, with spatial prompt (SpaP), semantic prompt (SemP), and style prompt (StyP) are added gradually, the performance is also improved gradually, which proves the effectiveness of the Triple-S prompt adapter (TSP-adapt) for mask decoder adaptation. On this basis, by exploiting the dynamic prompt combiner (DPC), the optimal performance is achieved, which proves our proposed DPC can effectively combine different prompts via a dynamic optimization strategy to improve potential bias implied in an individual prompt.

5 Conclusion and Outlook

In this paper, we first pioneer the use of large vision model (i.e., SAM) for a challenging downstream task i.e. urban waterlogging detection and advancing its real-world applications. Considering the generic segmentation capability of large model and the task-specificity of small model, we propose a large-small model co-adapter paradigm following the win-win mechanism. To address the data scarcity of real-world waterlogging detection, we further contribute a large benchmark to advance this field fundamentally, on which more powerful algorithms can be developed. In experiments, we provide new perspectives on the training strategy of large-small model collaboration, due to their architectural differences. This paper sheds light on the possibility of large model in adapting to challenging downstream task with congenital data scarcity and adverse conditions.

In the future work, it is expected to further enrich the benchmark to facilitate the pre-train and fine-tune of large model. We hope our proposed large-small model paradigm and perspectives can inspire future work, particularly for downstream tasks with limited resources.

Acknowledgements

This work was partially supported by National Key R&D Program of China (2021YFB3100800), National Natural Science Fund of China (62271090, 61771079), Chongqing Natural Science Fund (cstc2021jcyj-jqX0023) and National Youth Talent Project. This work is also supported by Huawei computational power of Chongqing Artificial Intelligence Innovation Center.

References

  • [1] Basha, E.A., Ravela, S., Rus, D.: Model-based monitoring for early warning flood detection. In: Proceedings of the 6th ACM conference on Embedded network sensor systems. pp. 295–308 (2008)
  • [2] Chen, K., Liu, C., Chen, H., Zhang, H., Li, W., Zou, Z., Shi, Z.: Rsprompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Transactions on Geoscience and Remote Sensing (2024)
  • [3] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40(4), 834–848 (2017)
  • [4] Chen, L.C., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv (2017)
  • [5] Chen, T., Zhu, L., Deng, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: Sam-adapter: Adapting segment anything in underperformed scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3367–3375 (2023)
  • [6] Fan, D.P., Ji, G.P., Cheng, M.M., Shao, L.: Concealed object detection. IEEE transactions on pattern analysis and machine intelligence 44(10), 6024–6042 (2021)
  • [7] Geetha, M., Manoj, M., Sarika, A., Mohan, M., Rao, S.N.: Detection and estimation of the extent of flood from crowd sourced images. In: 2017 international conference on communication and signal processing (ICCSP). pp. 0603–0608. IEEE (2017)
  • [8] Guo, A., Fei, G., Pasupuletic, H., Wang, J.: Clicksam: Fine-tuning segment anything model using click prompts for ultrasound image segmentation. arXiv preprint arXiv:2402.05902 (2024)
  • [9] Han, D., Zhang, C., Qiao, Y., Qamar, M., Jung, Y., Lee, S., Bae, S.H., Hong, C.S.: Segment anything model (sam) meets glass: Mirror and transparent objects cannot be easily detected. arXiv preprint arXiv:2305.00278 (2023)
  • [10] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
  • [11] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • [12] Hu, M., Li, Y., Yang, X.: Skinsam: Empowering skin cancer segmentation with segment anything model. arXiv preprint arXiv:2304.13973 (2023)
  • [13] Jiang, J., Qin, C.Z., Yu, J., Cheng, C., Liu, J., Huang, J.: Obtaining urban waterlogging depths from video images using synthetic image data. Remote Sensing 12(6),  1014 (2020)
  • [14] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
  • [15] Klemas, V.: Remote sensing of floods and flood-prone areas: An overview. Journal of Coastal Research 31(4), 1005–1013 (2015)
  • [16] Le, T.N., Nguyen, T.V., Nie, Z., Tran, M.T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. Computer vision and image understanding 184, 45–56 (2019)
  • [17] Li, W., Zhu, H., Feng, X., Li, F.: Semantic segmentation-based algorithm for urban road waterlogging disaster detection. In: Proceedings of the 2021 5th International Conference on Video and Image Processing. pp. 104–110 (2021)
  • [18] Li, Y., Hu, M., Yang, X.: Polyp-sam: Transfer sam for polyp segmentation. arXiv preprint arXiv:2305.00293 (2023)
  • [19] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
  • [20] Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications 15(1),  654 (2024)
  • [21] Mettes, P., Tan, R.T., Veltkamp, R.: On the segmentation and classification of water in videos. In: 2014 International Conference on Computer Vision Theory and Applications (VISAPP). vol. 1, pp. 283–292. IEEE (2014)
  • [22] Muhadi, N.A., Abdullah, A.F., Bejo, S.K., Mahadi, M.R., Mijic, A.: Deep learning semantic segmentation for water level estimation using surveillance camera. Applied Sciences 11(20),  9691 (2021)
  • [23] Na, S., Guo, Y., Jiang, F., Ma, H., Huang, J.: Segment any cell: A sam-based auto-prompting fine-tuning framework for nuclei segmentation. arXiv preprint arXiv:2401.13220 (2024)
  • [24] Pasi, A.A., Bhave, U.: Flood detection system using wireless sensor network. International Journal of Advanced Research in Computer Science and Software Engineering 5(2) (2015)
  • [25] Pu, X., Jia, H., Zheng, L., Wang, F., Xu, F.: Classwise-sam-adapter: Parameter efficient fine-tuning adapts segment anything to sar domain for semantic segmentation. arXiv preprint arXiv:2401.02326 (2024)
  • [26] Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-net: Going deeper with nested u-structure for salient object detection. Pattern recognition 106, 107404 (2020)
  • [27] Robertson, N.M., Chan, T.: Aerial image segmentation for flood risk analysis. In: 2009 16th IEEE International Conference on Image Processing (ICIP). pp. 597–600. IEEE (2009)
  • [28] Ronneberger, O., et al.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI (2015)
  • [29] Sarp, S., Kuzlu, M., Cetin, M., Sazara, C., Guler, O.: Detecting floodwater on roadways from image data using mask-r-cnn. In: 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA). pp. 1–6. IEEE (2020)
  • [30] Sazara, C., Cetin, M., Iftekharuddin, K.M.: Detecting floodwater on roadways from image data with handcrafted features and deep transfer learning. In: 2019 IEEE intelligent transportation systems conference (ITSC). pp. 804–809. IEEE (2019)
  • [31] Shaharabany, T., Dahan, A., Giryes, R., Wolf, L.: Autosam: Adapting sam to medical images by overloading the prompt encoder. arXiv preprint arXiv:2306.06370 (2023)
  • [32] Skurowski, P., Abdulameer, H., Błaszczyk, J., Depta, T., Kornacki, A., Kozieł, P.: Animal camouflage analysis: Chameleon database. Unpublished manuscript 2(6),  7 (2018)
  • [33] Tang, L., Xiao, H., Li, B.: Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709 (2023)
  • [34] Tang, X., Wu, Z., Liu, W., Tian, J., Liu, L.: Exploring effective ways to increase reliable positive samples for machine learning-based urban waterlogging susceptibility assessments. Journal of Environmental Management 344, 118682 (2023)
  • [35] Wang, J., Li, X., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1788–1797 (2018)
  • [36] Xie, E., et al.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
  • [37] Xue, F., Tian, J., Song, X., Yan, Y.: Urban waterlogging monitoring and early warning based on video images. International Journal of Embedded Systems 13(4), 380–386 (2020)
  • [38] Zhang, C., Puspitasari, F.D., Zheng, S., Li, C., Qiao, Y., Kang, T., Shan, X., Zhang, C., Qin, C., Rameau, F., et al.: A survey on segment anything model (sam): Vision foundation model meets prompt engineering. arXiv preprint arXiv:2306.06211 (2023)
  • [39] Zhang, C., Liu, L., Cui, Y., Huang, G., Lin, W., Yang, Y., Hu, Y.: A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196 (2023)
  • [40] Zhang, K., Liu, D.: Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785 (2023)
  • [41] Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
  • [42] Zhou, T., Wang, W., Konukoglu, E., Van Gool, L.: Rethinking semantic segmentation: A prototype view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2582–2593 (2022)

A. Additional details on algorithm.

In Algorithm 1, we summarize the detailed training process of our proposed LSM-Adapter using a two-stage training strategy.

Algorithm 1 Training Procedure of LSM-Adapter with two-stage training strategy.

input: Training set 𝒟tr\mathcal{D}_{tr}.
output: LSM-Adapter \mathcal{M}.

1:  /* Stage-one */
2:  Freeze the image encoder of the large model large\mathcal{M}_{\mathrm{large}}
3:  while the maximal iterations are not reached do
4:     {xi,yi}i=1n𝒟tr\{x_{i},y_{i}\}_{i=1}^{n}\in{\mathcal{D}_{tr}} // sample mini-batch
5:     Obtain predictions of small model Msmall=small(xi)M_{\mathrm{small}}=\mathcal{M}_{\mathrm{small}}(x_{i})
6:     Obtain predictions of large model Mlarge=large(xi)M_{\mathrm{large}}=\mathcal{M}_{\mathrm{large}}(x_{i})
7:     Update small\mathcal{M}_{\mathrm{small}} and large\mathcal{M}_{\mathrm{large}} by minimizing the loss function of small model and Eq. 9, respectively.
8:  end while
9:  /* Stage-two */
10:  Load well-trained small\mathcal{M}_{\mathrm{small}} and large\mathcal{M}_{\mathrm{large}}
11:  Freeze small\mathcal{M}_{\mathrm{small}}, and freeze the image encoder and HE-Adapt of large\mathcal{M}_{\mathrm{large}}
12:  while the maximal iterations are not reached do
13:     {xi,yi}i=1n𝒟tr\{x_{i},y_{i}\}_{i=1}^{n}\in{\mathcal{D}_{tr}} // sample mini-batch
14:     Z=LSM_Encoder(xi)Z=\mathrm{LSM\_Encoder}(x_{i})
15:     Obtain espae_{\mathrm{spa}} by small model small\mathcal{M}_{\mathrm{small}} and spatial prompter // spatial prompt
16:     Obtain eseme_{\mathrm{sem}} by Eq. 1, Eq. 2 and Eq. 3 // semantic prompt
17:     Obtain estye_{\mathrm{sty}} by Eq. 4, Eq. 5, and a convolutional block // style prompt
18:     Obtain epe_{\mathrm{p}} by Eq. 6 // final prompt
19:     Obtain predictions M=LSM_Decoder(Z,ep)M=\mathrm{LSM\_Decoder}(Z,e_{\mathrm{p}})
20:     Update \mathcal{M} by minimizing Eq. 10
21:  end while

B. Additional experimental results.

B.1 Additional implementation details.

For the one-stage training, batch size is set to 2, and the large model and small model are optimized jointly using the AdamW optimizer. The learning rate of the large model is set to 0.0005. For three different small models, i.e., Mask R-CNN [10], U2Net [26], and SINet [6], the learning rate is set to 0.0005, 0.001, and 0.0005, respectively. Cosine annealing decay is applied, and the epoch is set to 40.

For the two-stage training, the large and small models are first optimized individually. The training settings for the large model are the same as in the one-stage training. Mask R-CNN [10] is trained using the SGD optimizer with a learning rate of 0.001, a batch size of 8, and an epoch of 40. U2Net [26] is trained using the Adam optimizer with a learning rate of 0.001, a batch size of 8, and an epoch of 40. And SINet [6] is trained using the Adam optimizer with a learning rate of 0.0001, a batch size of 16, and an epoch of 100. In the second stage, the epoch is set to 20, and the rest of the settings for the large model are the same as in the first stage.

B.2 Model efficiency.

We evaluate the efficiency of LSM-AdapterS, including parameter amount and inference time of a single image in Tab. 4. Due to the introduction of the small model, LSM-AdapterS has an increased overall parameter amount compared to SAM Adapter [5]. However, the trainable parameters in the second training stage have only increased by 1M, and the inference time per image is comparable to that of the SAM AdapterS.

Table 4: Model efficiency.
Method Params (train / total) Inference time
SAM-Adapter 4.1M / 93.8M 0.2088s
SAM-AdapterS 4.1M / 120.8M 0.3034s
LSM-AdapterS 5.1M / 121.8M 0.3273s
Table 5: Ablation of rr.
rr Precision Recall F1-Score IoU
0.25 71.59 55.00 62.21 45.15
0.50 79.46 65.29 71.68 55.86
1.00 79.47 70.57 74.76 59.69

B.3 Additional ablation study.

5.0.1 Effect of the scale of dataset.

To investigate the impact of the scale of the annotated dataset, we randomly select training data with and without waterlogging according to the ratio rr and then merge them to form a subset for training. The ratio rr is set to 1, 0.5, and 0.25, respectively, where a value of 1 indicates training with the original fully annotated data. Tab. 5 shows that models trained with more annotated data perform better on downstream tasks, indicating that although foundation model possesses powerful generalization capabilities, additional annotated data is still needed to help the model adapt to downstream tasks.

Table 6: Ablation of τ\tau.
τ\tau Precision Recall F1-Score IoU
0.3 81.50 66.77 73.40 57.98
0.4 80.55 68.66 74.13 58.89
0.5 78.00 69.85 73.70 58.35
0.6 78.73 69.38 73.76 58.42
0.7 80.18 67.92 73.54 58.15
Table 7: Ablation of λ\lambda.
λ\lambda Precision Recall F1-Score IoU
10 52.00 68.15 58.99 41.83
1 63.88 62.58 63.22 46.22
0.1 63.22 59.76 61.44 44.34
0.01 70.50 51.42 59.47 42.31

5.0.2 Effect of Hyper-parameter selection.

We analyze the impact of two hyper-parameters on performance, the threshold τ\tau for spatial prompt of point type and the coefficient λ\lambda in the total loss (Eq. 8) of the one-stage training. As the threshold τ\tau varies from 0.3 to 0.7, the overall performance of the model changes minimally, reaching optimal performance at a threshold of 0.4. The choice of coefficient λ\lambda significantly affects the model’s performance. When SINet [6] is used as the small model and λ\lambda is set to 1, the overall performance of LSM-AdapterS is optimal. For other values, both the F1-score and IoU decline noticeably. Therefore, we recommend careful selection of the coefficient value when choosing other models as task-specific small models to achieve better performance. Generally, the coefficient should be chosen to keep the loss values of the large and small models within the same order of magnitude.

B.4 Experiments on additional datasets.

We use four additional datasets, CHAMELEON [32], CAMO [16], COD10K [6] for camouflaged object detection, and ISTD [35] for shadow detection to evaluate the proposed method on more downstream tasks. Tab. 8 shows that LSM-AdapterS outperforms other methods on all datasets.

Table 8: Experimental results on additional datasets.
Method CHAMELEON [32] CAMO [16] COD10K [6] ISTD [35]
SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβωF_{\beta}^{\omega}\uparrow MAE\mathrm{MAE}\downarrow SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβωF_{\beta}^{\omega}\uparrow MAE\mathrm{MAE}\downarrow SαS_{\alpha}\uparrow EϕE_{\phi}\uparrow FβωF_{\beta}^{\omega}\uparrow MAE\mathrm{MAE}\downarrow BER\downarrow
Mask RCNN [10] 0.771 0.798 0.659 0.050 0.680 0.676 0.515 0.107 0.714 0.698 0.521 0.044 4.19
U2Net [26] 0.830 0.877 0.699 0.059 0.642 0.690 0.488 0.140 0.688 0.747 0.457 0.076 3.65
SINet [6] 0.888 0.942 0.816 0.030 0.820 0.882 0.743 0.070 0.815 0.887 0.680 0.037 2.35
SAM-Adapter [5] 0.834 0.858 0.680 0.055 0.800 0.816 0.657 0.094 0.820 0.856 0.657 0.044 1.65
LSM-AdapterM 0.843 0.893 0.765 0.040 0.784 0.849 0.723 0.081 0.817 0.883 0.710 0.035 2.02
LSM-AdapterU 0.868 0.906 0.740 0.044 0.723 0.762 0.596 0.115 0.778 0.821 0.580 0.053 2.22
LSM-AdapterS 0.903 0.955 0.836 0.024 0.825 0.889 0.756 0.066 0.839 0.903 0.727 0.031 1.55

B.5 Additional qualitative results.

We provide more qualitative results to further demonstrate the effectiveness and superiority of the proposed LSM-Adapter. Fig. 7 illustrates the visual comparison results between LSM-Adapter and other existing methods, including Mask R-CNN [10], U2Net [26], SINet [6], and SAM-Adapter [5]. LSM-Adapter selects masks from three different small models as spatial prompts and utilizes the two-stage training strategy. It is evident that LSM-Adapter has the prediction results closest to the ground truth, whether compared with the small or large models, even in the case of challenging samples, such as the first, second, eighth, and ninth rows. Meanwhile, we observe that spatial prompts derived from the small model can compensate for the knowledge gap of the large model when the prediction output of the small model is superior, resulting in LSM-Adapter’s predictions that more closely align with the ground truth. Concurrently, even when the small model’s inferior predictive masks are employed as prompts, LSM-Adapter remains relatively unaffected, maintaining a prediction quality comparable to, if not slightly superior to, that of the standalone large model.

Refer to caption
Figure 7: Visual comparison results of different models. M, U, S denotes Mask R-CNN, U2Net and SINet is used as small model, respectively.