Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Peng Zheng^1,4,5,6† Dehong Gao² Deng-Ping Fan^1∗ Li Liu³
Jorma Laaksonen⁴ Wanli Ouyang⁵ Nicu Sebe⁶
¹ Nankai University, ² Northwest Polytechnical University,
³ National University of Defense Technology, ⁴ Aalto University,
⁵ Shanghai AI Laboratory, ⁶ University of Trento
[email protected]

Abstract

We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS). It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef). LM aids in object localization using global semantic information. Within the RM, we utilize BiRef for the reconstruction process, where hierarchical patches of images provide the source reference, and gradient maps serve as the target reference. These components collaborate to generate the final predicted maps. We also introduce auxiliary gradient supervision to enhance the focus on regions with finer details. In addition, we outline practical training strategies tailored for DIS to improve map quality and the training process. To validate the general applicability of our approach, we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance, outperforming task-specific cutting-edge methods across all benchmarks. Our codes are publicly available at https://github.com/ZhengPeng7/BiRefNet.

\begin{overpic}[width=424.94574pt]{figs/Fig1.pdf} \put(25.5,10.5){{\color[rgb]{1,.5,0}GT}} \put(37.85,10.5){{\color[rgb]{1,.5,0}Ours}} \put(25.5,0.5){{\color[rgb]{1,.5,0}IS-Net}} \put(37.85,0.5){{\color[rgb]{1,.5,0}UDUN}} \par\put(75.5,10.5){{\color[rgb]{1,.5,0}GT}} \put(87.85,10.5){{\color[rgb]{1,.5,0}Ours}} \put(75.5,0.5){{\color[rgb]{1,.5,0}IS-Net}} \put(87.85,0.5){{\color[rgb]{1,.5,0}UDUN}} \par\end{overpic}

Figure 1: Visual comparison between the results of our proposed BiRefNet and the latest state-of-the-art methods (e.g., IS-Net [37] and UDUN [34]) for high-resolution dichotomous image segmentation (DIS). Details of segmentation are zoomed in for better display.

^†^†

\dagger

Peng finished the majority of this work when he was a visiting scholar at Nankai University.^†^†

\ast

Corresponding author ([email protected]).

1 Introduction

\begin{overpic}[width=424.94574pt]{figs/comp_other_methods.pdf} \put(7.8,-1.5){(a)} \put(27.0,-1.5){(b)} \put(47.0,-1.5){(c)} \put(71.0,-1.5){(d)} \par\put(2.4,4.5){Enc} \put(12.0,4.5){Dec} \put(21.5,4.5){Enc} \put(31.2,4.5){Dec} \put(41.5,4.5){Enc} \put(50.5,4.5){Dec} \put(66.0,4.5){Enc} \put(75.0,4.5){Dec} \end{overpic}

Figure 2: Comparison between our proposed BiRefNet and other existing methods for HR segmentation tasks. (a) Common framework [38]; (b) Image pyramid as input [13, 56]; (c) Scaled images as inward reference [21, 27]; (d) BiRefNet: patches of original images at original scales as inward reference and gradient priors as outward reference. Enc = encoder, Dec = decoder.

With the advancement in high-resolution image acquisition, image segmentation technology has evolved from traditional coarse localization to achieving high-precision object segmentation. This task, whether it involves salient [12] or concealed object detection [11, 10], is referred to as high-resolution dichotomous image segmentation (DIS) [37] and has attracted widespread attention and use in the industry, e.g., by Samsung, Adobe, and Disney.

For the new DIS task, recent works have considered strategies such as intermediate supervision [37], frequency prior [58], and unite-divide-unite [34], and have achieved favorable results. Essentially, they either split the supervision [34, 37] at the feature-level or introduce an additional prior [58] to enhance feature extraction. These strategies are, however, still insufficient to capture very fine features (see Fig. 1). Based on our observations, we found that fine and non-salient features in image objects can be well reflected by obtaining gradient features through derivative operations on the original image. In addition, when certain positions exhibit high similarity in color and texture to the background, the gradient features are probably too weak. For such cases, we further introduce ground-truth (GT) features for side supervision, allowing the framework to learn the characteristics of these positions. We name the incorporation of the image reference and the introduction of both the gradient and GT references as bilateral reference.

We propose a novel progressive bilateral reference network BiRefNet to handle the high-resolution DIS task with separate localization and reconstruction modules. For the localization module, we extract hierarchical features from vision transformer backbone, which are combined and squeezed to obtain corase predictions in low resolution in deep layers. For the reconstruction module, we further design the inward and outward references as bilateral references (BiRef), in which the source image and the gradient map are fed into the decoder at different stages. Instead of resizing the original images to lower-resolution versions to ensure consistency with decoding features at each stage [21, 27], we keep the original resolution for intact detail features in inward reference and adaptively crop them into patches for compatibility with decoding features. In addition, we investigate and summarize practical strategies for the training on high-resolution (HR) data, including long training and region-level loss for better segmentation in parts of fine details, and multi-stage supervision to accelerate learning of them.

Our main contributions are summarized as follows:

1.

We present a bilateral reference network (BiRefNet), which is a simple yet strong baseline to perform high-quality dichotomous image segmentation.
2.

We propose a bilateral reference module, which consists of an inward reference with source image guidance and an outward reference with gradient supervision. It shows great efficacy in the reconstruction of the predicted HR results.
3.

We explore and summarize various practical strategies tailored for DIS to easily improve performance, prediction quality, and convergence acceleration.
4.

The proposed BiRefNet shows its excellent performance and strong generalization capabilities to achieve state-of-the-art performance on not only the DIS5K task but also on HRSOD and COD with 6.8%, 2.0%, and 5.6% average $S_{m}$ [7] improvements, respectively.

\begin{overpic}[width=424.94574pt]{figs/BiRefNet.pdf} \put(13.0,41.5){\normalsize{} {$\mathcal{I}$}} \put(74.0,41.5){\normalsize{} {$\mathcal{M}$}} \par\put(24.2,34.0){\normalsize{} {$\mathcal{F}^{e}_{1}$}} \put(24.2,23.3){\normalsize{} {$\mathcal{F}^{e}_{2}$}} \put(24.2,9.5){\normalsize{} {$\mathcal{F}^{e}_{3}$}} \put(52.0,38.2){\normalsize{} {$\mathcal{F}^{d}_{1}$}} \put(52.0,24.5){\normalsize{} {$\mathcal{F}^{d}_{2}$}} \put(52.0,9.0){\normalsize{} {$\mathcal{F}^{d}_{3}$}} \par\put(33.8,6.5){\normalsize{} {$\mathcal{F}^{e}$}} \put(41.5,6.5){\normalsize{} {$\mathcal{F}^{d}$}} \par\put(52.0,41.5){\normalsize{} {$\mathcal{F}^{d+}_{1}$}} \put(52.0,29.5){\normalsize{} {$\mathcal{F}^{d+}_{2}$}} \put(52.0,16.0){\normalsize{} {$\mathcal{F}^{d+}_{3}$}} \par\put(41.0,41.5){\normalsize{} {$\mathcal{F}^{l}_{1}$}} \put(41.0,28.5){\normalsize{} {$\mathcal{F}^{l}_{2}$}} \put(41.0,15.5){\normalsize{} {$\mathcal{F}^{l}_{3}$}} \par\put(68.5,33.5){\normalsize{} {$\mathcal{F}^{d}_{2}$}} \put(62.5,26.0){\normalsize{} {$\mathcal{F}^{d^{\prime}}_{3}$}} \put(62.5,9.0){\normalsize{} {$\mathcal{F}^{d+}_{3}$}} \put(73.0,9.7){\normalsize{} {$\{\mathcal{P}_{k=1}^{N}\}$}} \par\put(76.0,30.0){\footnotesize{$\mathcal{A}_{3}^{G}$}} \put(87.0,17.0){\footnotesize{$\mathcal{M}_{2}$}} \put(95.0,25.9){\footnotesize{$\hat{G}_{3}$}} \par\put(47.0,0.9){\footnotesize{${L}_{\text{CE}}$}} \put(92.0,37.3){\footnotesize{${L}_{\text{BCE, IoU}}$}} \put(88.0,6.8){\footnotesize{${L}_{\text{BCE}}$}} \put(88.0,4.8){\footnotesize{${L}_{\text{IoU}}$}} \put(94.0,6.8){\footnotesize{${L}_{\text{BCE}}$}} \end{overpic}

Figure 3: Pipeline of the proposed bilateral reference Network (BiRefNet). BiRefNet mainly consists of the localization module (LM) and the reconstruction module (RM) with bilateral reference (BiRef) blocks. Please refer to Sec. 3.1 for details.

2 Related Works

2.1 High-Resolution Class-agnostic Segmentation

High-resolution class-agnostic segmentation has been a typical computer vision objective for decades, and many related tasks have been proposed and attracted much attention, such as dichotomous image segmentation (DIS) [37], high-resolution salient object detection (HRSOD) [53], and concealed object detection (COD) [10]. To provide standard HRSOD benchmarks, several typical HRSOD datasets (e.g., HRSOD [53], UHRSD [46], HRS10K [6]) and numerous approaches [21, 42, 53, 46] have been proposed. Zeng et al. [53] employed a global-local fusion of the multi-scale input in their network. Xie et al. [46] used cross-model grafting modules to process images at different scales from multiple backbones (lightweight [16] and heavy [30]). Pyramid blending was also used in [21] for a lower computational cost. Concealed objects are difficult to locate due to similar-looking surrounding distractors [9]. Therefore, image priors, such as frequency [57], boundary [40], gradient [19], etc, are used as auxiliary guidance to train COD models. Furthermore, a higher resolution has been found beneficial for detecting targets [19, 18, 17]. To produce more precise and fine-detail segmentation results, Yin et al. [50] employed progressive refinement with masked separable attention. Li et al. [27] incorporated the original images at different scales to aid in the refining process.

High-resolution DIS is a newly proposed task that focuses more on the complex slender structure of target objects in high-resolution images, making it even more challenging. Qin et al. [37] proposed the DIS5K dataset and IS-Net with intermediate supervision to alleviate the loss of fine areas. In addition, Zhou et al. [58] embedded a frequency prior to their DIS network to capture more details. Pei et al. [34] applied a label decoupling strategy [45] to the DIS task and achieved competitive segmentation performance in the boundary areas of objects. Yu et al. [52] used patches of HR images to accelerate the training in a more memory-efficient way. Unlike previous models that used compressed/resized images to enhance HR segmentation, we utilized intact HR images as supplementary information for better predictions in high resolution.

2.2 Progressive Refinement in Segmentation

In the image matting task, trimaps have been used as a pre-positioning technique for more precise segmentation results [26, 47]. In Fig. 2, we illustrate different approaches of relevant networks and compare the differences [38, 55, 37, 58, 34]. Many approaches have been proposed based on the progressive refinement strategy. Yu et al. [51] used the predicted LR alpha matte as a guide for refined HR maps. In BASNet [35], the initial results are revised with an additional refiner network. The CRM [39] continuously aligns the feature map with the refinement target to aggregate detailed features. In ICNet [56], the original images are also downscaled and added to the decoder output at different stages for refinement. In ICEG [15], the generator and the detector iteratively evolve through interaction to obtain better COD segmentation results. In addition to images and GT, auxiliary information is also used in existing methods. For example, Tang et al. [41] cropped patches on the boundary to further refine them. In the LapSRN network [23, 24] for image super-resolution, Laplacian pyramids are also generated to help with image reconstruction at higher resolution. Although these methods successfully employed refinement to achieve better results, models are not guided to focus on certain areas, which is a problem in DIS. Therefore, we introduce gradient supervision in our outward reference to guide features sensitive to areas with richer fine details.

3 Methodology

3.1 Overview

As shown in Fig. 2(d), our proposed BiRefNet is different from the previous DIS methods. On the one hand, our BiRefNet explicitly decomposes the DIS task on HR data into two modules, i.e., a localization module (LM) and a reconstruction module (RM). On the other hand, instead of directly adding the source images [13] or the priors [58] to the input, BiRefNet employs our proposed bilateral reference in the RM, making full use of the source images at the original scales and the gradient priors. The complete framework of our BiRefNet is illustrated in Fig. 3.

\begin{overpic}[width=424.94574pt]{figs/BiRef.pdf} \put(4.0,4.0){\footnotesize{$\mathcal{I}$}} \put(1.0,2.5){\footnotesize{1024$\times$1024}} \put(6.0,11.0){\scriptsize{Adaptatively Crop}} \put(9.5,5.0){\footnotesize{$\{\mathcal{P}_{k=1}^{N}\}$}} \put(12.0,15.5){\footnotesize{$\mathcal{F}_{i}^{d+}$}} \put(42.0,10.0){\footnotesize{$\mathcal{F}_{i}^{\theta}$}} \put(49.5,9.0){\footnotesize{$\mathcal{F}_{i}^{d^{\prime}}$}} \put(51.2,5.0){\footnotesize{$\mathcal{M}_{i}$}} \put(75.3,6.9){\footnotesize{$G_{i}^{m}$}} \put(74.2,3.0){\footnotesize{$G_{i}^{gt}$}} \put(69.0,3.2){\footnotesize{$\hat{G}_{i}$}} \put(63.0,2.5){\footnotesize{$\mathcal{F}_{i}^{G}$}} \put(70.8,15.6){\footnotesize{$\mathcal{M}_{i}$}} \put(55.0,15.6){\footnotesize{$\mathcal{A}_{i}^{G}$}} \put(48.2,17.7){\footnotesize{$\mathcal{F}_{i-1}^{d}$}} \put(6.5,0.7){\footnotesize{Inward Reference}} \put(30.5,0.7){\footnotesize{Reconstruction Block}} \put(62.0,0.7){\footnotesize{Outward Reference}} \end{overpic}

Figure 4: Pipeline of the proposed bilateral reference blocks. The source images at the original scale are combined with decoder features as the inward reference and fed into the reconstruction block, where deformable convolutions with hierarchical receptive fields are employed. The aggregated features are then used to predict the gradient maps in the outward reference. Gradient-aware features are then turned into the attention map to act on the original features.

3.2 Localization Module

For a batch of HR images $\mathcal{I}\in{}\mathbb{R}^{N\times 3\times H\times W}$ as input, the transformer encoder [30] extracts features at different stages, i.e., $\mathcal{F}_{1}^{e},\mathcal{F}_{2}^{e},\mathcal{F}_{3}^{e},\mathcal{F}^{e}$ with resolutions as $\{[\frac{H}{k},\frac{W}{k}],k=4,8,16,32\}$ . The features of the first four stages $\{\mathcal{F}_{i}^{e}\}_{i=1}^{3}$ are transferred to the corresponding decoder stages with lateral connections (1 $\times$ 1 convolution layers). Meanwhile, they are stacked and concatenated in the last encoder block to generate $\mathcal{F}^{e}$ .

The encoder output feature $\mathcal{F}^{e}$ is then fed into a classification module, where $\mathcal{F}^{e}$ is led into a global average pooling layer and a fully connected layer for classification with the category $C$ to obtain a better semantic representation for localization. HR features are squeezed in the bottleneck. To enlarge the receptive fields to cover features of large objects and focus on local features for high precision simultaneously [49], which is important for HR tasks involved, we employ ASPP modules [3] here for multi-context fusion. $\mathcal{F}^{e}$ is squeezed to $\mathcal{F}^{d}$ for transfer to the reconstruction module.

3.3 Reconstruction Module

The setting of the receptive field (RF) has been a challenge of HR segmentation. Small RFs lead to inadequate context information to locate the right target on a large background, whereas large RFs often result in insufficient feature extraction in detailed areas. To achieve balance, we propose the reconstruction block (RB) in each BiRef block as a replacement for the vanilla residual blocks. In RB, we employ deformable convolutions [4] with hierarchical receptive fields (i.e., 1 $\times$ 1, 3 $\times$ 3, 7 $\times$ 7) and an adaptive average pooling layer to extract features with RFs of various scales. These features extracted by different RFs are then concatenated as $\mathcal{F}_{i}^{\theta}$ , followed by a 1 $\times$ 1 convolution layer and a batch normalization layer to generate the output feature of RM $\mathcal{F}_{i}^{d^{\prime}}$ . In the reconstruction module, the squeezed feature $\mathcal{F}^{d}$ is fed into the BiRef block for the feature $\mathcal{F}_{3}^{d}$ . With $\mathcal{F}_{3}^{l}$ , the first BiRef block predicts coarse maps, which are then reconstructed into higher-resolution versions through the following BiRef blocks. Following [29], the output feature of each BiRef block $\mathcal{F}_{i}^{d}$ is added with its lateral feature $\mathcal{F}_{i}^{l}$ of the LM at each stage, i.e., $\{\mathcal{F}_{i}^{d+}=\textit{Upsample}\uparrow{}(\mathcal{F}_{i}^{d}+\mathcal{F}_{i}^{l}),i=,3,2,1\}$ . Meanwhile, all BiRef blocks generate intermediate predictions $\{\mathcal{M}_{i}\}_{i=3}^{1}$ by multi-stage supervision, with resolutions in ascending order. Finally, the last decoding feature $\mathcal{F}_{1}^{d+}$ is passed through an 1 $\times$ 1 convolution layer to obtain the final predicted maps $\mathcal{M}\in{}\mathbb{R}^{N\times{}1\times{}H\times{}W}$ .

3.4 Bilateral Reference

In DIS, HR training images are very important for deep models to learn details and perform highly accurate segmentation. However, most segmentation models follow previous works [38, 29] to design the network architecture in an encoder-decoder structure with down-sampling and up-sampling, respectively. Besides, due to the large size of the input, concentrating on the target objects becomes more challenging. To deal with these two main problems, we propose bilateral reference, consisting of an inward reference (InRef) and an outward reference (OutRef), which is illustrated in Fig. 4. Inward reference and outward reference play the roles of supplementing HR information and drawing attention to areas with dense details, respectively.

In InRef, images $\mathcal{I}$ with original high resolution are cropped to patches $\{\mathcal{P}_{k=1}^{N}\}$ of consistent size with the output features of the corresponding decoder stage. These patches are stacked with the original feature $\mathcal{F}_{i}^{d+}$ to be fed into the RM. Existing methods with similar techniques either add $\mathcal{I}$ only at the last decoding stage [34] or resize $\mathcal{I}$ to make it applicable with original features in low resolution. Our inward reference avoids these two problems through adaptive cropping and supplies the necessary HR information at every stage.

In OutRef, we use gradient labels to draw more attention to areas of richer gradient information, which is essential for the segmentation of fine structures. First, we extract the gradient maps of the input images as $G_{i}^{gt}$ . Meanwhile, $\mathcal{F}_{i}^{\theta}$ is used to generate the feature $\mathcal{F}_{i}^{G}$ to produce the predicted gradient maps $\hat{G}_{i}$ . With this gradient supervision, $\mathcal{F}_{i}^{G}$ is sensitive to the gradient. It passes through a conv and a sigmoid layer and is used to generate the gradient referring attention $\mathcal{A}_{i}^{G}$ , which is then multiplied by $\mathcal{F}_{i}^{d^{\prime}}$ to generate output of the BiRef block as $\mathcal{F}_{i-1}^{d}$ .

Considering that the background may have non-target noise with a lot of gradient information, we apply a masking strategy to alleviate the influence of non-target areas. We perform morphological operations on intermediate predictions $\mathcal{M}_{i}$ and use dilated $\mathcal{M}_{i}$ as a mask. The mask is used to multiply the gradient map $G_{i}^{gt}$ to generate $G_{i}^{m}$ , where the gradients outside the mask area are removed.

3.5 Objective Function

In HR segmentation tasks, using only pixel-level supervision (BCE loss) usually results in the deterioration of detailed structural information in HR data. Inspired by the great results in [35] which used a hybrid loss, we use BCE, IoU, SSIM, and CE losses together to collaborate for the supervision on the levels of pixel, region, boundary, and semantic, respectively. The final objective function is a weighted combination of the above losses and can be formulated as:

\begin{split}{L}&={L}_{\text{pixel}}+{L}_{\text{region}}+{L}_{\text{boundary}}+{L}_{\text{semantic}}\\ &=\lambda_{1}{L}_{\text{BCE}}+\lambda_{2}{L}_{\text{IoU}}+\lambda_{3}{L}_{\text{SSIM}}+\lambda_{4}{L}_{\text{CE}}\;,\end{split}

(1)

where $\lambda_{1},~{}\lambda_{2},~{}\lambda_{3},$ and $\lambda_{4}$ are respectively set to 30, 0.5, 10, and 5 to keep all the losses on the same quantitative level at the beginning of the training. The final objective function consists of binary cross-entropy (BCE) loss, intersection over union (IoU) loss, structural similarity index measure (SSIM) loss, and cross-entropy (CE) loss. The complete definition of losses can be found below.

•

BCE loss: pixel-aware supervision, which is used for pixel-level supervision for the generation of binary maps:

\centering{\scriptstyle L_{BCE}=-\sum\limits_{(i,j)}[G(i,j)\log(M(i,j))+(1-G(i,j))\log(1-M(i,j))]},\@add@centering

(2)

where $G(i,j)$ and $\mathcal{M}(i,j)$ denote the value of the GT and binarized predicted maps, respectively, at pixel $(i,j)$ .

•

IoU loss: region-aware supervision for the enhancement of binary map predictions:

\centering{L_{IoU}=1-\tfrac{\sum\limits_{r=1}\limits^{H}\sum\limits_{c=1}\limits^{W}M(i,j)G(i,j)}{\sum\limits_{r=1}\limits^{H}\sum\limits_{c=1}\limits^{W}[M(i,j)+G(i,j)-M(i,j)G(i,j)]}}.\@add@centering

(3)

•

SSIM loss: boundary-aware supervision to improve the accuracy in boundary parts. Given GT maps $G$ and predicted maps $\mathcal{M}$ , $\mathbf{y}=\{y_{j}:j=1,...,N^{2}\}$ and $\mathbf{x}=\{x_{j}:j=1,...,N^{2}\}$ represent the pixel values of two corresponding $N\times N$ patches derived from $G$ and $\mathcal{M}$ , respectively. $SSIM(\mathbf{x},\mathbf{y})$ is defined as:

\centering L_{SSIM}=1-\frac{(2\mu_{x}\mu_{y}+C_{1})(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})},\@add@centering

(4)

where $\mu_{x}$ , $\mu_{y}$ and $\sigma_{x}$ , $\sigma_{y}$ are the means and standard deviations of $\mathbf{x}$ and $\mathbf{y}$ , respectively, $\sigma_{xy}$ is their covariance. $C$ is used to avoid division by zero.

•

CE loss: semantic-aware supervision, which is used to learn better semantic representation:

$\centering{L_{CE}=-\sum_{c=1}^{N}y_{o,c}\log(p_{o,c})},\@add@centering$ (5)

where $N$ is the number of classes, $y_{o,c}$ states whether class label $c$ is the correction classification for observation $o$ , and $p_{o,c}$ denotes the predicted probability that $o$ is of class $c$ .

3.6 Training Strategies Tailored for DIS

Due to the high cost of training models on HR data, we have explored training tricks for HR segmentation tasks to improve performance and reduce training costs.

First, we found that our model converges relatively quickly in the localization of targets and the segmentation of rough structures (measured by F-measure [1], S-measure [7]) on DIS5K (e.g., 200 epochs). However, the performance in segmenting fine parts is still increasing after very long training (e.g., 400 epochs), which is reflected in metrics such as $F_{\beta}^{\omega}$ and $HCE_{\gamma}$ . Second, though long training can easily achieve great results in terms of both structure and edges, it consumes too much computation; we found that multi-stage supervision can dramatically accelerate the learning on segmenting fine details and make the model achieve similar performance as before but with only 30% training epochs. Third, we also found that fine-tuning with only region-level losses can easily improve the binarization of predicted results and those metric scores (e.g., $F_{\beta}^{\omega}$ , $E_{\phi}^{m}$ , $HCE$ ) that are closer to practical use. Finally, we used context feature fusion and image pyramid inputs on the backbone, which are commonly used tricks to process HR images with deep models. In experiments, these two modifications to the backbone achieved a general improvement in DIS and similar HR segmentation tasks.

As shown in Tab. 1, we show the effectiveness of training epochs and the multi-stage supervision. As the results show, our BiRefNet can achieve relatively good results after 200 epochs of training. Continuous training in 400 epochs can increase a small portion of metrics measuring structural information (e.g., $F_{\beta}^{x}$ , $S_{m}$ ), while bringing a larger improvement in metrics measuring fine details (e.g., $HCE_{\gamma}$ ).

Although simple long training can achieve better results, the improvement is relatively small, concerning its high computational cost on HR data. We investigated the multi-stage supervision (MSS), which is a widely used training strategy used in binary segmentation works [54, 37]. Different from MSS in these works for higher precision, it plays a role in accelerating the training convergence. As the results in Tab. 1 show, our BiRefNet trained for 200 epochs with MSS can achieve similar performance with it trained for 400 epochs. MSS successfully cut training time in half and can be used for further HR segmentation tasks for more efficient training.

Table 1: Quantitative ablation studies of the proposed multi-stage supervision for acceleration and training epochs.

Settings		DIS-VD
MSS	Epoch	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$
	200	.875	.848	.041	.886	.914	1207
	400	.897	.863	.036	.905	.937	1039
✓	200	.892	.858	.037	.901	.932	1043

4 Experiments

4.1 Datasets

Training Sets. For DIS, we follow [37, 58, 34] to use DIS5K-TR as our training set in experiments. For HRSOD, we follow [46] to set different combinations of HRSOD, UHRSD, and DUTS as the training set. For COD, we follow [10, 18] to use the concealed samples in CAMO-TR and COD10K-TR as the training set.

Test Sets. To obtain a complete evaluation of our BiRefNet, we tested it on all test sets in DIS5K (DIS-TE1, DIS-TE2, DIS-TE3, and DIS-TE4). We also conducted an evaluation of BiRefNet on the HRSOD test sets (DAVIS-S [53], HRSOD-TE [53], and UHRSD-TE [46]) and the COD test sets (CAMO-TE [25], COD10K-TE [9], and NC4K [31]). Low-resolution SOD test sets (DUTS-TE [44] and DUT-OMRON [48]) are additionally used for supplementary experiments.

4.2 Evaluation Protocol

For a comprehensive evaluation, we employ the widely used metrics, i.e., S-measure [7] ( $S_{m}$ ), max/mean/weighted F-measure [1] ( $F_{\beta}^{x}/F_{\beta}^{m}/F_{\beta}^{\omega}$ ), max/mean E-measure [8] ( $E_{\xi}^{x}/E_{\xi}^{m}$ ), mean absolute error (MAE), and relax HCE [37] ( $HCE_{\gamma}$ ) to evaluate performance. Detailed descriptions of these metrics can be found as follows.

•

S-measure [7] (structure measure, $S_{\alpha}$ ) is a structural similarity measurement between a saliency map and its corresponding GT map. Evaluation with $S_{\alpha}$ can be obtained at high speed without binarization. The $S_{\alpha}$ -measure is computed as:

$S_{\alpha}=\alpha\cdot S_{o}+(1-\alpha)\cdot S_{r},$ (6)

where $S_{o}$ and $S_{r}$ denote object-aware and region-aware structural similarity, and $\alpha$ is set to 0.5 by default, as suggested by Fan et al. in [7].

•

F-measure [1] ( $F_{\beta}$ ) is designed to evaluate the weighted harmonic mean value of precision and recall. The output of the saliency map is binarized with different thresholds to obtain a set of binary saliency predictions. The predicted saliency maps and GT maps are compared to obtain precision and recall values. F-measure can be computed as:

F_{\beta}=\frac{(1+\beta^{2})\cdot Precision\cdot Recall}{\beta^{2}\cdot Precision+Recall},

(7)

where $\beta^{2}$ is set to 0.3 to emphasize precision over recall, following [2]. The maximum F-measure score obtained with the best threshold for the entire dataset is used and denoted as $F_{\beta}^{x}$ .

•

E-measure [8] (enhanced-alignment measure, $E_{\xi}$ ) is designed as a perceptual metric to evaluate the similarity between the predicted maps and the GT maps both locally and globally. E-measure is defined as:

$E_{\xi}=\frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H}\phi_{\xi}(x,y),$ (8)

where $\phi_{\xi}$ indicates the enhanced alignment matrix. Similarly to the F-measure, we also adopt the maximum E-measure ( $E_{\xi}^{x}$ ) and also the mean E-measure ( $E_{\xi}^{m}$ ) as our evaluation metrics.
•

MAE (mean absolute error, $\epsilon$ ) is a simple pixel-level evaluation metric that measures the absolute difference between non-binarized predicted results $\mathcal{M}$ and GT maps $G$ . It is defined as:

$\epsilon=\frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H}|\hat{Y}(x,y)-\text{G}(x,y)|.$ (9)
•

$\mathbf{HCE_{\gamma}}$ [37] (human correction efforts, $HCE$ ) is a newly proposed metric aimed at evaluating the human efforts required to correct faulty predictions to satisfy specific accuracy requirements in real-world applications. Specifically, $HCE$ is quantified by the approximate number of mouse clicks. In practical applications, minor prediction errors can be tolerated. Therefore, relax $HCE$ ( $HCE_{\gamma}$ ) is introduced, where $\gamma$ denotes tolerance. In experiments, we use $HCE_{5}$ (relax $HCE$ with $\gamma=5$ ) to stay consistent with that of the original paper [37], where detailed descriptions of $HCE_{\gamma}$ are provided.

Table 2: Quantitative ablation studies of the proposed components in the proposed BiRefNet. The ablation studies are conducted on the effectiveness of the proposed components, including reconstruction module (RM), inward reference (InRef), outward reference (OutRef), and their combinations.

Modules			DIS-VD
RM	InRef	OutRef	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$
			.837	.785	.056	.845	.887	1204
✓			.855	.831	.048	.865	.895	1167
	✓		.848	.825	.050	.857	.903	1152
✓	✓		.869	.834	.041	.886	.912	1093
✓		✓	.863	.831	.042	.891	.918	1106
	✓	✓	.861	.839	.044	.881	.911	1114
✓	✓	✓	.889	.851	.038	.900	.924	1065

Table 3: Effectiveness of practical strategies for training high-resolution segmentation. The experimental comparison of the proposed several tricks for HR segmentation tasks is provided here, including context feature fusion (CFF), image pyramids input (IPT), regional loss fine-tuning (RLFT), and their combinations. The results are obtained by our final model.

Modules			DIS-VD
CFF	IPT	RLFT	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$
			.889	.851	.038	.900	.924	1065
✓			.893	.856	.038	.904	.928	1054
	✓		.895	.857	.037	.904	.927	1051
		✓	.890	.861	.036	.899	.932	1043
✓	✓	✓	.897	.863	.036	.905	.937	1039

4.3 Implementation Details

All images are resized to 1024 $\times$ 1024 for training and testing. The generated segmentation maps are resized (i.e., bilinear interpolation) to the original size of the corresponding GT maps for evaluation. Horizontal flip is the only data augmentation used in the training process. The number of categories $C$ is set to 219, as given in DIS-TR. The proposed BiRefNet is trained with Adam optimizer [22] for DIS/HRSOD/COD tasks for 600/150/150 epochs, respectively. The model is fine-tuned with the IoU loss for the last 20 epochs. The initial learning rate is set to $10^{-4}$ and $10^{-5}$ for DIS and others, respectively. Models are trained with PyTorch [33] on eight NVIDIA A100 GPUs. The batch size is set to $N$ =4 for each GPU during training.

4.4 Ablation Study

We study the effectiveness of each component (i.e., RM and BiRef) and practical strategies (i.e., CFF, IPT, and RLFT) introduced for our BiRefNet and conduct an investigation about their contributions to improved DIS results. Quantitative results regarding each module and strategy are shown in Tab. 2 and 3, respectively.

\begin{overpic}[width=243.46228pt]{figs/sota_radar.png} \end{overpic}

Figure 5: Quantitative comparisons of the proposed BiRefNet and the best task-specific models. S-measure [7] is used for the comparison here. UDUN [34], FSPNet [18], PGNet-UH [46], and PGNet-DH [46] are currently the best models for the DIS, COD, HRSOD, and SOD tasks, respectively.

\begin{overpic}[width=491.89041pt]{figs/qual_dis.pdf} \put(7.0,1.0){Image} \put(22.5,1.0){GT} \put(36.0,1.0){{Ours}} \put(48.0,1.0){UDUN~{}\cite[cite]{[\@@bibref{Number}{UDUN}{}{}]}} \put(62.0,1.0){IS-Net~{}\cite[cite]{[\@@bibref{Number}{DIS5K}{}{}]}} \put(76.0,1.0){U${}^{2}$Net~{}\cite[cite]{[\@@bibref{Number}{U-2-Net}{}{}]}} \put(90.0,1.0){HRNet~{}\cite[cite]{[\@@bibref{Number}{HRNet}{}{}]}} \put(1.0,50.0){\rotatebox{90.0}{DIS-TE1}} \put(1.0,39.0){\rotatebox{90.0}{DIS-TE2}} \put(1.0,28.0){\rotatebox{90.0}{DIS-TE3}} \put(1.0,17.0){\rotatebox{90.0}{DIS-TE4}} \put(1.0,5.5){\rotatebox{90.0}{DIS-VD}} \end{overpic}

Figure 6: Qualitative comparisons of the proposed BiRefNet and previous methods on the DIS5K benchmark. The results of the previous methods are from [34], where all models are trained with images in 1024

\times

1024. Zoom in for a better view.

Table 4: Quantitative comparisons between our BiRefNet and the state-of-the-art methods on DIS5K. “

\uparrow

” (“

\downarrow

”) means that the higher (lower) is better. We use the results from [34], where all methods take 1024

\times

1024 input.

Methods	DIS-TE1 (500)						DIS-TE2 (500)						DIS-TE3 (500)
Methods	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$
BASNet₁₉ [35]	.663	.577	.105	.741	.756	155	.738	.653	.096	.781	.808	341	.790	.714	.080	.816	.848	681
U²Net₂₀ [36]	.701	.601	.085	.762	.783	165	.768	.676	.083	.798	.825	367	.813	.721	.073	.823	.856	738
HRNet₂₀ [43]	.668	.579	.088	.742	.797	262	.747	.664	.087	.784	.840	555	.784	.700	.080	.805	.869	1049
PGNet₂₂ [46]	.754	.680	.067	.800	.848	162	.807	.743	.065	.833	.880	375	.843	.785	.056	.844	.911	797
IS-Net₂₂ [37]	.740	.662	.074	.787	.820	149	.799	.728	.070	.823	.858	340	.830	.758	.064	.836	.883	687
FP-DIS₂₃ [58]	.784	.713	.060	.821	.860	160	.827	.767	.059	.845	.893	373	.868	.811	.049	.871	.922	780
UDUN₂₃ [34]	.784	.720	.059	.817	.864	140	.829	.768	.058	.843	.886	325	.865	.809	.050	.865	.917	658
BiRefNet	.860	.819	.037	.885	.911	106	.894	.857	.036	.900	.930	266	.925	.893	.028	.919	.955	569
BiRefNet_SwinB	.857	.819	.038	.884	.912	110	.890	.854	.037	.898	.930	275	.919	.886	.030	.915	.953	597
BiRefNet_SwinT	.823	.774	.048	.855	.887	117	.862	.821	.046	.877	.912	290	.899	.860	.036	.897	.942	627
BiRefNet_PVTv2b2	.839	.796	.042	.870	.903	111	.881	.842	.040	.888	.925	280	.903	.866	.036	.901	.941	614
Methods	DIS-TE4 (500)						DIS-TE (1-4) (2,000)						DIS-VD (470)
Methods	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$	$F_{\beta}^{x}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$HCE_{\gamma}\downarrow$
BASNet₁₉ [35]	.785	.713	.087	.806	.844	2852	.744	.664	.092	.786	.814	1007	.737	.656	.094	.781	.809	1132
U²Net₂₀ [36]	.800	.707	.085	.814	.837	2898	.771	.676	.082	.799	.825	1042	.753	.656	.089	.785	.809	1139
HRNet₂₀ [43]	.772	.687	.092	.792	.854	3864	.743	.658	.087	.781	.840	1432	.726	.641	.095	.767	.824	1560
PGNet₂₂ [46]	.831	.774	.065	.841	.899	3361	.809	.746	.063	.830	.885	1173	.798	.733	.067	.824	.879	1326
IS-Net₂₂ [37]	.827	.753	.072	.830	.870	2888	.799	.726	.070	.819	.858	1016	.791	.717	.074	.813	.856	1116
FP-DIS₂₃ [58]	.846	.788	.061	.852	.906	3347	.831	.770	.047	.847	.895	1165	.823	.763	.062	.843	.891	1309
UDUN₂₃ [34]	.846	.792	.059	.849	.901	2785	.831	.772	.057	.844	.892	977	.823	.763	.059	.838	.892	1097
BiRefNet	.904	.864	.039	.900	.939	2723	.896	.858	.035	.901	.934	916	.891	.854	.038	.898	.931	989
BiRefNet_SwinB	.899	.860	.040	.895	.938	2836	.891	.855	.036	.898	.933	954	.881	.844	.039	.890	.925	1029
BiRefNet_SwinT	.880	.834	.049	.878	.925	2888	.866	.822	.045	.877	.916	980	.862	.819	.045	.874	.917	1070
BiRefNet_PVTv2b2	.890	.846	.045	.886	.929	2871	.878	.838	.041	.886	.925	969	.868	.827	.044	.880	.919	1073

Baseline. We provide a simple but strong encoder-decoder network as the baseline for the DIS task. To capture better hierarchical features on various scales, we chose the Swin transformer large [30] as our default backbone network. Then, to obtain a better semantic representation in the DIS task, we divided the images in DIS-TR into 219 classes according to their label names and added an auxiliary classification head at the end of the encoder. In the decoder of the baseline network, each decoder block is made up of two residual blocks [16]. All stages of the encoder and decoder are connected with an 1 $\times{}$ 1 convolution, except the deepest stage, where an ASPP [3] block is used for connectivity. With this setup, our baseline network has outperformed existing DIS models in most metrics, as shown in Tab. 2 and Tab. 4.

Reconstruction Module. As shown in Tab. 2, our model gains an overall improvement with the proposed RM. The RM provides multi-scale receptive fields on the HR features for local details and overall semantics. It brings $\sim$ 2.2% $F_{\beta}^{x}$ relative improvement with little extra computational cost.

Bilateral Reference. We separately investigate the effectiveness of the inward reference (InRef, with source images) and the outward reference (OutRef, with gradient labels) in BiRef. InRef supplemented lossless HR information globally, while OutRef drew more attention to the fine-detail parts to achieve higher precision in those areas. As shown in Tab. 2, they work jointly to bring 2.9% $F_{\beta}^{x}$ relative improvement to BiRefNet. RM and BiRef are combined to achieve 6.2% $F_{\beta}^{x}$ relative improvement.

Training Strategies. As shown in Tab. 3, the proposed strategies improve performance from different perspectives. CCF and IPT improve overall performance, while RLFT specifically improves precision in edge details, which is reflected in metrics such as $F_{\beta}^{\omega}$ and $HCE_{\gamma}$ .

4.5 State-of-the-Art Comparison

To validate the general applicability of our method, we conduct extensive experiments on four tasks, i.e., high-resolution dichotomous image segmentation (DIS), high-resolution salient object detection (HRSOD), concealed object detection (COD), and salient object detection (SOD). We compare our proposed BiRefNet with all the latest task-specific models on existing benchmarks [37, 53, 46, 44, 48, 25, 10, 31].

Table 5: Quantitative comparisons between our BiRefNet and the state-of-the-art methods in high-resolution and low-resolution SOD datasets. TR denotes the training set. To provide a fair comparison, we train our BiRefNet with different combinations of training sets, where 1, 2, and 3 represent DUTS [44], HRSOD [53], and UHRSD [46], respectively.

Test Sets		High-Resolution Benchmarks												Low-Resolution Benchmarks
Methods	TR	DAVIS-S (92)				HRSOD-TE (400)				UHRSD-TE (988)				DUTS-TE (5,019)				DUT-OMRON(5,168)
Methods	TR	$S_{m}\uparrow$	$F_{\beta}^{x}\uparrow$	$E_{\phi}^{m}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$F_{\beta}^{x}\uparrow$	$E_{\phi}^{m}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$F_{\beta}^{x}\uparrow$	$E_{\phi}^{m}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$F_{\beta}^{x}\uparrow$	$E_{\phi}^{m}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$F_{\beta}^{x}\uparrow$	$E_{\phi}^{m}\uparrow$	$\mathcal{M}\downarrow$
LDF₂₀ [45]	1	.922	.911	.947	.019	.904	.904	.919	.032	.888	.913	.891	.047	.892	.898	.910	.034	.838	.820	.873	.051
HRSOD₁₉ [53]	1,2	.876	.899	.955	.026	.896	.905	.934	.030	-	-	-	-	.824	.835	.885	.050	.762	.743	.831	.065
DHQ₂₁ [42]	1,2	.920	.938	.947	.012	.920	.922	.947	.022	.900	.911	.905	.039	.894	.900	.919	.031	.836	.820	.873	.045
PGNet₂₂ [46]	1	.935	.936	.947	.015	.930	.931	.944	.021	.912	.931	.904	.037	.911	.917	.922	.027	.855	.835	.887	.045
PGNet₂₂ [46]	1,2	.948	.950	.975	.012	.935	.937	.946	.020	.912	.935	.905	.036	.912	.919	.925	.028	.858	.835	.887	.046
PGNet₂₂ [46]	2,3	.954	.957	.979	.010	.938	.945	.946	.020	.935	.949	.916	.026	.859	.871	.897	.038	.786	.772	.884	.058
BiRefNet	1	.967	.966	.984	.008	.957	.958	.972	.014	.931	.933	.943	.030	.939	.937	.958	.019	.868	.813	.878	.040
BiRefNet	1,2	.973	.976	.990	.006	.962	.963	.976	.011	.937	.942	.951	.024	.938	.935	.960	.018	.868	.818	.882	.040
BiRefNet	1,3	.975	.977	.989	.006	.959	.958	.972	.014	.952	.960	.965	.019	.942	.942	.961	.018	.881	.837	.896	.036
BiRefNet	2,3	.976	.980	.990	.006	.956	.953	.967	.016	.952	.958	.964	.019	.933	.928	.954	.020	.864	.810	.879	.040
BiRefNet	1,2,3	.975	.979	.989	.006	.962	.961	.973	.013	.957	.963	.969	.016	.944	.943	.962	.018	.882	.839	.896	.038

Table 6: Comparison of BiRefNet with recent methods. As seen, BiRefNet performs much better than previous methods.

Methods	CAMO (250)						COD10K (2,026)						NC4K (4,121)
Methods	$S_{m}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$F_{\beta}^{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$E_{\phi}^{x}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$F_{\beta}^{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$E_{\phi}^{x}\uparrow$	$\mathcal{M}\downarrow$	$S_{m}\uparrow$	$F_{\beta}^{\omega}\uparrow$	$F_{\beta}^{m}\uparrow$	$E_{\phi}^{m}\uparrow$	$E_{\phi}^{x}\uparrow$	$\mathcal{M}\downarrow$
SINet₂₀ [9]	.751	.606	.675	.771	.831	.100	.771	.551	.634	.806	.868	.051	.808	.723	.769	.871	.883	.058
BGNet₂₂ [40]	.812	.749	.789	.870	.882	.073	.831	.722	.753	.901	.911	.033	.851	.788	.820	.907	.916	.044
SegMaR₂₂ [20]	.815	.753	.795	.874	.884	.071	.833	.724	.757	.899	.906	.034	.841	.781	.820	.896	.907	.046
ZoomNet₂₂ [32]	.820	.752	.794	.878	.892	.066	.838	.729	.766	.888	.911	.029	.853	.784	.818	.896	.912	.043
SINetv2₂₂ [10]	.820	.743	.782	.882	.895	.070	.815	.680	.718	.887	.906	.037	.847	.770	.805	.903	.914	.048
FEDER₂₃ [14]	.802	.738	.781	.867	.873	.071	.822	.716	.751	.900	.905	.032	.847	.789	.824	.907	.915	.044
HitNet₂₃ [17]	.849	.809	.831	.906	.910	.055	.871	.806	.823	.935	.938	.023	.875	.834	.853	.926	.929	.037
FSPNet₂₃ [18]	.856	.799	.830	.899	.928	.050	.851	.735	.769	.895	.930	.026	.879	.816	.843	.915	.937	.035
BiRefNet	.904	.890	.904	.954	.959	.030	.913	.874	.888	.960	.967	.014	.914	.894	.909	.953	.960	.023

\begin{overpic}[width=496.85625pt]{figs/qual_cod.pdf} \put(0.7,40.5){\rotatebox{90.0}{Tiny}} \put(0.7,29.5){\rotatebox{90.0}{Slim}} \put(0.7,16.5){\rotatebox{90.0}{Occluded}} \put(0.7,6.0){\rotatebox{90.0}{Multiple}} \put(6.5,1.0){Input} \put(21.5,1.0){GT} \put(35.0,1.0){{Ours}} \put(49.0,1.0){HitNet} \put(62.0,1.0){FSPNet} \put(75.0,1.0){ZoomNet} \put(89.0,1.0){SINetv2} \end{overpic}

Figure 7: Visual comparisons of the proposed BiRefNet and other competitors on COD10K benchmark. Samples with different challenges are provided here to show the superiority of BiRefNet from different perspectives.

Table 7: Comparison of different DIS methods on the performance, efficiency, and model complexity. Full details can be referred to https://drive.google.com/drive/u/0/folders/1s2Xe0cjq-2ctnJBR24563yMSCOu4CcxM.

Model	Runtime (ms)	#Params (MB)	MACs (G)	DIS-TEs ( $HCE,F_{\beta}^{\omega}$ )
BiRefNet_SwinL	83.3	215	1143	916, .858
BiRefNet_{SwinL_cp}	78.3	215	1143	916, .858
BiRefNet_SwinB	61.4	101	561	954, .855
BiRefNet_SwinT	40.9	39	231	980, .822
BiRefNet_PVTv2b2	47.8	35	195	969, .838
BiRefNet_PVTv2b1	36.6	23	147	978, .817
BiRefNet_PVTv2b0	32.9	11	89	1013, .806
IS-Net	16.0	44	160	1016, .726
UDUN_Res50	33.5	25	142	977, .772

\begin{overpic}[width=433.62pt]{figs/BiRefNet-acc_runtime.pdf} \end{overpic}

Figure 8: Comparison of the efficiency, size, complexity, and performance of BiRefNet and existing DIS methods.

Quantitative Results. Tab. 4 shows a quantitative comparison between the proposed BiRefNet and previous state-of-the-art methods. Our BiRefNet outperforms all previous methods in widely used metrics. The complexities of DIS-TE1 $\sim$ DIS-TE4 are in ascending order. The metrics for structure similarity (e.g., $S_{\alpha}$ , $E_{\phi}^{x}$ ) focus more on global information. Pixel-level metrics, such as MAE ( $M$ ), emphasize the precision of details. Metrics based on mean values (e.g., $E_{\phi}^{m}$ , $F_{\phi}^{m}$ ) better match the requirements of practical applications where maps are thresholded. As seen in Tab. 4, our BiRefNet outperforms previous methods not only on the accuracy of the global shape but also in the details of the pixels. It is noteworthy that the results are better, especially in metrics that cater more to practical applications.

Additionally, our BiRefNet outperforms existing task-specific models on the HRSOD and COD tasks. As shown in Tab. 5, BiRefNet achieved much higher accuracy on both high-resolution and low-resolution SOD benchmarks. Compared with the previous SOTA method [46], our BiRefNet achieved an average improvement of 2.0% $S_{m}$ . Furthermore, as shown in Tab. 6, in the COD task, BiRefNet also shows a much better performance compared to the previous SOTA models, with an average improvement of 5.6% $S_{m}$ on the three widely used COD benchmarks. These results show the remarkable generalization ability of our BiRefNet to similar HR tasks.

For a clearer illustration of the generalizability and powerful performance of BiRefNet, we provide a radar picture shown in Fig. 5, where we run our model and the best task-specific models on DIS/HRSOD/COD/SOD tasks. As the results show, our BiRefNet achieves leading results in all four tasks. The other task-specific models show their weakness in similar HR segmentation tasks. For example, FSPNet [18] ranks second in COD benchmarks, while it ranks fourth/third/third in DIS/HRSOD/SOD tasks, respectively.

Qualitative Results. Fig. 6 shows segmentation maps produced by the most competitive existing DIS models and the proposed BiRefNet. As the results show, we provide samples of all test sets and one validation set. BiRefNet outperforms the previous DIS methods from two perspectives, i.e., the location of target objects and the more accurate segmentation of the details of the objects. For example, in the samples of DIS-TE4 and DIS-TE2, there are neighboring distractors that attract the attention of other models to produce false positives. On the contrary, our BiRefNet eliminates the distractors and accurately segments the target. In samples of DIS-TE3 and DIS-VD, BiRefNet shows much greater performance in precisely segmenting areas where fine details are rich. Compared with previous methods, our BiRefNet can clearly segment slim shapes and curved edges.

We also provide a qualitative comparison on the COD task. Fig. 7 shows hard samples with different challenges. For example, in the row of the occluded frog, the area of the frog is divided by the branch that covers it, while our BiRefNet can accurately segment the scattered fragments almost the same as the GT map. In contrast, in the results of the other methods, fragments are difficult to find all, let alone to provide precise segmentation maps. For tiny and slim objects, BiRefNet shows a better ability to find the right target. Our BiRefNet also shows superiority in finding multiple concealed objects.

\begin{overpic}[width=496.85625pt]{figs/applications.pdf} \end{overpic}

Figure 9: Potenial applications and selected existing third-party applications based on BiRefNet, and visual comparisons on social media. (A) Potential application #1. Building crack detection for the maintenance of architecture health. (B) Potential application #2. Highly accurate object extraction in high-resolution natural images. (C) A project by viperyl first packs our BiRefNet as a ComfyUI node and makes this SOTA model easier to use for everyone. (D) ZHO also provides a ComfyUI-based project to further improve the UI for our BiRefNet, especially for video data. (E) Fal.AI encapsulates our BiRefNet online with more useful options in UI and API to call the model. (F) ZHO provides a visual comparison between our BiRefNet and previous SOTA method BRIA RMGB v1.4 which has extra training on their private training dataset. (G) Toyxyz conducted a comparison between our BiRefNet and previous competitive human matting methods (e.g., BRIAAI, RemBG, Robust Video Matting, and Person YOLOv8s) with both videos and images.

Efficiency and Complexity Comparison. We equip our BiRefNet with different backbones to obtain models in different sizes. Runtime, number of parameters, MACs, and performance of them are further tested to provide a comprehensive comparison between them and other methods. First, we provide a quantitative comparison in Tab. 7. The FPS of the largest BiRefNet can be more than 10, which is acceptable in most practical applications. We also used the compiled version (BiRefNet_{SwinL_cp}) by PyTorch 2.0 [33] on BiRefNet_SwinL to accelerate its inference by 13%. In addition, we draw the performance and runtime of each model in Fig. 8 for a clearer display. BiRefNet with different backbones are evaluated and compared with existing DIS methods on DIS-TEs and DIS-VD. Different methods are drawn in different colors and markers. All tests are conducted on a single NVIDIA A100 GPU and an AMD EPYC 7J13 CPU.

5 Potential Applications

We envisage that generated fine segmentation maps have the potential to be utilized in various practical applications.

Potential Application #1 Crack Detection. The quality of the walls is important for the health of the architecture [59]. However, segmentation models trained on commonly used datasets (e.g., COCO [28]) can only segment regular foreground objects. The proposed BiRefNet trained on the DIS5K dataset is more aware of the fine details and can also segment targets with higher shape complexities. As shown in Fig. 9 (A), our BiRefNet can accurately find cracks in the walls and help maintain when to repair them.

Potential Application #2 Highly Accurate Object Extraction. Foreground object extraction and background removal have been popular applications in recent years. However, commonly seen methods fail to generate high-quality results when target objects have too high shape complexities [36, 35] or need manual guidance (e.g., scribble, point, and coarse mask) for more accurate segmentation [26, 5]. The proposed BiRefNet trained on DIS5K can generate results with much higher resolution and segment thin threads at the hair level without a mask, as shown in Fig. 9 (B). On the basis of such refined results, there may be numerous successful downstream applications in the future.

6 Third-Party Creations

Since the release of our project on Mar 7, 2024, it has attracted much attention from many researchers and developers in the community to promote it spontaneously. Furthermore, great third-party applications have also been made based on our BiRefNet. Due to the rapid growth of relevant works, we only list some typical ones.

#1 Practical Applications. Because of the excellent performance of our BiRefNet, more and more third-party applications have been created by developers in the community¹¹1https://github.com/comfyanonymous/ComfyUI²²2https://github.com/ZHO-ZHO-ZHO/ComfyUI-BiRefNet-ZHO. As shown in Fig. 9 (C and D), some developers have integrated our BiRefNet into the ComfyUI as a node, which helps a lot in matting foreground segmentation to better processing in the subsequent stable diffusion models. For better online access, Fal.AI has established an online demo of our BiRefNet running on an A6000 GPU³³3https://fal.ai/models/birefnet⁴⁴4Thanks to FAL.AI for providing us with additional computation resources for further explorations on more practical applications., as shown in Fig. 9 (E). In addition to the common prediction of results, this online application also provides an API service for easy use with HTTP requests.

#2 Social Media. In recent days, our BiRefNet has drawn attention from the community. Many tweets have been posted on the X platform (formerly Twitter)⁵⁵5https://twitter.com/search?q=birefnet&src=typed_query. ZHO provides a visual comparison between our BiRefNet and other methods, as given in Fig. 9 (F). BiRefNet achieves competitive results with the previous SOTA method BRIA RMGB v1.4 in their tests⁶⁶6https://twitter.com/ZHOZHO672070/status/1771026516388041038. It should be noted that our BiRefNet was trained on the training set of the open-source dataset DIS5K [37] under MIT license, while the other one was trained on their carefully selected private data and cannot be used for commercial use. As shown in Fig. 9 (G), more comparisons on both video and image data have been provided by Toyxyz on X between our BiRefNet and previous great foreground human matting methods⁷⁷7https://twitter.com/toyxyz3/status/1771413245267746952. In addition to these posts, PurzBeats has also made an animation with our BiRefNet and uploaded relevant videos⁸⁸8https://twitter.com/i/status/1772323682934775896. A video tutorial can also be found on YouTube by ‘AI is in wonderland’ in Japanese about how to use our BiRefNet in ComfyUI⁹⁹9https://www.youtube.com/watch?v=o2_nMDUYk6s.

7 Conclusions

This work proposes a BiRefNet framework equipped with a bilateral reference, which can perform dichotomous image segmentation, high-resolution (HR) salient object detection, and concealed object detection in the same framework. With the comprehensive experiments conducted, we find that unscaled source images and a focus on regions of rich information are vital to generating fine and detailed areas in HR images. To this end, we propose the bilateral reference to fill in the missing information in the fine parts (inward reference) and guide the model to focus more on regions with richer details (outward reference). This significantly improves the model’s ability to capture tiny-pixel features. To alleviate the high training cost of HR data training, we also provide various practical tricks to deliver higher-quality prediction and faster convergence. Competitive results on 13 benchmarks demonstrate outstanding performance and strong generalization ability of our BiRefNet. We also show that the techniques of BiRefNet can be transferred and used in many practical applications. We hope that the proposed framework can encourage the development of unified models for various tasks in the academic community and that our model can empower and inspire the developer community to create more great works.

References

Achanta et al. [2009] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. Frequency-tuned salient region detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2009.
Borji et al. [2015] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. IEEE Transactions on Image Process., 24:5706–5722, 2015.
Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision Workshop, 2018.
Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE / CVF International Conference on Computer Vision, 2017.
Dai et al. [2022] Linhui Dai, Xiang Song, Xiaohong Liu, Chengqi Li, Zhihao Shi, Martin Brooks, and Jun Chen. Enabling trimap-free image matting with a frequency-guided saliency-aware network via joint learning. IEEE Transactions on Multimedia, 25:4868–4879, 2022.
Deng et al. [2023] Xinhao Deng, Pingping Zhang, Wei Liu, and Huchuan Lu. Recurrent multi-scale transformer for high-resolution salient object detection. In ACM International Conference on Multimedia, 2023.
Fan et al. [2017] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. Structure-measure: A new way to evaluate foreground maps. In IEEE / CVF International Conference on Computer Vision, 2017.
Fan et al. [2018] Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming-Ming Cheng, and Ali Borji. Enhanced-alignment measure for binary foreground map evaluation. In International Joint Conference on Artificial Intelligence, 2018.
Fan et al. [2020] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2020.
Fan et al. [2022] Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6024–6042, 2022.
Fan et al. [2023a] Deng-Ping Fan, Ge-Peng Ji, Peng Xu, Ming-Ming Cheng, Christos Sakaridis, and Luc Van Gool. Advances in deep concealed scene understanding. Visual Intelligence, 1(1):16, 2023a.
Fan et al. [2023b] Deng-Ping Fan, Jing Zhang, Gang Xu, Ming-Ming Cheng, and Ling Shao. Salient objects in clutter. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2344–2366, 2023b.
Grosky and Jain [1986] William I Grosky and Ramesh Jain. A pyramid-based approach to segmentation applied to region matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(5):639–650, 1986.
He et al. [2023a] Chunming He, Kai Li, Yachao Zhang, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and Xiu Li. Camouflaged object detection with feature decomposition and edge reconstruction. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2023a.
He et al. [2023b] Chunming He, Kai Li, Yachao Zhang, Yulun Zhang, Chenyu You, Zhenhua Guo, Xiu Li, Martin Danelljan, and Fisher Yu. Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. In International Conference on Learning Representations, 2023b.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2016.
Hu et al. [2023] Xiaobin Hu, Shuo Wang, Xuebin Qin, Hang Dai, Wenqi Ren, Donghao Luo, Ying Tai, and Ling Shao. High-resolution iterative feedback network for camouflaged object detection. In AAAI Conference on Artificial Intelligence, 2023.
Huang et al. [2023] Zhou Huang, Hang Dai, Tian-Zhu Xiang, Shuo Wang, Huai-Xin Chen, Jie Qin, and Huan Xiong. Feature shrinkage pyramid for camouflaged object detection with transformers. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2023.
Ji et al. [2023] Ge-Peng Ji, Deng-Ping Fan, Yu-Cheng Chou, Dengxin Dai, Alexander Liniger, and Luc Van Gool. Deep gradient learning for efficient camouflaged object detection. Machine Intelligence Research, 20(1):92–108, 2023.
Jia et al. [2022] Qi Jia, Shuilian Yao, Yu Liu, Xin Fan, Risheng Liu, and Zhongxuan Luo. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
Kim et al. [2022] Taehun Kim, Kunhee Kim, Joonyeong Lee, Dongmin Cha, Jiho Lee, and Daijin Kim. Revisiting image pyramid structure for high resolution salient object detection. In Asian Conference on Computer Vision, 2022.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
Lai et al. [2017] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
Lai et al. [2018] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2599–2613, 2018.
Le et al. [2019] Trung-Nghia Le, Tam V Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation. Computer Vision and Image Understanding, 184:45–56, 2019.
Li et al. [2022] Jizhizi Li, Jing Zhang, Stephen J Maybank, and Dacheng Tao. Bridging composite and real: towards end-to-end deep image matting. International Journal of Computer Vision, 130(2):246–266, 2022.
Li et al. [2023] Xiaofei Li, Jiaxin Yang, Shuohao Li, Jun Lei, Jun Zhang, and Dong Chen. Locate, refine and restore: A progressive enhancement network for camouflaged object detection. In International Joint Conference on Artificial Intelligence, 2023.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision Workshop, 2014.
Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE / CVF International Conference on Computer Vision, 2021.
Lv et al. [2021] Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously localize, segment and rank the camouflaged objects. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021.
Pang et al. [2022] Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.
Pei et al. [2023] Jialun Pei, Zhangjun Zhou, Yueming Jin, He Tang, and Heng Pheng-Ann. Unite-divide-unite: Joint boosting trunk and structure for high-accuracy dichotomous image segmentation. In ACM International Conference on Multimedia, 2023.
Qin et al. [2019] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. Basnet: Boundary-aware salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2019.
Qin et al. [2020] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106:107404, 2020.
Qin et al. [2022] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, et al. Highly accurate dichotomous image segmentation. In European Conference on Computer Vision Workshop, 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer Assisted Interventions, 2015.
Shen et al. [2022] Tiancheng Shen, Yuechen Zhang, Lu Qi, Jason Kuen, Xingyu Xie, Jianlong Wu, Zhe Lin, and Jiaya Jia. High quality segmentation for ultra high-resolution images. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
Sun et al. [2022] Yujia Sun, Shuo Wang, Chenglizhao Chen, and Tian-Zhu Xiang. Boundary-guided camouflaged object detection. In International Joint Conference on Artificial Intelligence, 2022.
Tang et al. [2021a] Chufeng Tang, Hang Chen, Xiao Li, Jianmin Li, Zhaoxiang Zhang, and Xiaolin Hu. Look closer to segment better: Boundary patch refinement for instance segmentation. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021a.
Tang et al. [2021b] Lv Tang, Bo Li, Yijie Zhong, Shouhong Ding, and Mofei Song. Disentangled high quality salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021b.
Wang et al. [2020] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349–3364, 2020.
Wang et al. [2017] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to detect salient objects with image-level supervision. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
Wei et al. [2020] Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, and Qi Tian. Label decoupling framework for salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2020.
Xie et al. [2022] Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xiaowu Chen, and Jia Li. Pyramid grafting network for one-stage high resolution saliency detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
Xu et al. [2017] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
Yang et al. [2013] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In IEEE / CVF Computer Vision and Pattern Recognition Conference, pages 3166–3173, 2013.
Yang et al. [2018] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In IEEE / CVF Computer Vision and Pattern Recognition Conference, pages 3684–3692, 2018.
Yin et al. [2022] Bowen Yin, Xuying Zhang, Qibin Hou, Bo-Yuan Sun, Deng-Ping Fan, and Luc Van Gool. Camoformer: Masked separable attention for camouflaged object detection. arXiv preprint arXiv:2212.06570, 2022.
Yu et al. [2021] Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021.
Yu et al. [2024] Qian Yu, Xiaoqi Zhao, Youwei Pang, Lihe Zhang, and Huchuan Lu. Multi-view aggregation network for dichotomous image segmentation. arXiv preprint arXiv:2404.07445, 2024.
Zeng et al. [2019] Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. Towards high-resolution salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2019.
Zhang et al. [2020] Zhao Zhang, Wenda Jin, Jun Xu, and Ming-Ming Cheng. Gradient-induced co-saliency detection. In European Conference on Computer Vision Workshop, 2020.
Zhao et al. [2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
Zhao et al. [2018] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision Workshop, 2018.
Zhong et al. [2022] Yijie Zhong, Bo Li, Lv Tang, Senyun Kuang, Shuang Wu, and Shouhong Ding. Detecting camouflaged object in frequency domain. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
Zhou et al. [2023] Yan Zhou, Bo Dong, Yuanfeng Wu, Wentao Zhu, Geng Chen, and Yanning Zhang. Dichotomous image segmentation with frequency priors. In International Joint Conference on Artificial Intelligence, 2023.
Zou et al. [2018] Qin Zou, Zheng Zhang, Qingquan Li, Xianbiao Qi, Qian Wang, and Song Wang. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Transactions on Image Process., 28(3):1498–1512, 2018.