This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Peng Zheng1,4,5,6†   Dehong Gao2   Deng-Ping Fan1∗   Li Liu3
Jorma Laaksonen4   Wanli Ouyang5   Nicu Sebe6  
1 Nankai University, 2 Northwest Polytechnical University,
3 National University of Defense Technology, 4 Aalto University,
5 Shanghai AI Laboratory, 6 University of Trento
[email protected]
Abstract

We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS). It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef). LM aids in object localization using global semantic information. Within the RM, we utilize BiRef for the reconstruction process, where hierarchical patches of images provide the source reference, and gradient maps serve as the target reference. These components collaborate to generate the final predicted maps. We also introduce auxiliary gradient supervision to enhance the focus on regions with finer details. In addition, we outline practical training strategies tailored for DIS to improve map quality and the training process. To validate the general applicability of our approach, we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance, outperforming task-specific cutting-edge methods across all benchmarks. Our codes are publicly available at https://github.com/ZhengPeng7/BiRefNet.

\begin{overpic}[width=424.94574pt]{figs/Fig1.pdf} \put(25.5,10.5){{\color[rgb]{1,.5,0}GT}} \put(37.85,10.5){{\color[rgb]{1,.5,0}Ours}} \put(25.5,0.5){{\color[rgb]{1,.5,0}IS-Net}} \put(37.85,0.5){{\color[rgb]{1,.5,0}UDUN}} \par\put(75.5,10.5){{\color[rgb]{1,.5,0}GT}} \put(87.85,10.5){{\color[rgb]{1,.5,0}Ours}} \put(75.5,0.5){{\color[rgb]{1,.5,0}IS-Net}} \put(87.85,0.5){{\color[rgb]{1,.5,0}UDUN}} \par\end{overpic}
Figure 1: Visual comparison between the results of our proposed BiRefNet and the latest state-of-the-art methods (e.g., IS-Net [37] and UDUN [34]) for high-resolution dichotomous image segmentation (DIS). Details of segmentation are zoomed in for better display.
\dagger Peng finished the majority of this work when he was a visiting scholar at Nankai University.\ast Corresponding author ([email protected]).

1 Introduction

\begin{overpic}[width=424.94574pt]{figs/comp_other_methods.pdf} \put(7.8,-1.5){(a)} \put(27.0,-1.5){(b)} \put(47.0,-1.5){(c)} \put(71.0,-1.5){(d)} \par\put(2.4,4.5){Enc} \put(12.0,4.5){Dec} \put(21.5,4.5){Enc} \put(31.2,4.5){Dec} \put(41.5,4.5){Enc} \put(50.5,4.5){Dec} \put(66.0,4.5){Enc} \put(75.0,4.5){Dec} \end{overpic}
Figure 2: Comparison between our proposed BiRefNet and other existing methods for HR segmentation tasks. (a) Common framework [38]; (b) Image pyramid as input [13, 56]; (c) Scaled images as inward reference [21, 27]; (d) BiRefNet: patches of original images at original scales as inward reference and gradient priors as outward reference. Enc = encoder, Dec = decoder.

With the advancement in high-resolution image acquisition, image segmentation technology has evolved from traditional coarse localization to achieving high-precision object segmentation. This task, whether it involves salient [12] or concealed object detection [11, 10], is referred to as high-resolution dichotomous image segmentation (DIS) [37] and has attracted widespread attention and use in the industry, e.g., by Samsung, Adobe, and Disney.

For the new DIS task, recent works have considered strategies such as intermediate supervision [37], frequency prior [58], and unite-divide-unite [34], and have achieved favorable results. Essentially, they either split the supervision [34, 37] at the feature-level or introduce an additional prior [58] to enhance feature extraction. These strategies are, however, still insufficient to capture very fine features (see Fig. 1). Based on our observations, we found that fine and non-salient features in image objects can be well reflected by obtaining gradient features through derivative operations on the original image. In addition, when certain positions exhibit high similarity in color and texture to the background, the gradient features are probably too weak. For such cases, we further introduce ground-truth (GT) features for side supervision, allowing the framework to learn the characteristics of these positions. We name the incorporation of the image reference and the introduction of both the gradient and GT references as bilateral reference.

We propose a novel progressive bilateral reference network BiRefNet to handle the high-resolution DIS task with separate localization and reconstruction modules. For the localization module, we extract hierarchical features from vision transformer backbone, which are combined and squeezed to obtain corase predictions in low resolution in deep layers. For the reconstruction module, we further design the inward and outward references as bilateral references (BiRef), in which the source image and the gradient map are fed into the decoder at different stages. Instead of resizing the original images to lower-resolution versions to ensure consistency with decoding features at each stage [21, 27], we keep the original resolution for intact detail features in inward reference and adaptively crop them into patches for compatibility with decoding features. In addition, we investigate and summarize practical strategies for the training on high-resolution (HR) data, including long training and region-level loss for better segmentation in parts of fine details, and multi-stage supervision to accelerate learning of them.

Our main contributions are summarized as follows:

  1. 1.

    We present a bilateral reference network (BiRefNet), which is a simple yet strong baseline to perform high-quality dichotomous image segmentation.

  2. 2.

    We propose a bilateral reference module, which consists of an inward reference with source image guidance and an outward reference with gradient supervision. It shows great efficacy in the reconstruction of the predicted HR results.

  3. 3.

    We explore and summarize various practical strategies tailored for DIS to easily improve performance, prediction quality, and convergence acceleration.

  4. 4.

    The proposed BiRefNet shows its excellent performance and strong generalization capabilities to achieve state-of-the-art performance on not only the DIS5K task but also on HRSOD and COD with 6.8%, 2.0%, and 5.6% average SmS_{m} [7] improvements, respectively.

\begin{overpic}[width=424.94574pt]{figs/BiRefNet.pdf} \put(13.0,41.5){\normalsize{} {$\mathcal{I}$}} \put(74.0,41.5){\normalsize{} {$\mathcal{M}$}} \par\put(24.2,34.0){\normalsize{} {$\mathcal{F}^{e}_{1}$}} \put(24.2,23.3){\normalsize{} {$\mathcal{F}^{e}_{2}$}} \put(24.2,9.5){\normalsize{} {$\mathcal{F}^{e}_{3}$}} \put(52.0,38.2){\normalsize{} {$\mathcal{F}^{d}_{1}$}} \put(52.0,24.5){\normalsize{} {$\mathcal{F}^{d}_{2}$}} \put(52.0,9.0){\normalsize{} {$\mathcal{F}^{d}_{3}$}} \par\put(33.8,6.5){\normalsize{} {$\mathcal{F}^{e}$}} \put(41.5,6.5){\normalsize{} {$\mathcal{F}^{d}$}} \par\put(52.0,41.5){\normalsize{} {$\mathcal{F}^{d+}_{1}$}} \put(52.0,29.5){\normalsize{} {$\mathcal{F}^{d+}_{2}$}} \put(52.0,16.0){\normalsize{} {$\mathcal{F}^{d+}_{3}$}} \par\put(41.0,41.5){\normalsize{} {$\mathcal{F}^{l}_{1}$}} \put(41.0,28.5){\normalsize{} {$\mathcal{F}^{l}_{2}$}} \put(41.0,15.5){\normalsize{} {$\mathcal{F}^{l}_{3}$}} \par\put(68.5,33.5){\normalsize{} {$\mathcal{F}^{d}_{2}$}} \put(62.5,26.0){\normalsize{} {$\mathcal{F}^{d^{\prime}}_{3}$}} \put(62.5,9.0){\normalsize{} {$\mathcal{F}^{d+}_{3}$}} \put(73.0,9.7){\normalsize{} {$\{\mathcal{P}_{k=1}^{N}\}$}} \par\put(76.0,30.0){\footnotesize{$\mathcal{A}_{3}^{G}$}} \put(87.0,17.0){\footnotesize{$\mathcal{M}_{2}$}} \put(95.0,25.9){\footnotesize{$\hat{G}_{3}$}} \par\put(47.0,0.9){\footnotesize{${L}_{\text{CE}}$}} \put(92.0,37.3){\footnotesize{${L}_{\text{BCE, IoU}}$}} \put(88.0,6.8){\footnotesize{${L}_{\text{BCE}}$}} \put(88.0,4.8){\footnotesize{${L}_{\text{IoU}}$}} \put(94.0,6.8){\footnotesize{${L}_{\text{BCE}}$}} \end{overpic}
Figure 3: Pipeline of the proposed bilateral reference Network (BiRefNet). BiRefNet mainly consists of the localization module (LM) and the reconstruction module (RM) with bilateral reference (BiRef) blocks. Please refer to Sec. 3.1 for details.

2 Related Works

2.1 High-Resolution Class-agnostic Segmentation

High-resolution class-agnostic segmentation has been a typical computer vision objective for decades, and many related tasks have been proposed and attracted much attention, such as dichotomous image segmentation (DIS) [37], high-resolution salient object detection (HRSOD) [53], and concealed object detection (COD) [10]. To provide standard HRSOD benchmarks, several typical HRSOD datasets (e.g., HRSOD [53], UHRSD [46], HRS10K [6]) and numerous approaches [21, 42, 53, 46] have been proposed. Zeng et al. [53] employed a global-local fusion of the multi-scale input in their network. Xie et al. [46] used cross-model grafting modules to process images at different scales from multiple backbones (lightweight [16] and heavy [30]). Pyramid blending was also used in [21] for a lower computational cost. Concealed objects are difficult to locate due to similar-looking surrounding distractors [9]. Therefore, image priors, such as frequency [57], boundary [40], gradient [19]etc, are used as auxiliary guidance to train COD models. Furthermore, a higher resolution has been found beneficial for detecting targets [19, 18, 17]. To produce more precise and fine-detail segmentation results, Yin et al. [50] employed progressive refinement with masked separable attention. Li et al. [27] incorporated the original images at different scales to aid in the refining process.

High-resolution DIS is a newly proposed task that focuses more on the complex slender structure of target objects in high-resolution images, making it even more challenging. Qin et al. [37] proposed the DIS5K dataset and IS-Net with intermediate supervision to alleviate the loss of fine areas. In addition, Zhou et al. [58] embedded a frequency prior to their DIS network to capture more details. Pei et al. [34] applied a label decoupling strategy [45] to the DIS task and achieved competitive segmentation performance in the boundary areas of objects. Yu et al.  [52] used patches of HR images to accelerate the training in a more memory-efficient way. Unlike previous models that used compressed/resized images to enhance HR segmentation, we utilized intact HR images as supplementary information for better predictions in high resolution.

2.2 Progressive Refinement in Segmentation

In the image matting task, trimaps have been used as a pre-positioning technique for more precise segmentation results [26, 47]. In Fig. 2, we illustrate different approaches of relevant networks and compare the differences [38, 55, 37, 58, 34]. Many approaches have been proposed based on the progressive refinement strategy. Yu et al. [51] used the predicted LR alpha matte as a guide for refined HR maps. In BASNet [35], the initial results are revised with an additional refiner network. The CRM [39] continuously aligns the feature map with the refinement target to aggregate detailed features. In ICNet [56], the original images are also downscaled and added to the decoder output at different stages for refinement. In ICEG [15], the generator and the detector iteratively evolve through interaction to obtain better COD segmentation results. In addition to images and GT, auxiliary information is also used in existing methods. For example, Tang et al. [41] cropped patches on the boundary to further refine them. In the LapSRN network [23, 24] for image super-resolution, Laplacian pyramids are also generated to help with image reconstruction at higher resolution. Although these methods successfully employed refinement to achieve better results, models are not guided to focus on certain areas, which is a problem in DIS. Therefore, we introduce gradient supervision in our outward reference to guide features sensitive to areas with richer fine details.

3 Methodology

3.1 Overview

As shown in Fig. 2(d), our proposed BiRefNet is different from the previous DIS methods. On the one hand, our BiRefNet explicitly decomposes the DIS task on HR data into two modules, i.e., a localization module (LM) and a reconstruction module (RM). On the other hand, instead of directly adding the source images [13] or the priors [58] to the input, BiRefNet employs our proposed bilateral reference in the RM, making full use of the source images at the original scales and the gradient priors. The complete framework of our BiRefNet is illustrated in Fig. 3.

\begin{overpic}[width=424.94574pt]{figs/BiRef.pdf} \put(4.0,4.0){\footnotesize{$\mathcal{I}$}} \put(1.0,2.5){\footnotesize{1024$\times$1024}} \put(6.0,11.0){\scriptsize{Adaptatively Crop}} \put(9.5,5.0){\footnotesize{$\{\mathcal{P}_{k=1}^{N}\}$}} \put(12.0,15.5){\footnotesize{$\mathcal{F}_{i}^{d+}$}} \put(42.0,10.0){\footnotesize{$\mathcal{F}_{i}^{\theta}$}} \put(49.5,9.0){\footnotesize{$\mathcal{F}_{i}^{d^{\prime}}$}} \put(51.2,5.0){\footnotesize{$\mathcal{M}_{i}$}} \put(75.3,6.9){\footnotesize{$G_{i}^{m}$}} \put(74.2,3.0){\footnotesize{$G_{i}^{gt}$}} \put(69.0,3.2){\footnotesize{$\hat{G}_{i}$}} \put(63.0,2.5){\footnotesize{$\mathcal{F}_{i}^{G}$}} \put(70.8,15.6){\footnotesize{$\mathcal{M}_{i}$}} \put(55.0,15.6){\footnotesize{$\mathcal{A}_{i}^{G}$}} \put(48.2,17.7){\footnotesize{$\mathcal{F}_{i-1}^{d}$}} \put(6.5,0.7){\footnotesize{Inward Reference}} \put(30.5,0.7){\footnotesize{Reconstruction Block}} \put(62.0,0.7){\footnotesize{Outward Reference}} \end{overpic}
Figure 4: Pipeline of the proposed bilateral reference blocks. The source images at the original scale are combined with decoder features as the inward reference and fed into the reconstruction block, where deformable convolutions with hierarchical receptive fields are employed. The aggregated features are then used to predict the gradient maps in the outward reference. Gradient-aware features are then turned into the attention map to act on the original features.

3.2 Localization Module

For a batch of HR images N×3×H×W\mathcal{I}\in{}\mathbb{R}^{N\times 3\times H\times W} as input, the transformer encoder [30] extracts features at different stages, i.e., 1e,2e,3e,e\mathcal{F}_{1}^{e},\mathcal{F}_{2}^{e},\mathcal{F}_{3}^{e},\mathcal{F}^{e} with resolutions as {[Hk,Wk],k=4,8,16,32}\{[\frac{H}{k},\frac{W}{k}],k=4,8,16,32\}. The features of the first four stages {ie}i=13\{\mathcal{F}_{i}^{e}\}_{i=1}^{3} are transferred to the corresponding decoder stages with lateral connections (1×\times1 convolution layers). Meanwhile, they are stacked and concatenated in the last encoder block to generate e\mathcal{F}^{e}.

The encoder output feature e\mathcal{F}^{e} is then fed into a classification module, where e\mathcal{F}^{e} is led into a global average pooling layer and a fully connected layer for classification with the category CC to obtain a better semantic representation for localization. HR features are squeezed in the bottleneck. To enlarge the receptive fields to cover features of large objects and focus on local features for high precision simultaneously [49], which is important for HR tasks involved, we employ ASPP modules [3] here for multi-context fusion. e\mathcal{F}^{e} is squeezed to d\mathcal{F}^{d} for transfer to the reconstruction module.

3.3 Reconstruction Module

The setting of the receptive field (RF) has been a challenge of HR segmentation. Small RFs lead to inadequate context information to locate the right target on a large background, whereas large RFs often result in insufficient feature extraction in detailed areas. To achieve balance, we propose the reconstruction block (RB) in each BiRef block as a replacement for the vanilla residual blocks. In RB, we employ deformable convolutions [4] with hierarchical receptive fields (i.e., 1×\times1, 3×\times3, 7×\times7) and an adaptive average pooling layer to extract features with RFs of various scales. These features extracted by different RFs are then concatenated as iθ\mathcal{F}_{i}^{\theta}, followed by a 1×\times1 convolution layer and a batch normalization layer to generate the output feature of RM id\mathcal{F}_{i}^{d^{\prime}}. In the reconstruction module, the squeezed feature d\mathcal{F}^{d} is fed into the BiRef block for the feature 3d\mathcal{F}_{3}^{d}. With 3l\mathcal{F}_{3}^{l}, the first BiRef block predicts coarse maps, which are then reconstructed into higher-resolution versions through the following BiRef blocks. Following [29], the output feature of each BiRef block id\mathcal{F}_{i}^{d} is added with its lateral feature il\mathcal{F}_{i}^{l} of the LM at each stage, i.e., {id+=Upsample(id+il),i=,3,2,1}\{\mathcal{F}_{i}^{d+}=\textit{Upsample}\uparrow{}(\mathcal{F}_{i}^{d}+\mathcal{F}_{i}^{l}),i=,3,2,1\}. Meanwhile, all BiRef blocks generate intermediate predictions {i}i=31\{\mathcal{M}_{i}\}_{i=3}^{1} by multi-stage supervision, with resolutions in ascending order. Finally, the last decoding feature 1d+\mathcal{F}_{1}^{d+} is passed through an 1×\times1 convolution layer to obtain the final predicted maps N×1×H×W\mathcal{M}\in{}\mathbb{R}^{N\times{}1\times{}H\times{}W}.

3.4 Bilateral Reference

In DIS, HR training images are very important for deep models to learn details and perform highly accurate segmentation. However, most segmentation models follow previous works [38, 29] to design the network architecture in an encoder-decoder structure with down-sampling and up-sampling, respectively. Besides, due to the large size of the input, concentrating on the target objects becomes more challenging. To deal with these two main problems, we propose bilateral reference, consisting of an inward reference (InRef) and an outward reference (OutRef), which is illustrated in Fig. 4. Inward reference and outward reference play the roles of supplementing HR information and drawing attention to areas with dense details, respectively.

In InRef, images \mathcal{I} with original high resolution are cropped to patches {𝒫k=1N}\{\mathcal{P}_{k=1}^{N}\} of consistent size with the output features of the corresponding decoder stage. These patches are stacked with the original feature id+\mathcal{F}_{i}^{d+} to be fed into the RM. Existing methods with similar techniques either add \mathcal{I} only at the last decoding stage [34] or resize \mathcal{I} to make it applicable with original features in low resolution. Our inward reference avoids these two problems through adaptive cropping and supplies the necessary HR information at every stage.

In OutRef, we use gradient labels to draw more attention to areas of richer gradient information, which is essential for the segmentation of fine structures. First, we extract the gradient maps of the input images as GigtG_{i}^{gt}. Meanwhile, iθ\mathcal{F}_{i}^{\theta} is used to generate the feature iG\mathcal{F}_{i}^{G} to produce the predicted gradient maps G^i\hat{G}_{i}. With this gradient supervision, iG\mathcal{F}_{i}^{G} is sensitive to the gradient. It passes through a conv and a sigmoid layer and is used to generate the gradient referring attention 𝒜iG\mathcal{A}_{i}^{G}, which is then multiplied by id\mathcal{F}_{i}^{d^{\prime}} to generate output of the BiRef block as i1d\mathcal{F}_{i-1}^{d}.

Considering that the background may have non-target noise with a lot of gradient information, we apply a masking strategy to alleviate the influence of non-target areas. We perform morphological operations on intermediate predictions i\mathcal{M}_{i} and use dilated i\mathcal{M}_{i} as a mask. The mask is used to multiply the gradient map GigtG_{i}^{gt} to generate GimG_{i}^{m}, where the gradients outside the mask area are removed.

3.5 Objective Function

In HR segmentation tasks, using only pixel-level supervision (BCE loss) usually results in the deterioration of detailed structural information in HR data. Inspired by the great results in [35] which used a hybrid loss, we use BCE, IoU, SSIM, and CE losses together to collaborate for the supervision on the levels of pixel, region, boundary, and semantic, respectively. The final objective function is a weighted combination of the above losses and can be formulated as:

L=Lpixel+Lregion+Lboundary+Lsemantic=λ1LBCE+λ2LIoU+λ3LSSIM+λ4LCE,\begin{split}{L}&={L}_{\text{pixel}}+{L}_{\text{region}}+{L}_{\text{boundary}}+{L}_{\text{semantic}}\\ &=\lambda_{1}{L}_{\text{BCE}}+\lambda_{2}{L}_{\text{IoU}}+\lambda_{3}{L}_{\text{SSIM}}+\lambda_{4}{L}_{\text{CE}}\;,\end{split} (1)

where λ1,λ2,λ3,\lambda_{1},~{}\lambda_{2},~{}\lambda_{3}, and λ4\lambda_{4} are respectively set to 30, 0.5, 10, and 5 to keep all the losses on the same quantitative level at the beginning of the training. The final objective function consists of binary cross-entropy (BCE) loss, intersection over union (IoU) loss, structural similarity index measure (SSIM) loss, and cross-entropy (CE) loss. The complete definition of losses can be found below.

  • BCE loss: pixel-aware supervision, which is used for pixel-level supervision for the generation of binary maps:

    LBCE=(i,j)[G(i,j)log(M(i,j))+(1G(i,j))log(1M(i,j))],\centering{\scriptstyle L_{BCE}=-\sum\limits_{(i,j)}[G(i,j)\log(M(i,j))+(1-G(i,j))\log(1-M(i,j))]},\@add@centering (2)

    where G(i,j)G(i,j) and (i,j)\mathcal{M}(i,j) denote the value of the GT and binarized predicted maps, respectively, at pixel (i,j)(i,j).

  • IoU loss: region-aware supervision for the enhancement of binary map predictions:

    LIoU=1r=1Hc=1WM(i,j)G(i,j)r=1Hc=1W[M(i,j)+G(i,j)M(i,j)G(i,j)].\centering{L_{IoU}=1-\tfrac{\sum\limits_{r=1}\limits^{H}\sum\limits_{c=1}\limits^{W}M(i,j)G(i,j)}{\sum\limits_{r=1}\limits^{H}\sum\limits_{c=1}\limits^{W}[M(i,j)+G(i,j)-M(i,j)G(i,j)]}}.\@add@centering (3)
  • SSIM loss: boundary-aware supervision to improve the accuracy in boundary parts. Given GT maps GG and predicted maps \mathcal{M}, 𝐲={yj:j=1,,N2}\mathbf{y}=\{y_{j}:j=1,...,N^{2}\} and 𝐱={xj:j=1,,N2}\mathbf{x}=\{x_{j}:j=1,...,N^{2}\} represent the pixel values of two corresponding N×NN\times N patches derived from GG and \mathcal{M}, respectively. SSIM(𝐱,𝐲)SSIM(\mathbf{x},\mathbf{y}) is defined as:

    LSSIM=1(2μxμy+C1)(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2),\centering L_{SSIM}=1-\frac{(2\mu_{x}\mu_{y}+C_{1})(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})},\@add@centering (4)

    where μx\mu_{x}, μy\mu_{y} and σx\sigma_{x}, σy\sigma_{y} are the means and standard deviations of 𝐱\mathbf{x} and 𝐲\mathbf{y}, respectively, σxy\sigma_{xy} is their covariance. CC is used to avoid division by zero.

  • CE loss: semantic-aware supervision, which is used to learn better semantic representation:

    LCE=c=1Nyo,clog(po,c),\centering{L_{CE}=-\sum_{c=1}^{N}y_{o,c}\log(p_{o,c})},\@add@centering (5)

    where NN is the number of classes, yo,cy_{o,c} states whether class label cc is the correction classification for observation oo, and po,cp_{o,c} denotes the predicted probability that oo is of class cc.

3.6 Training Strategies Tailored for DIS

Due to the high cost of training models on HR data, we have explored training tricks for HR segmentation tasks to improve performance and reduce training costs.

First, we found that our model converges relatively quickly in the localization of targets and the segmentation of rough structures (measured by F-measure [1], S-measure [7]) on DIS5K (e.g., 200 epochs). However, the performance in segmenting fine parts is still increasing after very long training (e.g., 400 epochs), which is reflected in metrics such as FβωF_{\beta}^{\omega} and HCEγHCE_{\gamma}. Second, though long training can easily achieve great results in terms of both structure and edges, it consumes too much computation; we found that multi-stage supervision can dramatically accelerate the learning on segmenting fine details and make the model achieve similar performance as before but with only 30% training epochs. Third, we also found that fine-tuning with only region-level losses can easily improve the binarization of predicted results and those metric scores (e.g., FβωF_{\beta}^{\omega}, EϕmE_{\phi}^{m}, HCEHCE) that are closer to practical use. Finally, we used context feature fusion and image pyramid inputs on the backbone, which are commonly used tricks to process HR images with deep models. In experiments, these two modifications to the backbone achieved a general improvement in DIS and similar HR segmentation tasks.

As shown in Tab. 1, we show the effectiveness of training epochs and the multi-stage supervision. As the results show, our BiRefNet can achieve relatively good results after 200 epochs of training. Continuous training in 400 epochs can increase a small portion of metrics measuring structural information (e.g., FβxF_{\beta}^{x}, SmS_{m}), while bringing a larger improvement in metrics measuring fine details (e.g., HCEγHCE_{\gamma}).

Although simple long training can achieve better results, the improvement is relatively small, concerning its high computational cost on HR data. We investigated the multi-stage supervision (MSS), which is a widely used training strategy used in binary segmentation works [54, 37]. Different from MSS in these works for higher precision, it plays a role in accelerating the training convergence. As the results in Tab. 1 show, our BiRefNet trained for 200 epochs with MSS can achieve similar performance with it trained for 400 epochs. MSS successfully cut training time in half and can be used for further HR segmentation tasks for more efficient training.

Table 1: Quantitative ablation studies of the proposed multi-stage supervision for acceleration and training epochs.
Settings DIS-VD
MSS Epoch FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow
200 .875 .848 .041 .886 .914 1207
400 .897 .863 .036 .905 .937 1039
200 .892 .858 .037 .901 .932 1043

4 Experiments

4.1 Datasets

Training Sets. For DIS, we follow [37, 58, 34] to use DIS5K-TR as our training set in experiments. For HRSOD, we follow [46] to set different combinations of HRSOD, UHRSD, and DUTS as the training set. For COD, we follow [10, 18] to use the concealed samples in CAMO-TR and COD10K-TR as the training set.

Test Sets. To obtain a complete evaluation of our BiRefNet, we tested it on all test sets in DIS5K (DIS-TE1, DIS-TE2, DIS-TE3, and DIS-TE4). We also conducted an evaluation of BiRefNet on the HRSOD test sets (DAVIS-S [53], HRSOD-TE [53], and UHRSD-TE [46]) and the COD test sets (CAMO-TE [25], COD10K-TE [9], and NC4K [31]). Low-resolution SOD test sets (DUTS-TE [44] and DUT-OMRON [48]) are additionally used for supplementary experiments.

4.2 Evaluation Protocol

For a comprehensive evaluation, we employ the widely used metrics, i.e., S-measure [7] (SmS_{m}), max/mean/weighted F-measure [1] (Fβx/Fβm/FβωF_{\beta}^{x}/F_{\beta}^{m}/F_{\beta}^{\omega}), max/mean E-measure [8] (Eξx/EξmE_{\xi}^{x}/E_{\xi}^{m}), mean absolute error (MAE), and relax HCE [37] (HCEγHCE_{\gamma}) to evaluate performance. Detailed descriptions of these metrics can be found as follows.

  • S-measure [7] (structure measure, SαS_{\alpha}) is a structural similarity measurement between a saliency map and its corresponding GT map. Evaluation with SαS_{\alpha} can be obtained at high speed without binarization. The SαS_{\alpha}-measure is computed as:

    Sα=αSo+(1α)Sr,S_{\alpha}=\alpha\cdot S_{o}+(1-\alpha)\cdot S_{r}, (6)

    where SoS_{o} and SrS_{r} denote object-aware and region-aware structural similarity, and α\alpha is set to 0.5 by default, as suggested by Fan et al. in [7].

  • F-measure [1] (FβF_{\beta}) is designed to evaluate the weighted harmonic mean value of precision and recall. The output of the saliency map is binarized with different thresholds to obtain a set of binary saliency predictions. The predicted saliency maps and GT maps are compared to obtain precision and recall values. F-measure can be computed as:

    Fβ=(1+β2)PrecisionRecallβ2Precision+Recall,F_{\beta}=\frac{(1+\beta^{2})\cdot Precision\cdot Recall}{\beta^{2}\cdot Precision+Recall}, (7)

    where β2\beta^{2} is set to 0.3 to emphasize precision over recall, following [2]. The maximum F-measure score obtained with the best threshold for the entire dataset is used and denoted as FβxF_{\beta}^{x}.

  • E-measure [8] (enhanced-alignment measure, EξE_{\xi}) is designed as a perceptual metric to evaluate the similarity between the predicted maps and the GT maps both locally and globally. E-measure is defined as:

    Eξ=1WHx=1Wy=1Hϕξ(x,y),E_{\xi}=\frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H}\phi_{\xi}(x,y), (8)

    where ϕξ\phi_{\xi} indicates the enhanced alignment matrix. Similarly to the F-measure, we also adopt the maximum E-measure (EξxE_{\xi}^{x}) and also the mean E-measure (EξmE_{\xi}^{m}) as our evaluation metrics.

  • MAE (mean absolute error, ϵ\epsilon) is a simple pixel-level evaluation metric that measures the absolute difference between non-binarized predicted results \mathcal{M} and GT maps GG. It is defined as:

    ϵ=1WHx=1Wy=1H|Y^(x,y)G(x,y)|.\epsilon=\frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H}|\hat{Y}(x,y)-\text{G}(x,y)|. (9)
  • 𝐇𝐂𝐄γ\mathbf{HCE_{\gamma}} [37] (human correction efforts, HCEHCE) is a newly proposed metric aimed at evaluating the human efforts required to correct faulty predictions to satisfy specific accuracy requirements in real-world applications. Specifically, HCEHCE is quantified by the approximate number of mouse clicks. In practical applications, minor prediction errors can be tolerated. Therefore, relax HCEHCE (HCEγHCE_{\gamma}) is introduced, where γ\gamma denotes tolerance. In experiments, we use HCE5HCE_{5} (relax HCEHCE with γ=5\gamma=5) to stay consistent with that of the original paper [37], where detailed descriptions of HCEγHCE_{\gamma} are provided.

Table 2: Quantitative ablation studies of the proposed components in the proposed BiRefNet. The ablation studies are conducted on the effectiveness of the proposed components, including reconstruction module (RM), inward reference (InRef), outward reference (OutRef), and their combinations.
Modules DIS-VD
RM InRef OutRef FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow
.837 .785 .056 .845 .887 1204
.855 .831 .048 .865 .895 1167
.848 .825 .050 .857 .903 1152
.869 .834 .041 .886 .912 1093
.863 .831 .042 .891 .918 1106
.861 .839 .044 .881 .911 1114
.889 .851 .038 .900 .924 1065
Table 3: Effectiveness of practical strategies for training high-resolution segmentation. The experimental comparison of the proposed several tricks for HR segmentation tasks is provided here, including context feature fusion (CFF), image pyramids input (IPT), regional loss fine-tuning (RLFT), and their combinations. The results are obtained by our final model.
Modules DIS-VD
CFF IPT RLFT FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow
.889 .851 .038 .900 .924 1065
.893 .856 .038 .904 .928 1054
.895 .857 .037 .904 .927 1051
.890 .861 .036 .899 .932 1043
.897 .863 .036 .905 .937 1039

4.3 Implementation Details

All images are resized to 1024×\times1024 for training and testing. The generated segmentation maps are resized (i.e., bilinear interpolation) to the original size of the corresponding GT maps for evaluation. Horizontal flip is the only data augmentation used in the training process. The number of categories CC is set to 219, as given in DIS-TR. The proposed BiRefNet is trained with Adam optimizer [22] for DIS/HRSOD/COD tasks for 600/150/150 epochs, respectively. The model is fine-tuned with the IoU loss for the last 20 epochs. The initial learning rate is set to 10410^{-4} and 10510^{-5} for DIS and others, respectively. Models are trained with PyTorch [33] on eight NVIDIA A100 GPUs. The batch size is set to NN=4 for each GPU during training.

4.4 Ablation Study

We study the effectiveness of each component (i.e., RM and BiRef) and practical strategies (i.e., CFF, IPT, and RLFT) introduced for our BiRefNet and conduct an investigation about their contributions to improved DIS results. Quantitative results regarding each module and strategy are shown in Tab. 2 and 3, respectively.

\begin{overpic}[width=243.46228pt]{figs/sota_radar.png} \end{overpic}
Figure 5: Quantitative comparisons of the proposed BiRefNet and the best task-specific models. S-measure [7] is used for the comparison here. UDUN [34], FSPNet [18], PGNet-UH [46], and PGNet-DH [46] are currently the best models for the DIS, COD, HRSOD, and SOD tasks, respectively.
\begin{overpic}[width=491.89041pt]{figs/qual_dis.pdf} \put(7.0,1.0){Image} \put(22.5,1.0){GT} \put(36.0,1.0){{Ours}} \put(48.0,1.0){UDUN~{}\cite[cite]{[\@@bibref{Number}{UDUN}{}{}]}} \put(62.0,1.0){IS-Net~{}\cite[cite]{[\@@bibref{Number}{DIS5K}{}{}]}} \put(76.0,1.0){U${}^{2}$Net~{}\cite[cite]{[\@@bibref{Number}{U-2-Net}{}{}]}} \put(90.0,1.0){HRNet~{}\cite[cite]{[\@@bibref{Number}{HRNet}{}{}]}} \put(1.0,50.0){\rotatebox{90.0}{DIS-TE1}} \put(1.0,39.0){\rotatebox{90.0}{DIS-TE2}} \put(1.0,28.0){\rotatebox{90.0}{DIS-TE3}} \put(1.0,17.0){\rotatebox{90.0}{DIS-TE4}} \put(1.0,5.5){\rotatebox{90.0}{DIS-VD}} \end{overpic}
Figure 6: Qualitative comparisons of the proposed BiRefNet and previous methods on the DIS5K benchmark. The results of the previous methods are from [34], where all models are trained with images in 1024×\times1024. Zoom in for a better view.
Table 4: Quantitative comparisons between our BiRefNet and the state-of-the-art methods on DIS5K.\uparrow” (“\downarrow”) means that the higher (lower) is better. We use the results from [34], where all methods take 1024×\times1024 input.
Methods DIS-TE1 (500) DIS-TE2 (500) DIS-TE3 (500)
FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow
BASNet19 [35] .663 .577 .105 .741 .756 155 .738 .653 .096 .781 .808 341 .790 .714 .080 .816 .848 681
U2Net20 [36] .701 .601 .085 .762 .783 165 .768 .676 .083 .798 .825 367 .813 .721 .073 .823 .856 738
HRNet20 [43] .668 .579 .088 .742 .797 262 .747 .664 .087 .784 .840 555 .784 .700 .080 .805 .869 1049
PGNet22 [46] .754 .680 .067 .800 .848 162 .807 .743 .065 .833 .880 375 .843 .785 .056 .844 .911 797
IS-Net22 [37] .740 .662 .074 .787 .820 149 .799 .728 .070 .823 .858 340 .830 .758 .064 .836 .883 687
FP-DIS23 [58] .784 .713 .060 .821 .860 160 .827 .767 .059 .845 .893 373 .868 .811 .049 .871 .922 780
UDUN23 [34] .784 .720 .059 .817 .864 140 .829 .768 .058 .843 .886 325 .865 .809 .050 .865 .917 658
BiRefNet .860 .819 .037 .885 .911 106 .894 .857 .036 .900 .930 266 .925 .893 .028 .919 .955 569
BiRefNetSwinB .857 .819 .038 .884 .912 110 .890 .854 .037 .898 .930 275 .919 .886 .030 .915 .953 597
BiRefNetSwinT .823 .774 .048 .855 .887 117 .862 .821 .046 .877 .912 290 .899 .860 .036 .897 .942 627
BiRefNetPVTv2b2 .839 .796 .042 .870 .903 111 .881 .842 .040 .888 .925 280 .903 .866 .036 .901 .941 614
Methods DIS-TE4 (500) DIS-TE (1-4) (2,000) DIS-VD (470)
FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow FβxF_{\beta}^{x}\uparrow FβωF_{\beta}^{\omega}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow EϕmE_{\phi}^{m}\uparrow HCEγHCE_{\gamma}\downarrow
BASNet19 [35] .785 .713 .087 .806 .844 2852 .744 .664 .092 .786 .814 1007 .737 .656 .094 .781 .809 1132
U2Net20 [36] .800 .707 .085 .814 .837 2898 .771 .676 .082 .799 .825 1042 .753 .656 .089 .785 .809 1139
HRNet20 [43] .772 .687 .092 .792 .854 3864 .743 .658 .087 .781 .840 1432 .726 .641 .095 .767 .824 1560
PGNet22 [46] .831 .774 .065 .841 .899 3361 .809 .746 .063 .830 .885 1173 .798 .733 .067 .824 .879 1326
IS-Net22 [37] .827 .753 .072 .830 .870 2888 .799 .726 .070 .819 .858 1016 .791 .717 .074 .813 .856 1116
FP-DIS23 [58] .846 .788 .061 .852 .906 3347 .831 .770 .047 .847 .895 1165 .823 .763 .062 .843 .891 1309
UDUN23 [34] .846 .792 .059 .849 .901 2785 .831 .772 .057 .844 .892 977 .823 .763 .059 .838 .892 1097
BiRefNet .904 .864 .039 .900 .939 2723 .896 .858 .035 .901 .934 916 .891 .854 .038 .898 .931 989
BiRefNetSwinB .899 .860 .040 .895 .938 2836 .891 .855 .036 .898 .933 954 .881 .844 .039 .890 .925 1029
BiRefNetSwinT .880 .834 .049 .878 .925 2888 .866 .822 .045 .877 .916 980 .862 .819 .045 .874 .917 1070
BiRefNetPVTv2b2 .890 .846 .045 .886 .929 2871 .878 .838 .041 .886 .925 969 .868 .827 .044 .880 .919 1073

Baseline. We provide a simple but strong encoder-decoder network as the baseline for the DIS task. To capture better hierarchical features on various scales, we chose the Swin transformer large [30] as our default backbone network. Then, to obtain a better semantic representation in the DIS task, we divided the images in DIS-TR into 219 classes according to their label names and added an auxiliary classification head at the end of the encoder. In the decoder of the baseline network, each decoder block is made up of two residual blocks [16]. All stages of the encoder and decoder are connected with an 1×\times{}1 convolution, except the deepest stage, where an ASPP [3] block is used for connectivity. With this setup, our baseline network has outperformed existing DIS models in most metrics, as shown in Tab. 2 and Tab. 4.

Reconstruction Module. As shown in Tab. 2, our model gains an overall improvement with the proposed RM. The RM provides multi-scale receptive fields on the HR features for local details and overall semantics. It brings \sim2.2% FβxF_{\beta}^{x} relative improvement with little extra computational cost.

Bilateral Reference. We separately investigate the effectiveness of the inward reference (InRef, with source images) and the outward reference (OutRef, with gradient labels) in BiRef. InRef supplemented lossless HR information globally, while OutRef drew more attention to the fine-detail parts to achieve higher precision in those areas. As shown in Tab. 2, they work jointly to bring 2.9% FβxF_{\beta}^{x} relative improvement to BiRefNet. RM and BiRef are combined to achieve 6.2% FβxF_{\beta}^{x} relative improvement.

Training Strategies. As shown in Tab. 3, the proposed strategies improve performance from different perspectives. CCF and IPT improve overall performance, while RLFT specifically improves precision in edge details, which is reflected in metrics such as FβωF_{\beta}^{\omega} and HCEγHCE_{\gamma}.

4.5 State-of-the-Art Comparison

To validate the general applicability of our method, we conduct extensive experiments on four tasks, i.e., high-resolution dichotomous image segmentation (DIS), high-resolution salient object detection (HRSOD), concealed object detection (COD), and salient object detection (SOD). We compare our proposed BiRefNet with all the latest task-specific models on existing benchmarks [37, 53, 46, 44, 48, 25, 10, 31].

Table 5: Quantitative comparisons between our BiRefNet and the state-of-the-art methods in high-resolution and low-resolution SOD datasets. TR denotes the training set. To provide a fair comparison, we train our BiRefNet with different combinations of training sets, where 1, 2, and 3 represent DUTS [44], HRSOD [53], and UHRSD [46], respectively.
Test Sets High-Resolution Benchmarks Low-Resolution Benchmarks
Methods TR DAVIS-S (92) HRSOD-TE (400) UHRSD-TE (988) DUTS-TE (5,019) DUT-OMRON(5,168)
SmS_{m}\uparrow FβxF_{\beta}^{x}\uparrow EϕmE_{\phi}^{m}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow FβxF_{\beta}^{x}\uparrow EϕmE_{\phi}^{m}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow FβxF_{\beta}^{x}\uparrow EϕmE_{\phi}^{m}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow FβxF_{\beta}^{x}\uparrow EϕmE_{\phi}^{m}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow FβxF_{\beta}^{x}\uparrow EϕmE_{\phi}^{m}\uparrow \mathcal{M}\downarrow
LDF20 [45] 1 .922 .911 .947 .019 .904 .904 .919 .032 .888 .913 .891 .047 .892 .898 .910 .034 .838 .820 .873 .051
HRSOD19 [53] 1,2 .876 .899 .955 .026 .896 .905 .934 .030 - - - - .824 .835 .885 .050 .762 .743 .831 .065
DHQ21 [42] 1,2 .920 .938 .947 .012 .920 .922 .947 .022 .900 .911 .905 .039 .894 .900 .919 .031 .836 .820 .873 .045
PGNet22 [46] 1 .935 .936 .947 .015 .930 .931 .944 .021 .912 .931 .904 .037 .911 .917 .922 .027 .855 .835 .887 .045
PGNet22 [46] 1,2 .948 .950 .975 .012 .935 .937 .946 .020 .912 .935 .905 .036 .912 .919 .925 .028 .858 .835 .887 .046
PGNet22 [46] 2,3 .954 .957 .979 .010 .938 .945 .946 .020 .935 .949 .916 .026 .859 .871 .897 .038 .786 .772 .884 .058
BiRefNet 1 .967 .966 .984 .008 .957 .958 .972 .014 .931 .933 .943 .030 .939 .937 .958 .019 .868 .813 .878 .040
BiRefNet 1,2 .973 .976 .990 .006 .962 .963 .976 .011 .937 .942 .951 .024 .938 .935 .960 .018 .868 .818 .882 .040
BiRefNet 1,3 .975 .977 .989 .006 .959 .958 .972 .014 .952 .960 .965 .019 .942 .942 .961 .018 .881 .837 .896 .036
BiRefNet 2,3 .976 .980 .990 .006 .956 .953 .967 .016 .952 .958 .964 .019 .933 .928 .954 .020 .864 .810 .879 .040
BiRefNet 1,2,3 .975 .979 .989 .006 .962 .961 .973 .013 .957 .963 .969 .016 .944 .943 .962 .018 .882 .839 .896 .038
Table 6: Comparison of BiRefNet with recent methods. As seen, BiRefNet performs much better than previous methods.
Methods CAMO (250) COD10K (2,026) NC4K (4,121)
SmS_{m}\uparrow FβωF_{\beta}^{\omega}\uparrow FβmF_{\beta}^{m}\uparrow EϕmE_{\phi}^{m}\uparrow EϕxE_{\phi}^{x}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow FβωF_{\beta}^{\omega}\uparrow FβmF_{\beta}^{m}\uparrow EϕmE_{\phi}^{m}\uparrow EϕxE_{\phi}^{x}\uparrow \mathcal{M}\downarrow SmS_{m}\uparrow FβωF_{\beta}^{\omega}\uparrow FβmF_{\beta}^{m}\uparrow EϕmE_{\phi}^{m}\uparrow EϕxE_{\phi}^{x}\uparrow \mathcal{M}\downarrow
SINet20 [9] .751 .606 .675 .771 .831 .100 .771 .551 .634 .806 .868 .051 .808 .723 .769 .871 .883 .058
BGNet22 [40] .812 .749 .789 .870 .882 .073 .831 .722 .753 .901 .911 .033 .851 .788 .820 .907 .916 .044
SegMaR22 [20] .815 .753 .795 .874 .884 .071 .833 .724 .757 .899 .906 .034 .841 .781 .820 .896 .907 .046
ZoomNet22 [32] .820 .752 .794 .878 .892 .066 .838 .729 .766 .888 .911 .029 .853 .784 .818 .896 .912 .043
SINetv222 [10] .820 .743 .782 .882 .895 .070 .815 .680 .718 .887 .906 .037 .847 .770 .805 .903 .914 .048
FEDER23 [14] .802 .738 .781 .867 .873 .071 .822 .716 .751 .900 .905 .032 .847 .789 .824 .907 .915 .044
HitNet23 [17] .849 .809 .831 .906 .910 .055 .871 .806 .823 .935 .938 .023 .875 .834 .853 .926 .929 .037
FSPNet23 [18] .856 .799 .830 .899 .928 .050 .851 .735 .769 .895 .930 .026 .879 .816 .843 .915 .937 .035
BiRefNet .904 .890 .904 .954 .959 .030 .913 .874 .888 .960 .967 .014 .914 .894 .909 .953 .960 .023
\begin{overpic}[width=496.85625pt]{figs/qual_cod.pdf} \put(0.7,40.5){\rotatebox{90.0}{Tiny}} \put(0.7,29.5){\rotatebox{90.0}{Slim}} \put(0.7,16.5){\rotatebox{90.0}{Occluded}} \put(0.7,6.0){\rotatebox{90.0}{Multiple}} \put(6.5,1.0){Input} \put(21.5,1.0){GT} \put(35.0,1.0){{Ours}} \put(49.0,1.0){HitNet} \put(62.0,1.0){FSPNet} \put(75.0,1.0){ZoomNet} \put(89.0,1.0){SINetv2} \end{overpic}
Figure 7: Visual comparisons of the proposed BiRefNet and other competitors on COD10K benchmark. Samples with different challenges are provided here to show the superiority of BiRefNet from different perspectives.
Table 7: Comparison of different DIS methods on the performance, efficiency, and model complexity. Full details can be referred to https://drive.google.com/drive/u/0/folders/1s2Xe0cjq-2ctnJBR24563yMSCOu4CcxM.
Model Runtime (ms) #Params (MB) MACs (G) DIS-TEs (HCE,FβωHCE,F_{\beta}^{\omega})
BiRefNetSwinL 83.3 215 1143 916, .858
BiRefNetSwinL_cp 78.3 215 1143 916, .858
BiRefNetSwinB 61.4 101 561 954, .855
BiRefNetSwinT 40.9 39 231 980, .822
BiRefNetPVTv2b2 47.8 35 195 969, .838
BiRefNetPVTv2b1 36.6 23 147 978, .817
BiRefNetPVTv2b0 32.9 11 89 1013, .806
IS-Net 16.0 44 160 1016, .726
UDUNRes50 33.5 25 142 977, .772
\begin{overpic}[width=433.62pt]{figs/BiRefNet-acc_runtime.pdf} \end{overpic}
Figure 8: Comparison of the efficiency, size, complexity, and performance of BiRefNet and existing DIS methods.

Quantitative Results. Tab. 4 shows a quantitative comparison between the proposed BiRefNet and previous state-of-the-art methods. Our BiRefNet outperforms all previous methods in widely used metrics. The complexities of DIS-TE1\simDIS-TE4 are in ascending order. The metrics for structure similarity (e.g., SαS_{\alpha}, EϕxE_{\phi}^{x}) focus more on global information. Pixel-level metrics, such as MAE (MM), emphasize the precision of details. Metrics based on mean values (e.g., EϕmE_{\phi}^{m}, FϕmF_{\phi}^{m}) better match the requirements of practical applications where maps are thresholded. As seen in Tab. 4, our BiRefNet outperforms previous methods not only on the accuracy of the global shape but also in the details of the pixels. It is noteworthy that the results are better, especially in metrics that cater more to practical applications.

Additionally, our BiRefNet outperforms existing task-specific models on the HRSOD and COD tasks. As shown in Tab. 5, BiRefNet achieved much higher accuracy on both high-resolution and low-resolution SOD benchmarks. Compared with the previous SOTA method [46], our BiRefNet achieved an average improvement of 2.0% SmS_{m}. Furthermore, as shown in Tab. 6, in the COD task, BiRefNet also shows a much better performance compared to the previous SOTA models, with an average improvement of 5.6% SmS_{m} on the three widely used COD benchmarks. These results show the remarkable generalization ability of our BiRefNet to similar HR tasks.

For a clearer illustration of the generalizability and powerful performance of BiRefNet, we provide a radar picture shown in Fig. 5, where we run our model and the best task-specific models on DIS/HRSOD/COD/SOD tasks. As the results show, our BiRefNet achieves leading results in all four tasks. The other task-specific models show their weakness in similar HR segmentation tasks. For example, FSPNet [18] ranks second in COD benchmarks, while it ranks fourth/third/third in DIS/HRSOD/SOD tasks, respectively.

Qualitative Results. Fig. 6 shows segmentation maps produced by the most competitive existing DIS models and the proposed BiRefNet. As the results show, we provide samples of all test sets and one validation set. BiRefNet outperforms the previous DIS methods from two perspectives, i.e., the location of target objects and the more accurate segmentation of the details of the objects. For example, in the samples of DIS-TE4 and DIS-TE2, there are neighboring distractors that attract the attention of other models to produce false positives. On the contrary, our BiRefNet eliminates the distractors and accurately segments the target. In samples of DIS-TE3 and DIS-VD, BiRefNet shows much greater performance in precisely segmenting areas where fine details are rich. Compared with previous methods, our BiRefNet can clearly segment slim shapes and curved edges.

We also provide a qualitative comparison on the COD task. Fig. 7 shows hard samples with different challenges. For example, in the row of the occluded frog, the area of the frog is divided by the branch that covers it, while our BiRefNet can accurately segment the scattered fragments almost the same as the GT map. In contrast, in the results of the other methods, fragments are difficult to find all, let alone to provide precise segmentation maps. For tiny and slim objects, BiRefNet shows a better ability to find the right target. Our BiRefNet also shows superiority in finding multiple concealed objects.

\begin{overpic}[width=496.85625pt]{figs/applications.pdf} \end{overpic}
Figure 9: Potenial applications and selected existing third-party applications based on BiRefNet, and visual comparisons on social media. (A) Potential application #1. Building crack detection for the maintenance of architecture health. (B) Potential application #2. Highly accurate object extraction in high-resolution natural images. (C) A project by viperyl first packs our BiRefNet as a ComfyUI node and makes this SOTA model easier to use for everyone. (D) ZHO also provides a ComfyUI-based project to further improve the UI for our BiRefNet, especially for video data. (E) Fal.AI encapsulates our BiRefNet online with more useful options in UI and API to call the model. (F) ZHO provides a visual comparison between our BiRefNet and previous SOTA method BRIA RMGB v1.4 which has extra training on their private training dataset. (G) Toyxyz conducted a comparison between our BiRefNet and previous competitive human matting methods (e.g., BRIAAI, RemBG, Robust Video Matting, and Person YOLOv8s) with both videos and images.

Efficiency and Complexity Comparison. We equip our BiRefNet with different backbones to obtain models in different sizes. Runtime, number of parameters, MACs, and performance of them are further tested to provide a comprehensive comparison between them and other methods. First, we provide a quantitative comparison in Tab. 7. The FPS of the largest BiRefNet can be more than 10, which is acceptable in most practical applications. We also used the compiled version (BiRefNetSwinL_cp) by PyTorch 2.0 [33] on BiRefNetSwinL to accelerate its inference by 13%. In addition, we draw the performance and runtime of each model in Fig. 8 for a clearer display. BiRefNet with different backbones are evaluated and compared with existing DIS methods on DIS-TEs and DIS-VD. Different methods are drawn in different colors and markers. All tests are conducted on a single NVIDIA A100 GPU and an AMD EPYC 7J13 CPU.

5 Potential Applications

We envisage that generated fine segmentation maps have the potential to be utilized in various practical applications.

Potential Application #1 Crack Detection. The quality of the walls is important for the health of the architecture [59]. However, segmentation models trained on commonly used datasets (e.g., COCO [28]) can only segment regular foreground objects. The proposed BiRefNet trained on the DIS5K dataset is more aware of the fine details and can also segment targets with higher shape complexities. As shown in Fig. 9 (A), our BiRefNet can accurately find cracks in the walls and help maintain when to repair them.

Potential Application #2 Highly Accurate Object Extraction. Foreground object extraction and background removal have been popular applications in recent years. However, commonly seen methods fail to generate high-quality results when target objects have too high shape complexities [36, 35] or need manual guidance (e.g., scribble, point, and coarse mask) for more accurate segmentation [26, 5]. The proposed BiRefNet trained on DIS5K can generate results with much higher resolution and segment thin threads at the hair level without a mask, as shown in Fig. 9 (B). On the basis of such refined results, there may be numerous successful downstream applications in the future.

6 Third-Party Creations

Since the release of our project on Mar 7, 2024, it has attracted much attention from many researchers and developers in the community to promote it spontaneously. Furthermore, great third-party applications have also been made based on our BiRefNet. Due to the rapid growth of relevant works, we only list some typical ones.

#1 Practical Applications. Because of the excellent performance of our BiRefNet, more and more third-party applications have been created by developers in the community111https://github.com/comfyanonymous/ComfyUI222https://github.com/ZHO-ZHO-ZHO/ComfyUI-BiRefNet-ZHO. As shown in Fig. 9 (C and D), some developers have integrated our BiRefNet into the ComfyUI as a node, which helps a lot in matting foreground segmentation to better processing in the subsequent stable diffusion models. For better online access, Fal.AI has established an online demo of our BiRefNet running on an A6000 GPU333https://fal.ai/models/birefnet444Thanks to FAL.AI for providing us with additional computation resources for further explorations on more practical applications., as shown in Fig. 9 (E). In addition to the common prediction of results, this online application also provides an API service for easy use with HTTP requests.

#2 Social Media. In recent days, our BiRefNet has drawn attention from the community. Many tweets have been posted on the X platform (formerly Twitter)555https://twitter.com/search?q=birefnet&src=typed_query. ZHO provides a visual comparison between our BiRefNet and other methods, as given in Fig. 9 (F). BiRefNet achieves competitive results with the previous SOTA method BRIA RMGB v1.4 in their tests666https://twitter.com/ZHOZHO672070/status/1771026516388041038. It should be noted that our BiRefNet was trained on the training set of the open-source dataset DIS5K [37] under MIT license, while the other one was trained on their carefully selected private data and cannot be used for commercial use. As shown in Fig. 9 (G), more comparisons on both video and image data have been provided by Toyxyz on X between our BiRefNet and previous great foreground human matting methods777https://twitter.com/toyxyz3/status/1771413245267746952. In addition to these posts, PurzBeats has also made an animation with our BiRefNet and uploaded relevant videos888https://twitter.com/i/status/1772323682934775896. A video tutorial can also be found on YouTube by ‘AI is in wonderland’ in Japanese about how to use our BiRefNet in ComfyUI999https://www.youtube.com/watch?v=o2_nMDUYk6s.

7 Conclusions

This work proposes a BiRefNet framework equipped with a bilateral reference, which can perform dichotomous image segmentation, high-resolution (HR) salient object detection, and concealed object detection in the same framework. With the comprehensive experiments conducted, we find that unscaled source images and a focus on regions of rich information are vital to generating fine and detailed areas in HR images. To this end, we propose the bilateral reference to fill in the missing information in the fine parts (inward reference) and guide the model to focus more on regions with richer details (outward reference). This significantly improves the model’s ability to capture tiny-pixel features. To alleviate the high training cost of HR data training, we also provide various practical tricks to deliver higher-quality prediction and faster convergence. Competitive results on 13 benchmarks demonstrate outstanding performance and strong generalization ability of our BiRefNet. We also show that the techniques of BiRefNet can be transferred and used in many practical applications. We hope that the proposed framework can encourage the development of unified models for various tasks in the academic community and that our model can empower and inspire the developer community to create more great works.

References

  • Achanta et al. [2009] Radhakrishna Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. Frequency-tuned salient region detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2009.
  • Borji et al. [2015] Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. Salient object detection: A benchmark. IEEE Transactions on Image Process., 24:5706–5722, 2015.
  • Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In European Conference on Computer Vision Workshop, 2018.
  • Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE / CVF International Conference on Computer Vision, 2017.
  • Dai et al. [2022] Linhui Dai, Xiang Song, Xiaohong Liu, Chengqi Li, Zhihao Shi, Martin Brooks, and Jun Chen. Enabling trimap-free image matting with a frequency-guided saliency-aware network via joint learning. IEEE Transactions on Multimedia, 25:4868–4879, 2022.
  • Deng et al. [2023] Xinhao Deng, Pingping Zhang, Wei Liu, and Huchuan Lu. Recurrent multi-scale transformer for high-resolution salient object detection. In ACM International Conference on Multimedia, 2023.
  • Fan et al. [2017] Deng-Ping Fan, Ming-Ming Cheng, Yun Liu, Tao Li, and Ali Borji. Structure-measure: A new way to evaluate foreground maps. In IEEE / CVF International Conference on Computer Vision, 2017.
  • Fan et al. [2018] Deng-Ping Fan, Cheng Gong, Yang Cao, Bo Ren, Ming-Ming Cheng, and Ali Borji. Enhanced-alignment measure for binary foreground map evaluation. In International Joint Conference on Artificial Intelligence, 2018.
  • Fan et al. [2020] Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2020.
  • Fan et al. [2022] Deng-Ping Fan, Ge-Peng Ji, Ming-Ming Cheng, and Ling Shao. Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6024–6042, 2022.
  • Fan et al. [2023a] Deng-Ping Fan, Ge-Peng Ji, Peng Xu, Ming-Ming Cheng, Christos Sakaridis, and Luc Van Gool. Advances in deep concealed scene understanding. Visual Intelligence, 1(1):16, 2023a.
  • Fan et al. [2023b] Deng-Ping Fan, Jing Zhang, Gang Xu, Ming-Ming Cheng, and Ling Shao. Salient objects in clutter. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2344–2366, 2023b.
  • Grosky and Jain [1986] William I Grosky and Ramesh Jain. A pyramid-based approach to segmentation applied to region matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(5):639–650, 1986.
  • He et al. [2023a] Chunming He, Kai Li, Yachao Zhang, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and Xiu Li. Camouflaged object detection with feature decomposition and edge reconstruction. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2023a.
  • He et al. [2023b] Chunming He, Kai Li, Yachao Zhang, Yulun Zhang, Chenyu You, Zhenhua Guo, Xiu Li, Martin Danelljan, and Fisher Yu. Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. In International Conference on Learning Representations, 2023b.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2016.
  • Hu et al. [2023] Xiaobin Hu, Shuo Wang, Xuebin Qin, Hang Dai, Wenqi Ren, Donghao Luo, Ying Tai, and Ling Shao. High-resolution iterative feedback network for camouflaged object detection. In AAAI Conference on Artificial Intelligence, 2023.
  • Huang et al. [2023] Zhou Huang, Hang Dai, Tian-Zhu Xiang, Shuo Wang, Huai-Xin Chen, Jie Qin, and Huan Xiong. Feature shrinkage pyramid for camouflaged object detection with transformers. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2023.
  • Ji et al. [2023] Ge-Peng Ji, Deng-Ping Fan, Yu-Cheng Chou, Dengxin Dai, Alexander Liniger, and Luc Van Gool. Deep gradient learning for efficient camouflaged object detection. Machine Intelligence Research, 20(1):92–108, 2023.
  • Jia et al. [2022] Qi Jia, Shuilian Yao, Yu Liu, Xin Fan, Risheng Liu, and Zhongxuan Luo. Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
  • Kim et al. [2022] Taehun Kim, Kunhee Kim, Joonyeong Lee, Dongmin Cha, Jiho Lee, and Daijin Kim. Revisiting image pyramid structure for high resolution salient object detection. In Asian Conference on Computer Vision, 2022.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Lai et al. [2017] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
  • Lai et al. [2018] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2599–2613, 2018.
  • Le et al. [2019] Trung-Nghia Le, Tam V Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation. Computer Vision and Image Understanding, 184:45–56, 2019.
  • Li et al. [2022] Jizhizi Li, Jing Zhang, Stephen J Maybank, and Dacheng Tao. Bridging composite and real: towards end-to-end deep image matting. International Journal of Computer Vision, 130(2):246–266, 2022.
  • Li et al. [2023] Xiaofei Li, Jiaxin Yang, Shuohao Li, Jun Lei, Jun Zhang, and Dong Chen. Locate, refine and restore: A progressive enhancement network for camouflaged object detection. In International Joint Conference on Artificial Intelligence, 2023.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision Workshop, 2014.
  • Lin et al. [2017] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE / CVF International Conference on Computer Vision, 2021.
  • Lv et al. [2021] Yunqiu Lv, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simultaneously localize, segment and rank the camouflaged objects. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021.
  • Pang et al. [2022] Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. Zoom in and out: A mixed-scale triplet network for camouflaged object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.
  • Pei et al. [2023] Jialun Pei, Zhangjun Zhou, Yueming Jin, He Tang, and Heng Pheng-Ann. Unite-divide-unite: Joint boosting trunk and structure for high-accuracy dichotomous image segmentation. In ACM International Conference on Multimedia, 2023.
  • Qin et al. [2019] Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, and Martin Jagersand. Basnet: Boundary-aware salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2019.
  • Qin et al. [2020] Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R Zaiane, and Martin Jagersand. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition, 106:107404, 2020.
  • Qin et al. [2022] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, et al. Highly accurate dichotomous image segmentation. In European Conference on Computer Vision Workshop, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer Assisted Interventions, 2015.
  • Shen et al. [2022] Tiancheng Shen, Yuechen Zhang, Lu Qi, Jason Kuen, Xingyu Xie, Jianlong Wu, Zhe Lin, and Jiaya Jia. High quality segmentation for ultra high-resolution images. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
  • Sun et al. [2022] Yujia Sun, Shuo Wang, Chenglizhao Chen, and Tian-Zhu Xiang. Boundary-guided camouflaged object detection. In International Joint Conference on Artificial Intelligence, 2022.
  • Tang et al. [2021a] Chufeng Tang, Hang Chen, Xiao Li, Jianmin Li, Zhaoxiang Zhang, and Xiaolin Hu. Look closer to segment better: Boundary patch refinement for instance segmentation. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021a.
  • Tang et al. [2021b] Lv Tang, Bo Li, Yijie Zhong, Shouhong Ding, and Mofei Song. Disentangled high quality salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021b.
  • Wang et al. [2020] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349–3364, 2020.
  • Wang et al. [2017] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan. Learning to detect salient objects with image-level supervision. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
  • Wei et al. [2020] Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, and Qi Tian. Label decoupling framework for salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2020.
  • Xie et al. [2022] Chenxi Xie, Changqun Xia, Mingcan Ma, Zhirui Zhao, Xiaowu Chen, and Jia Li. Pyramid grafting network for one-stage high resolution saliency detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
  • Xu et al. [2017] Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. Deep image matting. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
  • Yang et al. [2013] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. Saliency detection via graph-based manifold ranking. In IEEE / CVF Computer Vision and Pattern Recognition Conference, pages 3166–3173, 2013.
  • Yang et al. [2018] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In IEEE / CVF Computer Vision and Pattern Recognition Conference, pages 3684–3692, 2018.
  • Yin et al. [2022] Bowen Yin, Xuying Zhang, Qibin Hou, Bo-Yuan Sun, Deng-Ping Fan, and Luc Van Gool. Camoformer: Masked separable attention for camouflaged object detection. arXiv preprint arXiv:2212.06570, 2022.
  • Yu et al. [2021] Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, and Alan Yuille. Mask guided matting via progressive refinement network. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2021.
  • Yu et al. [2024] Qian Yu, Xiaoqi Zhao, Youwei Pang, Lihe Zhang, and Huchuan Lu. Multi-view aggregation network for dichotomous image segmentation. arXiv preprint arXiv:2404.07445, 2024.
  • Zeng et al. [2019] Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. Towards high-resolution salient object detection. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2019.
  • Zhang et al. [2020] Zhao Zhang, Wenda Jin, Jun Xu, and Ming-Ming Cheng. Gradient-induced co-saliency detection. In European Conference on Computer Vision Workshop, 2020.
  • Zhao et al. [2017] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2017.
  • Zhao et al. [2018] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In European Conference on Computer Vision Workshop, 2018.
  • Zhong et al. [2022] Yijie Zhong, Bo Li, Lv Tang, Senyun Kuang, Shuang Wu, and Shouhong Ding. Detecting camouflaged object in frequency domain. In IEEE / CVF Computer Vision and Pattern Recognition Conference, 2022.
  • Zhou et al. [2023] Yan Zhou, Bo Dong, Yuanfeng Wu, Wentao Zhu, Geng Chen, and Yanning Zhang. Dichotomous image segmentation with frequency priors. In International Joint Conference on Artificial Intelligence, 2023.
  • Zou et al. [2018] Qin Zou, Zheng Zhang, Qingquan Li, Xianbiao Qi, Qian Wang, and Song Wang. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Transactions on Image Process., 28(3):1498–1512, 2018.