Towards Complex Backgrounds: A Unified Difference-Aware Decoder for Binary Segmentation
Abstract
Binary segmentation is used to distinguish objects of interest from background, and is an active area of convolutional encoder-decoder network research. The current decoders are designed for specific objects based on the common backbones as the encoders, but cannot deal with complex backgrounds. Inspired by the way human eyes detect objects of interest, a new unified dual-branch decoder paradigm named the difference-aware decoder is proposed in this paper to explore the difference between the foreground and the background and separate the objects of interest in optical images. The difference-aware decoder imitates the human eye in three stages using the multi-level features output by the encoder. In Stage A, the first branch decoder of the difference-aware decoder is used to obtain a guide map. The highest-level features are enhanced with a novel field expansion module and a dual residual attention module, and are combined with the lowest-level features to obtain the guide map. In Stage B, the other branch decoder adopts a middle feature fusion module to make trade-offs between textural details and semantic information and generate background-aware features. In Stage C, the proposed difference-aware extractor, consisting of a difference guidance model and a difference enhancement module, fuses the guide map from Stage A and the background-aware features from Stage B, to enlarge the differences between the foreground and the background and output a final detection result. To verify the performance of the proposed difference-aware decoder, we choose four well known backbones including VGG, ResNet, Res2Net, PVT, and four binary segmentation tasks, , salient object detection, camouflaged object detection, polyp segmentation, and mirror detection, for comparative experiments. The results demonstrate that the difference-aware decoder can achieve a higher accuracy than the other state-of-the-art binary segmentation methods for these tasks. The source code will be available on https://github.com/Henryjiepanli/DAD.
Index Terms:
Dual-branch decoder, binary segmentation, salient object detection, camouflaged object detection, polyp segmentation, mirror detection, difference-aware.1 Introduction
Binary segmentation is aimed at distinguishing pixels belonging to the foreground from the background, which is a problem that has attracted much attention in the field of computer vision. Binary segmentation has been extended to various tasks, such as salient object detection (SOD)[1], camouflaged object detection (COD)[2], polyp segmentation[3], and mirror detection[4]. It has also been applied in autonomous driving[5], robot navigation[6], disaster assessment[7], medical diagnosis (, lung infection segmentation [8], polyp segmentation [3]), and even the military applications[9].




In these tasks and applications, the foreground objects can be either similar to or very different from the background, making binary segmentation a difficult task. This task can be solved by the use of a convolutional neural network (CNN)-based model[13], as these models have outstanding feature representation capabilities and show a strong generalization performance. In these networks, the encoder-decoder structure[14] has become a major paradigm, where the encoder extracts multi-level features by performing a series of convolution operations on an input image, and the decoder[15] takes responsibility for utilizing the multi-level features to generate the segmentation results. Typically, the commonly used encoders (which are also called backbones) include VGG[16], ResNet[17], MobileNet[18], Res2Net[19], and the recently proposed pyramid vision transformer (PVT)[20]. The low-level features [21] are rich in detailed information (texture, color, etc.), but their semantic information is insufficient. Meanwhile, the high-level features [21] are highly semantic, but are lacking in detailed information, due to the downsampling. Therefore, the key to binary segmentation is how to take advantage of these multi-level features in the decoder design, which is a problem that has been well studied over the past few years.
The general decoding strategy of binary segmentation is to aggregate multi-level encoded features from the aspects of expanding the receptive field ([22, 23, 24]), enriching the contextual information ([25, 26, 27, 28]), or feature refinement ([29, 30]), to improve the final segmentation performance. Specifically, for the SOD task ([21, 31, 32, 4]), the decoder is designed to highlight the salient features from the decoding process, whereas, for the COD task ([33, 34, 35, 36]), the decoder is applied to simulate animal hunting in nature with the procedure of searching for prey and then capturing the prey. However, as shown in Fig. 1, the complex backgrounds of the various binary segmentation tasks increase the difficulty of efficient feature extraction. In summary, although various decoding strategies have been specifically developed to highlight the object features from the encoder for different binary segmentation tasks, how to design a general efficient decoder considering various backgrounds still remains a challenge.
In order to deal with the various complex backgrounds, we propose a unified dual-branch decoder paradigm named the difference-aware decoder (DAD) for the binary segmentation task. This is inspired by the procedure of how human eyes detect special objects [37]. Specifically, the eyes first observe the foreground objects and obtain a coarse detection map of the objects of interest. Secondly, the eyes pay substantial attention to the background area adjacent to the foreground objects. Finally, the differences between the foreground and the background are enlarged by the processing of the brain, and the real objects can be precisely outlined. To imitate the procedure of human eyes, on the basis of the multi-level features from the encoder, the proposed difference-aware decoder has the following three stages. In Stage A, the first branch decoder adopts the general decoding strategies to obtain a coarse guide map. In detail, the highest-level features are enhanced with a novel field expansion module (FEM) and a dual residual attention (DRA) module[29], and are then combined with the lowest-level features to obtain the coarse guide map. In Stage B, the other branch decoder adopts a middle feature fusion (MFF) module in order to trade-off the textural details, take the semantic information into account, and obtain the background-aware features. In Stage C, the proposed difference-aware extractor (DAE), which consists of a difference guidance model (DGM) and a difference enhancement module (DEM), fuses the foreground guide map from Stage A and the background-aware features from Stage B, to enlarge the differences between foreground and background and output the final segmentation results. The proposed difference-aware decoder method benefits from the differences between the guide map and the background-aware features, and can overcome the variable background while achieving a superior performance, as illustrated in Fig. 1.
The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents the full details of the proposed difference-aware decoder method. The comprehensive experimental results and the analysis, including a parametric analysis and ablation study, are provided in Section 4. Section 5 draws the conclusions from this study. Due to the page limitation, further details about the experiments are presented in the Supplementary Material.
2 RELATED WORKS
In this paper, we mainly review the development of deep learning methods for two typical binary segmentation tasks, , SOD and COD. We also review the development of the dual-branch decoder for various applications. For more binary segmentation tasks, we refer the reader to [3],[38],[39],[40],[41],[12], and [4].
2.1 Salient Object Detection
CNNs can be used to locate the boundary of detected salient regions and conduct the segmentation. For example, Zhao et al.[42] utilized a CNN to simulate the saliency in the image by modeling the global context, and modeled the local context for the saliency prediction in refined regions; Li et al. [43] combined the multi-level features extracted by a CNN with a network with multiple fully connected layers, which can achieve a very high level of regression accuracy; and Liu et al.[44] utilized an end-to-end network named DHSNet to realize hierarchical detection of significant targets from global to local and coarse to fine. Other researchers have focused on the deficiencies in the precise localization of high-level features and have proposed some improved approaches[45]. However, the CNN-based methods have some problems, such as blurred and inaccurate predictions near the boundaries of salient objects[46].

Subsequently, the fully convolutional network (FCN)[14] with encoder-decoder structure was successfully introduced for the SOD task. Based on the common encoders, researchers have developed various decoders to interpret the multi-level features. For example, Wu et al.[21] put forward a cascaded partial decoder to suppress the distractors in the features and improve the feature representation ability; Liu et al.[32] decoded the high-level features with the low-level features by proposing a feature aggregation model; Zhao et al.[1] designed a decoder to integrate the local edge information and global location information to obtain the salient edge features; Qin et al.[47] proposed a densely supervised encoder-decoder network and utilized a residual refinement module to refine the final saliency map; Pang et al.[48] proposed a decoder embedded with self-interaction modules to obtain more efficient multi-scale features from the integrated features; Wang et al.[49] used a pyramid attention structure to concentrate more on salient regions while considering multi-scale saliency information; and Zhao et al.[50] designed a decoder based on the consideration of the disparity of the contributions of the different encoder blocks. However, in general, most of the above methods can distinguish the objects of interest from a simple background, but they are incapable of discovering the objects of interest in the case of a complex background, especially for camouflaged objects. In this paper, we investigate how best to deal with complex backgrounds for various binary segmentation tasks.
2.2 Camouflaged Object Detection
The COD task is much more difficult than the SOD task. The development of encoder-decoder based COD methods can be classified into two aspects, , encoders and decoders. With regard to encoders, ResNet[2] was the first to be introduced, which was followed by Res2Net[11] and PVT[51]. With regard to decoders, SINet[2] simulates the predation process in nature and searches for and identifies the potential objects of interest. PFNet[52] improves the search for potential objects with a global positioning module and the identification process with a focus module. In addition, Lv et al. [36] suggested that explicitly modeling the estimation of the conspicuousness of a camouflaged object against its surrounding can not only better explain animal camouflage and evolution, but can also provide guidance for designing more complex camouflage techniques. On this basis, Lv et al. [36] proposed a ranking-based COD decoder. Zhai et al.[53] designed a mutual graph learning (MGL) decoder model to generalize the idea of conventional mutual learning from regular grids to the graph domain. Liu et al. [54] integrated the multi-level features with informative attention coefficients and obtained multi-scale feature representations for exploiting the rich global context information. On the basis of transformer backbones, more advanced decoders have also been developed [51]. Although simulating the predation process is a relatively efficient approach for COD, it mainly focuses on the foreground camouflaged objects and ignores the contributions from the background. The proposed difference-aware decoder realizes the importance of the difference between the foreground and background, to imitate the object detection procedure of human eyes.
2.3 Dual-branch Decoder Architectures
The dual-branch decoder architecture has been introduced into various binary segmentation tasks, to obtain a better accuracy. For example, Wu et al.[21] designed a dual-branch decoder named the cascaded partial decoder, which utilizes the initial saliency map of the first branch to refine the features of the second branch. In addition, Fan et al.[2, 11] proposed a dual-branch decoder that uses a holistic attention mechanism to refine the segmentation map in the first decoder and output a more accurate result. Other works have utilized dual decoders for complementary tasks to improve the learning efficiency and generalization across different tasks. For example, Zhang et al.[55] utilized a joint task-recursive learning framework to refine the results of both semantic segmentation and monocular depth estimation through serialized task-level interactions; Zhen et al.[56] proposed a dual-branch decoder framework to fuse the feature maps generated for semantic segmentation and boundary detection; and Li et al. [35] used a dual-branch decoder to achieve both a higher-order similarity measure and network confidence estimation. Typically, the dual-branch decoder acts as different approaches to achieve the segmentation task.
In this paper, we also propose a unified dual-branch decoder architecture to imitate the object detection procedure of human eyes for binary segmentation. Specifically, one branch is used to discover the coarse foreground, while the other is used to exploit the background-aware features. Finally, the guide map and the background-aware features are fused to extract the difference-aware features for the final binary object segmentation.
3 METHODOLOGY
3.1 Overview
In this section, we describe the proposed difference-aware decoder in detail, as illustrated in Fig. 2. The process of how objects are detected by human eyes is imitated on the basis of the multi-level features from the backbone. Three stages are utilized to simulate the procedure[37] of how human eyes detect special objects. To be specific, Stage A outputs a guide map, which can represent the coarse foreground information; Stage B fuses the three middle-level features to obtain the background-aware features; and Stage C makes use of the guide map and the background-aware features to enhance the difference between the foreground and the background and obtain a refined detection map.

3.2 Stage A: Guide Map Generator
As presented in Fig. 2, in Stage A, the aim is to obtain a guide map of the objects of interest, which is achieved by the GMG in the difference-aware decoder. Fig. 3 illustrates the details of the proposed GMG architecture, with the highest- and lowest-layer encoded features as input. It has been reported that CNNs are apt to extracting information mainly from the much smaller regions in the receptive field[57], since the receptive field helps to highlight the importance of the regions closer to the center and elevates the insensitivity to small spatial shifts. However, in the binary segmentation task, even the increase of the encoding layers can enlarge the receptive field, which is an inefficient way to cover the camouflaged/salient objects in the task, limiting the final performance. Therein, we propose the GMG to enlarge the receptive field as much as possible, to cover the foreground objects by atrous convolution[58], while keeping the feature map resolution unchanged. To further enrich the detailed information of the objects, the lowest-layer encoding features are combined with the enlarged features to generate the guide map.
As shown in Fig. 3, on the basis of the multi-scale features extracted from the backbone, the FEM consists of four parallel paths to enlarge the receptive field. The first path is made up of a convolution block and four consecutive convolution blocks. Importantly, the dilation rates of those four convolution blocks are 4, 8, 16, and 32, respectively. The second path is the simple convolution block to reduce the dimensions of the input features. The third path is similar to the first branch, with the dilation rates of the convolution blocks replaced with 2, 4, 8, and 16 in turn. The outputs of the first three paths are then concatenated to fuse the features with sufficient contextual information, followed with a convolution block. Finally, a modified residual structure is adopted to enhance the final output of the FEM, which is implemented via a convolution block to match the number of concatenated features. It is worth emphasizing that all the convolution blocks are composed of a convolutional layer, a batch normalization layer, and an activation layer.




Previous works, such as ASPP[22], RFB[59], and DenseASPP[60], have proved the importance of enlarging the receptive field. Differently, in the proposed approach, the dilation rate is increased in the consecutive convolution blocks to increase the receptive field of the path as much as possible. What is more, the two paths with simple convolution blocks can keep the original information from both the channel dimension and the spatial dimension. As shown in Fig. 4, the FEM represents an effective way to make use of this parallel structure, to obtain a more appropriate receptive field.
Aiming at modeling the long-range dependencies to capture rich contextual relationships for better feature representations with intra-class compactness, the attention module is further utilized to enhance the output FEM features with an enlarged receptive field[29]. In detail, the DRA module from[29] is introduced to enhance the highest-level features. By this sequential operation of the FEM and DRA[29], the global information can be captured and the long-distance dependency between the foreground and the background can be established. This reduces the interference from extraneous information, which is of huge benefit to binary segmentation, especially in some difficult tasks. Subsequently, the enhanced features obtained via the DRA module are combined with the lowest-layer encoded features to provide enough detailed information. The highest-level features are upsampled to the same size as the lowest-level features. These two types of features are then concatenated and processed with two convolution blocks, to finally obtain the guide map , which will be further used for the guided loss calculation.
3.3 Stage B: Middle Feature Fusion
As presented in Fig. 2, Stage B involves extracting the background-aware features by the proposed MFF strategy, with the three middle-level features as inputs. As mentioned previously, in the object detection procedure of human eyes, the background information is important to outline the foreground objects of interest. However, for the binary segmentation task, the background extraction is a challenge due to the complexity of the background compared to the foreground. From another aspect, the lowest-level encoding features are more detailed and result in a complex background, whereas the highest-level encoding features are semantic and lose the detailed background information. Therein, we chose the three middle-level encoding features for the background-aware feature extraction.
Differing from the mainstream architecture of the top-down and bottom-up methods, which resize the multi-level features to the biggest size or the smallest size, we propose fusing the three features of the middle-level size. As shown in Fig. 5, the MFF module selects the middle size as the base size and resizes the other features to fit this base size. After using an upsampling operation for layer 4, this is resized to fit the size of layer 3. One convolution block and two convolution blocks are utilized to reduce the channels of layer 2, layer 3, and layer 4 to the same number (which is set to 32). The stride of the convolution block is 2, so it can act as a downsampling operation. Before concatenating these three processed layers, the FEM is used to obtain a larger receptive field with richer contextual information. Finally, the concatenated three-layer features are utilized to represent the background-aware features.

3.4 Stage C: Difference-Aware Extractor
In the first two stages, the two-branch decoder is used to obtain the guide map and the background features. Therefore, in Stage C, as presented in Fig. 2, we are committed to exploring the differences between the foreground and the background for the final segmentation of the objects of interest, which is achieved by the proposed DAE. The DAE can take advantage of the prior knowledge of the guide map to guide the background-aware features to learn more subtle differences. As shown in Fig. 6, the DAE consists of a DGM and a DEM. The DGM fuses the guide map and background features to separate the foreground objects from the background. The DEM then separately learns the foreground and background-aware features, and then fuses them adaptively to generate the final refined segmentation map. The details of the DGM and DEM are presented in the following.

3.4.1 Difference Guidance Module
As presented in Fig. 6, the inputs of the DGM are the guide map and the background-aware features from the two branches.
Firstly, the background-aware feature is upsampled to the same sample size as , while the copy operation is performed on to extend the channel number of the guide map to the same as that of . The formulation is as follows:
(1) |
Secondly, the cross-attention operation is performed between and . is reshaped and transposed to be and is reshaped to obtain two feature maps and . After this, the guidance is performed between and . In detail, matrix multiplication is performed between and and the result is reshaped to obtain the map . is then processed by softmax normalization to obtain the probability value, which represents the relationship of the difference between the foreground and the background. The process can be summarized as:
(2) |
Meanwhile, matrix multiplication is used between and and the result is reshaped to the attention features . Finally, a residual structure is introduced to obtain the final enhanced :
(3) |
where is a learnable parameter.
3.4.2 Difference Enhancement Module
As shown in Fig. 6, the DEM is used to separably learn the foreground and background features, while adaptively fusing the two to enhance the difference between the foreground and the background. As mentioned before, the guide map can be directly used to generate the coarse map by using the sigmoid function. In the coarse map , a pixel value closer to one means that the pixel belongs to the foreground, to a large extent, and vice versa. Therefore, and are used to represent the probability of being foreground and background, respectively. These two probability maps are then used to extract the features from :
(4) |
where represents the foreground features, and represents the background features. To enlarge the difference between the foreground and the background, subtraction between and is utilized. Furthermore, two learnable parameters and are introduced to adaptively fuse the two types of features, and obtain the difference-aware features :
(5) |
In fact, the subtraction used in Eq.5 can be replaced with addition, due to the introduction of the two learnable parameters. Finally, the difference-aware features are processed by two convolution blocks and converted to the refined map by the sigmoid function.
3.5 Loss Function
To enhance the difference between the foreground and the background as much as possible, the DAE module is utilized twice. As shown in Fig. 2, the output of the DAE is regarded as the guide map, which can again be merged with the background-aware features and processed by the DAE. The DAE is used twice, and the output of the first DAE is processed by the sigmoid function to produce the map , while the second DAE produces the refined map . At this point, there are three output maps , , and , which can be used to build the loss function with the ground truth. Weighted binary cross-entropy (BCE) loss and weighted intersection over union (IoU) loss are used to construct the loss function[62], for each output map. Therefore, the total loss function can be described as:
(6) |
3.6 Analysis of the Three Stages
To verify the proposed paradigm, the feature maps of each stage were visualized in a real-data experiment. As can be seen in Fig. 7 (c) and (d), it is clear that the output of the guide map is a coarse segmentation result for the camouflaged objects, whereas the MFF module produces the background-aware features. The proposed DAE, consisting of the DGM and DEM, is then used to exploit the differences between the two. In detail, the DGM outputs the enhanced background features, as shown in Fig. 7 (e), and the DEM outputs the difference-aware features, as shown in Fig. 7 (f). By comparing Fig. 7 (c) and (f), it can be concluded that the binary object with fused difference-aware features is much clearer than the coarse map, indicating the advantage of the proposed paradigm.






4 Experiments

Backbone


ResNet-50

ResNet-50

ResNet-50

ResNet-50

Res2Net-50

ResNet-50

Res2Net-50

PVT-v2-b2
Baseline | Backbone | DUT-OMRON | SOD | ECSSD | DUTS-TE | HKU-IS | PASCAL-S | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CPD | VGG-16 | 0.845 | 0.057 | 0.787 | 0.113 | 0.938 | 0.040 | 0.902 | 0.043 | - | - | 0.883 | 0.072 |
EGNet | VGG-16 | 0.848 | 0.056 | 0.802 | 0.110 | 0.936 | 0.041 | 0.898 | 0.044 | - | - | 0.878 | 0.077 |
MINet | VGG-16 | 0.841 | 0.057 | - | - | 0.941 | 0.037 | 0.907 | 0.039 | - | - | 0.891 | 0.065 |
GateNet | VGG-16 | 0.836 | 0.061 | - | - | 0.931 | 0.042 | 0.893 | 0.045 | - | - | 0.891 | 0.065 |
PoolNet+ | VGG-16 | 0.851 | 0.056 | 0.815 | 0.105 | 0.936 | 0.042 | 0.904 | 0.040 | 0.940 | 0.033 | 0.876 | 0.075 |
DAD | VGG-16 | 0.869 | 0.055 | 0.829 | 0.098 | 0.951 | 0.033 | 0.919 | 0.039 | 0.951 | 0.030 | 0.897 | 0.064 |
CPD | ResNet-50 | 0.845 | 0.057 | 0.787 | 0.113 | 0.938 | 0.040 | 0.902 | 0.043 | - | - | 0.883 | 0.072 |
PoolNet | ResNet-50 | 0.854 | 0.056 | 0.818 | 0.102 | 0.940 | 0.039 | 0.904 | 0.040 | 0.940 | 0.033 | 0.876 | 0.075 |
BANet | ResNet-50 | 0.861 | 0.061 | 0.813 | 0.109 | 0.940 | 0.041 | 0.897 | 0.046 | 0.938 | 0.037 | 0.875 | 0.078 |
EGNet | ResNet-50 | 0.848 | 0.053 | 0.820 | 0.097 | 0.943 | 0.037 | 0.907 | 0.039 | - | - | 0.881 | 0.074 |
MINet | ResNet-50 | 0.855 | 0.056 | - | - | 0.948 | 0.034 | 0.917 | 0.037 | - | - | 0.893 | 0.064 |
GateNet | ResNet-50 | 0.851 | 0.055 | - | - | 0.934 | 0.041 | 0.906 | 0.040 | - | - | 0.884 | 0.068 |
PoolNet+ | ResNet-50 | 0.841 | 0.054 | 0.805 | 0.104 | 0.945 | 0.035 | 0.910 | 0.037 | - | - | 0.897 | 0.065 |
DFI | ResNet-50 | 0.864 | 0.055 | 0.812 | 0.102 | 0.924 | 0.035 | 0.892 | 0.039 | 0.951 | 0.031 | 0.863 | 0.066 |
DAD | ResNet-50 | 0.867 | 0.052 | 0.825 | 0.095 | 0.953 | 0.032 | 0.925 | 0.035 | 0.953 | 0.028 | 0.901 | 0.060 |
CSF | Res2Net-50 | 0.868 | 0.057 | 0.822 | 0.102 | 0.947 | 0.034 | 0.907 | 0.042 | 0.948 | 0.030 | 0.878 | 0.072 |
PoolNet # | Res2Net-50 | 0.857 | 0.053 | 0.815 | 0.100 | 0.946 | 0.036 | 0.902 | 0.041 | 0.942 | 0.032 | 0.884 | 0.069 |
DAD | Res2Net-50 | 0.882 | 0.049 | 0.834 | 0.091 | 0.959 | 0.028 | 0.934 | 0.032 | 0.958 | 0.026 | 0.909 | 0.058 |
PVT-SOD | PVT-v2-b2 | 0.883 | 0.044 | - | - | 0.958 | 0.028 | 0.933 | 0.030 | 0.960 | 0.026 | 0.906 | 0.057 |
DAD | PVT-v2-b2 | 0.903 | 0.045 | 0.859 | 0.082 | 0.965 | 0.024 | 0.949 | 0.026 | 0.967 | 0.022 | 0.918 | 0.052 |
# means that the results were obtained by ourselves.
4.1 Experimental Setting
In this study, we aimed to design a unified decoder network to capture and enlarge the difference between the foreground and the background, for binary segmentation. To evaluate the proposed model, experiments were conducted in SOD, COD, polyp segmentation, and mirror detection. All the models were implemented in PyTorch 1.7.0, and the training and testing were achieved using an NVIDIA GeForce RTX 3090 GPU with 24 GB memory. The spatial size of the input images was resized to . The Adam optimizer[63] was used in the training process. The initial learning rate was , which was attenuated by 10 times every 50 epochs. It is also worth mentioning that the difference-aware decoder was trained for 200 epochs and the batch size was 36.
4.2 Evaluation Metrics
Following the previous works, we utilize the structure-measure () [64], E-measure () [65], weighted F-measure () [66] and mean absolute error () as the evaluation metrics. Structure-measure () [64] can evaluate region-aware and object-aware structural similarity between predict map and ground truth, which can be calculated as:
(7) |
where , which was set to . and represent the region-aware and object-aware structural similarity, respectively. The E-measure () [65] can capture the image-level statistics and pixel-level matching information at the same time. It can be calculated as:
(8) |
where is the enhanced alignment matrix, and and respectively represent the width and height of the image. Weighted F-measure () [66] evaluates the segmentation result by considering each pixel independently. It can be calculated as:
(9) |
Mean absolute error () calculates the error between the predicted map and the ground truth . It is formulated as:
(10) |
where and respectively represent the width and height of the image.
For both the SOD and COD tasks, we selected the structure-measure () [64], E-measure () [65], weighted F-measure () [66] and mean absolute error (). For the polyp segmentation, we not only selected the structure-measure () [64], weighted F-measure () [66] and mean absolute error (), but we also selected the mean value of the dice coefficients[67] and , which mainly focus on the internal consistency of the segmentation results.

Backbone


ResNet-50

ResNet-50

ResNet-50

Res2Net-50

PVT-b2-v2

ResNet-50

Res2Net-50

PVT-b2-v2
Baseline | Backbone | CAMO | CHAMELLON | COD10K | NC4K | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SINet | ResNet-50 | 0.751 | 0.771 | 0.606 | 0.100 | 0.869 | 0.891 | 0.740 | 0.044 | 0.771 | 0.806 | 0.551 | 0.051 | 0.808 | 0.871 | 0.738 | 0.058 |
LSR | ResNet-50 | 0.787 | 0.838 | 0.696 | 0.080 | 0.890 | 0.935 | 0.822 | 0.030 | 0.804 | 0.877 | 0.660 | 0.040 | 0.839 | 0.883 | 0.777 | 0.053 |
MGL-R | ResNet-50 | 0.775 | 0.812 | 0.673 | 0.088 | 0.893 | 0.917 | 0.808 | 0.031 | 0.814 | 0.851 | 0.666 | 0.035 | 0.833 | 0.866 | 0.754 | 0.053 |
MGL-S | ResNet-50 | 0.772 | 0.806 | 0.664 | 0.089 | 0.892 | 0.911 | 0.799 | 0.032 | 0.811 | 0.844 | 0.654 | 0.037 | 0.827 | 0.860 | 0.747 | 0.055 |
UGTR | ResNet-50 | 0.784 | 0.821 | 0.683 | 0.086 | 0.888 | 0.910 | 0.787 | 0.031 | 0.817 | 0.852 | 0.665 | 0.036 | 0.839 | 0.874 | 0.755 | 0.052 |
PFNet | ResNet-50 | 0.782 | 0.842 | 0.695 | 0.085 | 0.882 | 0.931 | 0.808 | 0.032 | 0.800 | 0.877 | 0.660 | 0.040 | 0.829 | 0.886 | 0.747 | 0.053 |
EINet | ResNet-50 | 0.626 | 0.675 | 0.424 | 0.143 | 0.793 | 0.850 | 0.631 | 0.069 | 0.636 | 0.708 | 0.363 | 0.090 | 0.676 | 0.735 | 0.483 | 0.112 |
DAD | ResNet-50 | 0.795 | 0.863 | 0.713 | 0.076 | 0.885 | 0.941 | 0.801 | 0.028 | 0.803 | 0.886 | 0.659 | 0.038 | 0.834 | 0.900 | 0.758 | 0.049 |
SINet V2 | Res2Net-50 | 0.820 | 0.882 | 0.743 | 0.070 | 0.888 | 0.942 | 0.816 | 0.030 | 0.815 | 0.887 | 0.680 | 0.037 | 0.847 | 0.903 | 0.767 | 0.048 |
EINet | Res2Net-50 | 0.817 | 0.872 | 0.740 | 0.070 | 0.891 | 0.939 | 0.819 | 0.030 | 0.815 | 0.887 | 0.682 | 0.036 | 0.845 | 0.900 | 0.768 | 0.047 |
DAD | Res2Net-50 | 0.830 | 0.895 | 0.774 | 0.063 | 0.899 | 0.947 | 0.842 | 0.027 | 0.827 | 0.905 | 0.720 | 0.032 | 0.851 | 0.911 | 0.792 | 0.044 |
SINet V2 # | PVT-v2-b2 | 0.863 | 0.918 | 0.791 | 0.051 | 0.892 | 0.946 | 0.810 | 0.028 | 0.847 | 0.915 | 0.719 | 0.028 | 0.878 | 0.928 | 0.811 | 0.036 |
DIIT | PVT-v2-b2 | 0.857 | 0.916 | 0.796 | 0.050 | - | - | - | - | 0.824 | 0.896 | 0.695 | 0.034 | 0.863 | 0.917 | 0.792 | 0.041 |
EINet | PVT-v2-b2 | 0.856 | 0.910 | 0.801 | 0.054 | 0.895 | 0.944 | 0.837 | 0.027 | 0.847 | 0.915 | 0.742 | 0.028 | 0.875 | 0.926 | 0.817 | 0.037 |
DAD | PVT-v2-b2 | 0.867 | 0.929 | 0.821 | 0.047 | 0.905 | 0.963 | 0.855 | 0.022 | 0.864 | 0.935 | 0.776 | 0.023 | 0.882 | 0.935 | 0.839 | 0.033 |
# means that the results were obtained by ourselves.
4.3 Experiments in SOD
4.3.1 Datasets
We kept the same experimental setup as used for other SOD methods[10]. The training dataset had 10,553 images, and we conducted experiments on six datasets. The DUT-OMRON dataset[68] has 5,168 images with complex objects. The PASCAL-S dataset[69] consists of 850 challenging images. The HKU-IS dataset [43] contains 4,447 images with multiple foreground objects. The Extended Complex Scene Saliency Dataset (ECSSD)[70] and Salient Objects Dataset (SOD)[71] contain 1,000 images and 300 images, respectively. The DUTS dataset[72] is made up of the DUTS-TR (10,553 images for training) and DUTS-TE (5,019 images for testing) sets. In the experiments, the DUTS-TR set was used for the training, and the DUTS-TE set was used for the testing.
4.3.2 Compared Methods
For the SOD task, we selected 16 methods to evaluate the proposed difference-aware decoder: methods based on the VGG-16 backbone, , CPD (VGG-16)[21], EGNet (VGG-16)[1], MINet (VGG-16)[48], GateNet (VGG-16)[50], PoolNet + (VGG-16)[73]; methods based on the ResNet-50 backbone, , CPD (ResNet-50)[21], PoolNet (ResNet-50)[74], BANet (ResNet-50)[75], EGNet (ResNet-50)[1], MINet (ResNet-50)[48], GateNet (ResNet-50)[50], PoolNet + (ResNet-50)[73], DFI (ResNet-50) [76]; methods based on the Res2Net-50 backbone, , CSF (Res2Net-50) [77], PoolNet (Res2Net-50) [73]; and a method based on the PVT-v2-b2 backbone, , PVT-SOD (PVT-v2-b2)[78]. We directly used the results or the code provided by the related authors for the comparison. The proposed difference-aware decoder was tested on the VGG-16, ResNet-50, Res2Net-50, and PVT-v2-b2 backbones, for a fair comparison. It is worth noting that PVT-v2-b2 only has a four-layer encoder structure. Therein, for the proposed DAD, we select the first and fourth layer features for Stage A; whereas the first, second and third layers for Stage B.
4.3.3 Visual Results of the SOD
In order to compare the visual effects of the different SOD methods, several salient images are selected for visualization in Fig. 8. Compared with the other state-of-the-art SOD methods, the difference-aware decoder achieves the best segmentation effects. For the first image, the biggest difficulty is the segmentation of the bus roof from the sky, which are highly mixed. The compared methods of GateNet, MINet, and CSF fail to successfully segment the bus roof from the sky. The PoolNet and PoolNet+ methods can somehow deal with this problem, but there are still some flaws. The difference-aware decoder, with ResNet-50, Res2Net-50, and PVT-v2-b2 as the backbone, can perfectly distinguish the clear outline of the bus from the background. For the second and the third images, it is apparent that there are ambiguous regions in the segmentation results of the other methods, but the proposed method with different backbones can achieve maps that are much more similar to the ground truth.
4.3.4 Quantitative Evaluation of the SOD
Table I lists the overall quantitative evaluation results of the different methods obtained on the six datasets. The best score for each backbone is highlighted in bold, and the best score for each metric is marked in red. It is clear that the proposed difference-aware decoder with PVT-v2-b2 backbone achieves the best results, for all the datasets and evaluation metrics. What is more, with the different backbones of VGG-16, ResNet-50, and Res2Net-50, the difference-aware decoder always achieves the best performance, indicating the transferability of the proposed decoder architecture for different backbones.
Compared to the other methods, the difference-aware decoder improves the evaluation metrics by a large margin. The main reason for this advantage is that the proposed difference-aware decoder can more effectively enlarge the difference between the foreground and background, by fusing the guide map and the background-aware features.







Baseline | Backbone | ClinicDB | ColonDB | ETIS | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
UNet | ResNet-50 | 0.823 | 0.755 | 0.811 | 0.889 | 0.019 | 0.512 | 0.444 | 0.498 | 0.712 | 0.061 | 0.398 | 0.335 | 0.366 | 0.684 | 0.036 |
UNet++ | ResNet-50 | 0.794 | 0.729 | 0.785 | 0.873 | 0.022 | 0.483 | 0.410 | 0.467 | 0.691 | 0.064 | 0.401 | 0.344 | 0.390 | 0.683 | 0.035 |
MSEG | ResNet-50 | 0.909 | 0.864 | 0.907 | 0.938 | 0.007 | 0.735 | 0.666 | 0.724 | 0.834 | 0.038 | 0.700 | 0.630 | 0.671 | 0.828 | 0.015 |
ACSNet | ResNet-50 | 0.882 | 0.826 | 0.873 | 0.927 | 0.011 | 0.716 | 0.649 | 0.697 | 0.829 | 0.039 | 0.578 | 0.509 | 0.530 | 0.754 | 0.059 |
PraNet | Res2Net-50 | 0.899 | 0.849 | 0.896 | 0.936 | 0.009 | 0.712 | 0.640 | 0.699 | 0.820 | 0.043 | 0.628 | 0.567 | 0.600 | 0.794 | 0.031 |
EU-Net | ResNet-34 | 0.902 | 0.846 | 0.891 | 0.936 | 0.011 | 0.756 | 0.681 | 0.730 | 0.831 | 0.045 | 0.687 | 0.609 | 0.636 | 0.793 | 0.067 |
SANet | Res2Net-50 | 0.916 | 0.859 | 0.909 | 0.939 | 0.012 | 0.753 | 0.670 | 0.726 | 0.837 | 0.043 | 0.750 | 0.654 | 0.685 | 0.849 | 0.015 |
Polyp-PVT | PVT-v2-b2 | 0.937 | 0.889 | 0.936 | 0.949 | 0.006 | 0.808 | 0.727 | 0.795 | 0.865 | 0.031 | 0.787 | 0.706 | 0.750 | 0.871 | 0.013 |
DAD | PVT-v2-b2 | 0.940 | 0.893 | 0.937 | 0.954 | 0.006 | 0.826 | 0.751 | 0.809 | 0.880 | 0.027 | 0.801 | 0.726 | 0.763 | 0.880 | 0.017 |
4.4 Experiments in COD
4.4.1 Datasets
Four datasets were used to evaluate the proposed model: CHAMELEMON [79], CAMO [80], COD10K [2], NC4K[36]. These four datasets differ significantly. NC4K is the largest dataset, which contains 4,121 images downloaded from the Internet. COD10K includes 78 camouflaged categories, with 3,040 training images and 2,026 test images in total. CAMO provides 1,250 images in total, including 1,000 training images and 250 test images of eight categories. The smallest dataset is the CHAMELEON dataset, which does not provide a training set, and contains only 76 images for testing. According to [11, 33, 2], we used the training images from the CAMO and COD10K datasets as the training set (4,040 images) and the test images from the CHAMELEMON [79], CAMO [80], COD10K [2], and NC4K[36] as the test set.
4.4.2 Compared Methods
In order to evaluate the proposed difference-aware decoder, we selected 12 state-of-the-art methods for comparison: methods based on the ResNet-50 backbone, , SINet (ResNet-50) [2], LSR (ResNet-50) [36], MGL-R (ResNet-50)[53], MGL-S (ResNet-50)[53], UGTR (ResNet-50)[81], PFNet (ResNet-50) [52], EINet (ResNet-50)[82]; methods based on the Res2Net-50 backbone, , SINet V2 (Res2Net-50)[11] and EINet (Res2Net-50)[82]; and methods based on the PVT-v2-b2 backbone, , SINet V2 (PVT-v2-b2)[11], EINet (PVT-v2-b2)[82], and DTIT (PVT-v2-b2) [51]. The proposed difference-aware decoder was tested on the ResNet-50, Res2Net-50, and PVT-v2-b2 backbones, for a fair comparison.
4.4.3 Visual Results of the COD
In order to compare the visual effects of the different methods, we selected several challenging camouflaged images for the visualization, which are shown in Fig. 9. In the first image, the left person is perfectly hidden in the trees. The LSR, PFNet, and SINet V2 methods and the proposed difference-aware decoder with ResNet-50 backbone fail to segment the body of the left person. However, the difference-aware decoder with Res2Net-50 and PVT-b2-v2 backbones can clearly detect both persons. For the second image, the two soldiers are hidden in the rocks, and the other compared methods cannot segment them accurately, but the difference-aware decoder with different backbones can find them successfully. For the third image, the creatures are mixed with their surroundings and are difficult to distinguish. The segmentation results achieved by the compared methods are incomplete, referring to the ground truth, whereas the difference-aware decoder with Res2Net-50 and PVT-b2-v2 backbones can achieve much better segmentation maps. From our understanding, the backbone is important for the COD task. However, the proposed decoder paradigm can make the best use of different backbones to achieve a superior accuracy.
4.4.4 Quantitative Evaluation of the COD
Table II lists the overall quantitative evaluation results of the different methods obtained on the four datasets. On the ResNet-50 backbone, the proposed difference-aware decoder achieves the best score in the metric for all four datasets. With regard to the and metrics, the difference-aware decoder achieves the best scores on the CAMO dataset and the second-best scores on the NC4K dataset. With regard to the metric, the difference-aware decoder achieves the best scores on the CAMO, CHAMELEON, and NC4K datasets. For the COD task, Res2Net-50 and PVT-v2-b2 are the better backbones[2, 51]. On these two backbones, the proposed difference-aware decoder outperforms SINet V2 and DTIT by a large margin, for all the metrics and datasets. The main reason for this advantage is that the proposed difference-aware decoder can more effectively enlarge the difference between the camouflaged objects and background, by fusing the guide map and the background-aware features.
4.5 Experiments in Polyp Segmentation
4.5.1 Datasets
4.5.2 Compared Methods
In order to evaluate the proposed difference-aware decoder, we selected eight state-of-the-art methods for comparison: UNet[86], UNet++[87], MSEG[38], ACSNet[39], PraNet[3], EU-Net[40], SANet[41], and Polyp-PVT[12]. Following Polyp-PVT, the difference-aware decoder also used PVT-v2-b2 as the backbone for the experiments.
4.5.3 Visual Results of the Polyp Segmentation
In Fig. 10, two examples are selected for visualization. From these two examples, it can be seen that the polyps are highly mixed with the background. The compared methods of PraNet, EUNet, and SANet fail to achieve efficient segmentation, and Polyp-PVT can achieve only a rough map. However, the difference-aware decoder can segment the complete outline of the polyps successfully, compared to the ground truth.
4.5.4 Quantitative Evaluation of Polyp Segmentation
Table III lists the overall quantitative evaluation results of the different methods obtained on the three datasets. For the , , , and metrics, the difference-aware decoder achieves the best performance on the ClinicDB, ColonDB, and ETIS datasets. For the metric, the difference-aware decoder obtains the best accuracy on the ClinicDB and ColonDB datasets. These quantitative evaluation results reflect the effectiveness of the proposed decoder architecture in the polyp segmentation task.





4.6 Experiments in Mirror Detection
4.6.1 Datasets
The MSD[4] dataset was chosen for the mirror detection experiments. The MSD dataset consists of 4,018 pairs of images, with 3,063 pairs of images for training and 955 pairs of images for testing.
4.6.2 Compared Methods
In order to evaluate the proposed difference-aware decoder, we selected two general segmentation methods: PSPNet[26], and DeepLab v3+[24]; and three state-of-the-art methods designed for mirror detection: MirrorNet[4], PMD-Net[88], and LSA[89]. For a fair comparison, we used ResNeXt-101 [90] pre-trained on ImageNet-1K [91] as the backbone for all the methods. Conditional random fields (CRF)[92] was also utilized as a common post-processing method for all the methods during the inference.
4.6.3 Visual Results of the Mirror Detection
Fig. 11 illustrates several examples for the visual comparison between the proposed method and the other state-of-the-art methods. In these examples, it can be found that the backgrounds of the mirrors are complex and it is difficult for the other methods to segment relatively clear bodies for the mirrors. In contrast, the difference-aware decoder can achieve a segmentation map that is closer to the ground truth, showing the superiority of the proposed method.
4.6.4 Quantitative Evaluation of the Mirror Detection
Following the evaluation strategy used for other methods[89], we used the F1-score (), Intersection over Union (), accuracy (), and metric to measure the performance of the proposed difference-aware decoder and the other compared methods. From Table IV, it can be found that the proposed method achieves the best score in all the metrics.
Baseline | Backbone | MSD dataset | |||
---|---|---|---|---|---|
PSPNet | ResNext-101 | 0.8459 | 67.99 | 92.19 | 0.07875 |
Deeplab v3+ | ResNext-101 | 0.8750 | 77.48 | 94.13 | 0.05932 |
MirrorNet | ResNext-101 | 0.8597 | 77.41 | 92.75 | 0.07257 |
PMD-Net | ResNext-101 | 0.8691 | 76.88 | 93.94 | 0.06130 |
LSA | ResNext-101 | 0.8887 | 79.85 | 94.63 | 0.05421 |
DAD | ResNext-101 | 0.8910 | 82.90 | 95.37 | 0.04644 |
4.7 Parametric Analysis
In order to verify the effectiveness of the parameter selection in each module in the experiments, we used Res2Net-50 as the backbone and selected two COD datasets (CAMO [80], COD10K [2]) and two SOD datasets (ECSSD[70], PASCAL-S[69]) for the validation experiments.
4.7.1 Multi-Layer Feature Partition
Layers | CAMO | COD10K | ECSSD | PASCAL-S | ||||
---|---|---|---|---|---|---|---|---|
2+5 | 0.890 | 0.069 | 0.904 | 0.032 | 0.951 | 0.033 | 0.903 | 0.061 |
3+5 | 0.894 | 0.065 | 0.901 | 0.033 | 0.947 | 0.031 | 0.899 | 0.062 |
4+5 | 0.876 | 0.072 | 0.882 | 0.038 | 0.957 | 0.029 | 0.902 | 0.060 |
5 | 0.865 | 0.079 | 0.875 | 0.042 | 0.927 | 0.037 | 0.888 | 0.069 |
1+2+5 | 0.880 | 0.071 | 0.909 | 0.032 | 0.953 | 0.030 | 0.900 | 0.059 |
1+3+5 | 0.887 | 0.065 | 0.905 | 0.031 | 0.955 | 0.031 | 0.902 | 0.060 |
1+4+5 | 0.883 | 0.069 | 0.904 | 0.032 | 0.953 | 0.029 | 0.897 | 0.061 |
1+2+4+5 | 0.885 | 0.071 | 0.895 | 0.035 | 0.954 | 0.029 | 0.895 | 0.060 |
1+2+3+5 | 0.888 | 0.068 | 0.899 | 0.033 | 0.952 | 0.030 | 0.902 | 0.060 |
1+3+4+5 | 0.880 | 0.072 | 0.904 | 0.032 | 0.954 | 0.029 | 0.902 | 0.059 |
1+5(proposed) | 0.895 | 0.063 | 0.905 | 0.032 | 0.963 | 0.028 | 0.909 | 0.058 |
In Stage A, we chose layers 1 and 5 from the backbone for the guide map generation; and in Stage B, we fused layers 2 4 for the background-aware features extraction. In the following, we attempt to explain why we divided the multi-layer features in this way. From experience [22, 2], the highest-level features are always used for the coarse map generation. For a complete analysis, we combined layer 5 with different layer(s) to finish Stage A, and the other layer(s) for Stage B. Table V presents the results of the different combinations of multi-layer features. It can be clearly observed that the strategy of combining layers 1 and 5 for Stage A can achieve the best accuracy on the CAMO, ECSSD, and PASCAL-S datasets, while almost achieving the best results on the COD10K dataset. Therefore, in the experiments, we fixed layers 1 and 5 for Stage A. The explanation for this is that the high resolution of layer 1 can make up more spatial details for layer 5, which is an extremely effective way to obtain a fine guide map. It needs to be clarified that the multi-layer features were divided without overlapping. This is mainly done to enhance the differences between the guide map from Stage A and the background-aware features from Stage B.
4.7.2 Spatial Size in Middle Feature Fusion (MFF)
Fusion | CAMO | COD10K | ECSSD | PASCAL-S | ||||
---|---|---|---|---|---|---|---|---|
B-U | 0.885 | 0.071 | 0.903 | 0.034 | 0.957 | 0.029 | 0.901 | 0.059 |
T-D | 0.886 | 0.068 | 0.901 | 0.033 | 0.952 | 0.031 | 0.905 | 0.060 |
JPU[93] | 0.884 | 0.071 | 0.900 | 0.034 | 0.953 | 0.029 | 0.904 | 0.060 |
M(proposed) | 0.895 | 0.063 | 0.905 | 0.032 | 0.963 | 0.028 | 0.909 | 0.058 |
In Stage B, the proposed MFF module was used to integrate the features at different levels. In fact, the upsample/downsample function is utilized to resize layers 2 and 4 to the size of layer 3. In Table 5, Top-Down means that layer 2 and layer 3 were downsampled to the same size as layer 4, while Bottom-Up means that layer 3 and layer 4 are upsampled to the size of layer 4. Table VI indicates that the proposed strategy can achieve the best detection accuracy. This is mainly because, if the resolution is low, the proportion of the features that relates to the difference is also reduced. In addition, if resolution is blindly pursued, it would result in the loss of semantic information and an increase in the computational complexity. The joint pyramid upsampling (JPU[93]) method also takes the three feature maps as the inputs and utilizes a context module to generate a high-resolution feature map, which is similar to the MFF module. The MFF module was therefore compared with the JPU method [93]. From Table VI, it can be seen that the proposed MFF module obtains better results than the JPU method[93].
4.8 Ablation Study
In order to verify the effectiveness of each module in the proposed difference-aware decoder based on Res2Net-50, ablation experiments were conducted on two COD datasets (CAMO [80], COD10K [2]) and two SOD datasets (ECSSD[70], PASCAL-S[69]).
4.8.1 Field Expansion Module (FEM)
Methods | CAMO | COD10K | ECSSD | PASCAL-S | ||||
---|---|---|---|---|---|---|---|---|
RFB [59] | 0.875 | 0.073 | 0.901 | 0.033 | 0.957 | 0.030 | 0.904 | 0.059 |
ASPP [22] | 0.888 | 0.068 | 0.905 | 0.032 | 0.957 | 0.029 | 0.909 | 0.057 |
D-ASPP[60] | 0.883 | 0.070 | 0.904 | 0.032 | 0.957 | 0.029 | 0.905 | 0.060 |
FEM(w/o DR) | 0.887 | 0.068 | 0.906 | 0.032 | 0.960 | 0.028 | 0.908 | 0.059 |
FEM(proposed) | 0.895 | 0.063 | 0.905 | 0.032 | 0.963 | 0.028 | 0.909 | 0.058 |
The FEM was used in Stages A and B to enhance the receptive field of the feature map, so as to obtain sufficient contextual information to establish the dependency between the foreground and the background. As is well known, modules such as ASPP [22], Dense ASPP[60], and RFB [59] can also have the same effect. Therein, we replaced the proposed FEM with ASPP/FRB for the ablation study. From Table VII, it can be seen that the proposed FEM can achieve the best accuracy in the COD/SOD tasks. Furthermore, we removed the dilation rate in the FEM (w/o DR) and also found that the performance dropped. It can be concluded that the consecutive atrous convolutions with different dilation rates are helpful, while the proposed FEM can outperform the other well-known modules designed for the enhancement of the receptive field.
4.8.2 Difference Guidance Module (DGM)
Methods | CAMO | COD10K | ECSSD | PASCAL-S | ||||
---|---|---|---|---|---|---|---|---|
w/o DGM | 0.887 | 0.068 | 0.898 | 0.038 | 0.950 | 0.032 | 0.901 | 0.060 |
DGM(proposed) | 0.895 | 0.063 | 0.905 | 0.032 | 0.963 | 0.028 | 0.909 | 0.058 |
In Stage C, the DGM is utilized to enhance the foreground features from the two branches. The DGM can be regarded as a plug-and-play module. Table VIII lists the results of the difference-aware decoder with/without the DGM. It can be seen that the features are enhanced by the guide map in a way similar to cross-attention, and the prior knowledge of the foreground and the background on the guide map can be used to complete the screening of the features, which is the key to improving the model’s performance.
4.8.3 Difference Enhancement Module (DEM)
Methods | CAMO | COD10K | ECSSD | PASCAL-S | ||||
---|---|---|---|---|---|---|---|---|
F | 0.868 | 0.073 | 0.903 | 0.033 | 0.958 | 0.029 | 0.905 | 0.061 |
B | 0.885 | 0.070 | 0.900 | 0.033 | 0.954 | 0.031 | 0.910 | 0.058 |
F-B(proposed) | 0.895 | 0.063 | 0.905 | 0.032 | 0.963 | 0.028 | 0.909 | 0.058 |
Also in Stage C, the DEM is proposed to further enhance the difference between the foreground and the background. We chose the simple foreground (), the background (), and the subtraction of as the output features. From Table IX, it is clear that can achieve the best results, which validates the original intent of the proposed fusion design. It can be concluded that the difference-aware fusion of the two can more efficiently distinguish the background and precisely extract the foreground objects.
4.8.4 Repeat of DAE
Methods | CAMO | COD10K | ECSSD | PASCAL-S | ||||
---|---|---|---|---|---|---|---|---|
1 | 0.887 | 0.068 | 0.902 | 0.033 | 0.955 | 0.029 | 0.907 | 0.058 |
3 | 0.881 | 0.071 | 0.904 | 0.032 | 0.955 | 0.030 | 0.909 | 0.059 |
2(proposed) | 0.895 | 0.063 | 0.905 | 0.032 | 0.963 | 0.028 | 0.909 | 0.058 |
The DAE used in Stage C can be regarded as a plug-and-play module, and can be utilized multiple times. As can be observed in Table X, repeating the DAE three times can lead to a significant performance degradation. From our understanding, the map generated from the output of the DAE is regarded as the guide map for the next extractor. However, the map from the output of the DAE is very similar to that of the feature map from the MFF module. Therein, the DAE fails to improve the difference features. From Table X, it is apparent that repeating the DAE twice obtains the best results.
5 Conclusion
In this paper, inspired by the way human eyes detect objects of interest, we have proposed a unified dual-branch decoder paradigm dubbed the difference-aware decoder. The proposed difference-aware decoder consists of three stages. Stage A is used to generate a guide map from layers 1 and 5 of the backbone. In Stage B, layers 2 4 are used to generate the background-aware features. In Stage C, the two features are fused to generate the enhanced feature maps between the foreground and background. The many experiments conducted in this study confirmed the superiority of the proposed difference-aware decoder for SOD, COD, polyp segmentation, and mirror detection tasks. Furthermore, we also proved the effectiveness of the different modules, including the FEM in Stage A, the MFF module in Stage B, and the DGM and DEM in Stage C. In the future, we will extend the proposed difference-aware decoder to other binary segmentation tasks, such as road extraction.
References
- [1] J.-X. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Egnet: Edge guidance network for salient object detection,” in CVPR, 2019, pp. 8779–8788.
- [2] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao, “Camouflaged object detection,” in CVPR, 2020, pp. 2777–2787.
- [3] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in MICCAI. Springer, 2020, pp. 263–273.
- [4] X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, and R. W. Lau, “Where is my mirror?” in CVPR, 2019, pp. 8809–8818.
- [5] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant tracking,” in CVPR. IEEE, 2009, pp. 1007–1013.
- [6] C. Craye, D. Filliat, and J.-F. Goudou, “Environment exploration for object-based visual saliency learning,” in ICRA. IEEE, 2016, pp. 2303–2309.
- [7] T. Valentijn, J. Margutti, M. van den Homberg, and J. Laaksonen, “Multi-hazard and spatial transferability of a cnn for automated building damage assessment,” Remote Sensing, vol. 12, no. 17, p. 2839, 2020.
- [8] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Inf-net: Automatic covid-19 lung infection segmentation from ct images,” IEEE TMI, vol. 39, no. 8, pp. 2626–2637, 2020.
- [9] C. J. Lin and Y. T. Prasetyo, “A metaheuristic-based approach to optimizing color design for military camouflage using particle swarm optimization,” Color Research & Application, vol. 44, no. 5, pp. 740–748, 2019.
- [10] X. Qin, D. Fan, C. Huang, C. Diagne, Z. Zhang, A. C. Sant’Anna, A. Suàrez, M. Jägersand, and L. Shao, “Boundary-aware segmentation network for mobile and web applications,” CoRR, vol. abs/2101.04704, 2021. [Online]. Available: https://arxiv.org/abs/2101.04704
- [11] D.-P. Fan, G.-P. Ji, M.-M. Cheng, and L. Shao, “Concealed object detection,” IEEE TPAMI, pp. 1–1, 2021.
- [12] B. Dong, W. Wang, D.-P. Fan, J. Li, H. Fu, and L. Shao, “Polyp-pvt: Polyp segmentation with pyramid vision transformers,” arXiv preprint arXiv:2108.06932, 2021.
- [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
- [14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015, pp. 3431–3440.
- [15] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun, “Exfuse: Enhancing feature fusion for semantic segmentation,” in ECCV, September 2018.
- [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- [19] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE TPAMI, 2019.
- [20] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in ICCV, 2021, pp. 568–578.
- [21] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in CVPR, 2019, pp. 3907–3916.
- [22] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE TPAMI, vol. 40, no. 4, pp. 834–848, 2017.
- [23] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
- [24] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
- [25] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in CVPR, 2018, pp. 7151–7160.
- [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in CVPR, 2017, pp. 2881–2890.
- [27] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters–improve semantic segmentation by global convolutional network,” in CVPR, 2017, pp. 4353–4361.
- [28] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid context network for semantic segmentation,” in CVPR, 2019, pp. 7519–7528.
- [29] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in CVPR, 2019, pp. 3146–3154.
- [30] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in CVPR, 2017, pp. 1925–1934.
- [31] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. Torr, “Deeply supervised salient object detection with short connections,” in CVPR, 2017, pp. 3203–3212.
- [32] N. Liu, J. Han, and M.-H. Yang, “Picanet: Learning pixel-wise contextual attention for saliency detection,” in CVPR, 2018, pp. 3089–3098.
- [33] H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” in CVPR, 2021, pp. 8772–8781.
- [34] J. Zhao, J.-J. Liu, D.-P. Fan, Y. Cao, J. Yang, and M.-M. Cheng, “Egnet: Edge guidance network for salient object detection,” in ICCV, 2019, pp. 8778–8787.
- [35] A. Li, J. Zhang, Y. Lv, B. Liu, T. Zhang, and Y. Dai, “Uncertainty-aware joint salient object and camouflaged object detection,” in CVPR, 2021, pp. 10 071–10 081.
- [36] Y. Lv, J. Zhang, Y. Dai, A. Li, B. Liu, N. Barnes, and D.-P. Fan, “Simultaneously localize, segment and rank the camouflaged objects,” in CVPR, 2021, pp. 11 591–11 601.
- [37] C. J. Lin, C.-C. Chang, and Y.-H. Lee, “Evaluating camouflage design using eye movement data,” Applied ergonomics, vol. 45, no. 3, pp. 714–723, 2014.
- [38] C.-H. Huang, H.-Y. Wu, and Y.-L. Lin, “Hardnet-mseg: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps,” arXiv preprint arXiv:2101.07172, 2021.
- [39] R. Zhang, G. Li, Z. Li, S. Cui, D. Qian, and Y. Yu, “Adaptive context selection for polyp segmentation,” in MICCAI. Springer, 2020, pp. 253–262.
- [40] K. Patel, A. M. Bur, and G. Wang, “Enhanced u-net: A feature enhancement network for polyp segmentation,” in CRV. IEEE, 2021, pp. 181–188.
- [41] J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, and S. Cui, “Shallow attention network for polyp segmentation,” in MICCAI. Springer, 2021, pp. 699–708.
- [42] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in CVPR, 2015, pp. 1265–1274.
- [43] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in CVPR, 2015, pp. 5455–5463.
- [44] N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in CVPR, 2016, pp. 678–686.
- [45] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded low level distance map and high level features,” in CVPR, 2016, pp. 660–668.
- [46] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object detection: A survey,” Computational visual media, vol. 5, no. 2, pp. 117–150, 2019.
- [47] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, “Basnet: Boundary-aware salient object detection,” in CVPR, 2019, pp. 7479–7489.
- [48] Y. Pang, X. Zhao, L. Zhang, and H. Lu, “Multi-scale interactive network for salient object detection,” in CVPR, 2020, pp. 9413–9422.
- [49] W. Wang, S. Zhao, J. Shen, S. C. Hoi, and A. Borji, “Salient object detection with pyramid attention and salient edges,” in CVPR, 2019, pp. 1448–1457.
- [50] X. Zhao, Y. Pang, L. Zhang, H. Lu, and L. Zhang, “Suppress and balance: A simple gated network for salient object detection,” in ECCV. Springer, 2020, pp. 35–51.
- [51] Z. Liu, Z. Zhang, and W. Wu, “Boosting camouflaged object detection with dual-task interactive transformer,” arXiv preprint arXiv:2205.10579, 2022.
- [52] H. Mei, G.-P. Ji, Z. Wei, X. Yang, X. Wei, and D.-P. Fan, “Camouflaged object segmentation with distraction mining,” in CVPR, June 2021, pp. 8772–8781.
- [53] Q. Zhai, X. Li, F. Yang, C. Chen, H. Cheng, and D.-P. Fan, “Mutual graph learning for camouflaged object detection,” in CVPR, 2021, pp. 12 997–13 007.
- [54] Y. Sun, G. Chen, T. Zhou, Y. Zhang, and N. Liu, “Context-aware cross-level fusion network for camouflaged object detection,” arXiv preprint arXiv:2105.12555, 2021.
- [55] Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Pattern-affinitive propagation across depth, surface normal and semantic segmentation,” in CVPR, 2019, pp. 4106–4115.
- [56] M. Zhen, J. Wang, L. Zhou, S. Li, T. Shen, J. Shang, T. Fang, and L. Quan, “Joint semantic segmentation and boundary detection using iterative pyramid contexts,” in CVPR, 2020, pp. 13 666–13 675.
- [57] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” arXiv preprint arXiv:1412.6856, 2014.
- [58] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
- [59] S. Liu, D. Huang et al., “Receptive field block net for accurate and fast object detection,” in ECCV, 2018, pp. 385–400.
- [60] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in CVPR, 2018, pp. 3684–3692.
- [61] W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” Advances in neural information processing systems, vol. 29, 2016.
- [62] J. Wei, S. Wang, and Q. Huang, “F3net: Fusion, feedback and focus for salient object detection,” in AAAI, vol. 34, no. 07, 2020, pp. 12 321–12 328.
- [63] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [64] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in ICCV, 2017, pp. 4548–4557.
- [65] D.-P. Fan, G.-P. Ji, X. Qin, and M.-M. Cheng, “Cognitive vision inspired object segmentation metric and loss function,” SCIENTIA SINICA Informationis, vol. 6, 2021.
- [66] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foreground maps?” in CVPR, 2014, pp. 248–255.
- [67] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 fourth international conference on 3D vision (3DV). IEEE, 2016, pp. 565–571.
- [68] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in CVPR, 2013, pp. 3166–3173.
- [69] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets of salient object segmentation,” in CVPR, 2014, pp. 280–287.
- [70] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in CVPR, 2013, pp. 1155–1162.
- [71] V. Movahedi and J. H. Elder, “Design and perceptual validation of performance measures for salient object segmentation,” in CVPR WorkShops. IEEE, 2010, pp. 49–56.
- [72] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in CVPR, 2017, pp. 136–145.
- [73] J.-J. Liu, Q. Hou, Z.-A. Liu, and M.-M. Cheng, “Poolnet+: Exploring the potential of pooling for salient object detection,” IEEE TPAMI, pp. –, 2022.
- [74] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, “A simple pooling-based design for real-time salient object detection.”
- [75] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian, “Selectivity or invariance: Boundary-aware salient object detection,” in CVPR, 2019, pp. 3799–3808.
- [76] J.-J. Liu, Q. Hou, and M.-M. Cheng, “Dynamic feature integration for simultaneous detection of salient object, edge, and skeleton,” IEEE TIP, vol. 29, pp. 8652–8667, 2020.
- [77] S.-H. Gao, Y.-Q. Tan, M.-M. Cheng, C. Lu, Y. Chen, and S. Yan, “Highly efficient salient object detection with 100k parameters,” in ECCV. Springer, 2020, pp. 702–721.
- [78] B. Xu, G. Liu, H. Huang, C. Lu, and Y. Guo, “Semantic distillation guided salient object detection,” arXiv preprint arXiv:2203.04076, 2022.
- [79] P. Skurowski, H. Abdulameer, J. Błaszczyk, T. Depta, A. Kornacki, and P. Kozieł, “Animal camouflage analysis: Chameleon database,” Unpublished Manuscript, vol. 2, no. 6, p. 7, 2018.
- [80] T.-N. Le, T. V. Nguyen, Z. Nie, M.-T. Tran, and A. Sugimoto, “Anabranch network for camouflaged object segmentation,” CVIU, vol. 184, pp. 45–56, 2019.
- [81] F. Yang, Q. Zhai, X. Li, R. Huang, A. Luo, H. Cheng, and D.-P. Fan, “Uncertainty-guided transformer reasoning for camouflaged object detection,” in CVPR, 2021, pp. 4146–4155.
- [82] C. Li and G. Jiao, “Einet: camouflaged object detection with pyramid vision transformer,” Journal of Electronic Imaging, vol. 31, no. 5, p. 053002, 2022.
- [83] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computerized medical imaging and graphics, vol. 43, pp. 99–111, 2015.
- [84] N. Tajbakhsh, S. R. Gurudu, and J. Liang, “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, vol. 35, no. 2, pp. 630–644, 2015.
- [85] J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,” International journal of computer assisted radiology and surgery, vol. 9, no. 2, pp. 283–293, 2014.
- [86] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
- [87] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in DLMIA. Springer, 2018, pp. 3–11.
- [88] J. Lin, G. Wang, and R. W. Lau, “Progressive mirror detection,” in CVPR, 2020, pp. 3697–3705.
- [89] H. Guan, J. Lin, and R. W. Lau, “Learning semantic associations for mirror detection,” in CVPR, June 2022, pp. 5941–5950.
- [90] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in CVPR, 2017, pp. 1492–1500.
- [91] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 CVPR. Ieee, 2009, pp. 248–255.
- [92] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” Advances in neural information processing systems, vol. 24, 2011.
- [93] H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yu, “Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation,” arXiv preprint arXiv:1903.11816, 2019.