Receptive Field Broadening and Boosting for Salient Object Detection

Mingcan Ma¹², Changqun Xia^{2 $*$}, Chenxi Xie¹², Xiaowu Chen¹, Jia Li¹²¹¹1Correspondence should be addressed to Changqun Xia and Jia Li. URL: http://cvteam.net.

Abstract

Salient object detection requires a comprehensive and scalable receptive field to locate the visually significant objects in the image. Recently, the emergence of visual transformers and multi-branch modules has significantly enhanced the ability of neural networks to perceive objects at different scales. However, compared to the traditional backbone, the calculation process of transformers is time-consuming. Moreover, different branches of the multi-branch modules could cause the same error back propagation in each training iteration, which is not conducive to extracting discriminative features. To solve these problems, we propose a bilateral network based on transformer and CNN to efficiently broaden local details and global semantic information simultaneously. Besides, a Multi-Head Boosting (MHB) strategy is proposed to enhance the specificity of different network branches. By calculating the errors of different prediction heads, each branch can separately pay more attention to the pixels that other branches predict incorrectly. Moreover, Unlike multi-path parallel training, MHB randomly selects one branch each time for gradient back propagation in a boosting way. Additionally, an Attention Feature Fusion Module (AF) is proposed to fuse two types of features according to respective characteristics. Comprehensive experiments on five benchmark datasets demonstrate that the proposed method can achieve a significant performance improvement compared with the state-of-the-art methods.

Refer to caption — Figure 1: Comparison of the two types of models. From left to right, it shows the receptive field, features, frequency domain signals and efficiency. ResNet-18 (He et al. 2015) and Swin-Transformer (Liu et al. 2021b) are selected as the basic network to extract details and deep semantics respectively.

Introduction

Recently, CNN-based methods have made great progress in the SOD field due to the powerful feature representation. However, due to the limitations of the convolutional receptive field, it is difficult for the CNN networks to break through the existing performance. The emergence of visual transformers has broken the current visual attention mechanism and can understand images from an overall perspective. However, the price of a larger receptive field is more computational cost, especially when the input image resolution is larger.

Actually, CNN and transformers have their own distinctiveness. As shown in Fig. 1, we observe that: 1. Features are different in the frequency domain. CNN features reveal mostly high frequency signals, while transformer features contain mostly low-frequency signals. 2. The computational cost increases at different rates as the input resolution increases. Two types of models can both run fast on low-resolution images, but when the resolution is large, the calculation cost of Transformer is much higher than that of CNN. 3. There are differences in regard to the receptive field mechanism. CNN expands the receptive field from the local to the whole image through a bottom-up calculation method, while Transformer obtains global information directly from the overall perspective.

Based on the observation of these distinctiveness, we further analyzed the feasibility of complementarity between CNN and transformers from three aspects: 1. The signal difference in frequency domain indicates that the feature characteristics of the two models can be complementary. 2. The processing difference in attention mechanism reflects that their ways of understanding images can be complementary. 3. The difference in processing speed reveals that the input sizes can be complementary. Therefore, we consider using a transformer to extract global semantic information and CNN to extract detailed information together. In this way, transformer does not require a large input size and CNN does not require a deeper convolutional network, which can achieve a better balance of computational costs and performance. To this end, we propose to build an effective bilateral network to balance two kinds of requirements. The transformer model with low-resolution input is used to obtain low-frequency semantic information, and the lightweight CNN with high-resolution input is used to analyze high-frequency details. It is worth noting that the characteristic of the CNN models is that they can establish a perfect channel connection but the spatial receptive field is insufficient, while the transformer models can establish a full-space connection but the channel connection is relatively weak. So, Attention Fusion Module (AF) is proposed to explore a better integration strategy of two types of features.

However, when the input resolution becomes smaller, the transformer’s global attention mechanism degenerates into an association between regions, which cannot capture enough signal range. Therefore, the goal of increasing the receptive field cannot be totally dependent on the transformer branch. To enhance the receptive field of CNN across regions, multiscale modules such as ASPP (Chen et al. 2017) and PPM (Zhao et al. 2017) are good choices due to the strong ability of extracting features of different scales in different branches. But remarkably, the current multi-branch module adopts the strategy of parallel training and synchronous error back propagation. This method is not conducive to extracting discriminative features separately for different branches.

Therefore, we put forward a novel strategy named Multi-Head Boosting (MHB) as shown in Figure 2. MHB mainly has the following improvements: 1. Forward propagation activates all paths, and the error back propagation process only activates a random branch. 2. A separate prediction head is set for each branch, and each branch is back-propagated by pixel weighting to focus on the error parts of other branches. With MHB, we can effectively construct the association between different regions. The combination of bilateral network and MHB can effectively enhance the receptive field of the network and thus achieve higher SOD performance.

Finally, we design ablation experiments to verify the effectiveness of the proposed bilateral framework and multi-head boosting strategy. The results of comprehensive comparison experiments show that our method has greatly exceeded the current state-of-the-art methods. Our main contributions can be summarized as follows:

•

We propose to construct an effective bilateral network based on CNN and transformer. The lightweight CNN is responsible for fast detail extraction at high resolution, and the transformer model can efficiently generate global correlation features at low resolution input.
•

We propose an attention feature fusion module (AF) to fuse the features of the two types of models. AF considers the advantages of the two features in space and channel dimensions to strengthen each other.
•

We propose a novel multi-head boosting strategy, which makes each branch pay more attention to the pixel position of each other’s prediction error and breaks the synchronization relationship of the error back-propagation process.
•

The proposed method achieves a great performance improvement over the state-of-the-art methods. In particular, The MAE metric of our method reach 0.021 on ECSSD and 0.040 on DUT-OMRON, which is 29% and 20% lower than the previous methods.

Related Work

Efficient Two-Stream Network Designs

Computer vision tasks often need to extract detailed information and semantic information of an image. Detailed information does not depend on the depth of the network but requires larger input resolution, while semantic information is just the opposite. In order to balance these two processes, the dual-branch network is proposed to obtain these two features separately, such as BiSeNet (Yu et al. 2018). This method can better balance efficiency and performance. However, the backbones used by the two branches are often similar convolutional neural networks, which often brings a lot of redundancy. For example, StdcNet (Fan et al. 2021) propose that better results can be obtained without a separate detail branch. In fact, this two-stream fusion idea is not only suitable for the fusion between convolutional neural networks. Recently, more and more visual backbones have been proposed, such as ResNet (He et al. 2015), MobileNet (Howard et al. 2017), StdcNet (Fan et al. 2021), ShuffleNet (Zhang et al. 2018), Swin Transformer (Liu et al. 2021b), T2T-VIT (Yuan et al. 2021), etc. These models contain efficient lightweight CNN and high computational cost networks based on visual self-attention. Compared with fusing different convolutional neural networks, we propose to fuse CNN and transformer frameworks with different but complementary mechanisms.

Multi-scale perception

The multi-scale perception ability of neural networks often depends on the range of the receptive field. The multi-branch module is an effective method to enhance the receptive field of convolutional neural networks. For example, Pyramid Pooling Moudle (Zhao et al. 2017) and Atrous Spatial Pyramid Pooling (Chen et al. 2017) enhance the local scale perception ability of the network by designing multi-branch modules with different receptive fields. PoolNet (Liu et al. 2019) has used PPM for SOD for the first time and achieved good results. BaNet (Su et al. 2019) proposes to improve ASPP and achieve the latest new capabilities at the time. The above methods can effectively improve the expressive ability of the network, but there are still shortcomings. The original intention of the multi-branch model is to allow different branches to capture specific features. However, most of the previous methods use a synchronous error back propagation strategy and seldom consider the specificity of the branch separately. In contrast, We use a similar multi-model boosting idea to achieve mutual enhancement of multiple branches. Each branch calculates the error separately and weights the error areas of the other branches.

Accurate Detection of Salient Objects

In the past three years, the upper limit of SOD performance has been continuously broken by the latest methods. For example, F3Net (Wei, Wang, and Huang 2019) proposes Fusion, Feedback and Focus to detect salient objects in 2019 and has achieved the best performance at the time. The MAE reach 0.053 on the DUT-OMRON dataset. In 2020, LDF (Wei et al. 2020) decouples pixels based on the distance from the edge and iteratively optimizes the predicted maps, and its MAE on the same dataset reaches 0.051. In 2021, PA-KRN (Xu et al. 2021) proposes a strategy of positioning first, then segmentation, and its MAE reaches 0.050. VST (Liu et al. 2021a) uses the transformer framework for SOD for the first time, and its MAE reach 0.058. From the above results, it can be seen that the performance of SOD is getting better and better, but the improvement is very slow. We explore the differences between CNN and transformers in terms of features, computational complexity, and attention mechanisms. Based on this, we design a bilateral network and an Attention Feature Fusion Module to balance the two models and propose a Multi-Head Boosting method to compensate for the lack of association between regions. Due to the complementary combination, our method has achieved good results in both performance and efficiency. In particular, the MAE metric of our method on ECSSD has reached 0.040.

Proposed Method

In this section, we will introduce the details of the composition of the bilateral network, the Attention Feature Fusion Module, and the Multi-Head Boosting strategy. The training process of the model is shown in Fig. 3. The verification process only needs to add up the prediction results of each branch and then pass the activation function.

Bilateral Network

As can be seen from Fig. 1, CNN and transformers have both distinctiveness and complementarity. Therefore, we propose a bilateral network as shown in Fig. 3 to fully utilize the complementary advantages. The bilateral network contains a semantic branch and a detail branch. The semantic branch makes full use of the global view of the transformer model to extract global semantic features and enhance the receptive field of the overall network. The detail branch is responsible for extracting high frequency details. The ingenuity of the bilateral network is that it can take advantage of the two types of models at the lowest computational cost. The complex semantic branch is mainly responsible for extracting global attention information so it can be calculated under low-resolution input. The detail branch only requires a lightweight CNN as basic network. Therefore, the combination of transformer and CNN can efficiently balance semantics, details, and their calculation cost.

Specifically, we choose ResNet-18 (He et al. 2015) to build the detail branch. In order to enrich the detailed information extracted by this branch, we remove the first down-sampling operation to obtain more detailed features at high resolution. In this way, the resolution of the features output by each layer is $1/2$ , $1/4$ , $1/8$ , and $1/16$ of the initial resolution. As for the semantic branch, we utilize Swin Transformer (Liu et al. 2021b) to extract global features. This branch first samples the image to a resolution of $56\times 56$ , and then passes through the self-attention modules of the transformer in turn.

The working process of proposed bilateral network can be summarized as follows: Input the $352\times 352$ resolution image into the lightweight CNN to obtain multi-level detailed features $R_{1}$ , $R_{2}$ , $R_{3}$ and $R_{4}$ . Besides, the input image is down-sampled to a resolution of $56\times 56$ . Then we input it to the semantic branch to obtain features $T_{1}$ , $T_{2}$ , $T_{3}$ , and $T_{4}$ and reshape the features. Finally, Use the attention feature fusion module $AF_{i}$ to fuse $T_{i}$ and $R_{i}$ , where $i\in\{1,2,3,4\}$ .

Attention Feature Fusion Module

Unlike the features of CNN, the features of transformer model are composed of vectors stretched by pixel blocks. Given the transformer features, we first restore the previous positional relationship of the vector to get feature $X$ of Fig. 4. The transformer model calculates the correlation between all patch blocks in the space through the vector inner product, while CNN establishes the connection between all channels in the local space. The former constructs better spatial correlation, while the latter has stronger channel correlation. In order to better integrate the two types of features, we design an Attention Feature Fusion Module (AF). AF uses the respective characteristics of the two types of features to enhance the expression of each other. The specific process can be expressed as:

	$\displaystyle F_{mid}$	$\displaystyle=\mathcal{C}_{br}(\mathcal{U}(\mathcal{R}_{ca}(Y)\otimes X)ⓒ(\mathcal{R}_{sa}(X)\otimes Y)),$		(1)
	$\displaystyle F_{out}$	$\displaystyle=\mathcal{C}_{br}(\mathcal{U}(\mathcal{C}(X))\oplus F_{mid}\oplus\mathcal{C}(Y)),$		(2)

where $X$ and $Y$ correspond to the features in Fig. 4. $\mathcal{R}_{sa}$ and $\mathcal{R}_{ca}$ represent the operations corresponding to the crossed arrows in the figure. $\mathcal{U}$ represents the upsampling operation. $\mathcal{C}$ represents convolution, and the subscript $br$ represents a Batch Normalization and a ReLU activation. $\oplus$ , $\otimes$ represent pixel-wise addition and multiplication operations, respectively. $ⓒ$ denotes feature concatenation operation. $F_{mid}$ is the intermediate result, $F_{out}$ is the final output.

Specifically, the feature $X$ obtained by transformer contains rich semantic information, while the low level details are lost. CNN feature $Y$ is obtained by convolution, the spatial receptive field is smaller, but the correlation between channels is stronger. Therefore, we use the features of $X$ to select the spatial Features of $Y$ , and use the channel information of $Y$ to enhance $X$ . Finally, the final result is obtained by fusing all the information through the residual structure.

Multi-Head Boosting

The design idea of Multi-Head Boosting (MHB) is to improve the overall effect of the multi-branch module by decoupling the training process and adding branch complementary losses. It can be roughly divided into two parts: multi-path decoupling and boosting loss. Next, we will introduce these two processes in detail.

Multi-path decoupling aims to decouple multi-branch modules trained in parallel into multiple models. The significance is that the training process of different branches can be isolated, thereby reducing the gradient correlation between each branch and enhancing the scale specificity of different branches. ASPP integrates multi-channel features in the module, single-channel training cannot be performed directly, therefore we cannot use multi-channel features in the testing process. Our solution is to delay feature merging to the final result layer. In this way, the purpose of single-channel training and multi-channel testing can be realized.

Boosting Loss aims to enhance the complementarity between branches. As shown in Fig. 3, the forward propagation process will activate all branches and obtain $N$ predicted maps $\{P_{i}|i=1,2,3,...,N\}$ . The random number $X$ is used to determine which branch performs error back propagation, and the prediction errors of other branches are regarded as weight. In this way, each branch will pay attention to the mispredicted parts of other branches. The calculation process of weight can be expressed as:

\displaystyle W_{x}=\sum_{i=1}^{N}\mathcal{L}_{bce}(P_{i},g)(i\neq X)+1,

(3)

where $g$ represents the ground-truth map. $\mathcal{L}_{bce}$ denotes the pixel-level binary cross-entropy error, which can be expressed as:

\displaystyle\mathcal{L}_{bce}(p,g)=-((g\otimes log(p))\oplus(\overline{g}\otimes log(\overline{p}))),

(4)

where $p$ denotes predicted map. $log$ represents a pixel-by-pixel logarithmic operation. $\overline{g}$ and $\overline{p}$ denote the inversion operation pixel by pixel. The boosting loss can be calculated according to the calculated weight. It can be expressed as:

\displaystyle\mathcal{L}_{b}(p,g,w)=\mathcal{L}_{wbce}(p,g,w)+\mathcal{L}_{wiou}(p,g,w),

(5)

where $\mathcal{L}_{wbce}$ and $\mathcal{L}_{wiou}$ represent weighted binary cross entropy loss and weighted intersection of union loss. They can be expressed as:

\displaystyle\mathcal{L}_{wbce}(p,g,w)=\frac{\mathcal{S}(\mathcal{L}_{bce}(p,g)\otimes w)}{\mathcal{S}(w)},

(6)

\displaystyle\mathcal{L}_{wiou}(p,g,w)=1-\frac{\mathcal{S}(p\otimes g\otimes w)}{\mathcal{S}((p\oplus g)\otimes w)-\mathcal{S}(p\otimes g\otimes w)},

(7)

where $\mathcal{S}$ represents the operation of summing all pixels. Other symbols are consistent with the previous description.

In general, MHB decomposes the training process of multi-scale modules so that multiple branches receive training with random samples and relieves the mutual influence of the error back propagation process. BL further enhances the complementarity between each branch. The experiments prove that MHB and BL can effectively improve SOD performance.

Loss Function

We utilize the sum of Binary Cross Entropy (BCE) and Intersection of Union (IoU) as the loss function, which is widely used in LDF (Wei et al. 2020), etc. In addition to the final prediction map, we will also supervise the output features of each Convolution Block (CB) in Fig. 4. The total loss function can be expressed as:

\displaystyle\mathcal{L}_{tot}=\sum_{i=1}^{M}\mathcal{L}_{bi}(F_{i},g)+\mathcal{L}_{b}(P_{4},g,W_{x}),

(8)

where $F_{i}$ represents the prediction map of each CB module, and $P_{4}$ represents the final result. $M$ denotes the number of CB modules. $\mathcal{L}_{bi}$ can be expressed as:

\displaystyle\mathcal{L}_{bi}(p,g)=\mathcal{L}_{wbce}(p,g,W_{1})+\mathcal{L}_{wiou}(p,g,W_{1}),

(9)

where $W_{1}$ represents a matrix with all $1$ values, which means that the auxiliary loss is not weighted.

Experiment

Experiment setting

The evaluation indicators are mainly as follows: Mean Absolute Error (MAE), max and mean F-measure ( $F_{\beta}^{*},F_{\beta}^{m}$ ) (Yang et al. 2013), Maximum enhanced-alignment measure ( $E_{\xi}$ ) (Fan et al. 2018), structure measure ( $S_{m}$ ) (Fan et al. 2017) and precision-recall (PR).

The experiment involves the following datasets: DUT-OMRON (Yang et al. 2013) (5168), ECSSD (Yan et al. 2013) (1000), HKU-IS (Li and Yu 2015) (4447), PASCAL-S (Li et al. 2014) (850), DUTS-TE (Wang et al. 2017) (5019), DUTS-TR (Wang et al. 2017) (10553). We choose DUTS-TR as the training set and other datasets as the test set.

Table 1: Quantitative comparison table with the latest methods on multiple indicators, including the max and mean F-measure (

F_{\beta}^{*}

and

F_{\beta}^{m}

the larger the better), MAE (the smaller the better), E-measure (

E_{\xi}

, the larger the better), and S-measure (

S_{m}

, the larger the better) . The best and second best results are marked in red, and blue, respectively.

	ECSSD (1000)					HKU-IS (4447)					DUTS-TE (5019)					DUT-OMRON (5168)					PASCAL-S (850)
Method	$F_{\beta}^{*}\uparrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$E_{\xi}\uparrow$	$S_{m}\uparrow$	$F_{\beta}^{*}\uparrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$E_{\xi}\uparrow$	$S_{m}\uparrow$	$F_{\beta}^{*}\uparrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$E_{\xi}\uparrow$	$S_{m}\uparrow$	$F_{\beta}^{*}\uparrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$E_{\xi}\uparrow$	$S_{m}\uparrow$	$F_{\beta}^{*}\uparrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$E_{\xi}\uparrow$	$S_{m}\uparrow$
$BASNet_{19}$	.942	.880	.037	.921	.916	.928	.895	.032	.946	.909	.860	.791	.048	.884	.866	.805	.756	.056	.869	.836	.857	.775	.076	.847	.832
$PoolNet_{19}$	.944	.914	.039	.924	.922	.933	.896	.032	.949	.910	.880	.811	.040	.889	.878	.808	.746	.056	.863	.828	.869	.823	.074	.850	.847
$CPD_{19}$	.939	.917	.037	.925	.918	.925	.891	.034	.944	.905	.865	.805	.043	.887	.869	.797	.747	.056	.866	.825	.864	.824	.072	.849	.842
$BANet_{19}$	.945	.880	.035	.928	.916	.931	.895	.032	.950	.909	.872	.791	.040	.892	.866	.803	.756	.059	.860	.836	.870	.775	.070	.855	.832
$GateNet_{20}$	.945	.916	.040	.924	.920	.933	.899	.033	.949	.915	.888	.807	.040	.889	.885	.818	.746	.055	.862	.838	.875	.825	.068	.852	.852
$U2Net_{20}$	.951	.892	.033	.924	.928	.935	.896	.031	.948	.916	.873	.792	.045	.886	.874	.823	.761	.054	.871	.847	.862	.772	.076	.841	.836
$DFI_{20}$	.949	.920	.035	.924	.927	.934	.902	.031	.951	.920	.886	.814	.039	.892	.887	.818	.752	.055	.865	.839	.885	.837	.065	.857	.861
$MINet_{20}$	.947	.924	.033	.927	.925	.935	.909	.029	.953	.919	.884	.828	.037	.898	.884	.810	.755	.055	.865	.833	.865	.835	.064	.852	.851
$GCPANet_{20}$	.948	.919	.035	.920	.927	.938	.898	.031	.949	.920	.888	.817	.038	.891	.891	.812	.748	.056	.860	.839	.876	.833	.061	.850	.861
$ITSDNet_{20}$	.947	.895	.034	.927	.925	.934	.899	.031	.952	.917	.883	.804	.041	.895	.885	.821	.756	.061	.863	.840	.876	.792	.064	.853	.856
$LDF_{20}$	.950	.930	.034	.925	.924	.939	.914	.027	.954	.919	.898	.855	.034	.910	.892	.820	.773	.051	.873	.838	.874	.843	.059	.865	.856
$PFSNet_{21}$	.952	.932	.031	.928	.930	.943	.919	.026	.956	.924	.896	.847	.036	.902	.892	.823	.774	.055	.875	.842	.875	.837	.063	.856	.854
$VST_{21}$	.951	.920	.033	.918	.932	.942	.900	.029	.953	.928	.890	.818	.037	.892	.896	.825	.756	.058	.861	.850	.875	.829	.061	.837	.865
$PAKRN_{21}$	.953	.931	.032	.924	.928	.943	.920	.027	.955	.923	.907	.865	.033	.916	.900	.834	.793	.050	.885	.853	.873	.838	.066	.857	.852
OURS	.964	.949	.022	.932	.941	.953	.936	.020	.965	.933	.920	.890	.025	.925	.910	.838	.804	.040	.878	.847	.890	.862	.050	.866	.867

We use NVDIA GTX 2080Ti to train our network. The input resolution of the detail branch is $352\times 352$ , and the input resolution of the first transformer block is $56\times 56$ . The maximum learning rate of the backbone is 0.004, and the other parts are expanded by ten times. And the learning rate will first increase and then decrease when training our model. The optimization method uses Stochastic Gradient Descent. Batch size is set to 26, and the epoch is set to 32. The data augmentation methods involves multi-scale training, random flipping and cropping. The prediction results do not need any post-processing.

Methods of Comparison

The experimental process involves the state-of-the-art methods of the last three years. Four of them in 2019 include BANet (Su et al. 2019), BASNet (Qin et al. 2019), PoolNet (Liu et al. 2019) and CPD (Wu, Su, and Huang 2019). Seven methods in 2020, including LDF (Wei et al. 2020), MINet (Pang et al. 2020), GCPANet (Chen et al. 2020), GateNet (Zhao et al. 2020), DFI (Liu, Hou, and Cheng 2020), ITSDNet (Zhou et al. 2020) and U2Net (Qin et al. 2020). And there methods published in 2021: PAKRN (Xu et al. 2021), VST (Liu et al. 2021a) and PFSNet (Ma, Xia, and Li 2021). The evaluation process uses a unified evaluation code to evaluate the published saliency map.

Table 2: Ablation experiments of the proposed method. Bi is the abbreviation of Bilateral. The last line is the final result of our method.

	ECSSD		DUTS-TE		DUT-OMRON
Method	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$	$F_{\beta}^{m}\uparrow$	$mae\downarrow$
Detail branch	.867	.061	.716	.067	.644	.094
Semantic branch	.848	.033	.845	.033	.780	.050
Bilateral	.930	.027	.858	.029	.788	.047
Bi+AF	.940	.025	.869	.028	.790	.045
Bi+AF+MBD2	.941	.025	.874	.027	.798	.045
Bi+AF+MBD3	.946	.024	.882	.028	.799	.044
Bi+AF+MBD4	.947	.023	.888	.026	.801	.041
Bi+AF+MBD5	.946	.024	.886	.027	.800	.042
Bi+AF+ASPP	.941	.025	.864	.028	.793	.044
Bi+AF+MBD4+BL	.949	.022	.890	.025	.804	.040

Performance Comparison.

The verification results prove that the proposed method achieves a breakthrough in performance compared with the latest method. First of all, it can be seen from the PR curve in Fig. 5 that the PR curve of the previous methods has not improved significantly, but our method is significantly better than other methods. In addition, the performance comparison results in Tab. 1 can also verify the effectiveness of the method in this paper. Our method has mostly surpassed the previous method in five commonly used indicators. And the performance improvement is very significant. Taking the ECSSD dataset as an example, the MAE indicator reached $0.033$ in 2020, and it has been reduced to $0.031$ in 2021, a reduction of only $6\%$ . But our method reached $0.022$ , an decrease of $29\%$ compared to last year. The experimental data on DUTS-TE shows that the MAE metric of LDF (Wei et al. 2020) in 2020 has reached $0.034$ , while the result in 2021 is basically unchanged. But our method reduces the MAE indicator to $0.025$ , which is a $24\%$ reduction. From the results of F-measure, the proposed method greatly improves the F-measure value calculated at different thresholds.

Ablation Studies

In order to verify the innovations of this paper one by one, we conduct ablation experiments on the proposed Bilateral Network, Attention Fusion Module AF, Multi-Branch Decoupling, and Boosting Loss. The experimental details are shown in Tab. 2.

Bilateral Network.

The results of the first three rows of Tab. 2 verify the effectiveness of the bilateral network. The detailed branch represents a framework composed of a lightweight CNN and FPN (Lin et al. 2017) decoder. Here, the CNN network selects ResNet-18 (He et al. 2015), and the input resolution is set to $352\times 352$ . The semantic branch selects Swin Transformer (Liu et al. 2021b) as basic network and utilizes the same decoder. The input resolution of semantic branch is set to $56\times 56$ . It can be seen from the experimental results that the performance of a pure CNN or transformer model is not ideal. The lightweight CNN lacks deep semantic information and the transformer model with low-resolution input lacks high-frequency details. When we combine the two models into a bilateral network, the performance can be greatly improved.

Attention Fusion Module (AF)

AF improves their performance through cross-attention compensation. The performance of the fourth row in Tab. 2 is significantly improved than that of the third row. On different datasets, the F-measure metric can be improved by more than 1 percentage point. And the Average Absolute Error shows a downward trend. Experiments have proved that the bilateral network can achieve higher performance under the blessing of AF.

Multi-Branch Decoupling (MBD).

Tab. 2 also verifies the effect of multi-branch decoupling and includes an ablation experiment of the number of branches $N$ . The data in the fourth row to the eighth row shows the change in performance as the number of branches increases. When the number of branches increases to 4, the performance is close to saturation. Therefore, we finally chose to use four branches. In addition, the effect of directly adding ASPP to the bilateral network in the table is not obvious. This verifies the role of MBD in the serial training of multi-branch sub-networks.

Boosting Loss (BL).

In order to verify the effect of BL, we analyze it from both quantitative and qualitative perspectives. The last row in Tab. 2 shows the results of BL acting on the 4 branches. Compared to the model without BL, the average performance has improved and basically exceeded the best result of the ablation test. Fig. 7 shows the visualization results of the first and fourth branches. It is easy to see that different branches can extract different detailed features, and the fusion result can effectively integrate the results of different branches.

Visual comparison.

The latest method and the visualization results of the proposed method are shown in Fig. 6. It is easy to see that the existing methods mainly have the following problems: 1. Some objects are missing in the prediction result when there are multiple salient objects. 2. Incomplete prediction of a single object. 3. Improper handling of details generates noise. Among the methods compared, VST is based on the transformer model, while other methods are based on CNN. We observe that VST can handle problems 1 and 2 better, but the segmentation details are often not ideal.As shown in the second and fourth rows of the figure, the CNN model is difficult to deal with problem 1 and problem 2 due to the limited receptive field. It can be seen from the third column that our model can effectively combine the advantages of the two types of models. This also verifies the expansion and boosting effect of our method on the receptive field. The results of the first two rows verify that the method in this paper can effectively solve the problem of missing objects. The results in the third and fourth rows show that our method has advantages in detail processing. And the data in the last row shows that even if the object is large, our model can still fully predict and accurately segment the results. In short, the visualization results reflect that the proposed method can better alleviate the problems caused by the receptive field.

Conclusion

In this paper, we rethink the role of the receptive field in SOD and propose a bilateral network based on vision transformer and lightweight CNN to broaden the receptive field of the network. in order to better integrate the characteristics of the bilateral network, we design an attention feature fusion module based on the characteristics of the two types of features. In the proposed bilateral network, transformer branches input low-resolution images to efficiently generate semantic information. However, this operation causes the global attention mechanism of transformer branch to degenerate into an association mechanism between pixel regions. To enhance the attention between these regions, we propose a Multi-Head Boosting strategy to compensate for the loss of the global receptive field. Experiments show that our method can achieve impressive results under various evaluation indicators on multiple benchmark datasets.

References

Chen et al. (2017) Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4): 834–848.
Chen et al. (2020) Chen, Z.; Xu, Q.; Cong, R.; and Huang, Q. 2020. Global context-aware progressive aggregation network for salient object detection. arXiv preprint arXiv:2003.00651.
Fan et al. (2017) Fan, D.; Cheng, M.; Liu, Y.; Li, T.; and Borji, A. 2017. A new way to evaluate foreground maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 245484557.
Fan et al. (2018) Fan, D.-P.; Gong, C.; Cao, Y.; Ren, B.; Cheng, M.-M.; and Borji, A. 2018. Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421.
Fan et al. (2021) Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; and Wei, X. 2021. Rethinking BiSeNet For Real-time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9716–9725.
He et al. (2015) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385.
Howard et al. (2017) Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Li and Yu (2015) Li, G.; and Yu, Y. 2015. Visual saliency based on multiscale deep features. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5455–5463.
Li et al. (2014) Li, Y.; Hou, X.; Koch, C.; Rehg, J. M.; and Yuille, A. L. 2014. The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 280–287.
Lin et al. (2017) Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2117–2125.
Liu, Hou, and Cheng (2020) Liu, J.-J.; Hou, Q.; and Cheng, M.-M. 2020. Dynamic Feature Integration for Simultaneous Detection of Salient Object, Edge and Skeleton. arXiv preprint arXiv:2004.08595.
Liu et al. (2019) Liu, J.-J.; Hou, Q.; Cheng, M.-M.; Feng, J.; and Jiang, J. 2019. A Simple Pooling-Based Design for Real-Time Salient Object Detection. In IEEE CVPR.
Liu et al. (2021a) Liu, N.; Zhang, N.; Wan, K.; Han, J.; and Shao, L. 2021a. Visual Saliency Transformer. arXiv preprint arXiv:2104.12099.
Liu et al. (2021b) Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021b. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv:2103.14030.
Ma, Xia, and Li (2021) Ma, M.; Xia, C.; and Li, J. 2021. Pyramidal Feature Shrinking for Salient Object Detection. 35: 2311–2318.
Pang et al. (2020) Pang, Y.; Zhao, X.; Zhang, L.; and Lu, H. 2020. Multi-Scale Interactive Network for Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9413–9422.
Qin et al. (2020) Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O. R.; and Jagersand, M. 2020. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognition, 106: 107404.
Qin et al. (2019) Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; and Jagersand, M. 2019. BASNet: Boundary-Aware Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Su et al. (2019) Su, J.; Li, J.; Zhang, Y.; Xia, C.; and Tian, Y. 2019. Selectivity or Invariance: Boundary-aware Salient Object Detection. In ICCV.
Wang et al. (2017) Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; and Ruan, X. 2017. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 136–145.
Wei, Wang, and Huang (2019) Wei, J.; Wang, S.; and Huang, Q. 2019. F3Net: Fusion, feedback and focus for salient object detection. arXiv preprint arXiv:1911.11445.
Wei et al. (2020) Wei, J.; Wang, S.; Wu, Z.; Su, C.; Huang, Q.; and Tian, Q. 2020. Label decoupling framework for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13025–13034.
Wu, Su, and Huang (2019) Wu, Z.; Su, L.; and Huang, Q. 2019. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3907–3916.
Xu et al. (2021) Xu, B.; Liang, H.; Liang, R.; and Chen, P. 2021. Locate Globally, Segment Locally: A Progressive Architecture With Knowledge Review Network for Salient Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 3004–3012.
Yan et al. (2013) Yan, Q.; Xu, L.; Shi, J.; and Jia, J. 2013. Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1155–1162.
Yang et al. (2013) Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; and Yang, M.-H. 2013. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3166–3173.
Yu et al. (2018) Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; and Sang, N. 2018. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Ferrari, V.; Hebert, M.; Sminchisescu, C.; and Weiss, Y., eds., Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, volume 11217 of Lecture Notes in Computer Science, 334–349. Springer.
Yuan et al. (2021) Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.; Tay, F. E.; Feng, J.; and Yan, S. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986.
Zhang et al. (2018) Zhang, X.; Zhou, X.; Lin, M.; and Sun, J. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6848–6856.
Zhao et al. (2017) Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid Scene Parsing Network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6230–6239.
Zhao et al. (2020) Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; and Zhang, L. 2020. Suppress and balance: A simple gated network for salient object detection. arXiv preprint arXiv:2007.08074.
Zhou et al. (2020) Zhou, H.; Xie, X.; Lai, J.-H.; Chen, Z.; and Yang, L. 2020. Interactive Two-Stream Decoder for Accurate and Fast Saliency Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9141–9150.