¹¹institutetext: Dalian University of Technology, China
¹¹email: {lartpang, zxq}@mail.dlut.edu.cn, {zhanglihe, lhchuan}@dlut.edu.cn
²²institutetext: Peng Cheng Laboratory

Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection

Youwei Pang 11 Lihe Zhang Corresponding author: [email protected] Xiaoqi Zhao 11 Huchuan Lu 1122

Abstract

The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information. In this paper, we explore these issues from a new perspective. We integrate the features of different modalities through densely connected structures and use their mixed features to generate dynamic filters with receptive fields of different sizes. In the end, we implement a kind of more flexible and efficient multi-scale cross-modal feature processing, i.e. dynamic dilated pyramid module. In order to make the predictions have sharper edges and consistent saliency regions, we design a hybrid enhanced loss function to further optimize the results. This loss function is also validated to be effective in the single-modal RGB SOD task. In terms of six metrics, the proposed method outperforms the existing twelve methods on eight challenging benchmark datasets. A large number of experiments verify the effectiveness of the proposed module and loss function. Our code, model and results are available at https://github.com/lartpang/HDFNet.

Keywords:

RGB-D Salient Object Detection, Cross-modal Fusion, Dynamic Dilated Pyramid Module, Hybrid Enhanced Loss

1 Introduction

Salient object detection (SOD) aims to model the mechanism of human visual attention and mine the most salient objects or regions in data such as images or videos. SOD has been widely applied in many computer vision tasks, such as scene classification [39], video segmentation [14], semantic segmentation [44], foreground map evaluation [11, 12] visual tracking [31], person re-identification [40] and so on.

With the advent of the fully convolutional network [30], deep learning-based SOD models [18, 28] have made great progress. Some methods [51, 34, 23, 52] have achieved very good performance on the existing benchmark datasets. However, these works are mainly based on RGB data. They still face severe challenges when handling the cluttered or low-contrast scenes. Recently, some works [6, 9, 54, 38, 15, 2, 4] introduce the depth data as an aid to further improve the detection performance. The depth information can more intuitively express spatial structures of the objects in a scene and provide a powerful supplement for the detection and recognition of salient objects. Using complementary modal cues, the scene can be further deeply and intelligently understood. However, limited by the way of using the depth information, RGB-D salient object detection is still great challenging.

Refer to caption — Figure 1: Comparisons in model size and accuracy.

It is well known that RGB images contain rich appearance and detail information while depth images contain more spatial structure information. They complement each other for many vision tasks. RGB-D SOD approaches aim to formulate cross-modal fusion in different manners. Most of them integrate depth and RGB features by element-wise addition [4, 37], concatenation [13, 3] and convolution operations [2, 43]. Some methods compute attention map [50] or saliency map [43] via a shallow or deep CNN network from pure depth images. Because of using the fixed parameters for different samples during the testing phase, the generalization capability of these models is weakened.

Moreover, for dense prediction task, the loss in each spatial position is usually different. Thus, the actual optimization direction of gradients in different positions may be varying. The weight-sharing convolution operation across different positions, which is used in the existing methods, causes that the training process of each parameter relies on the global gradient. This forces the network to learn trade-off and sub-optimal parameters. To address these problems, we propose a dynamic dilated pyramid module (DDPM), which uses RGB-depth mixed features to adaptively adjust convolution kernels for different input samples and processing locations. These kernels can capture rich semantic cues at multiple scales with the help of the pyramid structure and the dilated convolution. This design is capable of making more efficient convolution operations for current RGB features and promotes the network to obtain more flexible and targeted features for saliency prediction. Early deep learning-based SOD models [15], which use fully connected layers, destroy the spatial structure of the data. This issue is alleviated to some extent by using the fully convolutional network. But the intrinsic gridding operation and the repeated down-sampling lead to the loss of numerous details in the predicted results. Although many methods frequently combine shallower features to restore feature resolution, the improvement is still limited. While some approaches [18, 10] leverage CRF post-processing to refine subtle structures, which has a large computational cost. In this work, we design a new hybrid enhanced loss function (HEL). The HEL encourages the consistency between the area around edges and the interior of objects, thereby achieving sharper boundaries and a solid saliency area.

Our main contributions are summarized as follows:

•

We propose a simple yet effective hierarchical dynamic filtering network (HDFNet) for RGB-D SOD. Especially, we provide a new perspective to utilize depth information. The depth and RGB features are combined to generate region-aware dynamic filters to guide the decoding in RGB stream.
•

We propose a hybrid enhanced loss and verify its effectiveness in both RGB and RGB-D SOD tasks. It can effectively optimize the details of predictions and enhance the consistency of salient regions without additional parameters.
•

We compare the proposed method with twelve state-of-the-art methods on eight datasets. It achieves the best performance under six evaluation metrics. Meanwhile, we implement a forward reasoning speed of 52 FPS on an NVIDIA GTX 1080 Ti GPU. The size of our VGG16-based model is about 170 MB (Fig. 1).

2 Related Word

RGB-D Salient Object Detection. The early methods are mainly based on hand-crafted features, such as contrast [6] and shape [8]. Limited by the representation ability of the features, they can not cope with complex scenes. Please refer to [13] for more details about traditional methods. In recent years, FCN-based methods have shown great potential and some of them achieve very good performance in the RGB-D SOD task [50, 37, 13]. Chen and Li [2] progressively combine the current depth/RGB features and the preceding fused feature by a series of convolution and element-wise addition operations to build the cross fusion modules. Recently, they concatenate depth and RGB features and feed them into an additional CNN stream to achieve multi-level cross-modal fusion [3]. Wang and Gong [43] respectively build a saliency prediction stream for RGB and depth inputs and then fuse their predictions and their preceding features to obtain final prediction via several convolutional layers. Zhao et. al [50] insert a lightweight net between adjacent encoding blocks to compute a contrast map from the depth input and use it to enhance the features from the RGB stream. Piao et. al [37] combine multi-level paired complementary features from RGB and depth streams by convolution and nonlinear operations. Fan et. al [13] design a depth depurator to remove the low-quality depth input, and for high-quality one they feed the concatenated 4-channel input into a convolutional neural network to achieve cross-modal fusion. Different from these methods, we use the RGB-depth mixed features to generate “adaptive” multi-scale convolution kernels to filter and enhance the decoding features from the RGB stream.

Dynamic Filters. The works closely related to ours are [21] and [16]. The conception of the dynamic filter is firstly proposed in video and stereo prediction task [21]. The filter is utilized to enhance the representation of its corresponding input in a self-learning manner. While we use multi-modal information to generate multi-scale filters to dynamically strengthen the cross-modal complementarity and suppress the inter-modality incompatibility. Besides, the kernel computation in [21] introduces a large number of parameters and is difficultly extended at multiple scales, which significantly increases parameters and causes optimization difficulties. To efficiently achieve hierarchical dynamic filters, we introduce the idea of depth-wise separable convolution [19] and dilated convolution [49]. In [16], the filters are computed by pooling the input feature. They share kernel parameters across different positions, which is only an image-specific filter generator. In contrast, we design position-specific and image-specific filters to provide cross-modal contextual guidance for the decoder. The parameter update of dynamic filters is determined by the gradients of local neighborhoods to achieve more targeted adjustments and guarantee the overall performance of optimization.

3 Proposed Method

In this section, we first introduce the overall structure of the proposed method and then detail two main components, including the dynamic dilated pyramid module (DDPM) and the hybrid enhanced loss (HEL).

3.1 Two Stream Structure

We build a two-stream network, which structure is shown in Fig. 2. It has two inputs: one is an RGB image and the other is a depth image, which corresponds to the RGB and depth streams, respectively. Through convolution blocks $\{E^{i}_{rgb}\}^{5}_{i=1}$ and $\{E^{i}_{d}\}^{5}_{i=1}$ in two encoding networks, we can obtain the intermediate features with different resolutions, which are recorded as $f^{1}$ , $f^{2}$ , $f^{3}$ , $f^{4}$ , $f^{5}$ from large to small. The third-level features still retain enough valid information. Besides, the shallower features contain more noise and also cause higher computational cost due to the larger resolution. To balance efficiency and effectiveness, we only utilize the features $f^{3}_{d}$ , $f^{4}_{d}$ , $f^{5}_{d}$ from the deepest three blocks in the depth stream. These features are respectively combined with the features $f^{3}_{rgb}$ , $f^{4}_{rgb}$ , $f^{5}_{rgb}$ from the RGB stream. Then, we use a dense block [20] to build the transport layer, which combines rich and various receptive fields and generates powerful mixed features $f_{T_{m}}$ with both spatial structures and appearance details. These features are fed into the DDPM to produce multi-scale convolution kernels that are used to filter the features $f_{D_{rgb}}$ from the decoder. The resulted features $f_{M}$ are merged in the top-down pathway by element-wise addition. After recovering the resolution layer by layer, we obtain the final prediction $P$ , which is supervised by the ground truth $G$ .

3.2 Dynamic Dilated Pyramid Module

Input:

f^{i}_{r}=\mathcal{R}(f^{i}_{D_{rgb}})\in\mathbb{R}^{N\times C^{\prime}\times H^{\prime}\times W^{\prime}}

f^{i}_{g^{j}}\in\mathbb{R}^{N\times(9\times C^{\prime})\times H^{\prime}\times W^{\prime}}

Output:

f^{i}_{B^{j}}\in\mathbb{R}^{N\times C^{\prime}\times H^{\prime}\times W^{\prime}}

d\leftarrow j\times 2-1

;

2 pad

f^{i}_{r}

with

0

from

(H^{\prime},W^{\prime})

(H^{\prime}+2\times d,W^{\prime}+2\times d)

;

3 for $n\leftarrow 0$ to $N-1$ do

4 for $c\leftarrow 0$ to $C^{\prime}-1$ do

5 for $h\leftarrow d$ to $H^{\prime}+d-1$ do

6 for $w\leftarrow d$ to $W^{\prime}+d-1$ do

(f^{i}_{B^{j}})_{[n,c,h,w]}

\leftarrow

\sum^{1}_{l=-1}\sum^{1}_{m=-1}\{(f^{i}_{g^{j}})_{[n,(l+1)\times 3+(m+1),h,w]}

\times(f^{i}_{r})_{[n,c,h+l\times d,w+m\times d]}\}

;

Algorithm 1 The operation process of adaptive convolution

\otimes

related to KTU_j in DDPMⁱ.

In order to make more reasonable and effective use of the mixed features $f_{T_{m}}$ from the dense transport layer, we employ DDPMs to generate the adaptive kernel for decoding RGB features. The DDPMs contain two inputs: the mixed feature $f_{T_{m}}$ and the feature $f_{D_{rgb}}$ from the decoder. On one hand, for specific position in feature maps $f_{D_{rgb}}$ , we use kernel generation units (KGUs) to yield independent weight tensors, i.e. $f_{g}$ , that can cover a $3\times 3$ , $7\times 7$ or $11\times 11$ square neighborhood. KGUs are also a kind of dense structure [20]. The module contains 4 densely connected layers and each layer is connected to all the others in a feed-forward fashion, which can further strengthen feature propagation and expression capabilities, encourage feature reuse and greatly improve parameter efficiency. Then, by recombining kernel tensors and inserting different numbers of zeros, kernel transformation units (KTUs) construct regular convolution kernels with different dilation rates. Please see “KTU” shown in Fig. 3 and introduced in Alg. 1 for a more intuitive presentation. On the other hand, after preliminary dimension reduction, the other input $f_{D_{rgb}}$ is re-weighted and integrated into three parallel branches to obtain the enhanced features $\{f_{B^{j}}\}^{3}_{j=1}$ . Note that this is actually a channel-wise adjustment and the operation of each channel is independent. Finally, after concating and merging $\{f_{B^{j}}\}^{3}_{j=1}$ and the reduced $f_{D_{rgb}}$ , the resulted features $\{f^{i}_{M}\}^{5}_{i=3}$ become more discriminative.

The entire process can be formulated as follows:

\begin{split}f^{i}_{M}&=\mathcal{DDPM}^{i}(f^{i}_{D_{rgb}},f^{i}_{T_{m}})\\ &=\mathcal{F}(\mathcal{C}(\mathcal{R}(f^{i}_{D_{rgb}}),f^{i}_{B^{1}},f^{i}_{B^{2}},f^{i}_{B^{3}})\\ &=\mathcal{F}(\mathcal{C}(\mathcal{R}(f^{i}_{D_{rgb}}),\mathcal{KTU}^{i}_{1}(\mathcal{KGU}^{i}_{1}(f^{i}_{T_{m}}))\otimes\mathcal{R}(f^{i}_{D_{rgb}}),\\ &\quad\quad\quad\mathcal{KTU}^{i}_{2}(\mathcal{KGU}^{i}_{2}(f^{i}_{T_{m}}))\otimes\mathcal{R}(f^{i}_{D_{rgb}}),\\ &\quad\quad\quad\mathcal{KTU}^{i}_{3}(\mathcal{KGU}^{i}_{3}(f^{i}_{T_{m}}))\otimes\mathcal{R}(f^{i}_{D_{rgb}}))),\end{split}

(1)

where $f^{i}_{M}$ represents the feature from the DDPMⁱ related to the $f^{i}_{D_{rgb}}$ . $\mathcal{DDPM}(\cdot)$ , $\mathcal{KGU}(\cdot)$ and $\mathcal{KTU}(\cdot)$ denote the operation of the corresponding module. $\mathcal{R}(\cdot)$ is a $1\times 1$ convolution operation, which is used to reduce the number of channels from 64 to 16. $\otimes$ is an adaptive convolution operation as shown in Alg. 1. $\mathcal{C}(\cdot)$ is a concatenation operation and $\mathcal{F}(\cdot)$ is a $3\times 3$ convolution to fuse the concatenated features from different branches. More details is as shown in Fig. 3.

3.3 Hybrid Enhanced Loss

No matter for RGB or RGB-D based SOD tasks, good prediction requires the salient area to be clearly and completely highlighted. This contains two aspects: one is the sharpness of boundaries and the other is the consistency of intra-class. We start with the loss function and design a new loss to constrain the edges and the fore-/background regions to separately achieve high-contrast predictions.

The common loss function in the SOD task is binary cross entropy (BCE). It is a pixel-level loss, which independently performs error calculation and supervision at different positions. The main form is as follows:

\begin{split}L_{bce}&=\frac{1}{N\times H\times W}\sum^{N}_{n}\sum^{H}_{h}\sum^{W}_{w}\left[g\log p+(1-g)\log(1-p)\right],\end{split}

(2)

where $P=\{p|0<p<1\}\in\mathbb{R}^{N\times 1\times H\times W}$ and $G=\{g|0<g<1\}\in\mathbb{R}^{N\times 1\times H\times W}$ respectively represent the prediction and the corresponding ground truch. $N$ , $H$ and $W$ are the batchsize, height and width of the input data, respectively. It calculates the error between the ground truth $g$ and the prediction $p$ at each position, and the loss $L_{bce}$ accumulates and averages the errors of all positions.

In order to further enhance the strength of supervision at higher levels such as edges and regions, we specially constrain and optimize the regions near the edges. In particular, the loss is formulated as follows:

\displaystyle\begin{split}L_{e}=&\frac{\sum^{H}_{h}\sum^{W}_{w}(e*|p-g|)}{\sum^{H}_{h}\sum^{W}_{w}e},\\ e=&\left\{\begin{matrix}0&\text{ if }(G-\mathcal{P}(G))_{[h,w]}=0,\\ 1&\text{ if }(G-\mathcal{P}(G))_{[h,w]}\neq 0,\end{matrix}\right.\end{split}

(3)

where $L_{e}$ represents the edge enhanced loss (EEL), and $\mathcal{P}(\cdot)$ denotes the average pooling operation with a $5\times 5$ slide window. In Equ. 3, we can obtain the local region near the contour of the ground truth by calculating $e$ . In this region, the difference $L_{e}$ between the prediction $p$ and the ground truth $g$ can be calculated. Through this loss, the optimization process can target the contours of salient objects.

In addition, we also design a region enhanced loss (REL) to constrain the prediction of intra-class. By respectively calculating the prediction errors within the foreground class and the background class, fore-/background predictions can be independently optimized. Specifically, the REL $L_{r}$ is written as:

\displaystyle\begin{split}L_{r}&=\frac{\sum^{N}_{n}(L_{f}+L_{b})}{N},\\ L_{f}&=\frac{\sum^{H}_{h}\sum^{W}_{w}(g-g*p)}{\sum^{H}_{h}\sum^{W}_{w}g},\\ L_{b}&=\frac{\sum^{H}_{h}\sum^{W}_{w}(1-g)*p}{\sum^{H}_{h}\sum^{W}_{w}(1-g)},\end{split}

(4)

where $L_{f}$ and $L_{b}$ denote the fore-/background losses, respectively. The losses compute the normalized prediction errors in the intra-class regions. They depict the region-level supervision. Finally, we integrate these three losses ( $L_{bce}$ , $L_{e}$ and $L_{r}$ ) to obtain the hybrid enhanced loss (HEL), which can optimize the prediction at two different levels. The total loss is expressed as follows:

\begin{split}L=L_{bce}+L_{e}+L_{r}.\end{split}

(5)

4 Experiments

4.1 Datasets

To fully verify the effectiveness of the proposed method, we evaluated the results on eight benchmark datasets. LFSD [25] is a small dataset that contains 100 images with depth information and human-labeled ground truths and is built for saliency detection on the light filed. NJUD [22] contains 1,985 groups of RGB, depth, and label images, which are collected from the Internet, 3D movies, and photographs taken by a Fuji W3 stereo camera. NLPR [35] is also called RGBD1000, which contains 1,000 natural RGBD images captured by Microsoft Kinect together with the human-marked ground truth. RGBD135 [7] is also named DES, which consists of 135 images about indoor scenes collected by Microsoft Kinect. SIP [13] includes 1,000 images with many challenging situations from various outdoor scenarios and these images emphasize salient persons in real-world scenes. SSD [53] contains 80 images picked up from three stereo movies. STEREO [33] is also called SSB, which contains 1,000 stereoscopic images downloaded from the Internet. DUTRGBD [37] is a new and large dataset and contains 800 indoor and 400 outdoor scenes paired with the depth maps and ground truths.

For comprehensively and fairly evaluating different methods, we follow the setting of [37]. On the DUTRGBD, we use 800 images for training and 400 images for testing. For the other seven datasets, we follow the data partition of [2, 4, 15, 37] to use 1,485 samples from the NJUD and 700 samples from the NLPR as the training set and the remaining samples in these datasets are used for testing.

4.2 Evaluation Metrics

Table 1: Results (

\uparrow

F_{max}

F_{ada}

[1],

F_{\beta}^{\omega}

[32],

S_{m}

[11] and

E_{m}

[12];

\downarrow

: MAE [36]) of different RGB-D SOD methods across eight datasets. The best results are highlight in red.

\natural

: Traditional methods.

\dagger

: VGG-16 [41] as backbone.

\ddagger

: VGG-19 [41] as backbone.

\sharp

: ResNet-50 [17] as backbone. -: No data available.

	Metric	DES ${}_{14}^{\natural}$ [6]	DCMC ${}_{16}^{\natural}$ [9]	CDCP ${}_{17}^{\natural}$ [54]	DF ${}_{17}^{\dagger}$ [38]	CTMF ${}_{18}^{\dagger}$ [15]	PCANet ${}_{18}^{\dagger}$ [2]	MMCI ${}_{19}^{\dagger}$ [4]	TANet ${}_{19}^{\dagger}$ [3]	AFNet ${}_{19}^{\dagger}$ [43]	CPFP ${}_{19}^{\dagger}$ [50]	OURS^†	DMRA ${}_{19}^{\ddagger}$ [37]	OURS^‡	D3Net ${}_{19}^{\sharp}$ [13]	OURS^♯
LFSD [25]	$F_{max}$	0.377	0.850	0.680	0.854	0.815	0.829	0.813	0.827	0.780	0.850	0.860	0.872	0.858	0.849	0.883
	$F_{ada}$	0.227	0.815	0.634	0.810	0.781	0.793	0.779	0.794	0.742	0.813	0.831	0.849	0.833	0.801	0.843
	$F_{\beta}^{\omega}$	0.274	0.601	0.518	0.642	0.696	0.716	0.663	0.719	0.671	0.775	0.792	0.811	0.793	0.756	0.806
	MAE	0.416	0.155	0.199	0.142	0.120	0.112	0.132	0.111	0.133	0.088	0.085	0.076	0.083	0.099	0.076
	$S_{m}$	0.440	0.754	0.658	0.786	0.796	0.800	0.787	0.801	0.738	0.828	0.847	0.847	0.844	0.832	0.854
	$E_{m}$	0.492	0.842	0.737	0.841	0.851	0.856	0.840	0.851	0.810	0.867	0.883	0.899	0.886	0.860	0.891
NJUD [22]	$F_{max}$	0.328	0.769	0.661	0.789	0.857	0.887	0.868	0.888	0.804	0.890	0.924	0.896	0.922	0.903	0.922
	$F_{ada}$	0.165	0.715	0.618	0.744	0.788	0.844	0.813	0.844	0.768	0.837	0.894	0.872	0.887	0.840	0.889
	$F_{\beta}^{\omega}$	0.234	0.497	0.510	0.545	0.720	0.803	0.739	0.805	0.696	0.828	0.881	0.847	0.877	0.833	0.877
	MAE	0.448	0.167	0.182	0.151	0.085	0.059	0.079	0.061	0.100	0.053	0.037	0.051	0.038	0.051	0.038
	$S_{m}$	0.413	0.703	0.672	0.735	0.849	0.877	0.859	0.878	0.772	0.878	0.911	0.885	0.911	0.895	0.908
	$E_{m}$	0.491	0.796	0.751	0.818	0.866	0.909	0.882	0.909	0.847	0.900	0.934	0.920	0.932	0.901	0.932
NLPR [35]	$F_{max}$	0.695	0.413	0.687	0.752	0.841	0.864	0.841	0.876	0.816	0.883	0.917	0.888	0.919	0.904	0.927
	$F_{ada}$	0.583	0.328	0.591	0.683	0.724	0.795	0.730	0.796	0.747	0.818	0.878	0.855	0.883	0.834	0.889
	$F_{\beta}^{\omega}$	0.254	0.259	0.501	0.516	0.679	0.762	0.676	0.780	0.693	0.807	0.869	0.839	0.871	0.826	0.882
	MAE	0.300	0.196	0.114	0.100	0.056	0.044	0.059	0.041	0.058	0.038	0.027	0.031	0.027	0.034	0.023
	$S_{m}$	0.582	0.550	0.724	0.769	0.860	0.873	0.856	0.886	0.799	0.884	0.916	0.898	0.915	0.906	0.923
	$E_{m}$	0.760	0.685	0.786	0.840	0.869	0.916	0.872	0.916	0.884	0.920	0.948	0.942	0.951	0.934	0.957
RGBD135 [7]	$F_{max}$	0.800	0.311	0.651	0.625	0.865	0.842	0.839	0.853	0.775	0.882	0.934	0.906	0.941	0.917	0.932
	$F_{ada}$	0.695	0.234	0.594	0.573	0.778	0.774	0.762	0.795	0.730	0.829	0.919	0.867	0.918	0.876	0.912
	$F_{\beta}^{\omega}$	0.301	0.169	0.478	0.392	0.686	0.711	0.650	0.740	0.641	0.787	0.902	0.843	0.913	0.831	0.895
	MAE	0.288	0.196	0.120	0.131	0.055	0.050	0.065	0.046	0.068	0.038	0.020	0.030	0.017	0.030	0.021
	$S_{m}$	0.632	0.469	0.709	0.685	0.863	0.843	0.848	0.858	0.770	0.872	0.932	0.899	0.937	0.904	0.926
	$E_{m}$	0.817	0.676	0.810	0.806	0.911	0.912	0.904	0.919	0.874	0.927	0.973	0.944	0.976	0.956	0.971
SIP [13]	$F_{max}$	0.720	0.680	0.544	0.704	0.720	0.860	0.840	0.851	0.756	0.870	0.904	0.847	0.907	0.882	0.910
	$F_{ada}$	0.644	0.645	0.495	0.673	0.684	0.825	0.795	0.809	0.705	0.819	0.863	0.815	0.870	0.831	0.875
	$F_{\beta}^{\omega}$	0.342	0.413	0.397	0.406	0.535	0.768	0.711	0.748	0.617	0.788	0.835	0.734	0.844	0.793	0.848
	MAE	0.298	0.186	0.224	0.185	0.139	0.071	0.086	0.075	0.118	0.064	0.050	0.088	0.047	0.063	0.047
	$S_{m}$	0.616	0.683	0.595	0.653	0.716	0.842	0.833	0.835	0.720	0.850	0.878	0.800	0.885	0.864	0.886
	$E_{m}$	0.751	0.786	0.722	0.794	0.824	0.900	0.886	0.894	0.815	0.899	0.920	0.858	0.924	0.903	0.924
SSD [53]	$F_{max}$	0.260	0.750	0.576	0.763	0.755	0.844	0.823	0.834	0.735	0.801	0.872	0.858	0.883	0.872	0.885
	$F_{ada}$	0.073	0.684	0.524	0.709	0.709	0.786	0.748	0.766	0.694	0.726	0.844	0.821	0.847	0.793	0.842
	$F_{\beta}^{\omega}$	0.172	0.480	0.429	0.536	0.622	0.733	0.662	0.727	0.589	0.708	0.808	0.787	0.819	0.780	0.821
	MAE	0.500	0.168	0.219	0.151	0.100	0.063	0.082	0.063	0.118	0.082	0.048	0.058	0.046	0.058	0.045
	$S_{m}$	0.341	0.706	0.603	0.741	0.776	0.842	0.813	0.839	0.714	0.807	0.866	0.856	0.875	0.866	0.879
	$E_{m}$	0.475	0.790	0.714	0.801	0.838	0.890	0.860	0.886	0.803	0.832	0.913	0.898	0.911	0.892	0.911
STEREO [33]	$F_{max}$	0.738	0.789	0.704	0.789	0.848	0.875	0.877	0.878	0.848	0.889	0.918	0.802	0.916	0.897	0.910
	$F_{ada}$	0.594	0.742	0.666	0.742	0.771	0.826	0.829	0.835	0.807	0.830	0.879	0.762	0.875	0.833	0.867
	$F_{\beta}^{\omega}$	0.375	0.520	0.558	0.549	0.698	0.778	0.760	0.787	0.752	0.817	0.863	0.647	0.859	0.815	0.853
	MAE	0.295	0.148	0.149	0.141	0.086	0.064	0.068	0.060	0.075	0.051	0.039	0.087	0.040	0.054	0.041
	$S_{m}$	0.642	0.731	0.713	0.757	0.848	0.875	0.873	0.871	0.825	0.879	0.906	0.752	0.903	0.891	0.900
	$E_{m}$	0.696	0.831	0.796	0.838	0.870	0.907	0.905	0.916	0.887	0.907	0.937	0.816	0.934	0.911	0.931
DUTRGBD [37]	$F_{max}$	0.770	0.444	0.658	0.774	0.842	0.809	0.804	0.823	-	0.787	0.926	0.908	0.934	-	0.930
	$F_{ada}$	0.667	0.405	0.633	0.747	0.792	0.760	0.753	0.778	-	0.735	0.892	0.883	0.894	-	0.885
	$F_{\beta}^{\omega}$	0.380	0.284	0.521	0.536	0.682	0.688	0.628	0.705	-	0.638	0.865	0.852	0.871	-	0.864
	MAE	0.280	0.243	0.159	0.145	0.097	0.100	0.112	0.093	-	0.100	0.040	0.048	0.039	-	0.041
	$S_{m}$	0.659	0.499	0.687	0.729	0.831	0.801	0.791	0.808	-	0.749	0.905	0.887	0.911	-	0.907
	$E_{m}$	0.751	0.712	0.794	0.842	0.882	0.863	0.856	0.871	-	0.815	0.938	0.930	0.941	-	0.938
AveMetric	$F_{max}$	0.654	0.666	0.642	0.756	0.811	0.861	0.850	0.862	0.801	0.868	0.914	0.855	0.915	0.893	0.915
	$F_{ada}$	0.534	0.618	0.595	0.714	0.747	0.814	0.794	0.815	0.755	0.813	0.878	0.822	0.878	0.834	0.877
	$F_{\beta}^{\omega}$	0.325	0.425	0.491	0.502	0.652	0.761	0.712	0.764	0.684	0.784	0.857	0.756	0.859	0.810	0.858
	MAE	0.325	0.179	0.174	0.151	0.099	0.068	0.081	0.067	0.093	0.061	0.041	0.069	0.041	0.055	0.041
	$S_{m}$	0.585	0.661	0.669	0.721	0.809	0.853	0.844	0.853	0.773	0.853	0.898	0.824	0.900	0.883	0.899
	$E_{m}$	0.686	0.781	0.765	0.822	0.859	0.899	0.885	0.901	0.853	0.892	0.933	0.876	0.933	0.909	0.932

There are six widely used metrics for evaluating RGB and RGB-D SOD models: Precision-Recall (PR) curve, F-measure [1], weighted F-measure [32], MAE [36], S-measure [11] and E-measure [12]. PR Curve. We use a series of fixed thresholds from 0 to 255 to binarize the gray prediction map, and then calculate several groups of precision ( $Pre$ ) and recall ( $Rec$ ) with ground truth by $Pre=\frac{TP}{TP+FP}$ and $Rec=\frac{TP}{TP+FN}$ . Based on them, we can plot a precision-recall curve to describe the performance of the model. F-measure [1]. It is a region-based similarity metric and is formulated as the weighted harmonic mean (the weight is set to 0.3) of $Pre$ and $Rec$ . In this paper, we employ the threshold changing from 0 to 255 to get $F_{max}$ , and use twice the mean value of the prediction $P$ as the threshold to obtain $F_{ada}$ . In addition, since F-measure reflects the performance of the binary predictions under different thresholds, we evaluate the consistency and uniformity at the regional level according to F-measure threshold curves. weighted F-measure ( $F^{\omega}_{\beta}$ ) [32]. It is proposed to improve the existing metric F-measure. It defines a weighted precision, which is a measure of exactness, and a weighted recall, which is a measure of completeness and follows the form of F-measure. MAE [36]. This metric estimates the approximation degree between the saliency map and ground-truth map, and it is normalized to $[0,1]$ . It focuses on pixel-level performance. S-measure ( $S_{m}$ ) [11]. It calculates the object-/region-aware structure similarities $S_{o}$ / $S_{r}$ between prediction and ground truth by the equation: $S_{m}=\alpha\cdot S_{o}+(1-\alpha)\cdot S_{r},\,\alpha=0.5$ . E-measure ( $E_{m}$ ) [12]. This measure utilizes the mean-removed predictions and ground truths to compute the similarity, which characterizes both image-level statistics and local pixel matching.

4.3 Implementation Details

Parameter setting. Two encoders of the proposed model are based on the same model, such as VGG-16 [41], VGG-19 [41], and ResNet-50 [17]. In both encoders, only the convolutional layers in corresponding classification networks are retained, and the last pooling layer of VGG-16 and VGG-19 is removed at the same time. During the training phase, we use the weight parameters pre-trained on the ImageNet to initialize the encoders. Also, since the depth image is a single channel data, we change the channel number of its corresponding input layer from 3 to 1, and its parameters are initialized randomly by PyTorch. The parameters of the remaining structures are all initialized randomly.

Training setting. During the training stage, we apply random horizontal flipping, random rotating as data augmentation for RGB images and depth images. In addition, we employ random color jittering and normalization for RGB images. We use the momentum SGD optimizer with a weight decay of 5e-4, an initial learning rate of 5e-3, and a momentum of 0.9. Besides, we apply a “poly” strategy [29] with a factor of 0.9. The input images are resized to $320\times 320$ . We train the model for 30 epochs on an NVIDIA GTX 1080 Ti GPU with a batch size of 4 to obtain the final model.

Testing details. During the testing stage, we resize RGB and depth images to $320\times 320$ and normalize RGB images. Besides, the final prediction is rescaled to the original size for evaluation.

4.4 Comparisons

In order to fully demonstrate the effectiveness of the proposed method, we compared it with the existing twelve RGB-D based SOD models, including DES [6], DCMC [9], CDCP [54], DF [38], CTMF [15], PCANet [2], MMCI [4], TANet [3], AFNet [43], CPFP [50], DMRA [37] and D3Net [13]. For fair comparisons, all saliency maps of these methods are directly provided by authors or computed by their released codes. Besides, the codes and results of AFNet [43] and D3Net [13] on the DUTRGBD [37] dataset are not publicly available. Therefore, their results on this dataset are not listed.

Quantitative Evaluation. In Tab. 1, we list the results of all competitors on eight datasets and six metrics. It can be seen that the proposed method performs best on most datasets and achieve significant performance improvement. On the DUTRGBD [37], our models based on VGG-16, VGG-19 and ResNet-50 have surpassed the second-best model DMRA [37] by 2.02%, 2.85% and 2.45% on $F_{max}$ , and 16.09%, 17.88% and 13.56% on MAE. At the same time, on the recent dataset SIP [13], they have increased by 3.83%, 4.65% and 5.23% on $F_{ada}$ , 5.22%, 6.37% and 6.84% on $F^{\omega}_{\beta}$ , and 20.94%, 24.65% and 24.91% on MAE, over the D3Net [13]. Because the existing RGB-D SOD datasets are relatively small, we propose a new calculation method to measure the performance of models. According to the proportion of each testing set in all testing datasets, the results on all datasets are weighted and summed to obtain an overall performance evaluation, which is listed in the row “AveMetric” in Tab. 1. It can be seen that our structure achieves similar and excellent results on different backbones, which shows that our structure has less dependence on the performance of the backbone. In addition, we show a scatter plot based on the average performance of each model on all datasets and the model size in Fig. 1. Our model has the smallest size while achieving the best result. We demonstrate the PR curves and the F-measure curves in Fig. 4 and Fig. 5. Our approach (red solid line) achieves very good results on these datasets. As shown in Fig. 5, our results are much flatter at most thresholds, which reflects that our prediction results are more uniform and consistent.

Qualitative Evaluation. In Fig. 6, we list some representative results. These examples include scenarios with varying complexity, as well as different types of objects, including cluttered background (Column 1 and 2), simple scene (Column 3 and 4), small objects (Column 5), complex objects (Column 6 and 7), large objects (Column 8), multiple objects (Column 9 and 10) and low contrast between foreground and background (Column 11 and 12). It can be seen that the proposed method can consistently produce more accurate and complete saliency maps with higher contrast.

4.5 Ablation Study

In this section, we perform ablation analysis over the main components of the HDFNet and further investigate their importance and contributions. Our baseline model, i.e. Model 1, uses the commonly used encoder-decoder structure, and all ablation experiments are based on the VGG-16 backbone. In the baseline model, the output features of the last three stages in the depth stream are added to the decoder after compressing the channel to 64 through an independent $1\times 1$ convolution. In order to evaluate the benefits of cross-modal fusion at the dense transport layer (i.e. Model 6), we feed single-modal features into this layer to build Model 2 (i.e. “+T_d”) and Model 4 (i.e. “+T_rgb”). Thus, the followed dynamic filters in the DDPM will be determined only by depth features or RGB features, respectively.

Dynamic Dilated Pyramid Module. Based on Model 2, Model 4, and Model 6, we add the dynamic dilated pyramid module to obtain Model 3, Model 5, and Model 7, respectively. In Tab. 2, we show the performance improvement contributed by different structures in terms of the weighted average metrics “AveMetric”. It can be seen that the DDPM significantly improves performance. Specifically, by comparing Model 3, 5 and 7 with Model 2, 4 and 6, we achieve a relative improvement of 1.47%, 3.11% and 2.11% in terms of $F^{\omega}_{\beta}$ and 5.01%, 10.29% and 6.77% in terms of MAE, respectively. We can see that even without the HEL, the average performance of Model 7 already exceeds these existing models. More comparisons can be found in Appendix 0.A.

In addition, we compare the design of the dynamic filter in DCM [16] with ours. It can be seen that the proposed DDPM (Model 7) has obvious advantages over the DCM (Model 8), and it respectively increases by 3.91%, 5.60%, and 18.07% in terms of $F_{ada}$ , $F^{\omega}_{\beta}$ and MAE. In Fig. 7, we can see that the noise in depth images interferes with the final predictions. By the cross-modal guidance from the DDPMs, the interference is effectively suppressed.

Table 2: Ablation experiments. +T_d: Using a dense transport layer for depth features. +T_rgb: Using a dense transport layer for RGB features. +DDPM: Using a DDPM after the transport layer. +DCM: Using the DCM [16] after the transport layer. +L_e: Using the edge loss as the auxiliary loss. +L_f: Using the foreground loss as the auxiliary loss. +L_b: Using the background loss as the auxiliary loss.

Model	No.	Baseline	+T_d	+T_rgb	+DDPM	+DCM	+L_e	+L_f	+L_b	$F_{max}$	$F_{ada}$	$F^{\omega}_{\beta}$	MAE	$S_{m}$	$E_{m}$
Ours^†	1	✔								0.875	0.819	0.768	0.067	0.865	0.898
	2	✔	✔							0.879	0.820	0.768	0.066	0.868	0.899
	3	✔	✔		✔					0.882	0.820	0.780	0.063	0.873	0.900
	4	✔		✔						0.884	0.839	0.787	0.060	0.874	0.909
	5	✔		✔	✔					0.896	0.852	0.811	0.054	0.886	0.916
	6	✔	✔	✔						0.898	0.846	0.803	0.056	0.884	0.913
	7	✔	✔	✔	✔					0.904	0.856	0.820	0.052	0.893	0.918
	8	✔	✔	✔		✔				0.878	0.823	0.777	0.064	0.871	0.903
	9	✔	✔	✔	✔		✔			0.909	0.878	0.849	0.044	0.898	0.929
	10	✔	✔	✔	✔			✔		0.909	0.845	0.827	0.050	0.887	0.916
	11	✔	✔	✔	✔				✔	0.907	0.874	0.836	0.048	0.895	0.926
	12	✔	✔	✔	✔		✔	✔	✔	0.914	0.878	0.857	0.041	0.898	0.933
R3Net₁₈ [10]	13									0.828	0.714	0.716	0.072	0.831	0.830
R3Net₁₈ [10]	14						✔	✔	✔	0.832	0.731	0.740	0.069	0.835	0.844
CPD₁₉ [45]	15									0.848	0.790	0.769	0.052	0.856	0.889
CPD₁₉ [45]	16						✔	✔	✔	0.849	0.804	0.792	0.049	0.857	0.898
PoolNet₁₉ [27]	15									0.832	0.755	0.728	0.060	0.841	0.865
PoolNet₁₉ [27]	16						✔	✔	✔	0.861	0.811	0.799	0.046	0.862	0.902
GCPANet₂₀ [5]	17									0.847	0.766	0.744	0.061	0.854	0.869
GCPANet₂₀ [5]	18						✔	✔	✔	0.854	0.779	0.773	0.055	0.856	0.880

Hybrid Enhanced Loss. As shown in Tab. 2, the proposed hybrid enhanced loss brings huge performance improvements by comparing Model 7 with Model 12. We evaluate each component in the HEL (Model 9, 10, and 11) and all of them contribute to the final performance. In addition, the benefits of this loss are also clearly reflected in Fig. 5 where the curves of the proposed model are more straight, and Fig. 7 where the predictions of the model “B+R+D+M+L” have higher contrast than ones of the model “B+D+R+M”. Since the design goal of the HEL is to solve the general requirements of SOD tasks, we evaluate its effectiveness on several recent RGB SOD models [10, 45, 27, 5]. For a fair comparison, we retrain these models using the released code. Most of hyper-parameters are the same as the default values given by their corresponding code. The average performance “AveMetric” on five main RGB SOD datasets (DUTS [42], ECSSD [47], HKU-IS [24], PASCAL-S [26] and DUT-OMRON [48]) is shown in Tab. 2. More experimental details and results can be found in Appendix 0.A.

5 Conclusions

In this paper, we revisit the role that depth information should play in the RGB-D based SOD task. We consider the characteristics of spatial structures contained in depth information and combine it with RGB information with rich appearance details. After that, the model generates adaptive filters with different receptive field sizes through the dynamic dilated pyramid module. It can make full use of semantic cues from multi-modal mixed features to achieve multi-scale cross-modal guidance, thereby enhancing the representation capabilities of the decoder. At the same time, we can obtain clearer predictions with the aid of additional region-level supervision to the regions around the edges and fore-/background regions. Expensive experiments on eight datasets and six metrics demonstrate the effectiveness of the designed components. The proposed approach achieves state-of-the-art performance with small model size and high running speed.

Acknowledgements. This work was supported in part by the National Key R&D Program of China #2018AAA0102003, National Natural Science Foundation of China #61876202, #61725202, #61751212 and #61829102, the Dalian Science and Technology Innovation Foundation #2019J12GX039, and the Fundamental Research Funds for the Central Universities #DUT20ZD212.

Appendix 0.A Appendix

In Table 2 of the original paper, we show the weighted average results of each model in terms of six metrics. In Sec. 0.A.1 and Sec 0.A.2 of this document, we respectively list the results of these models across different datasets.

This supplementary document is organized as follows:

•

More details about the performance contributed by different components in the proposed HDFNet.
•

More detailed comparisons of RGB SOD models with the HEL and without the HEL.

0.A.1 Ablation Study

Tab. 3 shows the performance improvement contributed by different components. Note that Model 2, 4 and 6 without the DDPM directly combine the features of dense transport layer into the decoder by element-wise addition instead of using convolution operation of Model 3, 5, 7.

The experiments in Tab 3 are divided into different groups:

1.

Model 2 vs. Model 3: Effectiveness of DDPM using only depth features to compute dynamic filters.
2.

Model 4 vs. Model 5: Effectiveness of DDPM using only RGB features to compute dynamic filters.
3.

Model 6 vs. Model 7 vs. Model 8: Effectiveness of DDPM using two-modality features to compute dynamic filters.
4.

Model 9 vs. Model 10 vs. Model 11 vs. Model 12: Effectiveness of three components in the HEL (L_e, L_f and L_b) and the overall HEL.

0.A.2 Effectiveness of the HEL

Tab 4 shows the performance gains of the proposed loss function in some recent RGB saliency models [10, 45, 27, 5]. Here, “AveMetric” still denotes the weighted average results on all datasets and is consistent with the data in Table 2 of the original paper. It is worth noting that there are some differences in our experimental settings for these models. 1) R3Net [10]: We use ResNeXt-101 [46] as the backbone as the original paper. We only change the supervision for the final prediction to the proposed HEL. 2) CPD [45]: ResNet-50 [17] is used as the backbone. For each branch, we use the proposed HEL to supervise the prediction. 3) PoolNet [27]: The backbone network is ResNet-50. We do not use the strategy of joint training with the edge and we apply the HEL on the final prediction. 4) GCPANet [5]: We also use ResNet-50, and the HEL to supervise the final result with the same resolution as the input.

Table 3: Ablation experiments of different components. “Model

\star

” corresponds to the model “No.

\star

” in Table 2 of the original paper. In each set of comparative experiments, we emphasize the best test results in red.

	Metric	Baseline	DDPM							HEL
			Depth Input		RGB Input		RGB-D Input			L_e	L_f	L_b	ALL
		Model 1	Model 2	Model 3	Model 4	Model 5	Model 6	Model 7	Model 8	Model 9	Model 10	Model 11	Model 12
LFSD [25]	$F_{max}$	0.808	0.819	0.832	0.792	0.825	0.846	0.851	0.809	0.846	0.859	0.858	0.860
	$F_{ada}$	0.776	0.778	0.788	0.762	0.796	0.816	0.812	0.760	0.819	0.806	0.837	0.831
	$F^{\omega}_{\beta}$	0.703	0.698	0.716	0.676	0.722	0.745	0.747	0.685	0.772	0.765	0.773	0.792
	MAE	0.116	0.117	0.113	0.124	0.114	0.099	0.101	0.129	0.092	0.092	0.089	0.085
	$S_{m}$	0.800	0.801	0.815	0.782	0.816	0.833	0.835	0.791	0.838	0.838	0.847	0.847
	$E_{m}$	0.846	0.845	0.847	0.832	0.852	0.867	0.860	0.826	0.869	0.867	0.878	0.883
NJUD [22]	$F_{max}$	0.889	0.897	0.902	0.891	0.899	0.909	0.916	0.892	0.925	0.920	0.916	0.924
	$F_{ada}$	0.840	0.848	0.849	0.851	0.856	0.858	0.872	0.843	0.895	0.867	0.886	0.894
	$F^{\omega}_{\beta}$	0.801	0.807	0.819	0.809	0.826	0.827	0.846	0.808	0.876	0.858	0.857	0.881
	MAE	0.061	0.059	0.055	0.057	0.054	0.051	0.047	0.058	0.039	0.043	0.045	0.037
	$S_{m}$	0.880	0.887	0.891	0.884	0.892	0.898	0.905	0.884	0.911	0.903	0.906	0.911
	$E_{m}$	0.897	0.903	0.905	0.907	0.907	0.908	0.916	0.903	0.932	0.921	0.923	0.934
NLPR [35]	$F_{max}$	0.882	0.886	0.889	0.890	0.899	0.900	0.908	0.885	0.910	0.907	0.912	0.917
	$F_{ada}$	0.819	0.810	0.810	0.830	0.845	0.828	0.848	0.819	0.882	0.829	0.869	0.878
	$F^{\omega}_{\beta}$	0.790	0.786	0.801	0.802	0.824	0.809	0.835	0.804	0.864	0.834	0.847	0.869
	MAE	0.042	0.042	0.039	0.038	0.034	0.038	0.033	0.039	0.029	0.032	0.030	0.027
	$S_{m}$	0.889	0.892	0.899	0.897	0.906	0.900	0.915	0.897	0.916	0.902	0.912	0.916
	$E_{m}$	0.925	0.921	0.921	0.929	0.935	0.927	0.936	0.925	0.948	0.929	0.949	0.948
RGBD135 [7]	$F_{max}$	0.892	0.910	0.904	0.889	0.894	0.917	0.927	0.879	0.931	0.933	0.926	0.934
	$F_{ada}$	0.834	0.864	0.865	0.847	0.841	0.883	0.890	0.828	0.917	0.873	0.906	0.919
	$F^{\omega}_{\beta}$	0.770	0.807	0.819	0.774	0.790	0.835	0.853	0.770	0.889	0.873	0.855	0.902
	MAE	0.040	0.035	0.034	0.039	0.037	0.029	0.027	0.042	0.022	0.023	0.026	0.020
	$S_{m}$	0.873	0.898	0.907	0.873	0.885	0.910	0.922	0.878	0.925	0.925	0.915	0.932
	$E_{m}$	0.929	0.950	0.959	0.934	0.934	0.962	0.965	0.930	0.974	0.959	0.971	0.973
SIP [13]	$F_{max}$	0.865	0.875	0.888	0.875	0.886	0.888	0.889	0.861	0.897	0.899	0.897	0.904
	$F_{ada}$	0.810	0.821	0.832	0.832	0.843	0.844	0.844	0.806	0.865	0.840	0.863	0.863
	$F^{\omega}_{\beta}$	0.753	0.761	0.783	0.772	0.795	0.788	0.797	0.749	0.829	0.812	0.819	0.835
	MAE	0.074	0.072	0.064	0.069	0.063	0.065	0.062	0.074	0.052	0.058	0.056	0.050
	$S_{m}$	0.853	0.858	0.872	0.858	0.867	0.867	0.872	0.849	0.879	0.870	0.878	0.878
	$E_{m}$	0.893	0.894	0.903	0.902	0.909	0.903	0.906	0.893	0.916	0.909	0.914	0.920
SSD [53]	$F_{max}$	0.844	0.859	0.863	0.825	0.853	0.862	0.861	0.841	0.877	0.880	0.875	0.872
	$F_{ada}$	0.770	0.790	0.786	0.782	0.802	0.799	0.818	0.791	0.843	0.811	0.828	0.844
	$F^{\omega}_{\beta}$	0.730	0.744	0.750	0.735	0.758	0.759	0.784	0.739	0.806	0.787	0.789	0.808
	MAE	0.072	0.067	0.063	0.068	0.059	0.058	0.054	0.067	0.047	0.054	0.049	0.048
	$S_{m}$	0.841	0.854	0.859	0.840	0.855	0.859	0.871	0.846	0.871	0.863	0.870	0.866
	$E_{m}$	0.863	0.874	0.876	0.871	0.897	0.887	0.902	0.885	0.911	0.888	0.899	0.913
STEREO [33]	$F_{max}$	0.883	0.871	0.861	0.898	0.906	0.900	0.909	0.895	0.912	0.911	0.910	0.918
	$F_{ada}$	0.822	0.806	0.787	0.853	0.863	0.846	0.857	0.843	0.877	0.841	0.875	0.879
	$F^{\omega}_{\beta}$	0.780	0.762	0.753	0.807	0.829	0.809	0.829	0.805	0.854	0.826	0.843	0.863
	MAE	0.062	0.068	0.071	0.055	0.048	0.053	0.049	0.055	0.042	0.048	0.045	0.039
	$S_{m}$	0.873	0.865	0.858	0.888	0.901	0.892	0.901	0.890	0.903	0.891	0.903	0.906
	$E_{m}$	0.901	0.896	0.883	0.919	0.924	0.918	0.922	0.914	0.932	0.917	0.931	0.937
DUTRGBD [37]	$F_{max}$	0.875	0.886	0.901	0.893	0.916	0.911	0.920	0.872	0.926	0.922	0.924	0.926
	$F_{ada}$	0.817	0.822	0.846	0.839	0.872	0.858	0.871	0.812	0.892	0.861	0.890	0.892
	$F^{\omega}_{\beta}$	0.740	0.749	0.784	0.772	0.817	0.800	0.820	0.742	0.857	0.827	0.840	0.865
	MAE	0.079	0.076	0.066	0.067	0.054	0.059	0.054	0.077	0.044	0.052	0.050	0.040
	$S_{m}$	0.853	0.865	0.877	0.874	0.892	0.886	0.895	0.859	0.908	0.891	0.896	0.905
	$E_{m}$	0.893	0.899	0.912	0.911	0.928	0.915	0.922	0.896	0.937	0.921	0.930	0.938
AveMetric	$F_{max}$	0.875	0.879	0.882	0.884	0.896	0.898	0.904	0.878	0.909	0.909	0.907	0.914
	$F_{ada}$	0.819	0.820	0.820	0.839	0.852	0.846	0.856	0.823	0.878	0.845	0.874	0.878
	$F^{\omega}_{\beta}$	0.768	0.768	0.780	0.787	0.811	0.803	0.820	0.777	0.849	0.827	0.836	0.857
	MAE	0.067	0.066	0.063	0.060	0.054	0.056	0.052	0.064	0.044	0.050	0.048	0.041
	$S_{m}$	0.865	0.868	0.873	0.874	0.886	0.884	0.893	0.871	0.898	0.887	0.895	0.898
	$E_{m}$	0.898	0.899	0.900	0.909	0.916	0.913	0.918	0.903	0.929	0.916	0.926	0.933

Table 4: Comparisons of RGB SOD models with the HEL (w) and without the HEL (w/o). The best result of each group is highlight in red.

	Metric	R3Net₁₈ [10]		CPD₁₉ [45]		PoolNet₁₉ [27]		GCPANet₂₀ [5]
	Metric	w/o	w	w/o	w	w/o	w	w/o	w
DUTS [42]	$F_{max}$	0.823	0.827	0.858	0.859	0.844	0.879	0.856	0.865
	$F_{ada}$	0.688	0.710	0.786	0.807	0.748	0.818	0.759	0.777
	$F^{\omega}_{\beta}$	0.701	0.726	0.774	0.800	0.733	0.814	0.744	0.778
	MAE	0.071	0.069	0.047	0.045	0.055	0.040	0.055	0.049
	$S_{m}$	0.821	0.826	0.863	0.865	0.849	0.873	0.860	0.864
	$E_{m}$	0.818	0.833	0.889	0.903	0.864	0.909	0.867	0.881
DUT-OMRON [48]	$F_{max}$	0.785	0.794	0.799	0.797	0.778	0.807	0.798	0.806
	$F_{ada}$	0.668	0.685	0.739	0.749	0.699	0.754	0.713	0.726
	$F^{\omega}_{\beta}$	0.669	0.693	0.714	0.732	0.668	0.736	0.689	0.715
	MAE	0.079	0.078	0.059	0.058	0.066	0.054	0.070	0.066
	$S_{m}$	0.812	0.816	0.828	0.826	0.810	0.829	0.826	0.826
	$E_{m}$	0.804	0.818	0.864	0.869	0.839	0.874	0.841	0.851
ECSSD [47]	$F_{max}$	0.927	0.932	0.934	0.939	0.921	0.943	0.933	0.933
	$F_{ada}$	0.858	0.866	0.908	0.918	0.882	0.919	0.892	0.896
	$F^{\omega}_{\beta}$	0.858	0.882	0.881	0.906	0.846	0.905	0.863	0.882
	MAE	0.053	0.045	0.044	0.035	0.054	0.037	0.049	0.042
	$S_{m}$	0.908	0.910	0.911	0.918	0.897	0.918	0.912	0.912
	$E_{m}$	0.911	0.918	0.941	0.951	0.920	0.947	0.930	0.935
HKU-IS [24]	$F_{max}$	0.914	0.917	0.922	0.927	0.915	0.931	0.923	0.928
	$F_{ada}$	0.842	0.856	0.886	0.899	0.869	0.903	0.878	0.885
	$F^{\omega}_{\beta}$	0.831	0.857	0.864	0.892	0.838	0.893	0.851	0.876
	MAE	0.047	0.040	0.037	0.030	0.043	0.029	0.041	0.035
	$S_{m}$	0.890	0.895	0.904	0.911	0.896	0.911	0.907	0.910
	$E_{m}$	0.918	0.929	0.945	0.956	0.937	0.958	0.943	0.947
PASCAL-S [26]	$F_{max}$	0.844	0.836	0.868	0.867	0.850	0.873	0.856	0.857
	$F_{ada}$	0.757	0.764	0.819	0.829	0.786	0.829	0.798	0.803
	$F^{\omega}_{\beta}$	0.733	0.750	0.784	0.806	0.745	0.806	0.764	0.782
	MAE	0.101	0.091	0.079	0.072	0.093	0.073	0.087	0.081
	$S_{m}$	0.813	0.822	0.842	0.844	0.822	0.843	0.837	0.835
	$E_{m}$	0.823	0.832	0.872	0.889	0.844	0.878	0.856	0.863
AveMetric	$F_{max}$	0.828	0.832	0.848	0.849	0.832	0.861	0.847	0.854
	$F_{ada}$	0.714	0.731	0.790	0.804	0.755	0.811	0.766	0.779
	$F^{\omega}_{\beta}$	0.716	0.740	0.769	0.792	0.728	0.799	0.744	0.773
	MAE	0.072	0.069	0.052	0.049	0.060	0.046	0.061	0.055
	$S_{m}$	0.831	0.835	0.856	0.857	0.841	0.862	0.854	0.856
	$E_{m}$	0.830	0.844	0.889	0.898	0.865	0.902	0.869	0.880

References

[1] Achanta, R., Hemami, S., Estrada, F., Süsstrunk, S.: Frequency-tuned salient region detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1597–1604. No. CONF (2009)
[2] Chen, H., Li, Y.: Progressively complementarity-aware fusion network for rgb-d salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3051–3060 (2018)
[3] Chen, H., Li, Y.: Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing 28(6), 2825–2835 (2019)
[4] Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognition 86, 376–385 (2019)
[5] Chen, Z., Xu, Q., Cong, R., Huang, Q.: Global context-aware progressive aggregation network for salient object detection. In: AAAI Conference on Artificial Intelligence (2020)
[6] Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detection method. In: Proceedings of the International Conference on Internet Multimedia Computing and Service. pp. 23–27 (2014)
[7] Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detection method. In: Proceedings of the International Conference on Internet Multimedia Computing and Service. pp. 23–27 (2014)
[8] Ciptadi, A., Hermans, T., Rehg, J.M.: An in depth view of saliency (2013)
[9] Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., Hou, C.: Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion. IEEE Signal Processing Letters 23(6), 819–823 (2016)
[10] Deng, Z., Hu, X., Zhu, L., Xu, X., Qin, J., Han, G., Heng, P.A.: R3net: Recurrent residual refinement network for saliency detection. In: International Joint Conference on Artificial Intelligence. pp. 684–690 (2018)
[11] Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4548–4557 (2017)
[12] Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: International Joint Conference on Artificial Intelligence. pp. 698–704 (2018)
[13] Fan, D.P., Lin, Z., Zhao, J.X., Liu, Y., Zhang, Z., Hou, Q., Zhu, M., Cheng, M.M.: Rethinking rgb-d salient object detection: Models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781 (2019)
[14] Fan, D.P., Wang, W., Cheng, M.M., Shen, J.: Shifting more attention to video salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 8554–8564 (2019)
[15] Han, J., Chen, H., Liu, N., Yan, C., Li, X.: Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics 48(11), 3171–3183 (2017)
[16] He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3562–3572 (2019)
[17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
[18] Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.H.: Deeply supervised salient object detection with short connections. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3203–3212 (2017)
[19] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
[20] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708 (2017)
[21] Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: Conference and Workshop on Neural Information Processing Systems. pp. 667–675 (2016)
[22] Ju, R., Liu, Y., Ren, T., Ge, L., Wu, G.: Depth-aware salient object detection using anisotropic center-surround difference. Signal Processing: Image Communication 38, 115–126 (2015)
[23] Jun Wei, Shuhui Wang, Q.H.: F3net: Fusion, feedback and focus for salient object detection. In: AAAI Conference on Artificial Intelligence (2020)
[24] Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 5455–5463 (2015)
[25] Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2806–2813 (2014)
[26] Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 280–287 (2014)
[27] Liu, J.J., Hou, Q., Cheng, M.M., Feng, J., Jiang, J.: A simple pooling-based design for real-time salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2019)
[28] Liu, N., Han, J., Yang, M.H.: Picanet: Learning pixel-wise contextual attention for saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3089–3098 (2018)
[29] Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
[30] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3431–3440 (2015)
[31] Mahadevan, V., Vasconcelos, N.: Saliency-based discriminant tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
[32] Margolin, R., Zelnik-Manor, L., Tal, A.: How to evaluate foreground maps? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2014)
[33] Niu, Y., Geng, Y., Li, X., Liu, F.: Leveraging stereopsis for saliency analysis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 454–461 (2012)
[34] Pang, Y., Zhao, X., Zhang, L., Lu, H.: Multi-scale interactive network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2020)
[35] Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: Rgbd salient object detection: a benchmark and algorithms. In: Proceedings of European Conference on Computer Vision. pp. 92–109 (2014)
[36] Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: Contrast based filtering for salient region detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 733–740 (2012)
[37] Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrent attention network for saliency detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 7254–7263 (2019)
[38] Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: Rgbd salient object detection via deep fusion. IEEE Transactions on Image Processing 26(5), 2274–2285 (2017)
[39] Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology 24(5), 769–779 (2013)
[40] Rui, Z., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)
[41] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[42] Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 136–145 (2017)
[43] Wang, N., Gong, X.: Adaptive fusion for rgb-d salient object detection. IEEE Access 7, 55277–55284 (2019)
[44] Wei, Y., Liang, X., Chen, Y., Shen, X., Cheng, M.M., Feng, J., Zhao, Y., Yan, S.: Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(11), 2314–2320 (2016)
[45] Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3907–3916 (2019)
[46] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)
[47] Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1155–1162 (2013)
[48] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3166–3173 (2013)
[49] Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: Bengio, Y., LeCun, Y. (eds.) International Conference on Learning Representations (2016), http://arxiv.org/abs/1511.07122
[50] Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for rgbd salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3927–3936 (2019)
[51] Zhao, J.X., Liu, J.J., Fan, D.P., Cao, Y., Yang, J., Cheng, M.M.: Egnet:edge guidance network for salient object detection. In: Proceedings of the IEEE International Conference on Computer Vision (Oct 2019)
[52] Zhao, X., Pang, Y., Zhang, L., Lu, H., Zhang, L.: Suppress and balance: A simple gated network for salient object detection. In: Proceedings of European Conference on Computer Vision (2020)
[53] Zhu, C., Li, G.: A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In: International Conference on Computer Vision Workshops. pp. 3008–3014 (2017)
[54] Zhu, C., Li, G., Wang, W., Wang, R.: An innovative salient object detection using center-dark channel prior. In: International Conference on Computer Vision Workshops. pp. 1509–1515 (2017)