¹¹institutetext: Dalian University of Technology, China ²²institutetext: Peng Cheng Laboratory ³³institutetext: Dept. of Computing, The Hong Kong Polytechnic University, China ⁴⁴institutetext: DAMO Academy, Alibaba Group
⁴⁴email: {zxq,lartpang}@mail.dlut.edu.cn, {zhanglihe,lhchuan}@dlut.edu.cn, [email protected] https://github.com/Xiaoqi-Zhao-DLUT/GateNet-RGB-Saliency

Suppress and Balance: A Simple Gated Network for Salient Object Detection

Xiaoqi Zhao⁴⁴4These authors contributed equally to this work. 11 Youwei Pang⁴⁴4These authors contributed equally to this work. 11 Lihe Zhang Corresponding author. 11 Huchuan Lu 1122 and Lei Zhang 3344

Abstract

Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as their basic structures. These methods ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control between them, the other is without considering the disparity of the contributions of different encoder blocks. In this work, we propose a simple gated network (GateNet) to solve both issues at once. With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder. We design a novel gated dual branch structure to build the cooperation among different levels of features and improve the discriminability of the whole network. Through the dual branch design, more details of the saliency map can be further restored. In addition, we adopt the atrous spatial pyramid pooling based on the proposed “Fold” operation (Fold-ASPP) to accurately localize salient objects of various scales. Extensive experiments on five challenging datasets demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics.

Keywords:

Salient Object Detection

\cdot

Gated Network

\cdot

Dual Branch

\cdot

Fold-ASPP

1 Introduction

Salient object detection aims to identify the visually distinctive regions or objects in a scene and then accurately segment them. In many computer vision applications, it is used as a pre-processing step, such as scene classification [39], visual tracking [32], person re-identification [41], light field image segmentation [52] and image captioning [17], etc.

Refer to caption — Figure 1: Visual comparison of different CNN based methods.

With the development of deep learning, salient object detection has gradually evolved from the traditional method based on manual design features to the deep learning method. In recent years, U-shape based structures [40, 28] have received the most attention due to their ability to utilize multilevel information to reconstruct high-resolution feature maps. Therefore, most state-of-the-art saliency detection networks [29, 20, 69, 53, 67, 70, 57, 37, 33] adopt U-shape as the encoder-decoder architecture. And many methods aim at combining multilevel features in either the encoder [69, 53, 67, 57, 37, 59] or the decoder [29, 20, 70, 59]. For each convolutional block, they separately formulate the relationships of internal features for forward update. It is well known that the high-quality saliency maps predicted in the decoder rely heavily on the effective features provided by the encoder. Nevertheless, the aforementioned methods directly use an all-pass skip-layer structure to concatenate the features of the encoder to the decoder, and the effectiveness of feature aggregation at different levels is not quantified. These restrictions not only introduce misleading context information into the decoder but also result in that the really useful features can not be adequately utilized. In cognitive science, Yang et al. [64] show that inhibitory neurons play an important role in how the human brain chooses to process the most important information from all the information presented to us. And inhibitory neurons ensure that humans respond appropriately to external stimuli by inhibiting other neurons and balancing excitatory neurons that stimulate neuronal activity. Inspired by this work, we think that it is necessary to set up an information screening unit between each pair of encoder and decoder blocks in saliency detection. It can help distinguish the most intense features of salient regions and suppress background interference, as shown in Fig. 1, in which these images have easily-confused backgrounds or low-contrast objects.

Moreover, due to the limited receptive field, a single-scale convolutional kernel is difficult to capture context information of size-varying objects. This motivates some efforts [12, 67] to investigate multiscale feature extraction. These methods directly equip an atrous spatial pyramid pooling module [6] (ASPP) in their networks. However, when using a convolution with a large dilation rate, the information under the kernel seriously lacks correlation due to inserting too many zeros. This may be detrimental to the discrimination of subtle image structures.

In this paper, we propose a simple gated network (GateNet) for salient object detection. Based on the feature pyramid network (FPN), we construct multilevel gate units to combine the features from the decoder and the encoder. We use convolution operation and nonlinear functions to calculate the correlations among features and assign gate values to different blocks. In this process, a partnership is established between different blocks by using weight distribution and the decoder can obtain more efficient information from the encoder and pay more attention to the salient regions. Since the top-layer features of the encoder network contain rich contextual information, we construct a folded atrous spatial pyramid pooling (Fold-ASPP) module to gather multiscale high-level saliency cues. With the “Fold” operation, the atrous convolution is implemented on a group of local neighborhoods rather than a group of isolated sampling points, which can help generate more stable features and more adequately depict finer structure. In addition, we design a parallel branch by concatenating the output of the FPN branch and the features of the gated encoder, so that the residual information complementary to the FPN branch is supplemented to generate the final saliency map.

Our main contributions can be summarized as follows.

•

We propose a simple gated network to adaptively control the amount of information that flows into the decoder from each encoder block. With multilevel gate units, the network can balance the contribution of each encoder block to the the decoder block and suppress the features of non-salient regions.
•

We design a Fold-ASPP module to capture richer context information and localize salient objects of various sizes. By the “Fold” operation, we can obtain more effective feature representation.
•

We build a dual branch architecture. They form a residual structure, complement each other through the gated processing and generate better results.

We compare the proposed model with seventeen state-of-the-art methods on five challenging datasets. The results show that our method performs much better than other competitors. And, it achieves a real-time speed of 30 fps.

2 Related Work

2.1 Salient Object Detection

Early saliency detection methods are based on low-level features and some heuristics prior knowledge, such as color contrast [1], background prior [62] and center prior [22]. Most of them using hand-crafted features, and more details about the traditional methods are discussed in [54].

With the breakthrough of deep learning in the field of computer vision, a large number of convolutional neural networks-based salient object detection methods have been proposed and their performance had been improved gradually. Especially, fully convolutional networks (FCN), which avoid the problems caused by the fully-connected layer, become the mainstream for dense prediction tasks. Wang et al. [50] use weight sharing methods to iteratively refine features and promote mutual fusion between features. Hou et al. [20] achieve efficient feature expression by continuously blending features from deep layers into shallow layers. However, the single-scale feature cannot roundly characterize various objects as well as image contexts. How to get multiscale features and integrate context information is an important problem in saliency detection.

2.2 Multiscale Feature Extraction

Recently, the atrous spatial pyramid pooling module (ASPP) [6] is widely applied in many tasks and networks. The atrous convolution can enlarge the receptive field to obtain large-scale features and does not increase the computational cost. Therefore, it is often used in saliency detection networks. Zhang et al. [67] insert several ASPP modules into the encoder blocks of different levels, while Deng et al. [12] install it on the highest-level encoder block. Nevertheless, the repeated stride and pooling operations already make the top-layer features lose much fine information. With the increase of atrous rate, the correlation of sampling points further degrades, which leads to difficulties in capturing the changes of image details (e.g., lathy background regions between adjacent objects or spindly parts of objects). In this work, we propose a folded ASPP to alleviate these issues and achieve a local-in-local effect.

2.3 Gated Mechanisms

The gated mechanism plays an important role in controlling the flow of information and is widely used in the long short term memory (LSTM). In [2], the gate unit combines two consecutive feature maps of different resolutions from the encoder to generate rich contextual information. Zhang et al. [67] adopt gate function to control the message passing when combining feature maps at all levels of the encoder. Due to the ability to filter information, the gated mechanism can also be seen as a special kind of attention mechanism. Some saliency methods [7, 70, 57] employ attention networks. Zhang et al. [70] apply both spatial and channel attention to each layer of the decoder. Wang et al. [57] exploit the pyramid attention module to enhance saliency representations for each layer in the encoder and enlarge the receptive field. The above methods all unilaterally consider the information interaction between different levels either in the encoder or in the decoder. We integrate the features from the encoder and the decoder to formulate gate function, which plays the role of block-wise attention and model the overall distribution of all blocks in the network from the global perspective. While previous methods actually utilize the block-specific feature to compute dense attention weights for the corresponding block. Moreover, in order to take advantage of rich contextual information in the encoder, these methods directly feed the encoder features into the decoder and do not consider their mutual interference. Our proposed gate unit can naturally balance their contributions, thereby suppressing the response of the encoder to non-salient regions. Experimental results in Fig. 4 and Fig. 9 intuitively demonstrate the effect of multilevel gate units on the above two aspects, respectively.

3 Proposed Method

The gated network architecture is shown in Fig. 2, in which encoder blocks, transition layers, decoder blocks and gate units are respectively denoted as $\mathbf{E}^{i}$ , $\mathbf{T}^{i}$ , $\mathbf{D}^{i}$ and $\mathbf{G}^{i}$ ( $i\in\left\{1,2,3,4,5\right\}$ indexes different levels). And their output feature maps are denoted as $E^{i}$ , $T^{i}$ , $D^{i}$ and $G^{i}$ , respectively. The final prediction is obtained by combining the FPN branch and the parallel branch. In this section, we first describe the overall architecture, then detail the gated dual branch structure and the folded atrous spatial pyramid pooling module.

3.1 Network Overview

Encoder Network. In our model, the encoder is based on a common pretrained backbone network, e.g., the VGG [43], ResNet [19] or ResNeXt [60]. We take the VGG-16 network as an example, which contains thirteen Conv layers, five max-pooling layers and two fully connected layers. In order to fit saliency detection task, similar to most previous approaches [69, 20, 70, 67], we cast away all the fully-connected layers of the VGG-16 and remove the last pooling layer to retain details of last convolutional layer.

Decoder Network. The decoder comprises three main components. i) The FPN branch, which continually fuses different level features from ${T}^{1}\sim{T}^{5}$ by element-wise addition. ii) The parallel branch, which combines the saliency map of the FPN branch with the feature maps of transition layers by cross-channel concatenation. At the same time, multilevel gate units ( $\mathbf{G}^{1}\sim\mathbf{G}^{5}$ ) are inserted between the transition layer and the decoder layer. iii) The Fold-ASPP module, which improves the original atrous spatial pyramid pooling (ASPP) by using a “Fold” operation. It can take advantage of semantic features learned from ${E}^{5}$ to provide multiscale information to the decoder.

3.2 Gated Dual Branch

The gate unit can control the message passing between scale-matching encoder and decoder blocks. By combining the feature maps of the previous decoder block, the gate value also characterizes the contribution that the current block of the encoder can provide. Fig. 3 shows the internal structure of the proposed gate unit. In particular, encoder feature ${E}^{i}$ and decoder feature ${D}^{i+1}$ are integrated to obtain feature ${F}^{i}$ , and then it is fed into two branches, which includes a series of convolution, activation and pooling operations, to compute a pair of gate values ${G}^{i}$ . The entire gated process can be formulated as,

{G}^{i}=\left\{\begin{matrix}P(S(Conv(Cat(E^{i},D^{i+1}))))&\text{ if }i=1,2,3,4\\ P(S(Conv(Cat(E^{i},T^{i}))))&\text{ if }i=5\end{matrix}\right.

(1)

where $Cat(\cdot)$ is the concatenation operation among channel axis, $Conv(\cdot)$ refers to the convolution layer, $S(\cdot)$ is the element-wise sigmoid function, and $P(\cdot)$ is the global average pooling. The output channel of $Conv(\cdot)$ is 2. The resulted gate vector ${G}^{i}$ has two different elements which correspond to two gate values in Fig. 3.

Given the gate values, they are applied to the FPN branch and the parallel branch for weighting the transition-layer features ${T}^{1}\sim{T}^{5}$ , which are generated by exploiting $3\times 3$ convolution to reduce the dimension of ${E}^{1}\sim{E}^{4}$ and the Fold-ASPP to finely process ${E}^{5}$ (Please see Fig. 2 for details). Through multilevel gate units, we can suppress and balance the information flowing from different encoder blocks to the decoder.

In Fig. 4, we statistically demonstrate the curves of gate value with a convolutional level as the horizontal axis. It can be seen that the high-level encoder features contribute more contextual guidance to the decoder than the low-level encoder features in the FPN branch. This trend is just the opposite in the parallel branch. It is because the FPN branch is responsible to predict the main body of the salient object by progressively combining multilevel features, which needs more high-level semantic knowledge. While the parallel branch, as a residual structure, aims to fill in the details, which are mainly contained in the low-level features. In addition, some visual examples are shown in Fig. 9 demonstrate that multilevel gate units can significantly suppress the interference from each encoder block and enhance the contrast between salient and non-salient regions. Since the proposed gate unit is simple yet effective, a raw FPN network with multilevel gate units can be viewed as a new baseline for saliency detection task.

Most existing models either use progressive decoder [67, 53, 70, 57] or parallel decoder [12, 72], as shown in Fig. 5. The progressive structure begins with the top layer and gradually utilizes the output of the higher layer as prior knowledge to fuse the encoder features. This mechanism is not conducive to the recovery of details because the high-level features lack fine information. While the parallel structure easily results in inaccurate localization of objects since the low-level features without semantic information directly interfere with the capture of global structure cues. In this work, we mix the two structures to build a dual branch decoder to overcome the above restrictions. We briefly describe the FPN branch. Taking $D^{i}$ as an example, we firstly apply bilinear interpolation to upsample the higher-level feature $D^{i+1}$ to the same size as ${T}^{i}$ . Next, to decrease the number of parameters, ${T}^{i}$ is reduced to $32$ channels and fed into gate unit $G^{i}$ . Lastly, the gated feature is fused with the upsampled feature of $D^{i+1}$ by element-wise addition and convolutional layers. This process can be formulated as follows:

D^{i}=\left\{\begin{matrix}Conv(G_{1}^{i}\cdot T^{i}+Up(D^{i+1}))&\text{ if }i=1,2,3,4\\ Conv(G_{1}^{i}\cdot T^{i})&\text{ if }i=5,\end{matrix}\right.

(2)

where $D^{1}$ is a single-channel feature map with the same size as the input image.

In the parallel branch, we firstly upsample ${T}^{1}\sim{T}^{5}$ to the same size of $D^{1}$ . Next, the multilevel gate units are followed to weight the corresponding transition-layer features. Lastly, we combine $D^{1}$ and the gated features by cross-channel concatenation. The whole process is written as follows:

\begin{split}F_{Cat}=Cat(&D^{1},Up(G_{2}^{1}\cdot T^{1}),Up(G_{2}^{2}\cdot T^{2}),\\ &Up(G_{2}^{3}\cdot T^{3}),Up(G_{2}^{4}\cdot T^{4}),Up(G_{2}^{5}\cdot T^{5})).\end{split}

(3)

The final saliency map $S^{F}$ is generated by integrating the predictions of the two branches with a residual connection as shown in Fig. 5(c),

\begin{split}S^{F}=S(Conv(F_{Cat})+D^{1})),\end{split}

(4)

where $S(\cdot)$ is the element-wise sigmoid function.

3.3 Folded Atrous Spatial Pyramid Pooling

In order to obtain robust segmentation results by integrating multiscale information, atrous spatial pyramid pooling (ASPP) is proposed in Deeplab [6]. And some works [67, 12] also show its effectiveness in saliency detection. The ASPP uses multiple parallel atrous convolutional layers with different dilation rates. The sparsity of atrous convolution kernel, especially when using a large dilation rate, results in that the association relationships among sampling points are too weak to extract stable features. In this paper, we apply a simple “Fold” operation to effectively relieve this issue. We visualize the folded convolution structure in Fig. 6, which not only further enlarges the receptive field but also extends each valid sampling position from an isolate point to a $2\times 2$ connected region.

Let $\mathbf{X}$ represent feature maps with the size of $N\times N\times C$ (C is the channel number). We slide a $2\times 2$ window on $\mathbf{X}$ in stride $2$ and then conduct atrous convolution with kernel size $K\times K$ in different dilation rates. Fig. 6 shows the computational process when $K=3$ and dilation rate is $2$ . Firstly, we collect $2\times 2\times C$ feature points in each window from $\mathbf{X}$ and then it is stacked by channel direction, we call this operation ”Fold”, which is shown in Fig. 6\scriptsize1⃝. After the fold operation, we can get new feature maps with the size of $N/2\times N/2\times 4C$ . A point on the new feature maps corresponds to a $2\times 2$ area on the original feature maps. Secondly, we adopt an atrous convolution with a kernel size of $3\times 3$ and dilation rate is $2$ . Followed by the reverse process of “Fold” which is called “Unfold” operation, the final feature maps are obtained. By using the folded atrous convolution, in the process of information transfer across convolution layers, more contexts are merged and the certain local correlation is also preserved, which provides the fault-tolerance capability for subsequent operations.

As shown in Fig. 2, the Fold-ASPP is only equipped on the top of the encoder network, which consists of three folded convolutional layers with dilation rates $[2,4,6]$ to fit the size of feature maps. Just as group convolution [60] is a trade-off between depthwise convolution [10, 21] and vanilla convolution in the channel dimension, the proposed folded convolution is a trade-off between atrous convolution and vanilla convolution in the spatial dimension.

3.4 Supervision

As shown in Fig. 2, we use the cross-entropy loss for both the intermediate prediction from the FPN branch and the final prediction from the dual branch. In the dual branch decoder, since the FPN branch gradually combines all-level gated encoding and decoding features, it has very powerful prediction ability. We expect that it can predict salient objects as accurately as possible under the supervision of ground truth. While the parallel branch only combines the gated encoding features, which is helpful to remedy the ignored details with the design of residual structure. Moreover, the supervision on $D^{1}$ can drive gate units to learn the weight of the contribution of each encoder block to the final prediction. We use the cross-entropy loss. The total loss L could be written as:

\begin{split}L=l_{s1}+l_{sf},\end{split}

(5)

where $l_{s1}$ and $l_{sf}$ are respectively used to regularize the output of the FPN branch and the final prediction. The cross-entropy loss could be computed as:

\begin{split}l=YlogP+(1-Y)log(1-P),\end{split}

(6)

where $P$ and $Y$ denote the predicted map and ground-truth, respectively.

4 Experiments

4.1 Experimental Setup

Dataset. We evaluate the proposed model on five benchmark datasets. ECSSD [61] contains $1,000$ semantically meaningful and complex images with pixel-accurate ground truth annotations. HKU-IS [25] has $4,447$ challenging images with multiple disconnected salient objects, overlapping the image boundary. PASCAL-S [27] contains $850$ images selected from the PASCAL VOC 2009 segmentation dataset. DUT-OMRON [63] includes $5,168$ challenging images, each of which usually has complicated background and one or more foreground objects. DUTS [49] is the largest salient object detection dataset, which contains $10,553$ training and $5,019$ test images. These images contain very complex scenarios with high-diversity contents.
Evaluation Metrics. For quantitative evaluation, we adopt four widely-used metrics: precision-recall (PR) curve, F-measure score, mean absolute error (MAE) and S-measure score. Precision-Recall curve: The pairs of precision and recall are calculated by comparing the binary saliency maps with the ground truth to plot the PR curve, where the threshold for binarizing slides from $0$ to $255$ . The closer the PR curve is to the upper right corner, the better the performance is. F-measure: It is an overall performance measurement that synthetically considers both precision and recall:

\text{F}{{}_{\beta}}=\frac{{\left({1+{\beta^{2}}}\right)\cdot\text{precision}\cdot\text{recall}}}{{{\beta^{2}}\cdot\text{precision}+\text{recall}}},

(7)

where $\beta^{2}$ is set to $0.3$ as suggested in [1] to emphasize the precision. In this paper, we report the maximum F-measure score across the binary maps of different thresholds. Mean Absolute Error: As the supplement of the PR curve and F-measure, it computes the average absolute difference between the saliency map and the ground truth pixel by pixel. S-measure: It is more sensitive to foreground structural information than the F-measure. It considers the region-aware structural similarity $S_{r}$ and the object-aware structural similarity $S_{o}$ :

\text{S}{{}_{m}}=\alpha*S_{o}+(1-\alpha)*S_{r},

(8)

where $\alpha$ is set to $0.5$ [14].
Implementation Details. We follow most state-of-the-art saliency detection methods [45, 37, 55, 59, 53, 57, 66, 70, 67] to use the DUTS-TR as the training dataset which contains $10,553$ images. Our model is implemented based on the Pytorch repository and the hyper-parameters are set as follows: We train the GateNet on a PC with GTX 1080 Ti GPU for $40$ epochs with mini-batch size $4$ . For the optimizer, we adopt the stochastic gradient descent (SGD). The momentum, weight decay, and learning rate are set as $0.9$ , $0.0005$ and $0.001$ , respectively. The “poly” policy [30] with the power of $0.9$ is used to adjust the learning rate. We adopt some data augmentation techniques to avoid overfitting and make the learned model more robust, which include random horizontally flipping, random rotation, random brightness, saturation and contrast changing. In order to preserve the integrity of the image semantic information, we only resize the image to $384\times 384$ instead of using a random crop.

4.2 Performance Comparison with State-of-the-art

We compare the proposed algorithm with seventeen state-of-the-art saliency detection methods, including the DCL [26], DSS [20], Amulet [69], SRM [51], DGRL [53], RAS [7], PAGRN [70], BMPM [67], R3Net [12], HRS [66], MLMS [58], PAGE [57], ICNet [55], CPD [59], BANet [45], BASNet [37] and Capsal [68]. For fair comparisons, all the saliency map of these methods are directly provided by their respective authors or computed by their released codes. To further show the effectiveness of our GateNet, we test its performance in both RGBD SOD and Video Object Segmentation tasks and include the results in appendix.

Quantitative Evaluation. Tab. 1 shows the experimental comparison results in terms of the F-measure, S-measure and MAE scores, from which we can see that the GateNet can consistently outperform other approaches across all five datasets and different metrics. In particular, the GateNet achieves significant performance improvement in terms of the F-measure compared to the second best method BANet [45] on the challenging DUTS-test ( $0.870$ vs $0.852$ and $0.888$ vs $0.872$ ) and PASCAL-S ( $0.882$ vs $0.866$ and $0.883$ vs $0.877$ ) datasets. This clearly demonstrates its superior performance in complex scenes. Moreover, some methods [26, 20, 51, 12] apply the post-processing techniques to refine their saliency maps. Our GateNet still performs better than them without any post-processing. We evaluate different algorithms using the standard PR curves in Fig. 7. It can be seen that our PR curves are significantly higher than those of other methods on five datasets.

Qualitative Evaluation. Fig. 1 and Fig. 8 illustrate some visual comparisons. In Fig. 1, other methods are severely disturbed by branches and weeds while ours can precisely identify the whole objects. And the GateNet can significantly suppress the background with similar shapes to salient objects (see the $1^{st}$ row in Fig. 8). Since the Fold-ASPP can obtain more stable structural features, it can help to accurately locate objects and separate adjacent objects well, but some competitors make adjacent objects stick together (see the $3^{th}$ and $4^{th}$ rows in Fig. 8). Besides, the proposed parallel branch can restore more details, therefore, the boundary information is retained well.

4.3 Ablation Studies

We detail the contribution of each component to the overall network.

Effectiveness of Backbones. Tab. 1 demonstrates that the performance of the gated network can be significantly improved by using better backbones such as ResNet-50, ResNet-101 or ResNeXt-101.

Table 1: Quantitative comparisons. Blue indicates the best performance under each backbone setting, while red indicates the best performance among all settings. The subscript in the first column regards the publication year. “

\dagger

”, “

S

” and “

X

” mean using the post-processing, ResNet-101 and ResNeXt-101 backbone, respectively. “—” represents that the results are not available.

\uparrow

and

\downarrow

indicate that the larger and smaller scores are better, respectively.

Method	DUTS-test			DUT-OMRON			PASCAL-S			HKU-IS			ECSSD
Method	F ${}_{\beta}\uparrow$	S ${}_{m}\uparrow$	MAE $\downarrow$	F ${}_{\beta}\uparrow$	S ${}_{m}\uparrow$	MAE $\downarrow$	F ${}_{\beta}\uparrow$	S ${}_{m}\uparrow$	MAE $\downarrow$	F ${}_{\beta}\uparrow$	S ${}_{m}\uparrow$	MAE $\downarrow$	F ${}_{\beta}\uparrow$	S ${}_{m}\uparrow$	MAE $\downarrow$
VGG-16 backbone
DCL ${}_{16}^{\dagger}$	0.782	0.796	0.088	0.757	0.770	0.080	0.829	0.793	0.109	0.907	0.877	0.048	0.901	0.868	0.068
DSS ${}_{17}^{\dagger}$	—	—	—	0.781	0.789	0.063	0.840	0.792	0.098	0.916	0.878	0.040	0.921	0.882	0.052
Amulet₁₇	0.778	0.804	0.085	0.743	0.780	0.098	0.839	0.819	0.099	0.899	0.886	0.050	0.915	0.894	0.059
BMPM₁₈	0.852	0.860	0.049	0.774	0.808	0.064	0.862	0.842	0.076	0.921	0.906	0.039	0.928	0.911	0.045
RAS₁₈	0.831	0.838	0.059	0.786	0.813	0.062	0.836	0.793	0.106	0.913	0.887	0.045	0.921	0.893	0.056
PAGRN₁₈	0.854	0.837	0.056	0.771	0.774	0.071	0.855	0.814	0.095	0.919	0.889	0.048	0.927	0.889	0.061
HRS₁₉	0.843	0.828	0.051	0.762	0.771	0.066	0.850	0.798	0.092	0.913	0.882	0.042	0.920	0.883	0.054
MLMS₁₉	0.852	0.861	0.049	0.774	0.808	0.064	0.864	0.844	0.075	0.921	0.906	0.039	0.928	0.911	0.045
PAGE₁₉	0.838	0.853	0.052	0.792	0.824	0.062	0.858	0.837	0.079	0.920	0.904	0.036	0.931	0.912	0.042
BANet₁₉	0.852	0.860	0.046	0.793	0.822	0.061	0.866	0.838	0.079	0.919	0.901	0.037	0.935	0.913	0.041
GateNet	0.870	0.869	0.045	0.794	0.820	0.061	0.882	0.855	0.070	0.928	0.909	0.035	0.941	0.917	0.041
ResNet-50 backbone
SRM ${}_{17}^{\dagger}$	0.826	0.835	0.059	0.769	0.797	0.069	0.848	0.830	0.087	0.906	0.886	0.046	0.917	0.895	0.054
DGRL₁₈	0.828	0.841	0.050	0.774	0.805	0.062	0.856	0.836	0.073	0.911	0.895	0.036	0.922	0.903	0.041
CPD₁₉	0.865	0.868	0.043	0.797	0.824	0.056	0.870	0.844	0.074	0.925	0.906	0.034	0.939	0.918	0.037
ICNet₁₉	0.855	0.864	0.048	0.813	0.837	0.061	0.865	0.849	0.072	0.925	0.908	0.037	0.938	0.918	0.041
BASNet₁₉	0.860	0.864	0.048	0.805	0.835	0.057	0.860	0.834	0.079	0.930	0.907	0.033	0.943	0.916	0.037
BANet₁₉	0.872	0.878	0.040	0.803	0.832	0.059	0.877	0.851	0.072	0.930	0.913	0.033	0.944	0.924	0.035
GateNet	0.888	0.884	0.040	0.818	0.837	0.055	0.883	0.857	0.069	0.933	0.915	0.033	0.945	0.920	0.040
ResNet/ResNeXt-101 backbone
R3Net ${}_{18}^{\dagger^{X}}$	0.819	0.827	0.063	0.795	0.816	0.063	0.844	0.802	0.095	0.915	0.895	0.035	0.934	0.910	0.040
Capsal ${}_{19}^{S}$	0.819	0.818	0.063	0.639	0.673	0.101	0.869	0.837	0.074	0.883	0.851	0.058	0.863	0.826	0.077
GateNet^S	0.893	0.889	0.038	0.821	0.844	0.054	0.883	0.862	0.067	0.937	0.920	0.031	0.951	0.930	0.035
GateNet^X	0.898	0.895	0.035	0.829	0.848	0.051	0.888	0.865	0.065	0.943	0.925	0.029	0.952	0.929	0.035

Effectiveness of Components. We quantitatively show the benefit of each component in Tab. 2. We take the results of the VGG-16 backbone with the FPN branch as the baseline. Firstly, the multilevel gate units are added to the baseline network. The performance is significantly improved with the gain of $2.94\%$ , $2.17\%$ and $11.67\%$ in terms of the F-measure, S-measure and MAE, respectively. To show the effect of the gate units more intuitively, we visualize the features of different levels in Fig. 9. It can be observed that even if the dog has a very low contrast with the chair or the billboard (see the $1^{st}$ $\sim$ $4^{th}$ rows), through using multilevel gate units, the high contrast between the object region and the background is always maintained at each layer while the detail information is continually regained, thereby making salient objects be effectively distinguished. Besides, the gate units can avoid excessive suppression for the slender parts of objects (see the $5^{th}$ $\sim$ $8^{th}$ rows). The corners of the poster, the limbs and even tentacles of the mantis are retained well. Secondly, based on the gated baseline network, we design a series of experimental options to verify the effectiveness of the folded convolution and Fold-ASPP.

Table 2: Ablation analysis on the DUTS dataset.

	F_β	S_m	MAE
$Baseline$ $(FPN)$	0.816	0.829	0.060
$+$ $Gate$ $Units$	0.840	0.847	0.053
$+$ $Fold$ - $ASPP$	0.866	0.863	0.047
$+$ $Parallel$ $Branch$	0.870	0.869	0.045

Table 3: Evaluation of the folded convolution and Fold-ASPP. (x) stands for different sampling rates of atrous convolution.

	Atrous(2)	Atrous(4)	Atrous(6)	Fold(2)	Fold(4)	Fold(6)	ASPP	Fold-ASPP
F_β	0.840	0.845	0.848	0.853	0.856	0.860	0.856	0.866
MAE	0.055	0.053	0.051	0.051	0.050	0.048	0.051	0.047
S_m	0.847	0.849	0.851	0.856	0.858	0.859	0.860	0.863

Tab. 3 illustrates the results in detail. We adopt the atrous convolution with dilation rates of $[2,4,6]$ and the same dilation rates are also applied to the folded convolution. It can be observed that the folded convolution consistently yields significant performance improvement at each dilation rate than the corresponding atrous convolution in terms of all three metrics. And the single-layer Fold(6) already performs better than the ASPP of aggregating three atrous convolution layers. The Fold-ASPP also naturally outperforms the ASPP with the gain of $1.17\%$ and $8.0\%$ in terms of the F-measure and MAE, respectively. Finally, we add the parallel branch to further restore the details of objects. In this process, the gate units, Fold-ASPP and parallel branch complement each other without repulsion.

5 Conclusions

In this paper, we propose a novel gated network architecture for saliency detection. We first adopt multilevel gate units to balance the contribution of each encoder block and suppress the activation of the features of non-salient regions, which can provide useful context information for the decoder while minimizing interference. The gate unit is simple yet effective, therefore, a gated FPN network can be used as a new baseline for dense prediction tasks. Next, we use the Fold-ASPP to gather multiscale semantic information for the decoder. By the folded operation, the atrous convolution achieves a local-in-local effect, which not only expands the receptive field but also retains the correlation among local sampling points. Finally, to further supplement the details, we combine all encoder features in parallel and construct a residual structure. Experimental results on five benchmark datasets demonstrate that the proposed model outperforms seventeen state-of-the-art methods under different evaluation metrics.

Acknowledgements. This work was supported in part by the National Natural Science Foundation of China #61876202, #61725202, #61751212 and #61829102, the Dalian Science and Technology Innovation Foundation #2019J12GX039, and the Fundamental Research Funds for the Central Universities # DUT20ZD212.

Appendix 0.A Appendix

We expand our GateNet to other tasks including RGB-D Salient Object Detection (SOD) and Video Object Segmentation (VOS) to further demonstrate its effectiveness.

0.A.1 Network Architecture

Fig. 10 shows our proposed dual-branch gated FPN network for RGB-D SOD and VOS. Compared with the RGB SOD network, we only add an extra encoder to extract features of other modals such as depth or optical flow. This dual-branch GateNet is easy to follow and can be used as a new baseline.

0.A.2 RGB-D Salient object detection

Dataset. There are five main RGB-D SOD datasets which are NJUD [23], RGBD135 [9] NLPR [34], SSD [74] and SIP [16]. We adopt the same splitting way as [3, 5, 18, 71, 36] to guarantee a fair comparison. We split 1,485 samples from NJUD and 700 samples from NLPR for traing a new model. The remaining images in these two datasets and other three datasets are all for testing to verify the generalization ability of saliency models.
Evaluation Metrics. We adopt several metrics widely used in RGB-D SOD for quantitative evaluation: F-measure score, mean absolute error (MAE, $\mathcal{M}$ ), the recently released S-measure ( $S_{m}$ ) [14] and E-measure ( $E_{m}$ ) [15] scores. The lower value is better for the MAE and higher is better for others.
Comparison with State-of-the-art Results. The performance of the proposed model is compared with ten state-of-the-art approaches on five benchmark datasets, including the DES [9], DCMC [11], CDCP [75], DF [38], CTMF [18], PCA [3], MMCI [5], TANet [4], CPFP [71] and DMRA [36]. For fair comparisons, all the saliency maps of these methods are directly provided by authors or computed by their released codes. And we take the VGG-16 as the backbone for each stream. Tab. 4 shows performance comparisons in terms of the maximum F-measure, mean F-measure, weighted F-measure, S-measure, E-measure and MAE scores. It can be seen that our GateNet is very competitive. We believe that future works based on GateNet can further improve performance and easily become the state-of-the-art RGB-D SOD model.

0.A.3 Video Object Segmentation

According to whether the mask of the first frame of the video is provided during the test, video object segmentation (vos) can be divided into zero-shot vos and one-shot vos. In this paper, we mainly use the dual-branch GateNet structure as shown in Fig. 10 for zero-shot vos.
Dataset and Metrics. DAVIS-16 [35] is one of the most popular benchmark datasets for video object segmentation tasks. It consists of 50 high-quality video sequences (30 for training and 20 for validation) in total. We follow the training strategy as AGS [56], COSNet [31], PDB [44] and MATNet [73] to use extra datasets. We use the image saliency datasets: MSRA10K [8] and DUT-OMRON [63] to pretrain our RGB branch, then train the whole model with the training videos in DAVIS16. For quantitative evaluation, we adopt two metrics, namely region similarity $\mathcal{J}$ and boundary accuracy $\mathcal{F}$ .
Comparison with State-of-the-art Results. The performance of the proposed model is compared with ten state-of-the-art approaches on the DAVIS-16 dataset, including the LVO [47], ARP [24], PDB [44], LSMO [48], MotAdapt [42], EPO [13], AGS [56], COSNet [31], AnDiff [65] and MATNet [73]. We follow most methods [73, 65, 31, 48] to take the ResNet-101 as the backbone. Tab. 5 shows performance comparisons in terms of the $\mathcal{J}$ and $\mathcal{F}$ . It should be noted that our method only performs feature extraction on the optical flow map generated by PWCNet [46] in order to supplement the motion information of the current frame. Without adding more cross-modal fusion techniques, or using other tracking or detection models, our GateNet can achieve competitive performance with most zero-shot vos methods.

Table 4: Quantitative comparison.

\uparrow

and

\downarrow

indicate that the larger and smaller scores are better, respectively. Among the CNN-based methods, the best results are shown in red. The subscript in each model name is the publication year.

Metric		Traditional Methods			CNNs-Based Models
Metric		DES₁₄	DCMC₁₆	CDCP₁₇	DF₁₇	CTMF₁₈	PCANet₁₈	MMCI₁₉	TANet₁₉	CPFP₁₉	DMRA₁₉	GateNet
		[9]	[11]	[75]	[38]	[18]	[3]	[5]	[4]	[71]	[36]	Ours
SSD [74]	$F_{\beta}^{max}\uparrow$	0.260	0.750	0.576	0.763	0.755	0.844	0.823	0.835	0.801	0.858	0.868
	$F_{\beta}^{mean}\uparrow$	0.073	0.684	0.524	0.709	0.709	0.786	0.748	0.767	0.726	0.821	0.822
	$F_{\beta}^{w}\uparrow$	0.172	0.480	0.429	0.536	0.622	0.733	0.662	0.727	0.709	0.787	0.785
	$S_{m}\uparrow$	0.341	0.706	0.603	0.741	0.776	0.842	0.813	0.839	0.807	0.856	0.870
	$E_{m}\uparrow$	0.475	0.790	0.714	0.801	0.838	0.890	0.860	0.886	0.832	0.898	0.901
	$\mathcal{M}\downarrow$	0.500	0.168	0.219	0.151	0.100	0.063	0.082	0.063	0.082	0.059	0.055
NJUD [23]	$F_{\beta}^{max}\uparrow$	0.328	0.769	0.661	0.789	0.857	0.888	0.868	0.888	0.890	0.896	0.914
	$F_{\beta}^{mean}\uparrow$	0.165	0.715	0.618	0.744	0.788	0.844	0.813	0.844	0.837	0.871	0.879
	$F_{\beta}^{w}\uparrow$	0.234	0.497	0.510	0.545	0.720	0.803	0.739	0.805	0.828	0.847	0.849
	$S_{m}\uparrow$	0.413	0.703	0.672	0.735	0.849	0.877	0.859	0.878	0.878	0.885	0.902
	$E_{m}\uparrow$	0.491	0.796	0.751	0.818	0.866	0.909	0.882	0.909	0.900	0.920	0.922
	$\mathcal{M}\downarrow$	0.448	0.167	0.182	0.151	0.085	0.059	0.079	0.061	0.053	0.051	0.047
RGBD135 [9]	$F_{\beta}^{max}\uparrow$	0.800	0.311	0.651	0.625	0.865	0.842	0.839	0.853	0.882	0.906	0.919
	$F_{\beta}^{mean}\uparrow$	0.695	0.234	0.594	0.573	0.778	0.774	0.762	0.795	0.829	0.867	0.891
	$F_{\beta}^{w}\uparrow$	0.301	0.169	0.478	0.392	0.687	0.711	0.650	0.740	0.787	0.843	0.838
	$S_{m}\uparrow$	0.632	0.469	0.709	0.685	0.863	0.843	0.848	0.858	0.872	0.899	0.905
	$E_{m}\uparrow$	0.817	0.676	0.810	0.806	0.911	0.912	0.904	0.919	0.927	0.944	0.966
	$\mathcal{M}\downarrow$	0.289	0.196	0.120	0.131	0.055	0.050	0.065	0.046	0.038	0.030	0.030
NLPR [34]	$F_{\beta}^{max}\uparrow$	0.695	0.413	0.687	0.752	0.841	0.864	0.841	0.876	0.884	0.888	0.904
	$F_{\beta}^{mean}\uparrow$	0.583	0.328	0.592	0.683	0.724	0.795	0.730	0.796	0.818	0.855	0.854
	$F_{\beta}^{w}\uparrow$	0.254	0.259	0.501	0.516	0.679	0.762	0.676	0.780	0.807	0.840	0.838
	$S_{m}\uparrow$	0.582	0.550	0.724	0.769	0.860	0.874	0.856	0.886	0.884	0.898	0.910
	$E_{m}\uparrow$	0.760	0.685	0.786	0.840	0.869	0.916	0.872	0.916	0.920	0.942	0.942
	$\mathcal{M}\downarrow$	0.301	0.196	0.115	0.100	0.056	0.044	0.059	0.041	0.038	0.031	0.032
SIP [16]	$F_{\beta}^{max}\uparrow$	0.720	0.680	0.544	0.704	0.720	0.861	0.840	0.851	0.870	0.847	0.894
	$F_{\beta}^{mean}\uparrow$	0.644	0.645	0.495	0.673	0.684	0.825	0.795	0.809	0.819	0.815	0.856
	$F_{\beta}^{w}\uparrow$	0.342	0.414	0.397	0.406	0.535	0.768	0.712	0.748	0.788	0.734	0.810
	$S_{m}\uparrow$	0.616	0.683	0.595	0.653	0.716	0.842	0.833	0.835	0.850	0.800	0.874
	$E_{m}\uparrow$	0.751	0.787	0.722	0.794	0.824	0.900	0.886	0.894	0.899	0.858	0.914
	$\mathcal{M}\downarrow$	0.298	0.186	0.224	0.185	0.139	0.071	0.086	0.075	0.064	0.088	0.057

Table 5: Quantitative comparison of Zero-shot VOS methods on the DAVIS-16 validation set.

\uparrow

and

\downarrow

indicate that the larger and smaller scores are better, respectively. The best results are shown in red. The subscript in each model name is the publication year.

Metric		LVO₁₇	ARP₁₇	PDB₁₈	LSMO₁₉	MotAdapt₁₉	EPO₂₀	AGS₁₉	COSNet₁₉	AnDiff₁₉	MATNet₂₀	GateNet
Metric		[47]	[24]	[44]	[48]	[42]	[13]	[56]	[31]	[65]	[73]	Ours
$\mathcal{J}$	Mean $\uparrow$	75.9	76.2	77.2	78.2	77.2	80.6	79.7	80.5	81.7	82.4	80.9
	Recall $\uparrow$	89.1	91.1	90.1	89.1	87.8	95.2	91.1	93.1	90.9	94.5	94.3
	Decay $\downarrow$	0.0	7.0	0.9	4.1	5.0	2.2	1.9	4.4	2.2	5.5	3.3
$\mathcal{F}$	Mean $\uparrow$	72.1	70.6	74.5	75.9	77.4	75.5	77.4	79.5	80.5	80.7	79.4
	Recall $\uparrow$	83.4	83.5	84.4	84.7	84.4	87.9	85.8	89.5	85.1	90.2	89.2
	Decay $\downarrow$	1.3	7.9	-0.2	3.5	3.3	2.4	1.6	5.0	0.6	4.5	2.9

References

[1] Achanta, R., Hemami, S., Estrada, F., Süsstrunk, S.: Frequency-tuned salient region detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1597–1604 (2009)
[2] Amirul Islam, M., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3751–3759 (2017)
[3] Chen, H., Li, Y.: Progressively complementarity-aware fusion network for rgb-d salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3051–3060 (2018)
[4] Chen, H., Li, Y.: Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing 28(6), 2825–2835 (2019)
[5] Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognition 86, 376–385 (2019)
[6] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4), 834–848 (2017)
[7] Chen, S., Tan, X., Wang, B., Hu, X.: Reverse attention for salient object detection. In: Proceedings of European Conference on Computer Vision. pp. 234–250 (2018)
[8] Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3), 569–582 (2015)
[9] Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detection method. In: International Conference on Internet Multimedia Computing and Service. p. 23 (2014)
[10] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258 (2017)
[11] Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., Hou, C.: Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion 23(6), 819–823 (2016)
[12] Deng, Z., Hu, X., Zhu, L., Xu, X., Qin, J., Han, G., Heng, P.A.: R3net: Recurrent residual refinement network for saliency detection. In: Proceedings of International Joint Conference on Artificial Intelligence. pp. 684–690 (2018)
[13] Faisal, M., Akhter, I., Ali, M., Hartley, R.: Exploiting geometric constraints on dense trajectories for motion saliency. arXiv preprint arXiv:1909.13258 (2019)
[14] Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new way to evaluate foreground maps. In: Proceedings of IEEE International Conference on Computer Vision. pp. 4548–4557 (2017)
[15] Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 (2018)
[16] Fan, D.P., Lin, Z., Zhao, J.X., Liu, Y., Zhang, Z., Hou, Q., Zhu, M., Cheng, M.M.: Rethinking rgb-d salient object detection: Models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781 (2019)
[17] Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1473–1482 (2015)
[18] Han, J., Chen, H., Liu, N., Yan, C., Li, X.: Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics 48(11), 3171–3183 (2017)
[19] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
[20] Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.H.: Deeply supervised salient object detection with short connections. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3203–3212 (2017)
[21] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
[22] Jiang, Z., Davis, L.S.: Submodular salient region detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2043–2050 (2013)
[23] Ju, R., Ge, L., Geng, W., Ren, T., Wu, G.: Depth saliency based on anisotropic center-surround difference. In: Proceedings of International Conference on Image Processing. pp. 1115–1119 (2014)
[24] Jun Koh, Y., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3442–3450 (2017)
[25] Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 5455–5463 (2015)
[26] Li, G., Yu, Y.: Deep contrast learning for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 478–487 (2016)
[27] Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 280–287 (2014)
[28] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)
[29] Liu, N., Han, J.: Dhsnet: Deep hierarchical saliency network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 678–686 (2016)
[30] Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
[31] Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3623–3632 (2019)
[32] Mahadevan, V., Vasconcelos, N.: Saliency-based discriminant tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
[33] Pang, Y., Zhao, X., Zhang, L., Lu, H.: Multi-scale interactive network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 9413–9422 (2020)
[34] Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: Rgbd salient object detection: A benchmark and algorithms. In: Proceedings of European Conference on Computer Vision. pp. 92–109 (2014)
[35] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 724–732 (2016)
[36] Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrent attention network for saliency detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 7254–7263 (2019)
[37] Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: Boundary-aware salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 7479–7489 (2019)
[38] Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: Rgbd salient object detection via deep fusion. IEEE Transactions on Image Processing 26(5), 2274–2285 (2017)
[39] Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology 24(5), 769–779 (2013)
[40] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234–241 (2015)
[41] Rui, Z., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)
[42] Siam, M., Jiang, C., Lu, S., Petrich, L., Gamal, M., Elhoseiny, M., Jagersand, M.: Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 50–56. IEEE (2019)
[43] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[44] Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of European Conference on Computer Vision. pp. 715–731 (2018)
[45] Su, J., Li, J., Zhang, Y., Xia, C., Tian, Y.: Selectivity or invariance: Boundary-aware salient object detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 3799–3808 (2019)
[46] Sun, D., Yang, X., Liu, M., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. p. 8934–8943 (2018)
[47] Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: Proceedings of IEEE International Conference on Computer Vision. pp. 4481–4490 (2017)
[48] Tokmakov, P., Schmid, C., Alahari, K.: Learning to segment moving objects. International Journal of Computer Vision 127(3), 282–301 (2019)
[49] Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 136–145 (2017)
[50] Wang, L., Wang, L., Lu, H., Zhang, P., Ruan, X.: Saliency detection with recurrent fully convolutional networks. In: Proceedings of European Conference on Computer Vision. pp. 825–841 (2016)
[51] Wang, T., Borji, A., Zhang, L., Zhang, P., Lu, H.: A stagewise refinement model for detecting salient objects in images. In: Proceedings of IEEE International Conference on Computer Vision. pp. 4019–4028 (2017)
[52] Wang, T., Piao, Y., Li, X., Zhang, L., Lu, H.: Deep learning for light field saliency detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 8838–8848 (2019)
[53] Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: A novel approach to saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3127–3135 (2018)
[54] Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H.: Salient object detection in the deep learning era: An in-depth survey. arXiv preprint arXiv:1904.09146 (2019)
[55] Wang, W., Shen, J., Cheng, M.M., Shao, L.: An iterative and cooperative top-down and bottom-up inference network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 5968–5977 (2019)
[56] Wang, W., Song, H., Zhao, S., Shen, J., Zhao, S., Hoi, S.C., Ling, H.: Learning unsupervised video object segmentation through visual attention. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3064–3074 (2019)
[57] Wang, W., Zhao, S., Shen, J., Hoi, S.C., Borji, A.: Salient object detection with pyramid attention and salient edges. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1448–1457 (2019)
[58] Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., Ding, E.: A mutual learning method for salient object detection with intertwined multi-supervision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 8150–8159 (2019)
[59] Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3907–3916 (2019)
[60] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)
[61] Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1155–1162 (2013)
[62] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3166–3173 (2013)
[63] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3166–3173 (2013)
[64] Yang, G.R., Murray, J.D., Wang, X.J.: A dendritic disinhibitory circuit mechanism for pathway-specific gating. Nature communications 7, 12815 (2016)
[65] Yang, Z., Wang, Q., Bertinetto, L., Hu, W., Bai, S., Torr, P.H.: Anchor diffusion for unsupervised video object segmentation. In: Proceedings of IEEE International Conference on Computer Vision. pp. 931–940 (2019)
[66] Zeng, Y., Zhang, P., Zhang, J., Lin, Z., Lu, H.: Towards high-resolution salient object detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 7234–7243 (2019)
[67] Zhang, L., Dai, J., Lu, H., He, Y., Wang, G.: A bi-directional message passing model for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1741–1750 (2018)
[68] Zhang, L., Zhang, J., Lin, Z., Lu, H., He, Y.: Capsal: Leveraging captioning to boost semantics for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 6024–6033 (2019)
[69] Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X.: Amulet: Aggregating multi-level convolutional features for salient object detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 202–211 (2017)
[70] Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 714–722 (2018)
[71] Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for rgbd salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2019)
[72] Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3085–3094 (2019)
[73] Zhou, T., Wang, S., Zhou, Y., Yao, Y., Li, J., Shao, L.: Motion-attentive transition for zero-shot video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 2, p. 3 (2020)
[74] Zhu, C., Li, G.: A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In: Proceedings of IEEE International Conference on Computer Vision. pp. 3008–3014 (2017)
[75] Zhu, C., Li, G., Wang, W., Wang, R.: An innovative salient object detection using center-dark channel prior. In: Proceedings of IEEE International Conference on Computer Vision. pp. 1509–1515 (2017)