This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Dalian University of Technology, China 22institutetext: Peng Cheng Laboratory 33institutetext: Dept. of Computing, The Hong Kong Polytechnic University, China 44institutetext: DAMO Academy, Alibaba Group
44email: {zxq,lartpang}@mail.dlut.edu.cn, {zhanglihe,lhchuan}@dlut.edu.cn, [email protected] https://github.com/Xiaoqi-Zhao-DLUT/GateNet-RGB-Saliency

Suppress and Balance: A Simple Gated Network for Salient Object Detection

Xiaoqi Zhao444These authors contributed equally to this work. 11 Youwei Pang444These authors contributed equally to this work. 11 Lihe Zhang Corresponding author. 11 Huchuan Lu 1122 and Lei Zhang 3344
Abstract

Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as their basic structures. These methods ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control between them, the other is without considering the disparity of the contributions of different encoder blocks. In this work, we propose a simple gated network (GateNet) to solve both issues at once. With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder. We design a novel gated dual branch structure to build the cooperation among different levels of features and improve the discriminability of the whole network. Through the dual branch design, more details of the saliency map can be further restored. In addition, we adopt the atrous spatial pyramid pooling based on the proposed “Fold” operation (Fold-ASPP) to accurately localize salient objects of various scales. Extensive experiments on five challenging datasets demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics.

Keywords:
Salient Object Detection \cdot Gated Network \cdot Dual Branch \cdot Fold-ASPP

1 Introduction

Salient object detection aims to identify the visually distinctive regions or objects in a scene and then accurately segment them. In many computer vision applications, it is used as a pre-processing step, such as scene classification [39], visual tracking [32], person re-identification [41], light field image segmentation [52] and image captioning [17], etc.

Refer to caption
Figure 1: Visual comparison of different CNN based methods.

With the development of deep learning, salient object detection has gradually evolved from the traditional method based on manual design features to the deep learning method. In recent years, U-shape based structures [40, 28] have received the most attention due to their ability to utilize multilevel information to reconstruct high-resolution feature maps. Therefore, most state-of-the-art saliency detection networks [29, 20, 69, 53, 67, 70, 57, 37, 33] adopt U-shape as the encoder-decoder architecture. And many methods aim at combining multilevel features in either the encoder [69, 53, 67, 57, 37, 59] or the decoder [29, 20, 70, 59]. For each convolutional block, they separately formulate the relationships of internal features for forward update. It is well known that the high-quality saliency maps predicted in the decoder rely heavily on the effective features provided by the encoder. Nevertheless, the aforementioned methods directly use an all-pass skip-layer structure to concatenate the features of the encoder to the decoder, and the effectiveness of feature aggregation at different levels is not quantified. These restrictions not only introduce misleading context information into the decoder but also result in that the really useful features can not be adequately utilized. In cognitive science, Yang et al. [64] show that inhibitory neurons play an important role in how the human brain chooses to process the most important information from all the information presented to us. And inhibitory neurons ensure that humans respond appropriately to external stimuli by inhibiting other neurons and balancing excitatory neurons that stimulate neuronal activity. Inspired by this work, we think that it is necessary to set up an information screening unit between each pair of encoder and decoder blocks in saliency detection. It can help distinguish the most intense features of salient regions and suppress background interference, as shown in Fig. 1, in which these images have easily-confused backgrounds or low-contrast objects.

Moreover, due to the limited receptive field, a single-scale convolutional kernel is difficult to capture context information of size-varying objects. This motivates some efforts [12, 67] to investigate multiscale feature extraction. These methods directly equip an atrous spatial pyramid pooling module [6] (ASPP) in their networks. However, when using a convolution with a large dilation rate, the information under the kernel seriously lacks correlation due to inserting too many zeros. This may be detrimental to the discrimination of subtle image structures.

In this paper, we propose a simple gated network (GateNet) for salient object detection. Based on the feature pyramid network (FPN), we construct multilevel gate units to combine the features from the decoder and the encoder. We use convolution operation and nonlinear functions to calculate the correlations among features and assign gate values to different blocks. In this process, a partnership is established between different blocks by using weight distribution and the decoder can obtain more efficient information from the encoder and pay more attention to the salient regions. Since the top-layer features of the encoder network contain rich contextual information, we construct a folded atrous spatial pyramid pooling (Fold-ASPP) module to gather multiscale high-level saliency cues. With the “Fold” operation, the atrous convolution is implemented on a group of local neighborhoods rather than a group of isolated sampling points, which can help generate more stable features and more adequately depict finer structure. In addition, we design a parallel branch by concatenating the output of the FPN branch and the features of the gated encoder, so that the residual information complementary to the FPN branch is supplemented to generate the final saliency map.

Our main contributions can be summarized as follows.

  • We propose a simple gated network to adaptively control the amount of information that flows into the decoder from each encoder block. With multilevel gate units, the network can balance the contribution of each encoder block to the the decoder block and suppress the features of non-salient regions.

  • We design a Fold-ASPP module to capture richer context information and localize salient objects of various sizes. By the “Fold” operation, we can obtain more effective feature representation.

  • We build a dual branch architecture. They form a residual structure, complement each other through the gated processing and generate better results.

We compare the proposed model with seventeen state-of-the-art methods on five challenging datasets. The results show that our method performs much better than other competitors. And, it achieves a real-time speed of 30 fps.

2 Related Work

2.1 Salient Object Detection

Early saliency detection methods are based on low-level features and some heuristics prior knowledge, such as color contrast [1], background prior [62] and center prior [22]. Most of them using hand-crafted features, and more details about the traditional methods are discussed in [54].

With the breakthrough of deep learning in the field of computer vision, a large number of convolutional neural networks-based salient object detection methods have been proposed and their performance had been improved gradually. Especially, fully convolutional networks (FCN), which avoid the problems caused by the fully-connected layer, become the mainstream for dense prediction tasks. Wang et al. [50] use weight sharing methods to iteratively refine features and promote mutual fusion between features. Hou et al. [20] achieve efficient feature expression by continuously blending features from deep layers into shallow layers. However, the single-scale feature cannot roundly characterize various objects as well as image contexts. How to get multiscale features and integrate context information is an important problem in saliency detection.

2.2 Multiscale Feature Extraction

Recently, the atrous spatial pyramid pooling module (ASPP) [6] is widely applied in many tasks and networks. The atrous convolution can enlarge the receptive field to obtain large-scale features and does not increase the computational cost. Therefore, it is often used in saliency detection networks. Zhang et al. [67] insert several ASPP modules into the encoder blocks of different levels, while Deng et al. [12] install it on the highest-level encoder block. Nevertheless, the repeated stride and pooling operations already make the top-layer features lose much fine information. With the increase of atrous rate, the correlation of sampling points further degrades, which leads to difficulties in capturing the changes of image details (e.g., lathy background regions between adjacent objects or spindly parts of objects). In this work, we propose a folded ASPP to alleviate these issues and achieve a local-in-local effect.

2.3 Gated Mechanisms

The gated mechanism plays an important role in controlling the flow of information and is widely used in the long short term memory (LSTM). In [2], the gate unit combines two consecutive feature maps of different resolutions from the encoder to generate rich contextual information. Zhang et al. [67] adopt gate function to control the message passing when combining feature maps at all levels of the encoder. Due to the ability to filter information, the gated mechanism can also be seen as a special kind of attention mechanism. Some saliency methods [7, 70, 57] employ attention networks. Zhang et al. [70] apply both spatial and channel attention to each layer of the decoder. Wang et al. [57] exploit the pyramid attention module to enhance saliency representations for each layer in the encoder and enlarge the receptive field. The above methods all unilaterally consider the information interaction between different levels either in the encoder or in the decoder. We integrate the features from the encoder and the decoder to formulate gate function, which plays the role of block-wise attention and model the overall distribution of all blocks in the network from the global perspective. While previous methods actually utilize the block-specific feature to compute dense attention weights for the corresponding block. Moreover, in order to take advantage of rich contextual information in the encoder, these methods directly feed the encoder features into the decoder and do not consider their mutual interference. Our proposed gate unit can naturally balance their contributions, thereby suppressing the response of the encoder to non-salient regions. Experimental results in Fig. 4 and Fig. 9 intuitively demonstrate the effect of multilevel gate units on the above two aspects, respectively.

3 Proposed Method

Refer to caption
Figure 2: The overall architecture of the gated network. It consists of the VGG-16 encoder (𝐄1𝐄5\mathbf{E}^{1}\sim\mathbf{E}^{5}), five transition layers (𝐓1𝐓5\mathbf{T}^{1}\sim\mathbf{T}^{5}), five gate units (𝐆1𝐆5\mathbf{G}^{1}\sim\mathbf{G}^{5}), five decoder blocks (𝐃1𝐃5\mathbf{D}^{1}\sim\mathbf{D}^{5}) and the Fold-ASPP module. We employ twice supervision in this network. Once acts at the end of the FPN branch D1{D}^{1}. The other is used to guide the fusion of the two branches.

The gated network architecture is shown in Fig. 2, in which encoder blocks, transition layers, decoder blocks and gate units are respectively denoted as 𝐄i\mathbf{E}^{i}, 𝐓i\mathbf{T}^{i} , 𝐃i\mathbf{D}^{i} and 𝐆i\mathbf{G}^{i} (i{1,2,3,4,5}i\in\left\{1,2,3,4,5\right\} indexes different levels). And their output feature maps are denoted as EiE^{i}, TiT^{i}, DiD^{i} and GiG^{i}, respectively. The final prediction is obtained by combining the FPN branch and the parallel branch. In this section, we first describe the overall architecture, then detail the gated dual branch structure and the folded atrous spatial pyramid pooling module.

3.1 Network Overview

Encoder Network. In our model, the encoder is based on a common pretrained backbone network, e.g., the VGG [43], ResNet [19] or ResNeXt [60]. We take the VGG-16 network as an example, which contains thirteen Conv layers, five max-pooling layers and two fully connected layers. In order to fit saliency detection task, similar to most previous approaches [69, 20, 70, 67], we cast away all the fully-connected layers of the VGG-16 and remove the last pooling layer to retain details of last convolutional layer.

Decoder Network. The decoder comprises three main components. i) The FPN branch, which continually fuses different level features from T1T5{T}^{1}\sim{T}^{5} by element-wise addition. ii) The parallel branch, which combines the saliency map of the FPN branch with the feature maps of transition layers by cross-channel concatenation. At the same time, multilevel gate units (𝐆1𝐆5\mathbf{G}^{1}\sim\mathbf{G}^{5}) are inserted between the transition layer and the decoder layer. iii) The Fold-ASPP module, which improves the original atrous spatial pyramid pooling (ASPP) by using a “Fold” operation. It can take advantage of semantic features learned from E5{E}^{5} to provide multiscale information to the decoder.

3.2 Gated Dual Branch

Refer to caption
Figure 3: Detailed illustration of the gate unit. Ei{E}^{i}, Di+1{D}^{i+1} indicates feature maps of the current encoder block and those of the previous decoder block, respectively. \scriptsizeS⃝ is sigmoid function.
Refer to caption
Figure 4: The distributions of the gate weights on five datasets. We calculate the average gate values for each level of the FPN branch and the parallel branch across all images in every dataset. For the FPN branch, the low-level gate values are significantly smaller than the high-level ones. For the parallel branch, the gate values gradually decrease with the promotion of levels.

The gate unit can control the message passing between scale-matching encoder and decoder blocks. By combining the feature maps of the previous decoder block, the gate value also characterizes the contribution that the current block of the encoder can provide. Fig. 3 shows the internal structure of the proposed gate unit. In particular, encoder feature Ei{E}^{i} and decoder feature Di+1{D}^{i+1} are integrated to obtain feature Fi{F}^{i}, and then it is fed into two branches, which includes a series of convolution, activation and pooling operations, to compute a pair of gate values Gi{G}^{i}. The entire gated process can be formulated as,

Gi={P(S(Conv(Cat(Ei,Di+1)))) if i=1,2,3,4P(S(Conv(Cat(Ei,Ti)))) if i=5{G}^{i}=\left\{\begin{matrix}P(S(Conv(Cat(E^{i},D^{i+1}))))&\text{ if }i=1,2,3,4\\ P(S(Conv(Cat(E^{i},T^{i}))))&\text{ if }i=5\end{matrix}\right. (1)

where Cat()Cat(\cdot) is the concatenation operation among channel axis, Conv()Conv(\cdot) refers to the convolution layer, S()S(\cdot) is the element-wise sigmoid function, and P()P(\cdot) is the global average pooling. The output channel of Conv()Conv(\cdot) is 2. The resulted gate vector Gi{G}^{i} has two different elements which correspond to two gate values in Fig. 3.

Given the gate values, they are applied to the FPN branch and the parallel branch for weighting the transition-layer features T1T5{T}^{1}\sim{T}^{5}, which are generated by exploiting 3×33\times 3 convolution to reduce the dimension of E1E4{E}^{1}\sim{E}^{4} and the Fold-ASPP to finely process E5{E}^{5} (Please see Fig. 2 for details). Through multilevel gate units, we can suppress and balance the information flowing from different encoder blocks to the decoder.

In Fig. 4, we statistically demonstrate the curves of gate value with a convolutional level as the horizontal axis. It can be seen that the high-level encoder features contribute more contextual guidance to the decoder than the low-level encoder features in the FPN branch. This trend is just the opposite in the parallel branch. It is because the FPN branch is responsible to predict the main body of the salient object by progressively combining multilevel features, which needs more high-level semantic knowledge. While the parallel branch, as a residual structure, aims to fill in the details, which are mainly contained in the low-level features. In addition, some visual examples are shown in Fig. 9 demonstrate that multilevel gate units can significantly suppress the interference from each encoder block and enhance the contrast between salient and non-salient regions. Since the proposed gate unit is simple yet effective, a raw FPN network with multilevel gate units can be viewed as a new baseline for saliency detection task.

Refer to caption
Figure 5: Illustration of different decoder architectures. (a) Progressive structure, (b) Parallel structure and (c) Our dual branch structure.

Most existing models either use progressive decoder [67, 53, 70, 57] or parallel decoder [12, 72], as shown in Fig. 5. The progressive structure begins with the top layer and gradually utilizes the output of the higher layer as prior knowledge to fuse the encoder features. This mechanism is not conducive to the recovery of details because the high-level features lack fine information. While the parallel structure easily results in inaccurate localization of objects since the low-level features without semantic information directly interfere with the capture of global structure cues. In this work, we mix the two structures to build a dual branch decoder to overcome the above restrictions. We briefly describe the FPN branch. Taking DiD^{i} as an example, we firstly apply bilinear interpolation to upsample the higher-level feature Di+1D^{i+1} to the same size as Ti{T}^{i}. Next, to decrease the number of parameters, Ti{T}^{i} is reduced to 3232 channels and fed into gate unit GiG^{i}. Lastly, the gated feature is fused with the upsampled feature of Di+1D^{i+1} by element-wise addition and convolutional layers. This process can be formulated as follows:

Di={Conv(G1iTi+Up(Di+1)) if i=1,2,3,4Conv(G1iTi) if i=5,D^{i}=\left\{\begin{matrix}Conv(G_{1}^{i}\cdot T^{i}+Up(D^{i+1}))&\text{ if }i=1,2,3,4\\ Conv(G_{1}^{i}\cdot T^{i})&\text{ if }i=5,\end{matrix}\right. (2)

where D1D^{1} is a single-channel feature map with the same size as the input image.

In the parallel branch, we firstly upsample T1T5{T}^{1}\sim{T}^{5} to the same size of D1D^{1}. Next, the multilevel gate units are followed to weight the corresponding transition-layer features. Lastly, we combine D1D^{1} and the gated features by cross-channel concatenation. The whole process is written as follows:

FCat=Cat(D1,Up(G21T1),Up(G22T2),Up(G23T3),Up(G24T4),Up(G25T5)).\begin{split}F_{Cat}=Cat(&D^{1},Up(G_{2}^{1}\cdot T^{1}),Up(G_{2}^{2}\cdot T^{2}),\\ &Up(G_{2}^{3}\cdot T^{3}),Up(G_{2}^{4}\cdot T^{4}),Up(G_{2}^{5}\cdot T^{5})).\end{split} (3)

The final saliency map SFS^{F} is generated by integrating the predictions of the two branches with a residual connection as shown in Fig. 5(c),

SF=S(Conv(FCat)+D1)),\begin{split}S^{F}=S(Conv(F_{Cat})+D^{1})),\end{split} (4)

where S()S(\cdot) is the element-wise sigmoid function.

3.3 Folded Atrous Spatial Pyramid Pooling

In order to obtain robust segmentation results by integrating multiscale information, atrous spatial pyramid pooling (ASPP) is proposed in Deeplab [6]. And some works [67, 12] also show its effectiveness in saliency detection. The ASPP uses multiple parallel atrous convolutional layers with different dilation rates. The sparsity of atrous convolution kernel, especially when using a large dilation rate, results in that the association relationships among sampling points are too weak to extract stable features. In this paper, we apply a simple “Fold” operation to effectively relieve this issue. We visualize the folded convolution structure in Fig. 6, which not only further enlarges the receptive field but also extends each valid sampling position from an isolate point to a 2×22\times 2 connected region.

Refer to caption
Figure 6: Illustration of the folded convolution. We use \scriptsize1⃝, \scriptsize2⃝ and \scriptsize3⃝ to respectively indicate “Fold” operation, atrous convolution and “Unfold” operation. \scriptsize4⃝ shows the comparison between atrous convolution (Left) and the folded atrous convolution (Right).

Let 𝐗\mathbf{X} represent feature maps with the size of N×N×CN\times N\times C (C is the channel number). We slide a 2×22\times 2 window on 𝐗\mathbf{X} in stride 22 and then conduct atrous convolution with kernel size K×KK\times K in different dilation rates. Fig. 6 shows the computational process when K=3K=3 and dilation rate is 22. Firstly, we collect 2×2×C2\times 2\times C feature points in each window from 𝐗\mathbf{X} and then it is stacked by channel direction, we call this operation ”Fold”, which is shown in Fig. 6\scriptsize1⃝. After the fold operation, we can get new feature maps with the size of N/2×N/2×4CN/2\times N/2\times 4C. A point on the new feature maps corresponds to a 2×22\times 2 area on the original feature maps. Secondly, we adopt an atrous convolution with a kernel size of 3×33\times 3 and dilation rate is 22. Followed by the reverse process of “Fold” which is called “Unfold” operation, the final feature maps are obtained. By using the folded atrous convolution, in the process of information transfer across convolution layers, more contexts are merged and the certain local correlation is also preserved, which provides the fault-tolerance capability for subsequent operations.

As shown in Fig. 2, the Fold-ASPP is only equipped on the top of the encoder network, which consists of three folded convolutional layers with dilation rates [2,4,6][2,4,6] to fit the size of feature maps. Just as group convolution [60] is a trade-off between depthwise convolution [10, 21] and vanilla convolution in the channel dimension, the proposed folded convolution is a trade-off between atrous convolution and vanilla convolution in the spatial dimension.

3.4 Supervision

As shown in Fig. 2, we use the cross-entropy loss for both the intermediate prediction from the FPN branch and the final prediction from the dual branch. In the dual branch decoder, since the FPN branch gradually combines all-level gated encoding and decoding features, it has very powerful prediction ability. We expect that it can predict salient objects as accurately as possible under the supervision of ground truth. While the parallel branch only combines the gated encoding features, which is helpful to remedy the ignored details with the design of residual structure. Moreover, the supervision on D1D^{1} can drive gate units to learn the weight of the contribution of each encoder block to the final prediction. We use the cross-entropy loss. The total loss L could be written as:

L=ls1+lsf,\begin{split}L=l_{s1}+l_{sf},\end{split} (5)

where ls1l_{s1} and lsfl_{sf} are respectively used to regularize the output of the FPN branch and the final prediction. The cross-entropy loss could be computed as:

l=YlogP+(1Y)log(1P),\begin{split}l=YlogP+(1-Y)log(1-P),\end{split} (6)

where PP and YY denote the predicted map and ground-truth, respectively.

4 Experiments

4.1 Experimental Setup

Dataset. We evaluate the proposed model on five benchmark datasets. ECSSD [61] contains 1,0001,000 semantically meaningful and complex images with pixel-accurate ground truth annotations. HKU-IS [25] has 4,4474,447 challenging images with multiple disconnected salient objects, overlapping the image boundary. PASCAL-S [27] contains 850850 images selected from the PASCAL VOC 2009 segmentation dataset. DUT-OMRON [63] includes 5,1685,168 challenging images, each of which usually has complicated background and one or more foreground objects. DUTS [49] is the largest salient object detection dataset, which contains 10,55310,553 training and 5,0195,019 test images. These images contain very complex scenarios with high-diversity contents.
Evaluation Metrics. For quantitative evaluation, we adopt four widely-used metrics: precision-recall (PR) curve, F-measure score, mean absolute error (MAE) and S-measure score. Precision-Recall curve: The pairs of precision and recall are calculated by comparing the binary saliency maps with the ground truth to plot the PR curve, where the threshold for binarizing slides from 0 to 255255. The closer the PR curve is to the upper right corner, the better the performance is. F-measure: It is an overall performance measurement that synthetically considers both precision and recall:

F=β(1+β2)precisionrecallβ2precision+recall,\text{F}{{}_{\beta}}=\frac{{\left({1+{\beta^{2}}}\right)\cdot\text{precision}\cdot\text{recall}}}{{{\beta^{2}}\cdot\text{precision}+\text{recall}}}, (7)

where β2\beta^{2} is set to 0.30.3 as suggested in [1] to emphasize the precision. In this paper, we report the maximum F-measure score across the binary maps of different thresholds. Mean Absolute Error: As the supplement of the PR curve and F-measure, it computes the average absolute difference between the saliency map and the ground truth pixel by pixel. S-measure: It is more sensitive to foreground structural information than the F-measure. It considers the region-aware structural similarity SrS_{r} and the object-aware structural similarity SoS_{o}:

S=mαSo+(1α)Sr,\text{S}{{}_{m}}=\alpha*S_{o}+(1-\alpha)*S_{r}, (8)

where α\alpha is set to 0.50.5 [14].
Implementation Details. We follow most state-of-the-art saliency detection methods [45, 37, 55, 59, 53, 57, 66, 70, 67] to use the DUTS-TR as the training dataset which contains 10,55310,553 images. Our model is implemented based on the Pytorch repository and the hyper-parameters are set as follows: We train the GateNet on a PC with GTX 1080 Ti GPU for 4040 epochs with mini-batch size 44. For the optimizer, we adopt the stochastic gradient descent (SGD). The momentum, weight decay, and learning rate are set as 0.90.9, 0.00050.0005 and 0.0010.001, respectively. The “poly” policy [30] with the power of 0.90.9 is used to adjust the learning rate. We adopt some data augmentation techniques to avoid overfitting and make the learned model more robust, which include random horizontally flipping, random rotation, random brightness, saturation and contrast changing. In order to preserve the integrity of the image semantic information, we only resize the image to 384×384384\times 384 instead of using a random crop.

4.2 Performance Comparison with State-of-the-art

We compare the proposed algorithm with seventeen state-of-the-art saliency detection methods, including the DCL [26], DSS [20], Amulet [69], SRM [51], DGRL [53], RAS [7], PAGRN [70], BMPM [67], R3Net [12], HRS [66], MLMS [58], PAGE [57], ICNet [55], CPD [59], BANet [45], BASNet [37] and Capsal [68]. For fair comparisons, all the saliency map of these methods are directly provided by their respective authors or computed by their released codes. To further show the effectiveness of our GateNet, we test its performance in both RGBD SOD and Video Object Segmentation tasks and include the results in appendix.

Quantitative Evaluation. Tab. 1 shows the experimental comparison results in terms of the F-measure, S-measure and MAE scores, from which we can see that the GateNet can consistently outperform other approaches across all five datasets and different metrics. In particular, the GateNet achieves significant performance improvement in terms of the F-measure compared to the second best method BANet [45] on the challenging DUTS-test (0.8700.870 vs 0.8520.852 and 0.8880.888 vs 0.8720.872) and PASCAL-S (0.8820.882 vs 0.8660.866 and 0.8830.883 vs 0.8770.877) datasets. This clearly demonstrates its superior performance in complex scenes. Moreover, some methods [26, 20, 51, 12] apply the post-processing techniques to refine their saliency maps. Our GateNet still performs better than them without any post-processing. We evaluate different algorithms using the standard PR curves in Fig. 7. It can be seen that our PR curves are significantly higher than those of other methods on five datasets.

Qualitative Evaluation. Fig. 1 and Fig. 8 illustrate some visual comparisons. In Fig. 1, other methods are severely disturbed by branches and weeds while ours can precisely identify the whole objects. And the GateNet can significantly suppress the background with similar shapes to salient objects (see the 1st1^{st} row in Fig. 8). Since the Fold-ASPP can obtain more stable structural features, it can help to accurately locate objects and separate adjacent objects well, but some competitors make adjacent objects stick together (see the 3th3^{th} and 4th4^{th} rows in Fig. 8). Besides, the proposed parallel branch can restore more details, therefore, the boundary information is retained well.

4.3 Ablation Studies

We detail the contribution of each component to the overall network.

Effectiveness of Backbones. Tab. 1 demonstrates that the performance of the gated network can be significantly improved by using better backbones such as ResNet-50, ResNet-101 or ResNeXt-101.

Table 1: Quantitative comparisons. Blue indicates the best performance under each backbone setting, while red indicates the best performance among all settings. The subscript in the first column regards the publication year. “\dagger”, “SS” and “XX” mean using the post-processing, ResNet-101 and ResNeXt-101 backbone, respectively. “—” represents that the results are not available. \uparrow and \downarrow indicate that the larger and smaller scores are better, respectively.
Method DUTS-test DUT-OMRON PASCAL-S HKU-IS ECSSD
Fβ{}_{\beta}\uparrow Sm{}_{m}\uparrow MAE\downarrow Fβ{}_{\beta}\uparrow Sm{}_{m}\uparrow MAE\downarrow Fβ{}_{\beta}\uparrow Sm{}_{m}\uparrow MAE\downarrow Fβ{}_{\beta}\uparrow Sm{}_{m}\uparrow MAE\downarrow Fβ{}_{\beta}\uparrow Sm{}_{m}\uparrow MAE\downarrow
VGG-16 backbone
DCL16{}_{16}^{\dagger} 0.782 0.796 0.088 0.757 0.770 0.080 0.829 0.793 0.109 0.907 0.877 0.048 0.901 0.868 0.068
DSS17{}_{17}^{\dagger} 0.781 0.789 0.063 0.840 0.792 0.098 0.916 0.878 0.040 0.921 0.882 0.052
Amulet17 0.778 0.804 0.085 0.743 0.780 0.098 0.839 0.819 0.099 0.899 0.886 0.050 0.915 0.894 0.059
BMPM18 0.852 0.860 0.049 0.774 0.808 0.064 0.862 0.842 0.076 0.921 0.906 0.039 0.928 0.911 0.045
RAS18 0.831 0.838 0.059 0.786 0.813 0.062 0.836 0.793 0.106 0.913 0.887 0.045 0.921 0.893 0.056
PAGRN18 0.854 0.837 0.056 0.771 0.774 0.071 0.855 0.814 0.095 0.919 0.889 0.048 0.927 0.889 0.061
HRS19 0.843 0.828 0.051 0.762 0.771 0.066 0.850 0.798 0.092 0.913 0.882 0.042 0.920 0.883 0.054
MLMS19 0.852 0.861 0.049 0.774 0.808 0.064 0.864 0.844 0.075 0.921 0.906 0.039 0.928 0.911 0.045
PAGE19 0.838 0.853 0.052 0.792 0.824 0.062 0.858 0.837 0.079 0.920 0.904 0.036 0.931 0.912 0.042
BANet19 0.852 0.860 0.046 0.793 0.822 0.061 0.866 0.838 0.079 0.919 0.901 0.037 0.935 0.913 0.041
GateNet 0.870 0.869 0.045 0.794 0.820 0.061 0.882 0.855 0.070 0.928 0.909 0.035 0.941 0.917 0.041
ResNet-50 backbone
SRM17{}_{17}^{\dagger} 0.826 0.835 0.059 0.769 0.797 0.069 0.848 0.830 0.087 0.906 0.886 0.046 0.917 0.895 0.054
DGRL18 0.828 0.841 0.050 0.774 0.805 0.062 0.856 0.836 0.073 0.911 0.895 0.036 0.922 0.903 0.041
CPD19 0.865 0.868 0.043 0.797 0.824 0.056 0.870 0.844 0.074 0.925 0.906 0.034 0.939 0.918 0.037
ICNet19 0.855 0.864 0.048 0.813 0.837 0.061 0.865 0.849 0.072 0.925 0.908 0.037 0.938 0.918 0.041
BASNet19 0.860 0.864 0.048 0.805 0.835 0.057 0.860 0.834 0.079 0.930 0.907 0.033 0.943 0.916 0.037
BANet19 0.872 0.878 0.040 0.803 0.832 0.059 0.877 0.851 0.072 0.930 0.913 0.033 0.944 0.924 0.035
GateNet 0.888 0.884 0.040 0.818 0.837 0.055 0.883 0.857 0.069 0.933 0.915 0.033 0.945 0.920 0.040
ResNet/ResNeXt-101 backbone
R3NetX18{}_{18}^{\dagger^{X}} 0.819 0.827 0.063 0.795 0.816 0.063 0.844 0.802 0.095 0.915 0.895 0.035 0.934 0.910 0.040
CapsalS19{}_{19}^{S} 0.819 0.818 0.063 0.639 0.673 0.101 0.869 0.837 0.074 0.883 0.851 0.058 0.863 0.826 0.077
GateNetS 0.893 0.889 0.038 0.821 0.844 0.054 0.883 0.862 0.067 0.937 0.920 0.031 0.951 0.930 0.035
GateNetX 0.898 0.895 0.035 0.829 0.848 0.051 0.888 0.865 0.065 0.943 0.925 0.029 0.952 0.929 0.035
Refer to caption
Figure 7: Precision (vertical axis) recall (horizontal axis) curves on six popular rgb-salient object datasets.
Refer to caption
Figure 8: Visual comparison between our results and state-of-the-art methods.
Refer to caption
Figure 9: Visual comparison of feature maps for showing the effect of the multilevel gate units. D5 \sim D1 represent the feature maps of each decoder block from high level to low level. Odd rows and even rows are the results of the FPN baseline without or with multilevel gate units, respectively.

Effectiveness of Components. We quantitatively show the benefit of each component in Tab. 2. We take the results of the VGG-16 backbone with the FPN branch as the baseline. Firstly, the multilevel gate units are added to the baseline network. The performance is significantly improved with the gain of 2.94%2.94\%, 2.17%2.17\% and 11.67%11.67\% in terms of the F-measure, S-measure and MAE, respectively. To show the effect of the gate units more intuitively, we visualize the features of different levels in Fig. 9. It can be observed that even if the dog has a very low contrast with the chair or the billboard (see the 1st1^{st} \sim 4th4^{th} rows), through using multilevel gate units, the high contrast between the object region and the background is always maintained at each layer while the detail information is continually regained, thereby making salient objects be effectively distinguished. Besides, the gate units can avoid excessive suppression for the slender parts of objects (see the 5th5^{th} \sim 8th8^{th} rows). The corners of the poster, the limbs and even tentacles of the mantis are retained well. Secondly, based on the gated baseline network, we design a series of experimental options to verify the effectiveness of the folded convolution and Fold-ASPP.

Table 2: Ablation analysis on the DUTS dataset.
Fβ Sm MAE
BaselineBaseline (FPN)(FPN) 0.816 0.829 0.060
++ GateGate UnitsUnits 0.840 0.847 0.053
++ FoldFold-ASPPASPP 0.866 0.863 0.047
++ ParallelParallel BranchBranch 0.870 0.869 0.045
Table 3: Evaluation of the folded convolution and Fold-ASPP. (x) stands for different sampling rates of atrous convolution.
Atrous(2) Atrous(4) Atrous(6) Fold(2) Fold(4) Fold(6) ASPP Fold-ASPP
Fβ 0.840 0.845 0.848 0.853 0.856 0.860 0.856 0.866
MAE 0.055 0.053 0.051 0.051 0.050 0.048 0.051 0.047
Sm 0.847 0.849 0.851 0.856 0.858 0.859 0.860 0.863

Tab. 3 illustrates the results in detail. We adopt the atrous convolution with dilation rates of [2,4,6][2,4,6] and the same dilation rates are also applied to the folded convolution. It can be observed that the folded convolution consistently yields significant performance improvement at each dilation rate than the corresponding atrous convolution in terms of all three metrics. And the single-layer Fold(6) already performs better than the ASPP of aggregating three atrous convolution layers. The Fold-ASPP also naturally outperforms the ASPP with the gain of 1.17%1.17\% and 8.0%8.0\% in terms of the F-measure and MAE, respectively. Finally, we add the parallel branch to further restore the details of objects. In this process, the gate units, Fold-ASPP and parallel branch complement each other without repulsion.

5 Conclusions

In this paper, we propose a novel gated network architecture for saliency detection. We first adopt multilevel gate units to balance the contribution of each encoder block and suppress the activation of the features of non-salient regions, which can provide useful context information for the decoder while minimizing interference. The gate unit is simple yet effective, therefore, a gated FPN network can be used as a new baseline for dense prediction tasks. Next, we use the Fold-ASPP to gather multiscale semantic information for the decoder. By the folded operation, the atrous convolution achieves a local-in-local effect, which not only expands the receptive field but also retains the correlation among local sampling points. Finally, to further supplement the details, we combine all encoder features in parallel and construct a residual structure. Experimental results on five benchmark datasets demonstrate that the proposed model outperforms seventeen state-of-the-art methods under different evaluation metrics.

Acknowledgements. This work was supported in part by the National Natural Science Foundation of China #61876202, #61725202, #61751212 and #61829102, the Dalian Science and Technology Innovation Foundation #2019J12GX039, and the Fundamental Research Funds for the Central Universities # DUT20ZD212.

Appendix 0.A Appendix

We expand our GateNet to other tasks including RGB-D Salient Object Detection (SOD) and Video Object Segmentation (VOS) to further demonstrate its effectiveness.

0.A.1 Network Architecture

Fig. 10 shows our proposed dual-branch gated FPN network for RGB-D SOD and VOS. Compared with the RGB SOD network, we only add an extra encoder to extract features of other modals such as depth or optical flow. This dual-branch GateNet is easy to follow and can be used as a new baseline.

Refer to caption
Figure 10: Network pipeline.

0.A.2 RGB-D Salient object detection

Dataset. There are five main RGB-D SOD datasets which are NJUD [23], RGBD135 [9] NLPR [34], SSD [74] and SIP [16]. We adopt the same splitting way as  [3, 5, 18, 71, 36] to guarantee a fair comparison. We split 1,485 samples from NJUD and 700 samples from NLPR for traing a new model. The remaining images in these two datasets and other three datasets are all for testing to verify the generalization ability of saliency models.
Evaluation Metrics. We adopt several metrics widely used in RGB-D SOD for quantitative evaluation: F-measure score, mean absolute error (MAE, \mathcal{M}), the recently released S-measure (SmS_{m}[14] and E-measure (EmE_{m}[15] scores. The lower value is better for the MAE and higher is better for others.
Comparison with State-of-the-art Results. The performance of the proposed model is compared with ten state-of-the-art approaches on five benchmark datasets, including the DES [9], DCMC [11], CDCP [75], DF [38], CTMF [18], PCA [3], MMCI [5], TANet [4], CPFP [71] and DMRA [36]. For fair comparisons, all the saliency maps of these methods are directly provided by authors or computed by their released codes. And we take the VGG-16 as the backbone for each stream. Tab. 4 shows performance comparisons in terms of the maximum F-measure, mean F-measure, weighted F-measure, S-measure, E-measure and MAE scores. It can be seen that our GateNet is very competitive. We believe that future works based on GateNet can further improve performance and easily become the state-of-the-art RGB-D SOD model.

0.A.3 Video Object Segmentation

According to whether the mask of the first frame of the video is provided during the test, video object segmentation (vos) can be divided into zero-shot vos and one-shot vos. In this paper, we mainly use the dual-branch GateNet structure as shown in Fig. 10 for zero-shot vos.
Dataset and Metrics. DAVIS-16 [35] is one of the most popular benchmark datasets for video object segmentation tasks. It consists of 50 high-quality video sequences (30 for training and 20 for validation) in total. We follow the training strategy as AGS [56], COSNet [31], PDB [44] and MATNet [73] to use extra datasets. We use the image saliency datasets: MSRA10K [8] and DUT-OMRON [63] to pretrain our RGB branch, then train the whole model with the training videos in DAVIS16. For quantitative evaluation, we adopt two metrics, namely region similarity 𝒥\mathcal{J} and boundary accuracy \mathcal{F}.
Comparison with State-of-the-art Results. The performance of the proposed model is compared with ten state-of-the-art approaches on the DAVIS-16 dataset, including the LVO [47], ARP [24], PDB [44], LSMO [48], MotAdapt [42], EPO [13], AGS [56], COSNet [31], AnDiff [65] and MATNet [73]. We follow most methods [73, 65, 31, 48] to take the ResNet-101 as the backbone. Tab. 5 shows performance comparisons in terms of the 𝒥\mathcal{J} and \mathcal{F}. It should be noted that our method only performs feature extraction on the optical flow map generated by PWCNet [46] in order to supplement the motion information of the current frame. Without adding more cross-modal fusion techniques, or using other tracking or detection models, our GateNet can achieve competitive performance with most zero-shot vos methods.

Table 4: Quantitative comparison. \uparrow and \downarrow indicate that the larger and smaller scores are better, respectively. Among the CNN-based methods, the best results are shown in red. The subscript in each model name is the publication year.
Metric Traditional Methods CNNs-Based Models
DES14 DCMC16 CDCP17 DF17 CTMF18 PCANet18 MMCI19 TANet19 CPFP19 DMRA19 GateNet
 [9]  [11]  [75]  [38]  [18]  [3]  [5]  [4]  [71]  [36] Ours
SSD [74] FβmaxF_{\beta}^{max}\uparrow 0.260 0.750 0.576 0.763 0.755 0.844 0.823 0.835 0.801 0.858 0.868
FβmeanF_{\beta}^{mean}\uparrow 0.073 0.684 0.524 0.709 0.709 0.786 0.748 0.767 0.726 0.821 0.822
FβwF_{\beta}^{w}\uparrow 0.172 0.480 0.429 0.536 0.622 0.733 0.662 0.727 0.709 0.787 0.785
SmS_{m}\uparrow 0.341 0.706 0.603 0.741 0.776 0.842 0.813 0.839 0.807 0.856 0.870
EmE_{m}\uparrow 0.475 0.790 0.714 0.801 0.838 0.890 0.860 0.886 0.832 0.898 0.901
\mathcal{M}\downarrow 0.500 0.168 0.219 0.151 0.100 0.063 0.082 0.063 0.082 0.059 0.055
NJUD [23] FβmaxF_{\beta}^{max}\uparrow 0.328 0.769 0.661 0.789 0.857 0.888 0.868 0.888 0.890 0.896 0.914
FβmeanF_{\beta}^{mean}\uparrow 0.165 0.715 0.618 0.744 0.788 0.844 0.813 0.844 0.837 0.871 0.879
FβwF_{\beta}^{w}\uparrow 0.234 0.497 0.510 0.545 0.720 0.803 0.739 0.805 0.828 0.847 0.849
SmS_{m}\uparrow 0.413 0.703 0.672 0.735 0.849 0.877 0.859 0.878 0.878 0.885 0.902
EmE_{m}\uparrow 0.491 0.796 0.751 0.818 0.866 0.909 0.882 0.909 0.900 0.920 0.922
\mathcal{M}\downarrow 0.448 0.167 0.182 0.151 0.085 0.059 0.079 0.061 0.053 0.051 0.047
RGBD135 [9] FβmaxF_{\beta}^{max}\uparrow 0.800 0.311 0.651 0.625 0.865 0.842 0.839 0.853 0.882 0.906 0.919
FβmeanF_{\beta}^{mean}\uparrow 0.695 0.234 0.594 0.573 0.778 0.774 0.762 0.795 0.829 0.867 0.891
FβwF_{\beta}^{w}\uparrow 0.301 0.169 0.478 0.392 0.687 0.711 0.650 0.740 0.787 0.843 0.838
SmS_{m}\uparrow 0.632 0.469 0.709 0.685 0.863 0.843 0.848 0.858 0.872 0.899 0.905
EmE_{m}\uparrow 0.817 0.676 0.810 0.806 0.911 0.912 0.904 0.919 0.927 0.944 0.966
\mathcal{M}\downarrow 0.289 0.196 0.120 0.131 0.055 0.050 0.065 0.046 0.038 0.030 0.030
NLPR [34] FβmaxF_{\beta}^{max}\uparrow 0.695 0.413 0.687 0.752 0.841 0.864 0.841 0.876 0.884 0.888 0.904
FβmeanF_{\beta}^{mean}\uparrow 0.583 0.328 0.592 0.683 0.724 0.795 0.730 0.796 0.818 0.855 0.854
FβwF_{\beta}^{w}\uparrow 0.254 0.259 0.501 0.516 0.679 0.762 0.676 0.780 0.807 0.840 0.838
SmS_{m}\uparrow 0.582 0.550 0.724 0.769 0.860 0.874 0.856 0.886 0.884 0.898 0.910
EmE_{m}\uparrow 0.760 0.685 0.786 0.840 0.869 0.916 0.872 0.916 0.920 0.942 0.942
\mathcal{M}\downarrow 0.301 0.196 0.115 0.100 0.056 0.044 0.059 0.041 0.038 0.031 0.032
SIP [16] FβmaxF_{\beta}^{max}\uparrow 0.720 0.680 0.544 0.704 0.720 0.861 0.840 0.851 0.870 0.847 0.894
FβmeanF_{\beta}^{mean}\uparrow 0.644 0.645 0.495 0.673 0.684 0.825 0.795 0.809 0.819 0.815 0.856
FβwF_{\beta}^{w}\uparrow 0.342 0.414 0.397 0.406 0.535 0.768 0.712 0.748 0.788 0.734 0.810
SmS_{m}\uparrow 0.616 0.683 0.595 0.653 0.716 0.842 0.833 0.835 0.850 0.800 0.874
EmE_{m}\uparrow 0.751 0.787 0.722 0.794 0.824 0.900 0.886 0.894 0.899 0.858 0.914
\mathcal{M}\downarrow 0.298 0.186 0.224 0.185 0.139 0.071 0.086 0.075 0.064 0.088 0.057
Table 5: Quantitative comparison of Zero-shot VOS methods on the DAVIS-16 validation set. \uparrow and \downarrow indicate that the larger and smaller scores are better, respectively. The best results are shown in red. The subscript in each model name is the publication year.
Metric LVO17 ARP17 PDB18 LSMO19 MotAdapt19 EPO20 AGS19 COSNet19 AnDiff19 MATNet20 GateNet
 [47]  [24]  [44]  [48]  [42]  [13]  [56]  [31]  [65]  [73] Ours
𝒥\mathcal{J} Mean\uparrow 75.9 76.2 77.2 78.2 77.2 80.6 79.7 80.5 81.7 82.4 80.9
Recall\uparrow 89.1 91.1 90.1 89.1 87.8 95.2 91.1 93.1 90.9 94.5 94.3
Decay\downarrow 0.0 7.0 0.9 4.1 5.0 2.2 1.9 4.4 2.2 5.5 3.3
\mathcal{F} Mean\uparrow 72.1 70.6 74.5 75.9 77.4 75.5 77.4 79.5 80.5 80.7 79.4
Recall\uparrow 83.4 83.5 84.4 84.7 84.4 87.9 85.8 89.5 85.1 90.2 89.2
Decay\downarrow 1.3 7.9 -0.2 3.5 3.3 2.4 1.6 5.0 0.6 4.5 2.9

References

  • [1] Achanta, R., Hemami, S., Estrada, F., Süsstrunk, S.: Frequency-tuned salient region detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1597–1604 (2009)
  • [2] Amirul Islam, M., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3751–3759 (2017)
  • [3] Chen, H., Li, Y.: Progressively complementarity-aware fusion network for rgb-d salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3051–3060 (2018)
  • [4] Chen, H., Li, Y.: Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing 28(6), 2825–2835 (2019)
  • [5] Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognition 86, 376–385 (2019)
  • [6] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4), 834–848 (2017)
  • [7] Chen, S., Tan, X., Wang, B., Hu, X.: Reverse attention for salient object detection. In: Proceedings of European Conference on Computer Vision. pp. 234–250 (2018)
  • [8] Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3), 569–582 (2015)
  • [9] Cheng, Y., Fu, H., Wei, X., Xiao, J., Cao, X.: Depth enhanced saliency detection method. In: International Conference on Internet Multimedia Computing and Service. p. 23 (2014)
  • [10] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1251–1258 (2017)
  • [11] Cong, R., Lei, J., Zhang, C., Huang, Q., Cao, X., Hou, C.: Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion 23(6), 819–823 (2016)
  • [12] Deng, Z., Hu, X., Zhu, L., Xu, X., Qin, J., Han, G., Heng, P.A.: R3net: Recurrent residual refinement network for saliency detection. In: Proceedings of International Joint Conference on Artificial Intelligence. pp. 684–690 (2018)
  • [13] Faisal, M., Akhter, I., Ali, M., Hartley, R.: Exploiting geometric constraints on dense trajectories for motion saliency. arXiv preprint arXiv:1909.13258 (2019)
  • [14] Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new way to evaluate foreground maps. In: Proceedings of IEEE International Conference on Computer Vision. pp. 4548–4557 (2017)
  • [15] Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 (2018)
  • [16] Fan, D.P., Lin, Z., Zhao, J.X., Liu, Y., Zhang, Z., Hou, Q., Zhu, M., Cheng, M.M.: Rethinking rgb-d salient object detection: Models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781 (2019)
  • [17] Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1473–1482 (2015)
  • [18] Han, J., Chen, H., Liu, N., Yan, C., Li, X.: Cnns-based rgb-d saliency detection via cross-view transfer and multiview fusion. IEEE Transactions on Cybernetics 48(11), 3171–3183 (2017)
  • [19] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
  • [20] Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.H.: Deeply supervised salient object detection with short connections. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3203–3212 (2017)
  • [21] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
  • [22] Jiang, Z., Davis, L.S.: Submodular salient region detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2043–2050 (2013)
  • [23] Ju, R., Ge, L., Geng, W., Ren, T., Wu, G.: Depth saliency based on anisotropic center-surround difference. In: Proceedings of International Conference on Image Processing. pp. 1115–1119 (2014)
  • [24] Jun Koh, Y., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3442–3450 (2017)
  • [25] Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 5455–5463 (2015)
  • [26] Li, G., Yu, Y.: Deep contrast learning for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 478–487 (2016)
  • [27] Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 280–287 (2014)
  • [28] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)
  • [29] Liu, N., Han, J.: Dhsnet: Deep hierarchical saliency network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 678–686 (2016)
  • [30] Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
  • [31] Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3623–3632 (2019)
  • [32] Mahadevan, V., Vasconcelos, N.: Saliency-based discriminant tracking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
  • [33] Pang, Y., Zhao, X., Zhang, L., Lu, H.: Multi-scale interactive network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 9413–9422 (2020)
  • [34] Peng, H., Li, B., Xiong, W., Hu, W., Ji, R.: Rgbd salient object detection: A benchmark and algorithms. In: Proceedings of European Conference on Computer Vision. pp. 92–109 (2014)
  • [35] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 724–732 (2016)
  • [36] Piao, Y., Ji, W., Li, J., Zhang, M., Lu, H.: Depth-induced multi-scale recurrent attention network for saliency detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 7254–7263 (2019)
  • [37] Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: Boundary-aware salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 7479–7489 (2019)
  • [38] Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: Rgbd salient object detection via deep fusion. IEEE Transactions on Image Processing 26(5), 2274–2285 (2017)
  • [39] Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. IEEE Transactions on Circuits and Systems for Video Technology 24(5), 769–779 (2013)
  • [40] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 234–241 (2015)
  • [41] Rui, Z., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013)
  • [42] Siam, M., Jiang, C., Lu, S., Petrich, L., Gamal, M., Elhoseiny, M., Jagersand, M.: Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 50–56. IEEE (2019)
  • [43] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [44] Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of European Conference on Computer Vision. pp. 715–731 (2018)
  • [45] Su, J., Li, J., Zhang, Y., Xia, C., Tian, Y.: Selectivity or invariance: Boundary-aware salient object detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 3799–3808 (2019)
  • [46] Sun, D., Yang, X., Liu, M., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. p. 8934–8943 (2018)
  • [47] Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: Proceedings of IEEE International Conference on Computer Vision. pp. 4481–4490 (2017)
  • [48] Tokmakov, P., Schmid, C., Alahari, K.: Learning to segment moving objects. International Journal of Computer Vision 127(3), 282–301 (2019)
  • [49] Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 136–145 (2017)
  • [50] Wang, L., Wang, L., Lu, H., Zhang, P., Ruan, X.: Saliency detection with recurrent fully convolutional networks. In: Proceedings of European Conference on Computer Vision. pp. 825–841 (2016)
  • [51] Wang, T., Borji, A., Zhang, L., Zhang, P., Lu, H.: A stagewise refinement model for detecting salient objects in images. In: Proceedings of IEEE International Conference on Computer Vision. pp. 4019–4028 (2017)
  • [52] Wang, T., Piao, Y., Li, X., Zhang, L., Lu, H.: Deep learning for light field saliency detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 8838–8848 (2019)
  • [53] Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: A novel approach to saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3127–3135 (2018)
  • [54] Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H.: Salient object detection in the deep learning era: An in-depth survey. arXiv preprint arXiv:1904.09146 (2019)
  • [55] Wang, W., Shen, J., Cheng, M.M., Shao, L.: An iterative and cooperative top-down and bottom-up inference network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 5968–5977 (2019)
  • [56] Wang, W., Song, H., Zhao, S., Shen, J., Zhao, S., Hoi, S.C., Ling, H.: Learning unsupervised video object segmentation through visual attention. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3064–3074 (2019)
  • [57] Wang, W., Zhao, S., Shen, J., Hoi, S.C., Borji, A.: Salient object detection with pyramid attention and salient edges. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1448–1457 (2019)
  • [58] Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., Ding, E.: A mutual learning method for salient object detection with intertwined multi-supervision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 8150–8159 (2019)
  • [59] Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3907–3916 (2019)
  • [60] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1492–1500 (2017)
  • [61] Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1155–1162 (2013)
  • [62] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3166–3173 (2013)
  • [63] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3166–3173 (2013)
  • [64] Yang, G.R., Murray, J.D., Wang, X.J.: A dendritic disinhibitory circuit mechanism for pathway-specific gating. Nature communications 7, 12815 (2016)
  • [65] Yang, Z., Wang, Q., Bertinetto, L., Hu, W., Bai, S., Torr, P.H.: Anchor diffusion for unsupervised video object segmentation. In: Proceedings of IEEE International Conference on Computer Vision. pp. 931–940 (2019)
  • [66] Zeng, Y., Zhang, P., Zhang, J., Lin, Z., Lu, H.: Towards high-resolution salient object detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 7234–7243 (2019)
  • [67] Zhang, L., Dai, J., Lu, H., He, Y., Wang, G.: A bi-directional message passing model for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 1741–1750 (2018)
  • [68] Zhang, L., Zhang, J., Lin, Z., Lu, H., He, Y.: Capsal: Leveraging captioning to boost semantics for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 6024–6033 (2019)
  • [69] Zhang, P., Wang, D., Lu, H., Wang, H., Ruan, X.: Amulet: Aggregating multi-level convolutional features for salient object detection. In: Proceedings of IEEE International Conference on Computer Vision. pp. 202–211 (2017)
  • [70] Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 714–722 (2018)
  • [71] Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for rgbd salient object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2019)
  • [72] Zhao, T., Wu, X.: Pyramid feature attention network for saliency detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3085–3094 (2019)
  • [73] Zhou, T., Wang, S., Zhou, Y., Yao, Y., Li, J., Shao, L.: Motion-attentive transition for zero-shot video object segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 2, p. 3 (2020)
  • [74] Zhu, C., Li, G.: A three-pathway psychobiological framework of salient object detection using stereoscopic technology. In: Proceedings of IEEE International Conference on Computer Vision. pp. 3008–3014 (2017)
  • [75] Zhu, C., Li, G., Wang, W., Wang, R.: An innovative salient object detection using center-dark channel prior. In: Proceedings of IEEE International Conference on Computer Vision. pp. 1509–1515 (2017)