Salient Object Detection via Bounding-box Supervision
Abstract
The success of fully supervised saliency detection models depends on a large number of pixel-wise labeling. In this paper, we work on bounding-box based weakly-supervised saliency detection to relief the labeling effort. Given the bounding box annotation, we observe that pixels inside the bounding box may contains extensive labeling noise. However, as the large amount of background is excluded, the foreground bounding box region contains less complex background, making it possible to perform handcrafted features based saliency detection with only the cropped foreground region. As the conventional handcrafted features are not representative enough, leading to noisy saliency maps, we further introduce structure-aware self-supervised loss to regularize the structure of the prediction. Further, we claim that pixels outside bounding box should be background, thus partial cross-entropy loss function can be used to accurately localize the accurate background region. Experimental results on six benchmark RGB saliency dataset illustrate effectiveness of our model.
Index Terms— Weakly supervised learning, Bounding-box annotation, Structure-aware self-supervised loss
1 Introduction
Salient object detection aims to localize the full scope of the salient foreground, which is usually defined as a binary segmentation task. Most of the conventional techniques are fully-supervised [1, 2, 3], where the pixel-wise annotations are needed as supervision to train a mapping from input image space to output saliency space. We find that the strong dependency of pixel-wise labeling pose both efficiency and budget challenges for existing fully-supervised saliency detection techniques. To relief the labeling effort, some un/weakly-supervised learning based salient object detection techniques [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] are introduced, which aim to learn saliency from easy-to-obtain labeling, including image-level labels [5], scribbles [4, 13], bounding box supervisions [12] or noisy labeling [9, 7, 6, 8, 14]. In this paper, we present a new bounding box supervision based weakly-supervised saliency detection model.
![]() |
![]() |
![]() |
Image | bndBox | GrabCut |
The advantage of bounding-box supervision is that the background of bounding-box annotation is pure background. Although there still exists background region within the bounding-box foreground annotation, the significantly reduced background region makes it possible to perform conventional handcrafted feature based saliency detection methods on the cropped foreground region. In Fig. 1, we show the generated pseudo label by combining bounding-box annotation with the conventional handcrafted feature based method, which clearly show it’s superiority.
As a deep neural network can fit any types of noise [15], directly using the generated pseudo label as ground truth will lead to biased model, over-fitting on the noisy pseudo label. To further constrain the structure of the prediction, we present structure-aware self-supervised loss function. The main goal of this loss function is constrain the prediction with structure well-aligned with the input image. In this way, we aim to obtain structure accurate predictions (see Fig. 3).
2 Related Work
Fully-supervised Saliency models: The main focus of fully-supervised saliency detection models is to achieve effective feature aggregation [16] [17] [18]. Due to the use of stride operation, the resulting saliency maps are usually with low resolution. To produce structure accurate saliency prediction, some method use edge supervision to learn more feature about the object boundary to refine the saliency predictions using better structure of the object [19] [20] [21].
![]() |
3 Our Method
3.1 Overview
In this paper, we study weakly-supervised saliency detection with bounding-box supervision. Specifically, given bounding-box supervision of the training dataset, model is trained to accurately produce saliency maps for testing images. Let’s define the training data set as: of size , where and are the input RGB image and it’s corresponding bounding box supervision, and indexes the images. To generate , each salient non-overlap salient instance is annotated with separated bounding box, and we generate one single bounding box for the overlapped salient instances. In this way, the foreground of contains different level of noise depending on the position and shape of the bounding box, and the background of is accurate background.
Four main steps are included in our method: (1) a persudo saliency map generator which use the Grabcut to generate pseudo saliency maps given bounding box supervision; (2) a saliency prediction network (SPN) to produce a saliency map, which is supervised with above pseudo saliency maps; (3) structure-aware loss to optimize the foreground predictions; (4) partial-cross entropy based background loss to optimize the background prediction.
3.2 Persudo Saliency Map Generator via GrabCut
Given bounding box supervision, we first generate pseudo saliency maps with Grabcut (see Fig. 1). Compared with the direct bounding box supervision , the generated pseudo saliency map is more accurate in structure, making it suitable to serve as pseudo label for saliency prediction. Specifically, we repeat the Grabcut operation several times until we obtain a reasonably accurate pseudo label.
3.3 Saliency Prediction Network
Following the conventional fully-supervised saliency detection, we design a “saliency prediction network” to generate saliency maps. Specifically, we take the ResNet50 as backbone. Given the backbone feature ( is parameters of the encoder (the backbone model), the saliency prediction network aims to generate saliency map , where with as parameters of the decoder. Specifically, we feed each to a convolutional layer to generate new backbone feature of same channel size . Then, we adopt decoder from [25], which takes as input and generate . Note that, we define parameters of the above 4 copies of convolutional layers as part of parameters within .
Given saliency prediction , we first design task related loss function, aiming to regress the pseudo salienncy map from GrabCut. Specifically, we adopt the symmetric cross-entropy loss [26], which is proven relative robust to labeling noise:
(1) |
where is the cross-entropy loss, and are used to weigh the contribution of each loss, and empirically we set in this paper. and are the model prediction and pseudo saliency map with GrabCut.
3.4 Foreground Structure Constrain
Although GrabCut can generate relatively better pseudo label compared with the original bounding box supervision, the complex saliency foreground leads to noisy supervision after GrabCut. To constrain the prediction within the foreground bounding box region, we adopt smoothness loss [22] to produce structure accurate prediction.
The smoothness loss is defined as:
(2) |
where is defined as to avoid calculating the square root of zero, is the image intensity value at pixel , indicates the partial derivatives on the and directions. Different with the conventional smoothness loss in [22], we introduce gate to the calculation of smoothness loss to pay attention to the bounding box foreground region inspired by [4].
DUTS [5] | ECSSD [27] | DUT [28] | HKU-IS [29] | PASCAL-S [30] | SOD [31] | |||||||||||||||||||
Method | ||||||||||||||||||||||||
SCRN [32] | .885 | .833 | .900 | .040 | .920 | .910 | .933 | .041 | .837 | .749 | .847 | .056 | .916 | .894 | .935 | .034 | .869 | .833 | .892 | .063 | .817 | .790 | .829 | .087 |
F3Net [33] | .888 | .852 | .920 | .035 | .919 | .921 | .943 | .036 | .839 | .766 | .864 | .053 | .917 | .910 | .952 | .028 | .861 | .835 | .898 | .062 | .824 | .814 | .850 | .077 |
ITSD [34] | .886 | .841 | .917 | .039 | .920 | .916 | .943 | .037 | .842 | .767 | .867 | .056 | .921 | .906 | .950 | .030 | .860 | .830 | .894 | .066 | .836 | .829 | .867 | .076 |
PAKRN [35] | .900 | .876 | .935 | .033 | .928 | .930 | .951 | .032 | .853 | .796 | .888 | .050 | .923 | .919 | .955 | .028 | .859 | .856 | .898 | .068 | .833 | .836 | .866 | .074 |
MSFNet [36] | .877 | .855 | .927 | .034 | .915 | .927 | .951 | .033 | .832 | .772 | .873 | .050 | .909 | .913 | .957 | .027 | .849 | .855 | .900 | .064 | .813 | .822 | .852 | .077 |
CTDNet[37] | .893 | .862 | .928 | .034 | .925 | .928 | .950 | .032 | .844 | .779 | .874 | .052 | .919 | .915 | .954 | .028 | .861 | .856 | .901 | .064 | .829 | .832 | .858 | .074 |
VST[38] | .896 | .842 | .918 | .037 | .932 | .911 | .943 | .034 | .850 | .771 | .869 | .058 | .928 | .903 | .950 | .030 | .873 | .832 | .900 | .067 | .854 | .833 | .879 | .065 |
GTSOD [39] | .908 | .875 | .942 | .029 | .935 | .935 | .962 | .026 | .858 | .797 | .892 | .051 | .930 | .922 | .964 | .023 | .877 | .855 | .915 | .054 | .860 | .860 | .898 | .061 |
SSAL [4] | .803 | .747 | .865 | .062 | .863 | .865 | .908 | .061 | .785 | .702 | .835 | .068 | .865 | .858 | .923 | .047 | .798 | .773 | .854 | .093 | .750 | .743 | .801 | .108 |
WSS [5] | .748 | .633 | .806 | .100 | .808 | .774 | .801 | .106 | .730 | .590 | .729 | .110 | .822 | .773 | .819 | .079 | .701 | .691 | .687 | .187 | .698 | .635 | .687 | .152 |
C2S [40] | .805 | .718 | .845 | .071 | - | - | - | - | .773 | .665 | .810 | .082 | .869 | .837 | .910 | .053 | .784 | .806 | .813 | .130 | .770 | .741 | .799 | .117 |
SCWS [13] | .841 | .818 | .901 | .049 | .879 | .894 | .924 | .051 | .813 | .751 | .856 | .060 | .883 | .892 | .938 | .038 | .821 | .815 | .877 | .078 | .782 | .791 | .833 | .090 |
Ours | .796 | .715 | .821 | .070 | .856 | .838 | .873 | .069 | .792 | .695 | .812 | .072 | .855 | .826 | .882 | .057 | .748 | .760 | .772 | .147 | .740 | .704 | .758 | .119 |
3.5 Background Accuracy Constrain
As background of the bounding box supervision is accurate background for saliency prediction, we adopt partial cross-entropy loss to constrain accuracy of prediction within the background bounding box region. Specifically, given bounding box annotation and model prediction from the saliency prediction network , we define the background loss as:
(3) |
where ( and represent image size, is the number of pixels that are covered in the foreground bounding box), is an all-zero matrix of the same size as .
3.6 Training the Model
With both the saliency regression loss , foreground structure loss and background accuracy loss , our final loss function is defined as:
(4) |
where and is used to control the contribution of the foreground loss and the background loss. Empirically, we set .
We train our final model for 40 epochs. The saliency prediction network is initialized with ResNet50 pre-trained on ImageNet. The initial learning rate is 2.5e-4, and the decay epoch is 20 while the decay rate is 0.9. As the training batch size be set up to 16 on a PC with an NVIDIA GeForce GTX 1080ti GPU, the training process of the network takes 8 hours.
4 Experimental Results
4.1 Setup
Data set: We train our models using the DUTS training dataset [5] of size , and test on six other widely used datasets: the DUTS testing dataset, ECSSD [27], DUT [28], HKU-IS [29], PASCAL-S [30] and the SOD testing dataset [31]. The supervision in our case is bounding box annotation.
DUTS [5] | ECSSD [27] | DUT [28] | HKU-IS [29] | PASCAL-S [30] | SOD [31] | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | ||||||||||||||||||||||||
BndBox | .642 | .514 | .673 | .167 | .683 | .620 | .712 | .185 | .671 | .533 | .701 | .149 | .683 | .597 | .722 | .167 | .621 | .608 | .669 | .232 | .625 | .552 | .669 | .203 |
GCut | .795 | .707 | .814 | .071 | .856 | .834 | .871 | .069 | .787 | .681 | .800 | .075 | .852 | .820 | .874 | .059 | .745 | .751 | .762 | .151 | .733 | .689 | .747 | .120 |
FGCut | .787 | .715 | .823 | .069 | .854 | .846 | .883 | .068 | .778 | .687 | .806 | .070 | .851 | .835 | .892 | .055 | .733 | .749 | .760 | .157 | .724 | .694 | .750 | .124 |
Ours | .796 | .715 | .821 | .070 | .856 | .838 | .873 | .069 | .792 | .695 | .812 | .072 | .855 | .826 | .882 | .057 | .748 | .760 | .772 | .147 | .740 | .704 | .758 | .119 |
Evaluation Metrics: Four evaluation metrics are used, including Mean Absolute Error (MAE ), Mean F-measure (), mean E-measure () and the S-measure [41]
4.2 Performance Comparison
Quantitative comparison: We show performance of our model in Table 1, where models in the top two blocks are fully supervised models (models in the middle block are transformer [42] based), and models in the last block (not “Ours”) are weakly supervised models. Performance comparison show competitive performance of our model, leading to an alternative weakly supervised saliency detection model.
Qualitative comparison: We compare predictions of our model with four benchmark models and show results in Fig. 3, which further explain that with both the foreground and background constrains, our weakly supervised model can obtain relative structure accurate predictions.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Image | GT | BndBox | GCut | FGCut | Ours |
Running time comparison: As both the foreground and background related loss are only used during training. At test time, we can produce saliency maps with the saliency prediction network, leading to an average inference time of 0.02s/image, which is comparable with existing techniques.
4.3 Ablation Study
We conducted further experiments to explain the contribution of each component of the proposed model, and show performance of the related experiments in Table 2.
Training directly with bounding box supervision : Given bounding box supervision, a straight-forward solution is training directly with the binary bounding box as supervision. We show it’s performance as “BndBox”.
Training directly with pseudo label from GrabCut: With the refined pseudo label using GrabCut, we can train another model with as pseudo label directly. The performance is shown as “GCut”.
Contribution of foreground structure loss: We further add the foreground structure loss to “GCut”, leading to “FGCut”.
Analysis: As shown in Table 2, directly training with bounding box supervision yields unsatisfactory results, where the model learns to regress the bounding box region (see “BndBox” in Fig. 4). Although the pseudo saliency map with GrabCut is noisy, the model based on it can still generate reasonable predictions (see “GCut” in both Table 2 and Fig. 4.). Then, with the proposed foreground and background loss as constrains ( and ), we obtain better performance with more accurte structures.
4.4 Model Analysis
We analyse the model further to explain the advantage and limitations of the proposed method.
Impact of the maximum training epoch: We set the maximum epoch as 40 in this paper. As proven in existing noisy labeling literature that longer training time is harmful for noisy labeling setting. We further performed experiments with both longer and shorter training epochs, and observed similar conclusion. We will investigate the optimal training epochs for better performance.
Impact of the feat channel for new backbonf feature generation in the “saliency prediction network”: For dimension reduction, we feed the backbone feature to four different convolutional layers to generate the new backbone feature of channel size . We find that model performance is influenced with . The larger leads to better overall performance. However, the size of the model is also significantly enlarged. To achieve trade off between model performance and training/testing time, we set . We will investigate the optimal in the future.
Edge detection as auxiliary task for prediction structure recovery: [4] introduced auxiliary edge detection module to their weakly supervised learning framework for structure recovery. We have tried the same strategy and observed no significant performance improvement in our setting. As a multi-task learning framework, the convergence rate of each task is especially important for the final performance, and more sophisticated multi-task learning solution is important to fully explore the contribution of auxiliary edge detection for weakly supervised learning.
5 Conclusion
In this paper, we introduce a bounding box based weakly supervised saliency detection model. Due to the different accuracy of foreground and background, we introduce two sets of loss functions to constrain the predictions within the foreground and background bounding box regions. Experimental results explain the superiority of the proposed solution, leading to an alternative for weakly supervised saliency detection. However, we observe that model performance is sensitive to the maximum training epochs. Longer training will lead to over-fitting on noisy labeling, while shorter training may lead to less effective model. More research on optimal training mechanism for noisy label learning should be investigated.
References
- [1] Qingming Huang Jun Wei, Shuhui Wang, “F3net: Fusion, feedback and focus for salient object detection,” in AAAI, 2020.
- [2] Bo Wang, Quan Chen, Min Zhou, Zhiqiang Zhang, Xiaogang Jin, and Kun Gai, “Progressive feature polishing network for salient object detection.,” in AAAI, 2020, pp. 12128–12135.
- [3] Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, and Qi Tian, “Label decoupling framework for salient object detection,” in CVPR, June 2020.
- [4] Jing Zhang, Xin Yu, Aixuan Li, Peipei Song, Bowen Liu, and Yuchao Dai, “Weakly-supervised salient object detection via scribble annotations,” in CVPR, 2020.
- [5] Lijun Wang, Huchuan Lu, Yifan Wang, Mengyang Feng, Dong Wang, Baocai Yin, and Xiang Ruan, “Learning to detect salient objects with image-level supervision,” in CVPR, 2017, pp. 136–145.
- [6] Duc Tam Nguyen, Maximilian Dax, Chaithanya Kumar Mummadi, Thi-Phuong-Nhung Ngo, Thi Hoai Phuong Nguyen, Zhongyu Lou, and Thomas Brox, “Deepusps: Deep robust unsupervised saliency prediction via self-supervision,” in NeurIPS, 2019, pp. 204–214.
- [7] Jing Zhang, Tong Zhang, Yuchao Dai, Mehrtash Harandi, and Richard Hartley, “Deep unsupervised saliency detection: A multiple noisy labeling perspective,” in CVPR, 2018, pp. 9029–9038.
- [8] Jing Zhang, Jianwen Xie, and Nick Barnes, “Learning noise-aware encoder-decoder from noisy labels by alternating back-propagation for saliency detection,” in ECCV, 2020.
- [9] Dingwen Zhang, Junwei Han, and Yu Zhang, “Supervision by fusion: Towards unsupervised learning of deep salient object detector,” in ICCV, 2017, pp. 4048–4056.
- [10] Guanbin Li, Yuan Xie, and Liang Lin, “Weakly supervised salient object detection using image labels,” in AAAI, 2018.
- [11] Yu Zeng, Yunzhi Zhuge, Huchuan Lu, Lihe Zhang, Mingyang Qian, and Yizhou Yu, “Multi-source weak supervision for saliency detection,” in CVPR, 2019, pp. 6074–6083.
- [12] Yuxuan Liu, Pengjie Wang, Ying Cao, Zijian Liang, and Rynson W. H. Lau, “Weakly-supervised salient object detection with saliency bounding boxes,” TIP, pp. 4423–4435, 2021.
- [13] Siyue Yu, Bingfeng Zhang, Jimin Xiao, and Eng Gee Lim, “Structure-consistent weakly supervised salient object detection with local saliency coherence,” in AAAI, 2021, pp. 3234–3242.
- [14] Jing Zhang, Yuchao Dai, Tong Zhang, Mehrtash Harandi, Nick Barnes, and Richard Hartley, “Learning saliency from single noisy labelling: A robust model fitting perspective,” TPAMI, vol. 43, no. 8, pp. 2866–2873, 2020.
- [15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger, “On calibration of modern neural networks,” in ICML, 2017, pp. 1321–1330.
- [16] Zuyao Chen, Qianqian Xu, Runmin Cong, and Qingming Huang, “Global context-aware progressive aggregation network for salient object detection,” in AAAI, 2020, pp. 10599–10606.
- [17] Xiaoning Zhang, Tiantian Wang, Jinqing Qi, Huchuan Lu, and Gang Wang, “Progressive attention guided recurrent network for salient object detection,” in CVPR, 2018, pp. 714–722.
- [18] Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu, “Multi-scale interactive network for salient object detection,” in CVPR, 2020, pp. 9413–9422.
- [19] Siyue Yu, Bingfeng Zhang, Jimin Xiao, and Eng Gee Lim, “Structure-consistent weakly supervised salient object detection with local saliency coherence,” in AAAI, 2021.
- [20] Wenguan Wang, Shuyang Zhao, Jianbing Shen, Steven CH Hoi, and Ali Borji, “Salient object detection with pyramid attention and salient edges,” in CVPR, 2019, pp. 1448–1457.
- [21] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao, Jufeng Yang, and Ming-Ming Cheng, “Egnet: Edge guidance network for salient object detection,” in ICCV, 2019, pp. 8779–8788.
- [22] Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu, “Occlusion aware unsupervised learning of optical flow,” in CVPR, 2018, pp. 4884–4893.
- [23] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake, “” grabcut” interactive foreground extraction using iterated graph cuts,” ACM transactions on graphics (TOG), vol. 23, no. 3, pp. 309–314, 2004.
- [24] Cheng-Chun Hsu, Kuang-Jui Hsu, Chung-Chi Tsai, Yen-Yu Lin, and Yung-Yu Chuang, “Weakly supervised instance segmentation using the bounding box tightness prior,” in NeurIPS, 2019, vol. 32.
- [25] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” TPAMI, 2020.
- [26] Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in ICCV, 2019.
- [27] Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia, “Hierarchical saliency detection,” in CVPR, 2013, pp. 1155–1162.
- [28] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang, “Saliency detection via graph-based manifold ranking,” in CVPR, 2013, pp. 3166–3173.
- [29] Guanbin Li and Yizhou Yu, “Visual saliency based on multiscale deep features,” in CVPR, 2015, pp. 5455–5463.
- [30] Yin Li, Xiaodi Hou, Christof Koch, James M Rehg, and Alan L Yuille, “The secrets of salient object segmentation,” in CVPR, 2014, pp. 280–287.
- [31] Vida Movahedi and James H Elder, “Design and perceptual validation of performance measures for salient object segmentation,” in CVPR Workshop, 2010, pp. 49–56.
- [32] Zhe Wu, Li Su, and Qingming Huang, “Stacked cross refinement network for edge-aware salient object detection,” in ICCV, 2019, pp. 7264–7273.
- [33] Jun Wei, Shuhui Wang, and Qingming Huang, “F3net: Fusion, feedback and focus for salient object detection,” in AAAI, 2020, pp. 12321–12328.
- [34] Huajun Zhou, Xiaohua Xie, Jian-Huang Lai, Zixuan Chen, and Lingxiao Yang, “Interactive two-stream decoder for accurate and fast saliency detection,” in CVPR, 2020, pp. 9141–9150.
- [35] Binwei Xu, Haoran Liang, Ronghua Liang, and Peng Chen, “Locate globally, segment locally: A progressive architecture with knowledge review network for salient object detection,” in AAAI, 2021, pp. 3004–3012.
- [36] Miao Zhang, Tingwei Liu, Yongri Piao, Shunyu Yao, and Huchuan Lu, “Auto-msfnet: Search multi-scale fusion network for salient object detection,” in ACM Int. Conf. Multimedia, 2021, pp. 667–676.
- [37] Zhirui Zhao, Changqun Xia, Chenxi Xie, and Jia Li, “Complementary trilateral decoder for fast and accurate salient object detection,” in ACM Int. Conf. Multimedia, 2021, pp. 4967–4975.
- [38] Nian Liu, Ni Zhang, Kaiyuan Wan, Ling Shao, and Junwei Han, “Visual saliency transformer,” in ICCV, 2021, pp. 4722–4732.
- [39] Jing Zhang, Jianwen Xie, Nick Barnes, and Ping Li, “Learning generative vision transformer with energy-based latent space for saliency prediction,” in NeurIPS, 2021, vol. 34.
- [40] Xin Li, Fan Yang, Hong Cheng, Wei Liu, and Dinggang Shen, “Contour knowledge transfer for salient object detection,” in ECCV, 2018, pp. 355–370.
- [41] Deng-Ping Fan, Ge-Peng Ji, Xuebin Qin, and Ming-Ming Cheng, “Cognitive vision inspired object segmentation metric and loss function,” SCIENTIA SINICA Informationis, 2021.
- [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NeurIPS, 2017.