This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

11institutetext: Johns Hopkins University, Baltimore, MD 21218, USA 11email: {ryasarl1,vpatel36}@jhu.edu
22institutetext: AIBEE Inc., Palo Alto, CA 94306, USA
22email: {rlweng,wgchoi,amirabs}@aibee.com

3SD: Self-Supervised Saliency Detection With No Labels

Rajeev Yasarla 11    Renliang Weng 22    Wongun Choi 22    Vishal Patel 11    Amir Sadeghian 22
Abstract

We present a conceptually simple self-supervised method for saliency detection. Our method generates and uses pseudo-ground truth labels for training. The generated pseudo-GT labels don’t require any kind of human annotations (e.g., pixel-wise labels or weak labels like scribbles). Recent works show that features extracted from classification tasks provide important saliency cues like structure and semantic information of salient objects in the image. Our method, called 3SD, exploits this idea by adding a branch for a self-supervised classification task in parallel with salient object detection, to obtain class activation maps (CAM maps). These CAM maps along with the edges of the input image are used to generate the pseudo-GT saliency maps to train our 3SD network. Specifically, we propose a contrastive learning-based training on multiple image patches for the classification task. We show the multi-patch classification with contrastive loss improves the quality of the CAM maps compared to naive classification on the entire image. Experiments on six benchmark datasets demonstrate that without any labels, our 3SD method outperforms all existing weakly supervised and unsupervised methods, and its performance is on par with the fully-supervised methods. Code is available at :https://github.com/rajeevyasarla/3SD

Keywords:
saliency object detection, self-supervised classification, CAM map, gated edge, pseudo-GT.
Refer to caption
Figure 1: Comparisons of saliency maps. Self-supervised: 3SD (ours). Weakly supervised: MSW[46], Scribble[49], MFNet[32]. Un-supervised: SBF[47], MNL[50], EDNS[48]. Fully-supervised: U2Net[33], LDF[40], VST[25].

1 Introduction

Salient object detection (SOD) task is defined as pixel-wise segmentation of interesting regions that capture human attention in an image. It is widely used as a prior to improve many computer vision tasks such as visual tracking, segmentation, etc. Early methods based on hand-crafted features like histograms [28], boundary connectivity [56], high-dimensional color transforms[20], may fail in producing high-quality saliency maps on cluttered images where the foreground object is similar to the background. In recent years, deep convolutional neural networks (CNNs), and in particular fully convolutional networks (FCN)[27] have provided excellent image segmentation and salient object detection performance.

In general, the CNN-based salient object detection methods can be classified into three groups: (i) fully-supervised methods (that require large-scale datasets with pixel-wise annotations), (ii) weakly supervised, and (iii) unsupervised or self-supervised methods (that don’t require actual pixel-wise annotations of salient object detection). The main drawback of the fully-supervised methods [34, 42, 33, 40, 25] is that they require a large amount of pixel-wise annotations of salient objects which is time-consuming and expensive. On the other hand, to minimize human efforts in labeling datasets, weakly-supervised approaches [21, 46] have been proposed which address saliency detection either by using weak sparse labels such as image class labels [21] or image captions [46]. Alternatively, Zhang et al. [49] present a weakly-supervised SOD method based on scribble annotations. Pia et al.[32] propose a learnable directive filter based method that extracts saliency cues using multiple labels from the attentions. Note, Pia et al.[32] and MSW [46] rely on the features or attention maps obtained from a classification task, and might fail to produce high-quality pseudo-GTs since the classification task is trained with global class label or image caption. For example, Fig. 1 shows that the outputs of [46, 32] are not sharp and miss fine details like legs and ears. Although these weakly-supervised methods reduce the amount of labeling required for SOD, they still require labeling resources to obtain image captions [46], image class labels [21, 32], or accurate scribble annotations[49]. On the other hand, unsupervised methods [50, 30, 48] devise a refinement procedure or generative based approach (noise-aware), that utilizes the hand crafted features, and/or noisy annotations. Note that performance of these unsupervised methods highly rely on the noisy annotation, and might struggle to produce high-quality saliency maps if they fail to recover the underlying semantics from the noisy annotations. For instance, we can observe in Fig. 1 saliency outputs of [50, 30, 48] miss parts like legs and ears.

In an attempt to overcome these issues, we propose our Self-Supervised Saliency Detection (3SD) method. Our framework follows the conventional encoder-decoder structure to generate saliency map. In terms of encoder, we present a novel encoder architecture which consists of a local encoder and global context encoder. The local encoder learns pixel-wise relationship among neighbourhood while the global encoder encodes the global context. The outputs of these encoders are concatenated and subsequently fed to the decoder stream. By fusing both local features and global context we are able to extract both fine-grain contour details as well as adhere to the underlying object structure.

For the decoder, we follow the literature [3] by adding an auxiliary classification task to capture important saliency cues like semantics and segmentation of the salient object in the image, which can be extracted in the form of class activation map (CAM map). However, performing a self-supervised classification with single global class label might result in low-quality class activation map (illustrated in Fig. 2f, where the CAM map is incomplete). To address this issue, we propose a contrastive learning [5] based patch-wise self-supervision for the classification task, where we perform patch-wise (32×3232\times 32 pixels in our implementation) classification and train it with proposed self-supervised contrastive loss. Specifically, positive patches (patches similar to the salient object) and negative patches (patches dissimilar to the salient object) are identified. Positive patches are pulled together, and they are pushed away from the negative patches. In this way, the network strives to learn the attentions or semantic information that are responsible for classifying the salient object at a fine-grain patch level.

With the novel designs of encoder and decoder, our 3SD is able to generate high-quality CAM maps (see Fig. 2). While the generated CAM maps provide salient object information, they might not have proper boundary corresponding to the salient object. To deal with this issue, we fuse CAM map with a gated edge map of the input image to generate the pseudo-GT salient map (Fig. 2d).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)                (b)                (c)                (d)                (e)

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(f)                (g)                (h)                (i)                (j)

Figure 2: The CAM map comparisons with different patch-wise auxiliary classification tasks: (a) input image, (b) ground-truth, (c) edge map of the input image, (d) pseudo-GT computed using our 3SD method, (e) saliency map computed using 3SD where we use 32×3232\times 32 pixel-wise classification, (f), (g), (h), (i), (j) are CAM-maps computed when auxiliary classification task is global class label, patchwise classification with patch size 16×1616\times 16, 24×2424\times 24, 32×3232\times 32, 48×4848\times 48 respectively.

Fig. 1 compares sample results of the proposed 3SD with the existing SOTA weakly, unsupervised [47, 50, 46, 48, 49] and SOTA fully-supervised [40] methods. One can clearly observe that [47, 50, 46, 48, 49] fail to produce sharp edges and proper saliency maps. In contrast, our method is able to provide sharper and better results. To summarize, the main contributions of our paper are as follows:

  • We propose a self-supervised 3SD method that requires no human annotations for training an SOD model. Our 3SD method is trained using high-quality pseudo-GT saliency maps generated from CAM maps using a novel self-supervised classification task.

  • We present a patch-wise self-supervised contrastive learning paradigm, which substantially improves the quality of pseudo-GT saliency maps and boosts 3SD performance.

  • We construct a novel encoder architecture for SOD that attends features locally (pixel-level understanding), and globally (patch-level understanding).

  • Extensive experiments on six benchmark datasets show that the proposed 3SD method outperforms the SOTA weakly/unsupervised methods.

2 Related Work

Classical image processing methods address SOD using histograms [28], boundary connectivity [56], high-dimensional color transforms [20], hand-crafted features like foreground consistency [51], and similarity in superpixels [53]. In recent years, various supervised CNN-based methods have been proposed for SOD [52, 29, 39, 41, 42, 12, 34, 40] which extensively study architectural changes, attention mechanisms, multi-scale contextual information extraction, boundary-aware designs, label decoupling, etc. In contrast, our 3SD method is a self-supervised method trained using pseudo-GT data. In what follows, we will review recent weakly/unsupervised SOD methods as well as the self-supervised methods.

Weakly supervised SOD methods: To reduce human efforts and expenses in pixel-level labeling and annotations, various weakly methods have been proposed. These methods use high-level labels such as image class labels and image captions  [21, 46], and scribble annotations [49]. Dai et al.[7], Khorea et al.[19] follow a bounding-box label approach to solve weakly-supervised segmentation task. Wang et al.[38] extract cues using image-level labels for foreground salient objects. Hsu et al.[18] propose a category-based saliency map generator using image-level labels. [4, 21, 31] propose a CRF-based method for weakly supervised SOD. Zeng et al.[46] train a network with multiple source labels like category labels, and captions of the images to perform saliency detection. Zhang et al.[49] introduce scribble annotations for SOD. Unlike these methods, the proposed 3SD method doesn’t require any kind of human annotations, noisy labels, scribble annotations, or hand-crafted features to perform SOD.

Unsupervised SOD methods: Zhang et al.[47] devise a fusion process that employs unsupervised saliency models to generate supervision. Nguyen et al.[30] propose an incrementally refinement technique that employs noisy pseudo labels generated from different handcrafted methods. Zhang et al.[50] and Zhang et al.[48] propose a saliency prediction network and noise modeling module that jointly learn from the noisy labels generated from multiple “weak” and “noisy” unsupervised handcrafted saliency methods. Unlike these unsupervised methods that highly rely on the noisy annotations, we propose a novel pseudo-GT generation technique using patchwise contrastive learning based self-supervised classification task.

Self-supervised/Contrastive learning methods: Several approaches explore discriminative approaches for self-supervised instance classification [9, 43]. These methods treat each image as a different class. The main limitation of these approaches is that they require comparing features from images for discrimination which can be complex in a large-scale dataset. To address this challenge, Grillet al.[13] introduce metric learning, called BOYL, where better representations of the features are learned by matching the outputs of momentum encoders. Subsequently, Caron et al.[3] propose the DINO method that is based on mean Teacher [36] self-distillation without labels. Recently, [15, 5, 16, 37] propose self-supervised contrastive learning based methods where the losses are inspired by noise-contrastive estimation[14], triplet loss[17], and N-pair loss[35]. Motivated by [1, 5], we perform patch-wise contrastive learning within the self-supervised classification framework to obtain high-quality CAM maps, leading to good-quality pseudo-GTs.

Refer to caption
Figure 3: Overview of the 3SD method in computing the pseudo-GT for a given input image.

3 Proposed Method

The proposed 3SD method is a fully self-supervised approach that doesn’t require any human annotations or noisy labels. As shown in Fig. 3, our method 3SD consists of a base network (BN) and a pseudo label generator. The base network outputs the saliency map along with the class labels which are used to generate the CAM map. BN utilizes self-supervised classification task to extract the semantic information for CAM map. The pseudo label generator fuses gated edge of the input image with CAM map to compute the pseudo-GT for training. In this section, we will discuss: (i) construction of the base network (BN), and (ii) pseudo-label generator.

3.1 Base Network

We construct our framework with two encoders (local encoder ELE_{L} and global encoder EGE_{G}) and two decoders (saliency decoder DeSDe_{S} and classification decoder DeCDe_{C}) as shown in Fig. 3. Both local features and global context are vital for SOD task. To learn the pixel-wise relationship with local features, the local encoder ELE_{L} is constructed using similar structure as U2Net [33]. Specifically, ELE_{L} has nested two-level U-structure network with ResUBlock to capture contextual information at different scales. Even-though ELE_{L} learns the inter-dependency between neighboring pixels in the receptive field size (approximately 96×9696\times 96), it fails to capture the global context for high-resolution images. To remedy this issue, we introduce a transformer based encoder EGE_{G} to model long-range relationship across patches (inspired by ViT[8]). By combining the outputs of both encoders, our 3SD is able to capture the local fine-grain details and reason globally. This combined encoded features are fed to saliency decoder DeSDe_{S} to obtain saliency map, and classification decoder DeCDe_{C} to obtain class label map as output. We use the similar architecture proposed in [33] for saliency decoder DeSDe_{S}. Our major contribution is on the classification decoder DeCDe_{C}. In contrast to conventional single-image-single-label classification design, our decoder performs self-supervised learning in patch wise. To enhance the feature representation and fully exploit the semantics, patches belonging to the object are encouraged to be differentiated from background patches using our novel contrastive learning paradigm. More details about ELE_{L}, EGE_{G}, DeSDe_{S}, and DeCDe_{C} are provided in the supplementary document.

As shown in Fig. 3, given an input image, BN predicts the salient object s^\hat{s}, and patch-wise class logits map CC. It is trained in a fully self-supervised way using our proposed pseudo-label generator, which is described in the next subsection.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)               (b)               (c)               (d)               (e)               (f)

Figure 4: The CAM map comparison with different patch-wise auxiliary classification task. (a) input image, (b) CAM-map computed when auxiliary classification is one global class label, patch-wise classification with patch size (c) 16×1616\times 16, (d) 24×2424\times 24, (e) 32×3232\times 32, (f) 48×4848\times 48.

3.2 Pseudo-label generator

The main goal of 3SD is to train an SOD network without any GT-labels of salient objects. To achieve this, 3SD should be able to extract structural or semantic information of the object in the image. From the earlier works [32, 21] it is evident that attention maps from classification task provide important cues for salient object detection. In contrast to [32, 21] which require image class and/or caption labels, we train our BN with student-teacher based knowledge-distillation technique, and found the features from encoders of BN contain structural or semantic information of the salient object in the image. These semantics are the key to generate high-quality CAM maps [55]. But as shown in Fig. 4, the quality of CAM maps obtained from the self-supervised image-wise classification task when trained with single global class label, might not be high enough to produce pseudo-GT. This is due to the fact that single label classification task does not need to attend to all object regions. Instead, classification task drives the model to focus on the discriminative object parts. To address this issue, we propose a contrastive learning based patch-wise classification on the image patches as shown in Fig. 5. Patch-wise learning drives 3SD to capture local structures that constitute the object. Fig. 4 (b)-(e) show CAM map comparison between different classification tasks. Finally, by guiding the CAM map with the edge information of the input image, we obtain high-quality pseudo-GT (spseudos_{pseudo}). This pseudo-GT is taken as the training labels to update the parameters of the 3SD network.

3.2.1 Self-supervised classification.

Student-teacher based knowledge distillation is a well-known learning paradigm that is commonly used for self-supervised classification tasks. As shown in Fig. 4, self-supervised classification task as done in [3] fails to produce high quality pseudo-GTs. This is because image-wise self-supervised classification with one global class only requires a few important activations in the salient object, leading to incomplete saliency map. To overcome this hurdle, our proposed self-supervised classification contains (i) global level self-supervision using st\mathcal{L}_{st} loss, and (ii) patch-wise contrastive learning based self-supervision using ρ\mathcal{L}_{\rho} loss.

Refer to caption
Figure 5: Self-distillation based training for classification. Here, “ema” means EMA update rule used to update the teacher network’s parameters, and “sg” means stop gradient.

In our 3SD, student and teacher networks share the same architecture as the BN, and we denote student and teacher networks as fθsf_{\theta_{s}} and fθtf_{\theta_{t}}, respectively with θs\theta_{s} and θt\theta_{t} as their corresponding parameters. Given an input image, we compute class logits maps CsC_{s} and CtC_{t} for both networks, as well as their corresponding softmax probability output P(Cs)P(C_{s}) and P(Ct){P(C_{t})}. Meanwhile, we obtain image wise class logits for student model: cs={pi}Cs(pi)c_{s}=\sum_{\{p_{i}\}}C_{s}(p_{i}), where pip_{i} is ithi^{th} patch in the image. Similar expression holds for the teacher’s logits ctc_{t}. Given a fixed teacher network fθtf_{\theta_{t}}, we match the image-level class probability distribution P(cs)P(c_{s}) and P(ct)P(c_{t}) by minimizing the cross-entropy loss to update the parameters of the student network fθsf_{\theta_{s}}:

st=Q(P(cs),P(ct)),\mathcal{L}_{st}~{}=~{}Q(P(c_{s}),P(c_{t})), (1)

where Q(a,b)=alogbQ(a,b)=-a\log b. Note that st\mathcal{L}_{st} is used to update the student network’s parameters θs\theta_{s} with stochastic gradient descent. The teacher weights are updated using exponential moving average (EMA) update rule as in [3, 13], i.e., θtλθt+(1λ)θs\theta_{t}\leftarrow\lambda\theta_{t}+(1-\lambda)\theta_{s}, with λ\lambda following a cosine schedule from 0.996 to 1 during training. Additionally, in order to improve quality of CAM maps we perform patchwise contrastive learning for the classification task, where we identify the top MρM_{\rho} positive patches (patches similar to salient object) and MρM_{\rho} negative patches (patches dissimilar to salient object) from CsC_{s} and CtC_{t}. Using these positive and negative pairs we construct the following contrastive learning [5] based loss,

ρ=1Mρ2{pis}Ps{pjt}Ptlogexp(sim(Cs(pis),Ct(pjt))/τ)nkNt,Nsexp(sim(Cs(pis),Ct(nk))/τ)\mathcal{L}_{\rho}=\frac{1}{M_{\rho}^{2}}\sum_{\{p^{s}_{i}\}\in P^{s}}\sum_{\{p^{t}_{j}\}\in P^{t}}-\log\frac{\exp\left(\operatorname{sim}\left(C_{s}(p^{s}_{i}),C_{t}(p^{t}_{j})\right)/\tau\right)}{\sum_{n_{k}\in N^{t},N^{s}}\exp\left(\operatorname{sim}\left(C_{s}(p^{s}_{i}),C_{t}(n_{k})\right)/\tau\right)} (2)

where PsP^{s} and NsN^{s} are the set of MρM_{\rho} positive and negative patches from CsC_{s} respectively, and PtP^{t} and NtN^{t} are the set of MρM_{\rho} positive and negative patches from CtC_{t} respectively. Concretely, PsP^{s} and NsN^{s} are extracted from CsC_{s} by comparing Cs(pi)C_{s}(p_{i}) with csc_{s} using normalized dot product. Similarly, PtP^{t} and NtN^{t} are obtained from CtC_{t}. From Fig. 4, we can observe that addition of loss ρ\mathcal{L}_{\rho} drives the student network to extract class-sensitive local features across the whole object region, which results in better quality CAM maps.

In summary, the following steps are performed in self-supervised classification task to learn student network weights: (i) Augment the image with two different augmentations and randomly crop the images to obtain xlx^{\prime l} (local crop or smaller resolution) and x′′gx^{\prime\prime g} (global crop or smaller resolution) as shown in Fig. 5. (ii) Fix the teacher network’s parameters. (iii) Pass xlx^{\prime l} as input to the student network and x′′gx^{\prime\prime g} as input to the teacher network, to obtain output class logits maps CsC^{\prime}_{s} and Ct′′C^{\prime\prime}_{t}, respectively. (iv) Match the outputs by minimizing the loss st\mathcal{L}_{st} and update the student network’s parameters θs\theta_{s}. (v) Additionally, we extract positive and negative patches from CsC_{s} and CtC_{t}, and update the student network’s parameters θs\theta_{s} using ρ\mathcal{L}_{\rho}. (v) Finally, employ the EMA update rule to update the teacher network’s parameters θt\theta_{t}. Fig. 5 shows the overview diagram of these steps in updating the student network’s parameters.

3.2.2 Pseudo-GT

When 3SD is trained with self-supervised classification, features from classification decoder explicitly contain the structural or semantic information of the salient object in the image. Hence, we compute CAM (using technique from [55]) to obtain this semantic information of the salient object. We can clearly observe in Fig. 6 (c) and (d), that the obtained CAM map doesn’t depict a proper or sharp boundary. To overcome this problem, we compute edge map using the RCF edge detection network [26]. Then we dilate the thresholded CAM map and gate the edge map to obtain the edges of the salient objects. This design effectively removes the background edges. Pseudo-GT (spseudos_{pseudo}) is then defined as the union of the gated edge map and the CAM map, i.e., spseudo=CMges_{pseudo}=CM\cup g_{e}, where CMCM is the CAM map and geg_{e} is the gated edge. Fig. 6 (g) and (h) show the pseudo-GT (spseudos_{pseudo}) and the thresholded version of the pseudo-GT (spseudos_{pseudo}) (with threshold 0.5). We compute cross-entropy between the pseudo-GT (spseudos_{pseudo}) and the salient object s^\hat{s} prediction output of 3SD,

pgt=(j,k)(H,W)[spseudo(j,k)logs^(j,k)+(1spseudo(j,k))log(1s^(j,k)))],\mathcal{L}_{pgt}=-\sum_{(j,k)}^{(H,W)}\left[s_{pseudo}(j,k)\log\hat{s}(j,k)\right.\left.+\left(1-s_{pseudo}(j,k)\right)\log\left(1-\hat{s}(j,k))\right)\right],

(3)

where (j,k)(j,k) are pixel coordinates, and (H,W)(H,W) are height and width of the image. We minimize p\mathcal{L}_{p} and update the 3SD parameters to perform SOD.

Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)                       (b)                       (c)                       (d)

Refer to caption
Refer to caption
Refer to caption
Refer to caption

(e)                       (f)                       (g)                       (h)

Figure 6: (a) input image, (b) edge map (c) CAM-maps computed by 3SD, (d) thresholded cam map, (e) dialated cam map, (f) gated edge map, (g) pseudo-GT, (h) thresholded pseudo-GT.

3.3 Loss

To improve the boundary estimation for SOD, we further introduce the gated structure-aware loss (gs\mathcal{L}_{gs}) proposed by [49]. Gated structure-aware (gs\mathcal{L}_{gs}) is defined as follows,

gs=(i,j)d(h,v)Ψ(|ds(i,j)|exp(0.5|d(g.x(i,j))|)),\mathcal{L}_{gs}=\sum_{(i,j)}\sum_{d\in(\overrightarrow{h},\overrightarrow{v})}\Psi\left(|\partial_{d}s(i,j)|exp\left(-0.5|\partial_{d}(g.x(i,j))|\right)\right), (4)

where Ψ(s)=s2+e6\Psi(s)=\sqrt{s^{2}+e^{-6}}, (i,j)(i,j) are pixel coordinates, (h,v)(\overrightarrow{h},\overrightarrow{v}) are horizontal and vertical directions, \partial is derivative operation, and gg is the gate for the structure loss (see Fig. 6(e)).

The overall loss function used for training the 3SD network is defined as follows,

total=st+ρ+pgt+β1gs(s^,x).\mathcal{L}_{total}=\mathcal{L}_{st}+\mathcal{L}_{\rho}+\mathcal{L}_{pgt}+\beta_{1}\mathcal{L}_{gs}(\hat{s},x). (5)

We set β1=0.3\beta_{1}=0.3 in all our experiments.

4 Experiments and results

In this section, we present the details of various experiments conducted on six benchmarks to demonstrate the effectiveness of the proposed 3SD method. Additionally, we perform extensive ablation studies which shows the factors and parameters that influence the performance of the proposed 3SD method.

4.1 Datasets

Training dataset: We use images from the DUTS-TR dataset which is the training dataset of DUTS [38]. Note that we only use the images to train 3SD, i.e., we don’t use the corresponding salient object annotations. The DUTS-TR dataset contains 10,553 images, and it is the largest and most commonly used dataset for training SOD.

Evaluation datasets: We evaluate the performance of 3SD on six benchmark test datasets: (i) The DUTS-TE testing dataset (5,019 images) [38], (ii) DUT-OMRON (5,168 test images) [45], (iii) HKU-IS [22] (4,447 test images), (iv) PASCAL-S [23] (850 test images), (v) ECSSD [44] (1,000 images), and (vi) THUR [6] (6,233 images).

4.2 Comparison methods and metrics

We compare the performance of 3SD with seven SOTA weakly/unsupervised methods [47, 21, 50, 46, 49, 48], and eleven SOTA fully-supervised methods [52, 29, 39, 24, 54, 41, 42, 12, 34, 33, 40]. We use the following five evaluation metrics for comparison: structure measure [10] (SmS_{m}), saliency structure measure [49] (BμB_{\mu}), mean F-score (FβF_{\beta}), mean E-measure[11] (EηE_{\eta}), and mean absolute error (MAEMAE). Additionally, we plot precision vs. recall curves to show the effectiveness of our 3SD method.

4.3 Implementation Details

Given a training image, we randomly crop 320×320320\times 320 pixels and use it for training 3SD. As explained in the earlier sections, we perform patch-wise contrastive learning based self-supervised classification without labels. Specifically, we learn the teacher centers TcT_{c} (KK dimension vector) to sharpen the teacher’s output before computing st\mathcal{L}_{st}. Here, we sharpen the teachers output class logits map with TcT_{c} as shown in Fig 5i.e.  Ct(pi)=Ct(pi)+TcC_{t}(p_{i})=C_{t}(p_{i})+T_{c}). In all of our experiments, we perform 32×3232\times 32 patch-wise self-supervised classification in our 3SD method. For augmentations, we use the augmentations of BYOL [13] (Gaussian blur solarization, and color jittering) and multi-crop [2] with bi-cubic interpolation to resize the crop. In our 3SD method, the student and teacher networks share the same architecture as BN with different weights. Initially, we train the student and teacher networks for self-supervised classification as explained in section 3.2 for 30 epochs. Later, we perform patch-wise self-supervised classification and pseudo-GT based salient object detection in an iterative manner as explained in the algorithm provided in the supplementary document. We set MρM_{\rho} as 10 for all our experiments. We train 3SD for 150 epochs using a stochastic gradient descent optimizer with batch-size 8 and learning rate 0.001. Following DUTS [38], we set K=200K=200 classes in all of our experiments. The student network (fθsf_{\theta_{s}}) in 3SD is used for all evaluations during inference. Algorithm 1 shows steps involved in training our 3SD method.

Input: Set of training images 𝒟={xi}i=1N\mathcal{D}=\{x^{i}\}_{i=1}^{N}, TcT_{c} teacher centers KK dimension vector.
Result: θ^s\hat{\theta}_{s}, optimized network parameters of 3SD
for every epoch  do
       for {x}𝒟\{x\}\in\mathcal{D} do
             ### self-supervised classification ###
             # augmentation images #
             x,x′′=x^{\prime},\>x^{\prime\prime}\>=\> augment1(xx), augment2(xx)
             xlx^{\prime l},x′′gx^{\prime\prime g} = localview(xx^{\prime}), globalview(x′′x^{\prime\prime})
             ss,Cs=fθs(xl)s_{s},C_{s}=f_{\theta_{s}}\left(x^{\prime l}\right) # student output
             st,Ct=fθt(x′′g)s_{t},C_{t}=f_{\theta_{t}}\left(x^{\prime\prime g}\right) # teacher output
             extract PsP^{s} and NsN^{s} from CsC_{s}; extract PtP^{t} and NtN^{t} from CtC_{t}
             compute sc+ρ;\mathcal{L}_{sc}+\mathcal{L}_{\rho}; update θs\theta_{s}
             # EMA update teacher weights and centers #
             θtλθt+(1λ)θs;λ[0.996,1]\theta_{t}\leftarrow\lambda\theta_{t}+(1-\lambda)\theta_{s};\lambda\in[0.996,1]
             TcλTc+(1λ)Tc,tT_{c}\leftarrow\lambda T_{c}+(1-\lambda)T_{c,t}
             ### Pseudo-GT generation ###
             compute CAM map (CMCM) for image xx using classification fθsf_{\theta_{s}}
             obtain geg_{e} using CMCM and edge ee (using [26])
             spseudo=CMges_{pseudo}=CM\cup g_{e}
             ### update student network weights fθsf_{\theta_{s}} for SOD task ###
             s^,c^=fθs(x)\hat{s},\hat{c}=f_{\theta_{s}}(x)
             using spseudos_{pseudo} compute pgt\mathcal{L}_{pgt} compute b(s^,x)\mathcal{L}_{b}(\hat{s},x)
             update θs\theta_{s} using pgt+b(s^,x)\mathcal{L}_{pgt}+\mathcal{L}_{b}(\hat{s},x)
       end for
      
end for
Algorithm 1 Pseudo code for training 3SD.

4.4 Comparisons

Quantitative results: As shown in Table 1, our 3SD method consistently outperforms the SOTA weakly/unsupervised methods on all datasets and in all metrics. This shows training with our pseudo-GT generation is superior to the existing weakly/unsupervised SOD techniques. Moreover, although 3SD is not trained with any human annotations or weak labels (e.g. image captions, handcrafted features, or scribble annotations), its performance is on par with fully-supervised methods. In some cases the performance of 3SD is even better than some fully-supervised methods like  [29, 24, 54]. Similar behaviors can be observed in the precision vs. recall curves shown in Fig. 8. Curves corresponding to our method are close to the SOTA fully-supervised methods, while the curves corresponding to the SOTA weakly/unsupervised methods are far below our 3SD method.

Qualitative results: Fig. 7 illustrates the qualitative comparisons of our 3SD method with SOTA methods on 4 sample images from DUTS-TE, DUT-OMRON, HKU-IS, and ECSSD. It can be seen that the outputs of [47, 46, 49, 48] are blurred or incomplete, and include parts of non-salient objects. In contrast, 3SD outputs are accurate, clear, and sharp. For example, output saliency maps of [47, 46, 49, 48] in the second row of Fig. 7 miss parts of legs for the horses. And, output saliency maps of [47, 46, 49, 48] in the third row of Fig. 7 contains artifacts or parts of non-salient objects. In contrast, saliency maps of our 3SD method delineate legs for the horses properly, and are free of artifacts. More qualitative comparisons are provided in the supplementary material.

Table 1: Comparison of our method with SOTA methods on six benchmark datasets (DUTS-TE, DUT-OMRON, HKU-IS, PASCAL, ECSSD and THUR) using the metrics Sm,Bμ,Fβ,Eη,S_{m},\>B_{\mu},\>F_{\beta},\>E_{\eta}, and MAEMAE where &\uparrow\&\downarrow denote larger and smaller is better, respectively.
Fully supervised Weakly supervised Un-supervised
Dataset Metric PiCANet[24] CVPR18 MSNet[41] CVPR19 CPD[42] CVPR19 BASNet[34] CVPR19 U2Net[33] PR2020 LDF[40] CVPR20
VST[25]
ICCV21
BN (ours) WSS[38] CVPR17 WSI[21] AAAI18
MSW[46]
CVPR19
Scrible_S[49] CVPR20 MFNet[32] ICCV21 SBF[47] ICCV17 MNL[50] CVPR18 EDNS[48] ECCV20 3SD (ours)
SmS_{m}\uparrow 0.851 0.851 0.867 0.866 0.861 0.881 0.885 0.883 0.748 0.697 0.759 0.793 0.775 0.739 0.813 0.828 \ul0.844
BμB_{\mu}\downarrow 0.635 0.582 0.462 0.400 0.365 0.384 0.347 0.358 0.780 0.879 0.829 0.603 0.610 0.808 0.712 0.628 \ul0.443
FβF_{\beta}\uparrow 0.757 0.792 0.8246 0.823 0.804 0.855 0.870 0.829 0.633 0.569 0.648 0.746 0.770 0.622 0.725 0.747 \ul0.765
EηE_{\eta}\uparrow 0.853 0.883 0.902 0.896 0.897 0.910 0.939 0.913 0.806 0.690 0.742 0.865 0.839 0.763 0.853 0.859 \ul0.875
DUTS-TE MAEMAE\downarrow 0.062 0.050 0.043 0.048 0.044 0.034 0.037 0.036 0.100 0.116 0.091 0.062 0.076 0.107 0.075 0.060 \ul0.044
SmS_{m}\uparrow 0.826 0.809 0.818 0.836 0.842 0.847 0.839 0.843 0.730 0.759 0.756 0.771 0.742 0.731 0.733 0.791 \ul0.803
BμB_{\mu}\downarrow 0.685 0.642 0.549 0.480 0.438 0.432 0.455 0.429 0.830 0.839 0.890 0.655 0.658 0.812 0.776 0.689 \ul0.501
FβF_{\beta}\uparrow 0.710 0.709 0.739 0.767 0.757 0.773 0.800 0.777 0.590 0.641 0.597 0.702 0.646 0.612 0.597 0.701 \ul0.735
EηE_{\eta}\uparrow 0.823 0.830 0.845 0.865 0.867 0.873 0.883 0.869 0.729 0.761 0.728 0.835 0.803 0.763 0.712 0.816 \ul0.845
DUT-OMRON MAEMAE\downarrow 0.072 0.064 0.057 0.057 0.054 0.051 0.058 0.049 0.110 0.100 0.109 0.068 0.087 0.108 0.103 0.070 \ul0.660
SmS_{m}\uparrow 0.906 0.907 0.904 0.909 0.916 0.919 0.919 0.918 0.822 0.808 0.818 0.855 0.846 0.812 0.860 0.890 \ul0.904
BμB_{\mu}\downarrow 0.561 0.498 0.421 0.359 0.325 0.337 0.338 0.321 0.752 0.782 0.830 0.537 0.522 0.734 0.627 0.567 \ul0.381
FβF_{\beta}\uparrow 0.854 0.878 0.895 0.903 0.890 0.914 0.922 0.918 0.773 0.763 0.734 0.857 0.851 0.783 0.820 0.878 \ul0.885
EηE_{\eta}\uparrow 0.909 0.930 0.940 0.943 0.945 0.954 0.962 0.959 0.819 0.800 0.786 0.923 0.921 0.855 0.858 0.919 \ul0.935
HKU-IS MAEMAE\downarrow 0.047 0.039 0.033 0.032 0.031 0.027 0.030 0.024 0.079 0.089 0.084 0.047 0.059 0.075 0.065 0.043 \ul0.039
SmS_{m}\uparrow 0.848 0.844 0.848 0.838 0.844 0.851 0.863 0.856 - - 0.697 0.742 0.770 0.712 0.728 0.750 \ul0.761
BμB_{\mu}\downarrow 0.704 0.671 0.616 0.582 0.513 0.512 0.504 0.509 0.831 0.855 0.870 0.665 0.679 0.815 0.776 0.739 \ul0.531
FβF_{\beta}\uparrow 0.799 0.813 0.822 0.821 0.797 0.848 0.829 0.832 0.698 0.653 0.685 0.788 0.751 0.735 0.748 0.759 \ul0.763
EηE_{\eta}\uparrow 0.804 0.822 0.819 0.821 0.831 0.865 0.865 0.849 0.690 0.647 0.693 0.798 0.817 0.746 0.741 0.794 \ul0.810
PASCAL MAEMAE\downarrow 0.129 0.119 0.122 0.122 0.074 0.060 0.067 0.068 0.184 0.206 0.178 0.140 0.115 0.167 0.158 0.142 \ul0.137
SmS_{m}\uparrow 0.867 0.905 0.905 0.910 0.918 0.912 0.917 0.924 0.808 0.805 0.825 0.854 0.834 0.813 0.845 0.860 0.883
BμB_{\mu}\downarrow 0.592 0.542 0.434 0.364 0.321 0.323 0.348 0.319 0.808 0.801 0.851 0.550 0.581 0.759 0.681 0.600 \ul0.387
FβF_{\beta}\uparrow 0.871 0.885 0.907 0.913 0.910 0.930 0.929 0.933 0.767 0.762 0.761 0.865 0.854 0.782 0.810 0.852 \ul0.887
EηE_{\eta}\uparrow 0.909 0.922 0.932 0.938 0.936 0.925 0.945 0.940 0.796 0.792 0.787 0.908 0.885 0.835 0.836 0.883 \ul0.928
ECSSD MAEMAE\downarrow 0.054 0.048 0.043 0.040 0.033 0.034 0.034 0.032 0.108 0.068 0.098 0.061 0.084 0.096 0.090 0.071 \ul0.049
SmS_{m}\uparrow - 0.819 0.831 0.823 0.828 0.847 - 0.852 0.775 - - 0.794 - 0.762 0.804 0.810 \ul0.827
BμB_{\mu}\downarrow 0.659 0.620 0.525 0.489 0.531 0.469 - 0.464 0.788 - - 0.596 - 0.785 0.717 0.656 \ul0.487
FβF_{\beta}\uparrow 0.710 0.718 0.750 0.737 0.749 0.764 - 0.769 0.653 - - 0.718 - 0.627 0.691 0.719 \ul0.746
EηE_{\eta}\uparrow 0.821 0.829 0.851 0.841 0.843 0.842 - 0.846 0.775 - - 0.837 - 0.770 0.807 0.838 \ul0.840
THUR MAEMAE\downarrow 0.084 0.079 0.094 0.073 0.075 0.064 - 0.061 0.097 - - 0.077 - 0.107 0.086 0.070 \ul0.064
Parameters Size in MB 197 - 183 349 176 98 82 219 53 - 119 68 47 376 170 - 219
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Image       GT       [40]   BN(ours)    [49]       [32]      [47]       [46]      [48]  3SD(ours)

Figure 7: Qualitative comparison of 3SD method against SOTA fully-supervised, and SOTA weakly/unsupervised methods.
Refer to caption
Figure 8: Precision vs. recall curve comparison of our method with SOTA methods on the ECSSD, HKU-IS, PASCAL, DUTS-TE, and DUT-OMRON datasets. In the precision vs. recall graphs, we represent all supervised methods with thick lines and weakly/unsupervised methods with dotted lines.

4.5 Ablation study

The goal of these ablation experiments is to analyze the components in the pseudo-GT generation that effect the performance of 3SD. We perform seven experiments, patch-wise contrastive learning based self-supervised classification experiments (PCL16: 16×1616\times 16, PCL24: 24×2424\times 24, PCL32: 32×3232\times 32, PCL48: 48×4848\times 48 pixels), self-supervised classification with one global class label (GCL), PCM: generating pseudo-GT without using the edge map ee (using CM as spseudos_{pseudo}), PGE: generating pseudo-GT without using the CAM map CM (using geg_{e} as spseudos_{pseudo}), CMG: training 3SD with the pseudo-GT (spseudo=CMges_{pseudo}=CM\cup g_{e}) and without gs\mathcal{L}_{gs}.

Impact of patch-wise contrastive learning based self-supervised classification: As can be seen from Table 2, PG (global class wise classification) fails to produce proper CAM maps (see Fig. 4) which results in lower performance when compared to patch-wise contrastive learning based self-supervised classification with patch-size PCL16, PCL24, PCL32, PCL40 (16×16, 24×24, 32×32, 48×4816\times 16,\>24\times 24,\>32\times 32,\>48\times 48 respectively). From Table 2 and Fig. 4, it is evident that when we increase the patch size in patch-wise self-supervised classification from 16×1616\times 16 to  32×32\>32\times 32, we obtain better quality CAM maps which in turn results in better pseudo-GT and as a result better SOD performance. We obtained the best performance using setting of 32×3232\times 32. Note that, larger patch doesn’t improve the performance of 3SD method.

Impact of Pseudo-GT: As explained in the section 3.2, pseudo-GT is defined as spseudo=CMges_{pseudo}=CM\cup g_{e}. Here we perform experiments to validate the important role played by CAM (CMCM) map and gated edge (geg_{e}) in the construction of the pseudo-GT. From Table 2 columns PCM and PGE, we can clearly observe a huge improvement in performance when we use CMCM as the pseudo-GT instead of geg_{e}. This shows that CMCM obtained from patch-wise self-supervised classification contains more consistent semantic information than gated edge (geg_{e}). Furthermore in the LPG column of Table 2, the combination of both CMCM and geg_{e} brings in further increase of the performance. The PCL32 column in Table 2 corresponds to the case where we train 3SD with the pseudo-GT (spseudo=CMges_{pseudo}=CM\cup g_{e}), with additional boundary aware loss (gs\mathcal{L}_{gs}). In this case, we found a small improvement in the performance. Results corresponding to this ablation experiments are also shown in a bar-graph in Fig. 9.

BaseNetwork (BN): We perform three experiments to evaluate the effectiveness of the constructed Base Network (BN). We use the following definitions in this experiment: 1) B0: BN with one encoder (local encoder ELE_{L}) and one decoder(saliency decoder DeSDe_{S}). 2) B1: BN with two encoders(ELE_{L} and EGE_{G}), and one decoder(DeSDe_{S}) (adding global encoder (EGE_{G}) to BN). 3) B2: Using two encoders and two decoders (adding classification decoder to BN), as shown in Fig. 3. For this experiment, B0, B1, B2 methods are trained in supervised fashion using the actual ground-truth labels with cross-entropy loss. As can be seen from Table 2, we obtain an improvement when we add transformer based global encoder (EGE_{G}) to BN (B1 in Table 2). This implies that EGE_{G} is efficient in capturing patch-wise relations to obtain better saliency maps. Furthermore, adding a classification decoder improves BN’s performance (B2 in Table 2).

Refer to caption
Figure 9: Bar graph showing the improvements obtained when using CMCM (CAM maps computed with help of self-supervised classification task) in the proposed pseudo-GT generation.
Table 2: Ablation study experiment on the DUTS-TE, HKU-IS, and ECSSD datasets. &\uparrow\&\downarrow denote larger and smaller is better, respectively.
Self or Unsupervised
Fully-supervised
Base Network (BN)
Dataset Metrics PCL16 PCL24 PCL32 PCL48 GCL PCM PGE LPG B0 B1 B2
SmS_{m}\uparrow 0.832 0.841 0.844 0.842 0.804 0.791 0.712 0.828 0.836 0.861 0.883
BμB_{\mu}\downarrow 0.451 0.438 0.443 0.432 0.463 0.621 0.658 0.501 0.368 0.364 0.358
FβF_{\beta}\uparrow 0.759 0.762 0.765 0.760 0.748 0.738 0.652 0.756 0.798 0.807 0.829
EηE_{\eta}\uparrow 0.866 0.874 0.875 0.869 0.844 0.832 0.740 0.858 0.884 0.897 0.913
DUTS-TE MAEMAE\downarrow 0.047 0.044 0.044 0.052 0.058 0.072 0.119 0.052 0.052 0.046 0.036
SmS_{m}\uparrow 0.792 0.796 0.803 0.795 0.782 0.767 0.672 0.783 0.828 0.832 0.843
BμB_{\mu}\downarrow 0.504 0.501 0.501 0.508 0.511 0.701 0.687 0.518 0.460 0.451 0.429
FβF_{\beta}\uparrow 0.726 0.738 0.735 0.732 0.715 0.691 0.614 0.718 0.747 0.756 0.777
EηE_{\eta}\uparrow 0.832 0.838 0.845 0.840 0.826 0.810 0.731 0.817 0.854 0.859 0.869
DUT-OMRON MAEMAE\downarrow 0.062 0.062 0.060 0.062 0.070 0.077 0.130 0.063 0.060 0.054 0.049
SmS_{m}\uparrow 0.878 0.881 0.883 0.879 0.869 0.852 0.770 0.876 0.895 0.905 0.924
BμB_{\mu}\downarrow 0.395 0.389 0.387 0.391 0.410 0.590 0.662 0.400 0.346 0.332 0.319
FβF_{\beta}\uparrow 0.881 0.891 0.897 0.894 0.873 0.843 0.756 0.892 0.891 0.905 0.933
EηE_{\eta}\uparrow 0.921 0.926 0.928 0.928 0.904 0.869 0.854 0.922 0.929 0.934 0.940
ECSSD MAEMAE\downarrow 0.050 0.049 0.049 0.049 0.052 0.077 0.128 0.049 0.047 0.043 0.032
SmS_{m}\uparrow 0.862 0.891 0.904 0.896 0.863 0.841 0.747 0.892 0.890 0.902 0.918
BμB_{\mu}\downarrow 0.389 0.384 0.381 0.383 0.394 0.543 0.657 0.397 0.354 0.338 0.321
FβF_{\beta}\uparrow 0.864 0.881 0.885 0.883 0.859 0.849 0.734 0.878 0.893 0.903 0.918
EηE_{\eta}\uparrow 0.920 0.928 0.935 0.932 0.916 0.901 0.829 0.931 0.941 0.946 0.959
HKU-IS MAEMAE\downarrow 0.042 0.038 0.039 0.040 0.048 0.055 0.125 0.040 0.035 0.031 0.024

5 Conclusion

We presented a Self-Supervised Saliency Detection method (3SD), which doesn’t require any labels, i.e., neither human annotations nor weak labels like image captions, handcrafted features or scribble annotations. The cornerstone of successful self-supervised SOD approach is generation of high-quality of pseudo-GTs. Our novel patch-wise contrastive learning paradigm effectively captures the semantics of salient objects. And this is the key of our superior performance verified on six benchmarks.

References

  • [1] Brendel, W., Bethge, M.: Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760 (2019)
  • [2] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882 (2020)
  • [3] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294 (2021)
  • [4] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40(4), 834–848 (2017)
  • [5] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML. pp. 1597–1607. PMLR (2020)
  • [6] Cheng, M.M., Mitra, N.J., Huang, X., Hu, S.M.: Salientshape: group saliency in image collections. The visual computer 30(4), 443–453 (2014)
  • [7] Dai, J., He, K., Sun, J.: Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: ICCV. pp. 1635–1643 (2015)
  • [8] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [9] Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. TPAMI 38(9), 1734–1747 (2015)
  • [10] Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new way to evaluate foreground maps. In: ICCV. pp. 4548–4557 (2017)
  • [11] Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421 (2018)
  • [12] Feng, M., Lu, H., Ding, E.: Attentive feedback network for boundary-aware salient object detection. In: CVPR. pp. 1623–1632 (2019)
  • [13] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., Valko, M.: Bootstrap your own latent: A new approach to self-supervised learning. NIPS (2020)
  • [14] Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. pp. 297–304 (2010)
  • [15] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR. pp. 9729–9738 (2020)
  • [16] Henaff, O.: Data-efficient image recognition with contrastive predictive coding. In: ICML. pp. 4182–4192. PMLR (2020)
  • [17] Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
  • [18] Hsu12, K.J., Lin, Y.Y., Chuang, Y.Y.: Weakly supervised saliency detection with a category-driven map generator. BMVC (2017)
  • [19] Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: Weakly supervised instance and semantic segmentation. In: CVPR. pp. 876–885 (2017)
  • [20] Kim, J., Han, D., Tai, Y.W., Kim, J.: Salient region detection via high-dimensional color transform. In: CVPR. pp. 883–890 (2014)
  • [21] Li, G., Xie, Y., Lin, L.: Weakly supervised salient object detection using image labels. In: AAAI. pp. 7024–7031 (2018)
  • [22] Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: CVPR. pp. 5455–5463 (2015)
  • [23] Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: CVPR. pp. 280–287 (2014)
  • [24] Liu, N., Han, J., Yang, M.H.: Picanet: Learning pixel-wise contextual attention for saliency detection. In: CVPR. pp. 3089–3098 (2018)
  • [25] Liu, N., Zhang, N., Wan, K., Shao, L., Han, J.: Visual saliency transformer. In: CVPR. pp. 4722–4732 (2021)
  • [26] Liu, Y., Cheng, M.M., Hu, X., Wang, K., Bai, X.: Richer convolutional features for edge detection. In: CVPR. pp. 3000–3009 (2017)
  • [27] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. pp. 3431–3440 (2015)
  • [28] Lu, S., Tan, C., Lim, J.H.: Robust and efficient saliency modeling from image co-occurrence histograms. TPAMI 36(1), 195–201 (2013)
  • [29] Luo, Z., Mishra, A., Achkar, A., Eichel, J., Li, S., Jodoin, P.M.: Non-local deep features for salient object detection. In: CVPR. pp. 6609–6617 (2017)
  • [30] Nguyen, D.T., Dax, M., Mummadi, C.K., Ngo, T.P.N., Nguyen, T.H.P., Lou, Z., Brox, T.: Deepusps: Deep robust unsupervised saliency prediction with self-supervision. arXiv preprint arXiv:1909.13055 (2019)
  • [31] Obukhov, A., Georgoulis, S., Dai, D., Van Gool, L.: Gated crf loss for weakly supervised semantic image segmentation. arXiv preprint arXiv:1906.04651 (2019)
  • [32] Piao, Y., Wang, J., Zhang, M., Lu, H.: Mfnet: Multi-filter directive network for weakly supervised salient object detection. In: ICCV. pp. 4136–4145 (October 2021)
  • [33] Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition 106, 107404 (2020)
  • [34] Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: Basnet: Boundary-aware salient object detection. In: CVPR. pp. 7479–7489 (June 2019)
  • [35] Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. NIPS 29 (2016)
  • [36] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780 (2017)
  • [37] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: ECCV. pp. 776–794. Springer (2020)
  • [38] Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., Ruan, X.: Learning to detect salient objects with image-level supervision. In: CVPR. pp. 136–145 (2017)
  • [39] Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: A novel approach to saliency detection. In: CVPR. pp. 3127–3135 (2018)
  • [40] Wei, J., Wang, S., Wu, Z., Su, C., Huang, Q., Tian, Q.: Label decoupling framework for salient object detection. In: CVPR. pp. 13022–13031 (June 2020)
  • [41] Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., Ding, E.: A mutual learning method for salient object detection with intertwined multi-supervision. In: CVPR. pp. 8150–8159 (2019)
  • [42] Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: CVPR. pp. 3907–3916 (2019)
  • [43] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR. pp. 3733–3742 (2018)
  • [44] Yan, Q., Xu, L., Shi, J., Jia, J.: Hierarchical saliency detection. In: CVPR. pp. 1155–1162 (2013)
  • [45] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: CVPR. pp. 3166–3173 (2013)
  • [46] Zeng, Y., Zhuge, Y., Lu, H., Zhang, L., Qian, M., Yu, Y.: Multi-source weak supervision for saliency detection. In: CVPR. pp. 6074–6083 (2019)
  • [47] Zhang, D., Han, J., Zhang, Y.: Supervision by fusion: Towards unsupervised learning of deep salient object detector. In: ICCV. pp. 4048–4056 (2017)
  • [48] Zhang, J., Xie, J., Barnes, N.: Learning noise-aware encoder-decoder from noisy labels by alternating back-propagation for saliency detection. In: ECCV. pp. 349–366. Springer (2020)
  • [49] Zhang, J., Yu, X., Li, A., Song, P., Liu, B., Dai, Y.: Weakly-supervised salient object detection via scribble annotations. In: CVPR. pp. 12546–12555 (2020)
  • [50] Zhang, J., Zhang, T., Dai, Y., Harandi, M., Hartley, R.: Deep unsupervised saliency detection: A multiple noisy labeling perspective. In: CVPR. pp. 9029–9038 (2018)
  • [51] Zhang, J., Ehinger, K.A., Wei, H., Zhang, K., Yang, J.: A novel graph-based optimization framework for salient object detection. Pattern Recognition 64, 39–50 (2017)
  • [52] Zhang, P., Wang, D., Lu, H., Wang, H., Yin, B.: Learning uncertain convolutional features for accurate saliency detection. In: ICCV. pp. 212–221 (2017)
  • [53] Zhang, Q., Huo, Z., Liu, Y., Pan, Y., Shan, C., Han, J.: Salient object detection employing a local tree-structured low-rank representation and foreground consistency. Pattern Recognition 92, 119–134 (2019)
  • [54] Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: CVPR. pp. 714–722 (2018)
  • [55] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR. pp. 2921–2929 (2016)
  • [56] Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: CVPR. pp. 2814–2821 (2014)