Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object Detection

Chenglizhao Chen¹ Guotao Wang¹ Chong Peng^1∗ Yuming Fang² Dingwen Zhang³ Hong Qin⁴
¹Qingdao University ² Jiangxi University of Finance and Economics
³Xidian University

~{}~{}^{4}

Stony Brook University
Code&Data: https://github.com/guotaowang/STVS
Corresponding author: Chong Peng ([email protected]), Chenglizhao Chen and Guotao Wang contributed equally to this work.

Abstract

The current main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches. As a complementary component, the main task for the temporal branch is to intermittently focus the spatial branch on those regions with salient movements. In this way, even though the overall video saliency quality is heavily dependent on its spatial branch, however, the performance of the temporal branch still matter. Thus, the key factor to improve the overall video saliency is how to further boost the performance of these branches efficiently. In this paper, we propose a novel spatiotemporal network to achieve such improvement in a full interactive fashion. We integrate a lightweight temporal model into the spatial branch to coarsely locate those spatially salient regions which are correlated with trustworthy salient movements. Meanwhile, the spatial branch itself is able to recurrently refine the temporal model in a multi-scale manner. In this way, both the spatial and temporal branches are able to interact with each other, achieving the mutual performance improvement. Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.

Index Terms:

Video Saliency Detection; lightweight temporal model; fast temporal shuffle scheme; multi-scale spatiotemporal deep features

Refer to caption — Figure 1: The demonstration of the structure differences between our method and the SOTA methods; VSOD: Video Salient Object Detection; sub figure A demonstrates the early classic bi-stream structure; sub figure B and C are the two current most representative structures (e.g., B: [1], C: [2]), which bias its video saliency to their spatial branch; sub figure D is our novel structure, which make full use of the multi-scale spatiotemporal information for an extremely fast video saliency detection; Also, we have summarized the advantages (+) and disadvantages (-) below each sub figures.

I Introduction and Motivation

The problem of video salient object detection aims to extract the most visually distinctive objects in video data. Previous works [3, 4, 5, 6] generate their video saliency maps mainly by fusing the saliency cues which are respectively revealed from its spatial and temporal branches. In such case, the main task for its spatial branch is to estimate the color saliency cue in single frame, yet its temporal branch aims the motion saliency cue between consecutive multiple frames.

After entering the deep learning era, we can use the off-the-shelf image saliency deep models [7, 8, 9] to serve as the spatial branch, and thus we have no intention to give a detailed introduction on this beyond scope issue here. On the other hand, most of the previous works have adopted the Optical Flow to sense the motion cues, and these motion cues will latterly be feeded into the temporal branch to coarsely locate those regions with salient movements [10, 11]. However, the heavy computational optical flow is the major performance bottle-neck, see the pictorial demonstration in the left part of Fig. 1.

To further improve, the recent works have focused their video saliency on the spatial branch, while the temporal branch becomes a subordinate to the spatial branch, aiming to intermittently shrink the spatial problem domain [12, 2, 1]. However, for video data, the color information is frequently more stable than the motion information, e.g., the motion information may be absent completely if the salient object stay static without any movement for a long period [10]. Thus, instead of using the full interaction strategy, the SOTA deep learning based methods either follow the single-direction interaction which bias their spatiotemporal trade-off to the spatial branch (e.g., MGA [1]), or choose the single-scale interaction to enable the fast end-to-end network training/testing (e.g., SSAV [2]). With regard to further clarify these issue, here, we have demonstrated 2 most representative SOTA architectures in the middle column of Fig. 1.

The first architecture (sub figure B in Fig. 1) continue uses the optical flow as the input of its temporal branch, while it treats its temporal branch provided motion saliency cues as the side-layer attentions to facilitate the spatial deep feature computation in the multi-scale manner [1]. Although such multi-scale spatiotemporal fusion can indeed improve the overall performance, its performance is still limited by its optical flow usage. Though the FlowNet can conduct the optical flow computation almost real-time, its computational cost is still the major bottle-neck for the extremely fast video saliency detection. Meanwhile, the the optical flow provided motion information is occasionally inaccurate, which further degenerate the overall robustness.

The second architecture has abandoned the optical flow usage, adopting the end-to-end ConvLSTM network to sense the temporal saliency cues [12, 2]. As a subsequent component to the spatial branch, the ConvLSTM network takes the output of its precedent spatial branch as input, seeking the consistent spatial saliency over the time scale as the spatiotemporal video saliency, see sub figure C in Fig. 1. Despite its merit of fast computation, such single stream structure has the following limitations: 1) The ConvLSTM network is heavily dependent on its precedent spatial branch, which easily lead to performance bottle-neck when the given video data is dominated by the motion cues; 2) Due to the heavy network architecture of the ConvLSTM, the overall video saliency computation speed is limited to about 10 FPS; 3) Its temporal branch (ConvLSTM) can not make full use of the multi-scale spatial deep features of its precedent spatial branch, while these two branches should interact with each other to boost their performance mutually [13, 14].

Thus, in this paper, we propose a novel spatiotemporal network for video saliency detection, and its key component is the newly designed temporal model to sense the temporal information in an extremely fast speed, see sub figure D in Fig. 1. In sharp contrast to all previous works, our method attempts to be full interaction between spatial and temporal branches. The reason is that our temporal module is lightweight designed yet with strong temporal ability, which can be directly inserted into each UNet decoder layer, receiving multi-scale spatial information to boost its robustness, whereas the conventional temporal branches can not receive such multi-scale spatial information (e.g., ConvLSTM). Our spatiotemporal network receive 3 consecutive video frames each time, in which its temporal model mainly consists of simple operations, i.e., sequential 3D convolutions with temporal shuffle operations. Benefited by such a lightweight designation, it is feasible to integrate the temporal model into each spatial feature layer. In this way, we can make full use of the multi-scale spatial deep features while sensing the motion saliency cues over temporal scale. Meanwhile, as an additional convolutional part for each spatial feature layer, the temporal model is able to facilitate the spatial deep feature computation in a recurrent manner.

In summary, compared to the current SOTA methods, our spatiotemporal network has three prominent advantages:

•

Instead of using the time-consuming optical flow for temporal information, we have devised a novel temporal module, which is very fast and compatible with the current main-stream encoder-decoder structures, achieving real-time detection with 50 FPS;
•

We have inserted the novel temporal module into the decoder layers of vanilla UNet, which can take full advantage of the multi-scale spatial information to alleviate the temporal module induced object boundary blur;
•

Moreover, because our temporal branch is more accurate than previous works, we can use the feature propagated from the temporal branch to further improve the performance of our spatial branch, achieving the full spatiotemporal interactions.

II Related Work

II-A Handcraft Feature Based Methods

Conventional methods estimate the temporal information mainly using the contrast computation over the hand-crafted features. Liu et al. [16] computed superpixel-wise spatiotemporal contrast histogram to sense the temporal information. Similarly, Liu et al. [4] proposed the intra-frame similarity matrices to perform the bi-directional temporal propagation as the temporal information. Wang et al. [17] acquired the spatiotemporal feature by computing robust geodesic measurement for locating spatial edges and motion boundaries. [3] adopted the gradient flow field to obtain intra-frame boundary information as well as inter-frame motion information. Further, Chen et al. [18, 19] adopted the long-term information to enhance the spatiotemporal saliency consistency.

II-B Deep Learning Based Methods

Benefited by the development of deep learning techniques, the current SOTA methods have widely adopted the bi-stream network structure, in which one of its streams aims the color saliency over the spatial information, and another stream extracts the motion saliency over the temporal scale. Le et al. [14] has adopted such bi-stream structure, in which one of its streams computes the superpixel-wise spatial saliency cues, and another one aims the temporal saliency computation by applying the 3D convolution directly over multiple video frames. However, its direct usage of 3D convolution has totally overlooked the spatial information, obscuring its detection boundary severely. Wang et al. [20] proposed to use 2D convolution network to sense the differences between adjacent two frames as the temporal information, in which its behind rationale is quiet similar to the deep learning based optical flow computation. Song et al. [12] adopted several dilated ConvLSTMs to extract multi-scale spatiotemporal information and feed it into bi-directional ConvLSTM to obtain the spatiotemporal information. Wang et al. [21], Fan et al [2] further adopted the human visual fixation to enable its video saliency shifting between different objects. Li et al. [22] has adopted the FlowNet based optical flow to sense the temporal information, and then these temporal information will be used to enhance the saliency consistency over temporal scale by using the ConvLSTM network. Most recently, Li et al. [1] have developed a multi-task network. As usually, it has adopted the optical flow to sense the temporal information, in which its major highlight is that it utilizes its temporal branch to accomplish its spatial branch, achieving a significant performance improvement. In [1], though its temporal branch is able to affect its spatial branch, it has overlooked the usage of its spatial branch to interact with its temporal branch, in which the color saliency obtained by the spatial branch can indeed affect the temporal branch positively.

III Spatiotemporal Network Overview

The classic UNet [15] adopts the encoder-decoder network structure, which has a remarkable learning ability to simultaneously extract both high-level semantic information and low-level tiny spatial details. Thus, we choose it as the baseline network, and its overall network architecture can be found in Fig. 2).

In the case of single image, the high-level semantic information in UNet tends to decrease with the increase of the decoder layers. However, the problem is that the performance of these increased decoder layers may get degenerated due to the reduced high-level semantic information. To alleviate it, the widely used scheme is to integrate those deep features in each encoder layer, which is supposed to have abundant of high-level semantic information, into each decoder layer.

In the case of video data, we input 3 video frames into our baseline network each time. For each encoder layer, we represent its feature block as ${F}_{i,j},i\in\{1,2,3\}$ , where $j$ is the encoder layer index, and $i$ indicates those spatial deep features of the 3 input images. As shown in Fig. 2, we respectively assign one temporal model (marked by yellow rectangle) for each decoder layer. The temporal model takes the spatial deep features ${F}_{i,j}$ as input, aiming to reveal an additional high-level semantic information over the temporal scale (i.e., between the 3 input frames). With the help of these temporal related high-level semantic information, the spatial deep features in the decoder layers can get improved significantly, and thus we name these fused deep features as the spatiotemporal deep features, which is able to sense both the spatial and temporal saliency cues at the same time.

Meanwhile, to make full use of both the multi-scale spatial deep features and the high-level attention based spatial localization information, the temporal model also takes the spatial attention maps as input, and these attention maps ( $A$ ) are computed by applying the dilated convolutions (with dilation factor $d=\{0,2,4,6\}$ ) over the spatial deep features of the last encoder layer (i.e., ${F}_{i,5}$ ). Thus, the computation procedure toward the spatiotemporal deep feature ( ${ST}_{i,3}$ ) in the 3rd encoder layer can be detailed as Eq. 1.

{ST}_{\emph{i},3}=\emph{TM}\Bigg{(}\emph{Conv}\Big{(}{F}_{\emph{i},3}\otimes\emph{U}({A})\Big{)}\Bigg{)},

(1)

where $Conv$ denotes the feature convolutional operation, $\otimes$ is the feature concatenate operation, $TM$ denotes the temporal model, $U$ denotes the up-sample operation.

Moreover, the spatiotemporal deep features ( $ST$ ) will be recurrently integrated into the next decoder layer for robust temporal deep feature computation. Thus, the complete spatiotemporal deep feature computation procedure should be updated to Eq. 2.

{ST}_{\emph{i},3}=\emph{TM}\Bigg{(}\emph{Conv}\Big{(}\emph{Conv}({F}_{\emph{i},3}\otimes R_{\emph{i},2})\otimes\emph{U}({A})\Big{)}\Bigg{)},

(2)

{R}_{i,2}={U\Bigg{(}{\ {ST}_{\emph{i},2}\ }+\ }{{\emph{Conv}\Big{(}{Conv}({F}_{\emph{i},2}\otimes R_{\emph{i},1})\otimes\emph{U}({A})\Big{)}}}\Bigg{)},

(3)

where $R_{i,2}$ represents the recurrent features from its previous decoder layer, which can be obtained by Eq. 3. In this way, we have achieved the full interactive status between the spatial branch (i.e., the encoder layers of UNet) and the temporal model.

IV Our Temporal Model

IV-A Fast 3D Convolution

Compared to the conventional 2D convolution (e.g., a $3\times 3$ flat kernel), which can only sense the spatial information in single video frame, the 3D convolution (e.g., a $3\times 3\times 3$ cubic kernel) is able to sense the temporal information. As a basic computational unit, our 3D convolution itself is exactly the same as the plain 3D convolution, yet the major difference is how we use it to capture temporal information. In general, the single plain 3D convolution frequently with limited temporal sensing ability, however, our 3D convolution with fast cyclic padding scheme kills two birds with one stone:
1) It avoids the conventional padding induced performance degradation when using sequential 3D convolutions to enhance the temporal sensing ability;
2) It cyclicly uses other frames as the padding data, enhancing the temporal ability naturally without additional computational costs.

Therefore, for each encoder spatial layer, we use the 3D convolution to sense its spatial common consistency over the temporal scale as the spatiotemporal deep features, e.g., the ${ST}_{\emph{i},3}$ in Eq. 2, which is also marked by red dash line in Fig. 2. As we have mentioned in Eq. 2, the input data of our temporal model consists of 3 aspects, including the multi-scale spatial deep features ( $F$ ), the attention maps ( $A$ ), and the recurrent data from the precedent decoder layer ( $R$ ). For simplicity, we use $T_{i,j},i\in\{1,2,3\}$ to denote the input deep features (i.e., 3 frames) of the temporal model in the $j$ -th decoder layer. To reveal the temporal information, we use the sliding window scheme to apply the 3D convolution ( $Conv3D$ ) over $T_{i}$ , and thus the temporal model output ( $ST_{i,j}$ ) can be fast computed by Eq. 4.

ST_{i,j}^{1}=\left\{\begin{array}[]{lll}Conv3D(pad_{1},T_{1,j},T_{2,j})&if\ \ i=1\\ Conv3D(T_{1,j},T_{2,j},T_{3,j})&if\ \ i=2\\ Conv3D(T_{2,j},T_{3,j},pad_{2})&if\ \ i=3\end{array}\right.,

(4)

where $pad$ denote the padding data. Thus far, we have applied the 3D convolution over spatial deep features to sense temporal information, however, we found that only using single 3D convolution is incapable to obtain temporal information accurately. Thus, in our implementation, we use 3 sequential 3D convolutions in our temporal model, and thus the spatiotemporal deep features of the 2rd 3D convolution can be updated by Eq. 5.

ST_{i,j}^{2}\leftarrow\left\{\begin{array}[]{lll}Conv3D(pad_{3},ST_{1,j}^{1},ST_{2,j}^{1})&if\ \ i=1\\ Conv3D(ST_{1,j}^{1},ST_{2,j}^{1},ST_{3,j}^{1})&if\ \ i=2\\ Conv3D(ST_{2,j}^{1},ST_{3,j}^{1},pad_{4})&if\ \ i=3\end{array}\right..

(5)

As shown in Eq. 4 and Eq. 5, there exists 2 major problems toward the sequential usage of multiple 3D convolutions:
1) Due to the miss-aligned spatial information between different video frames, the direct usage of multiple 3D convolution easily lead to loss the tiny spatial details, blurring the object boundaries in its detection results;
2) The intuitive padding scheme (e.g., zero padding) may lead the computed spatiotemporal deep features problematical after using multiple sequential 3D convolutions.

To solve the problem 1), we simply add each $ST$ with the deep features computed by 2D convolution, e.g., $ST^{1}_{1,j}\leftarrow\{ST^{1}_{1,j}+Conv(T_{1,j})\}$ , $ST^{2}_{1,j}\leftarrow\{ST^{2}_{1,j}+Conv(ST^{1}_{1,j})\}$ . Meanwhile, we use the cyclic padding scheme to handle the problem 2), see Eq. 6.

\{pad_{1},pad_{2},pad_{3},pad_{4}\}\leftarrow\{T_{3,j},T_{1,j},ST^{1}_{1,j},ST^{1}_{3,j}\}.

(6)

However, such cyclic padding scheme is time consuming if we reorganize the input data. Therefore, we directly use the repeat operation on GPU 3 times to expand the original $\{T_{1},T_{2},T_{3}\}$ into $\{T_{1},T_{2},T_{3},T_{1},T_{2},T_{3},T_{1},T_{2},T_{3}\}$ , and thus the cyclic padding can be fulfilled by using a sliding window over these expanded features, and such implementation is 5 times faster than the conventional feature reorganization, see the example in Eq. 7.

\begin{split}repeat(\{T_{1},\underset{ST^{1}_{1,j}=Conv3D(...)\ \ }{T_{2},T_{3}\},3)\rightarrow\{T_{1},T_{2},\underbrace{T_{3},T_{1},T_{2}},T_{3},T_{1},...}\}.\end{split}\vspace{-0.2cm}

(7)

Also, we have shown the complete data flow of our temporal model in Fig. 3

IV-B Fast Temporal Shuffle

The temporal information sensed by the 3D convolutions is mainly from the 3rd dimension of the adopted 3D kernels (e.g., “Spatial”: $\{3\times 3\}\times$ “Temporal”: $\{3\}$ ), which may lead the final spatiotemporal deep features bias to the spatial domain, degenerating the performance of the temporal model. To solve it, we propose a fast temporal shuffle scheme to enhance the temporal information sensing ability of the temporal model.

Inspired by the ShuffleNet [23] which enhances its spatial feature diversity by randomly scrambling its feature orders, here, we enhance the temporal part in our temporal model by swapping the deep features between consecutive video frames. For an example, as shown in Eq. 8, we swap the deep feature $f_{a}$ in frame 1 (i.e., $ST_{1}$ ) with the deep feature $f_{b}$ in frame 2 (i.e., $ST_{2}$ ), and we swap $f_{c}$ with $f_{d}$ as well.

ST_{1}\{...,\underset{swap}{\underbrace{f_{a},...\}\Big{|}ST_{2}\{...,f_{b}}},...,\underset{swap}{\underbrace{f_{c},...\}\Big{|}ST_{3}\{...,f_{d}}},...\}.

(8)

In fact, the above temporal shuffle can be fully implemented in GPU, which can be extremely simple and fast, see the pictorial demonstration in Fig. 4. We first sequentially divide the original $192\times 1$ deep features into 64 groups, and each group includes 3 deep features. Next, we reshape the original $192\times 1$ deep features into $64\times 3$ according to their group orders, and then we transpose it into $3\times 64$ , and finally we flatten it back to $192\times 1$ deep features. In this way, we have automatically inserted the temporally neighbored spatial features into the current frame.

In our implementation, we repeat the above temporal shuffle 2 times, i.e., one for the output ( $ST^{1}$ ) of the first 3D convolution , and another for the output ( $ST^{2}$ ) of the second 3D convolution. Thus the complete dataflow in temporal model can be represented by Eq. 9.

ST\leftarrow Conv3D\Bigg{(}S\bigg{(}Conv3D\Big{(}S\big{(}Conv3D(T)\big{)}\Big{)}\bigg{)}\Bigg{)},

(9)

where $T$ and $ST$ respectively denote the input and output of the temporal model, $S$ represents the temporal shuffle operation.

TABLE I: Quantitative comparisons between our method and the SOTA video saliency detection methods using F-max, S-measure, MAE metrics, the top three results are respectively highlighted in red, green, blue, ”—”indicate the model trained on this datasets, ”*” indicates the model was deep learning based video saliency detection methods, and ”**” indicates the results was deep learning based image saliency detection methods.

DataSets	-	DAVIS-T [24]			SegTrack-V2 [25]			ViSal [3]			FBMS-T [26]			VOS-T [27]			DAVSOD-T [2]
Metric	Year	F-max	S-measure	MAE	F-max	S-measure	MAE	F-max	S-measure	MAE	F-max	S-measure	MAE	F-max	S-measure	MAE	F-max	S-measure	MAE
OUR	-	0.865	0.892	0.023	0.860	0.891	0.017	0.952	0.952	0.013	0.856	0.872	0.038	0.791	0.850	0.058	0.651	0.746	0.086
MGA [1]*	2019	0.892	0.910	0.023	0.821	0.865	0.030	0.933	0.936	0.017	0.899	0.904	0.028	0.735	0.792	0.075	0.640	0.738	0.084
AGS [21]*	2019	0.873	0.898	0.026	0.816	0.858	0.022	0.960	0.960	0.014	0.840	0.874	0.048	0.774	0.840	0.066	0.661	0.759	0.090
SSAV [2]*	2019	0.861	0.893	0.028	0.801	0.851	0.023	0.939	0.943	0.020	0.865	0.879	0.040	0.742	0.819	0.073	0.603	0.724	0.092
CPD [28]**	2019	0.778	0.859	0.032	0.778	0.841	0.023	0.941	0.942	0.016	0.810	0.846	0.048	0.735	0.818	0.068	0.608	0.724	0.092
PoolNet [29]**	2019	0.827	0.860	0.044	0.782	0.843	0.020	0.945	0.945	0.015	0.856	0.878	0.037	0.719	0.796	0.076	0.612	0.731	0.088
EGNet [30]**	2019	0.767	0.828	0.057	0.774	0.848	0.024	0.941	0.946	0.015	0.848	0.878	0.044	0.698	0.793	0.082	0.604	0.719	0.101
PDBM [12]*	2018	0.855	0.882	0.028	0.800	0.864	0.024	0.888	0.907	0.032	0.821	0.851	0.064	0.742	0.818	0.078	0.572	0.698	0.116
MBNM [31]*	2018	0.861	0.887	0.031	0.716	0.809	0.026	0.883	0.898	0.020	0.816	0.857	0.047	0.670	0.742	0.099	0.520	0.637	0.159
FGRN [22]*	2018	0.783	0.838	0.043	—	—	—	0.848	0.861	0.045	0.767	0.809	0.088	0.669	0.715	0.097	0.573	0.693	0.098
DLVS [20]*	2018	0.708	0.794	0.061	—	—	—	0.852	0.881	0.048	0.759	0.794	0.091	0.675	0.760	0.099	0.521	0.657	0.129
SCNN [32]*	2018	0.714	0.783	0.064	—	—	—	0.831	0.847	0.071	0.762	0.794	0.095	0.609	0.704	0.109	0.532	0.674	0.128
SCOM [33]*	2018	0.783	0.832	0.048	0.764	0.815	0.030	0.831	0.762	0.122	0.797	0.794	0.079	0.690	0.712	0.162	0.464	0.599	0.220
SFLR [19]	2017	0.727	0.790	0.056	0.745	0.804	0.037	0.779	0.814	0.062	0.660	0.699	0.117	0.546	0.624	0.145	0.478	0.624	0.132
SGSP [4]	2017	0.655	0.692	0.138	0.673	0.681	0.124	0.677	0.706	0.165	0.630	0.661	0.172	0.426	0.557	0.236	0.426	0.577	0.207
STBP [34]	2017	0.544	0.677	0.096	0.640	0.735	0.061	0.622	0.629	0.163	0.595	0.627	0.152	0.526	0.576	0.163	0.410	0.568	0.160
MSTM [35]	2016	0.429	0.583	0.165	0.526	0.643	0.114	0.673	0.749	0.095	0.500	0.613	0.177	0.567	0.657	0.144	0.344	0.532	0.211
GFVM [3]	2015	0.569	0.687	0.103	0.592	0.699	0.091	0.683	0.757	0.107	0.571	0.651	0.160	0.506	0.615	0.162	0.334	0.553	0.167
SAGM [17]	2015	0.515	0.676	0.103	0.634	0.719	0.081	0.688	0.749	0.105	0.564	0.659	0.161	0.482	0.619	0.172	0.370	0.565	0.184
MB+M [36]	2015	0.470	0.597	0.177	0.554	0.618	0.146	0.692	0.726	0.129	0.487	0.609	0.206	0.562	0.661	0.158	0.342	0.538	0.228
RWRV [37]	2015	0.345	0.556	0.199	0.438	0.583	0.162	0.440	0.595	0.188	0.336	0.521	0.242	0.422	0.552	0.211	0.283	0.504	0.245
SPVM [16]	2014	0.390	0.592	0.146	0.618	0.668	0.108	0.700	0.724	0.133	0.330	0.515	0.209	0.351	0.511	0.223	0.358	0.538	0.202
TIMP [38]	2014	0.448	0.593	0.172	0.573	0.644	0.116	0.479	0.612	0.170	0.456	0.576	0.192	0.401	0.575	0.215	0.395	0.563	0.195
SIVM [39]	2010	0.450	0.557	0.212	0.581	0.605	0.251	0.522	0.606	0.197	0.426	0.545	0.236	0.439	0.558	0.217	0.298	0.486	0.288

V Experiments

V-A Datasets and Evaluation Criteria

V-A1 Evaluation Datasets.

To evaluate the performance of our method, we have conducted extensive quantitative evaluations over 6 widely used public benchmark datasets, including DAVIS-T [24], SegTrack-V2 [25], Visal [3], FBMS-T [26], VOS-T [27] and DAVSOD-T [2].

V-A2 Evaluation Metrics.

We use 3 widely adopted standard metrics in our quantitative evaluations: F-measure [40]; Structure Measure (S-measure) [41]; Mean Absolute Error (MAE) [42].

TABLE II: Summarizing of the current SOTA video saliency detection training sets. DAVIS [24], SegTrack-V2 [25], FBMS [26], and DAVSOD [2], MSRA10K [43], DUTOMRON [44], PASCAL-S [45], HKU-IS [46], DUTS [47], ”—” indicates this dataset was not adopted in this methods.

Data	OUR	MGA19 [1]	SSAV19 [2]
Videos	DAVSOD(5.5K)+DAVIS(2K)=7.5K	DAVIS(2K)+FBMS(0.5K)=2.5K	DAVSOD(5.5K)+DAVIS(2K)=7.5K
Images	MSRA10K(4K)+HKU-IS(3K)+DUTOMRON(2.5K)=9.5K	DUTS(10.5K)	DUTOMRON(5K)
Fixation	-	-	DAVSOD(5K)
Total	Videos(7.5K)+Images(9.5K)=17K	Videos(2.5K)+Images(10.5K)=13K	Videos(7.5K)+Images(5K)+Fixation(5K)=17.5K
Data	PDBM18 [12]	AGS19 [21]	DLVS18 [20]
Videos	DAVIS(2K)	-	SegTrack-V2(1K)+FBMS(0.5K)=1.5K
Images	MSRA10K(10K)+DUTOMRON(5K)=15K	DUTOMRON(5K)+PASCAL-S:(1K)=6K	MSRA10K(10K)+DUTOMRON(5K)=15K
Fixation	-	DAVIS(5.5K)+SegTrack-V2(1K)=6.5K	-
Total	Videos(2K)+Images(15K)=17K	Images(6K)+Fixation(6.5K)=12.5K	Videos(1.5K)+Images(15K)=16.5K

V-A3 Training Set.

Since the video saliency detection requires much more training data than the conventional image saliency detection, previous works [1, 12, 2] have followed the stage-wise training protocol as: pre-train the video saliency deep model using image data first and fine-tune it using video data latter. As shown in Tab. II, we have listed the detailed training sets adopted by the current SOTA methods. We have pre-trained our model using 9.5K image data, and these 9.5K images are selected from the DUTOMRON (2.5K) [44], HKU-IS (3K) [48] and MSRA10K (4K) [43], in which we have removed those images without containing any mobilizable objects. Then, we fine-tune our model using 7.5K video data, including the widely used DAVIS-TR (2K) [24] and the recently proposed DAVSOD (5.5K) [2]. It also should be noted that our training did not includes the fixation data of the DAVSOD dataset.

V-A4 Training Details.

We firstly use the entire training set (all 17.5K data including both images & videos) to pre-train our spatial branch (33,000 epoches). All images/frames are resized to 256 $\times$ 256, and we empirically set the batch size to 16. Next, based on the above pre-trained models, we train the whole spatiotemporal model using the above training set (including both images & videos). Since our spatiotemporal network takes 3 frames as input each time, each static image was copied three times to meet the input size requirement. And this training stage takes almost 8500 epoches, and we decrease the batch size from 16 to 4 here, because the spatiotemporal training takes more GPU memory.

Our network training uses stochastic gradient descent (SGD) with a momentum value of 0.9 and the weight decay is 5e-4, and we set the initial learning rate 5e-3. For relieving overfitting, we use random horizontal flip to augment the image training set. Meanwhile, we re-sample the video data to have different frame rate using interval {0,1,2,3,4,5,6} to augment the video training set.

V-B Performance Comparisons

We have compared our method with 23 SOTA saliency detection methods, especially, 20 of which are video saliency detection methods, MGA [1], AGS [21], SSAV [2], PDBM [12], MBNM [31], FGRN [22], DLVS [20], SCNN [32], SCOM [33], SFLR [19], SGSP [4], STBP [34], MSTM [35], GFVM [3], SAGM [17], MB+M [36], RWRV [37], SPVM [16], TIMP [38] and SIVM [39]. We also compared our method with the most recent 3 SOTA image saliency detection methods, CPD [28], PoolNet [29], EGNet [30].

V-B1 Quantitative Evaluation

We employ three common metrics of F-max, S-measure, MAE for quantitative evaluation, and the Tab. I shows the comparison details. As shown in Tab. I, our method basically maintains the top three in all these tested datasets, especially, our method shows the best performance on the SegTrack-V2 and VOS-T datasets.

It may be possible for our method to achieve more competitive results, e.g., we may achieve the best performances in FBMS and DAVSOD datasets if we include the FBMS-T and the Fixations into our training set as MGA [1] and SSAV [2] methods. However, our key foci is to design a general video saliency framework with extremely fast speed (the highest FPS), it may not be very necessary to pursue the full-leading performance in all tested datasets.

Meanwhile, we have also provided a brief qualitative comparison in Fig. 5, showing the advantage of our method to handle video scenes with clutter backgrounds with moderate motions, e.g., the breakdance sequence demonstrated in the 2nd row.

V-B2 Discussion

In fact, the human fixation is extremely important to shift human attention between different objects. Because the AGS [21] has trained its model using the fixation data provided by the DAVSOD dataset, the AGS model has achieved the leading performance over the DAVSOD-T dataset. On the other hand, the MGA [1] has achieved the best performance on the DAVIS-T dataset, because it has adopted the powerful off-the-shelf FlowNet2.0 [49], which was pre-trained using massive additional training data, to sense temporal information, and the DAVIS-T dataset is dominated by motion information. It should be noted that the leading performance of MGA over the FBMS dataset is mainly induced by its usage of the FBMS training set during the network training. As for our spatiotemporal model, it has achieved the leading performance over the SegTrack-V2 and VOS-T datasets, in which these two datasets are respectively dominated by temporal and spatial information, showing the robustness of our method. Also, we have provided the qualitative comparisons in Fig. 5.

V-B3 Efficiency Comparison.

We also report the runtime comparisons and the net size comparisons in Tab. III and Tab. IV. Our method is evaluated on a machine with a GTX1080Ti GPU. As shown in Tab. III, thought our method has evaluated on a GTX1080Ti GPU, it has achieved the highest FPS compared to all the other compared models even in the case that several other models are evaluated on the more powerful GPUs.

TABLE III: Runtime comparisons between our method and multiple most representative SOTA video saliency detection methods.

Methods	OUR	MGA19 [1]	AGS19 [21]
FPS	50.0	14.0	10.0
Platform	GTX1080Ti	GTX1080Ti	GTX1080Ti
Methods	COS19 [50]	SSAV19 [2]	PDBM18 [12]
FPS	0.4	20.0	20.0
Platform	GTX1080Ti	GTXTITANX	GTXTITANX

Tab. IV demonstrates the network parameter amount comparisons between our model and multiple most representative SOTA models, in which our method is with the lightest network architecture.

TABLE IV: Net size comparisons between our method with the multiple most representative SOTA video saliency detection methods.

Methods	OUR	MGA19 [1]	AGS19 [21]
Weight(M)	191.0	349.8	262.0
Toolbox	Pytorch	Pytorch	Caffe
Methods	COS19 [50]	SSAV19 [2]	PDBM18 [12]
Weight(M)	310.5	236.2	236.0
Toolbox	Pytorch	Caffe	Caffe

TABLE V: Component evaluation of our method on 6 datasets.“+3D” represents the performance after using the fast 3D convolution model (Sec. IV-B), “+S” represents the performance after using the fast temporal shuffle model (Sec. IV-A), “+MA” represents the performance after using the multi-scale attention (Sec. III).

DataSets	DAVIS-T [24]			SegTrack-V2 [25]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
+3D+S+MA	0.865	0.892	0.023	0.860	0.891	0.017
+3D+S	0.858	0.890	0.024	0.855	0.891	0.017
+3D	0.855	0.895	0.027	0.841	0.884	0.017
Baseline	0.837	0.878	0.032	0.822	0.876	0.022
DataSets	Visal [3]			FBMS-T [26]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
+3D+S+MA	0.952	0.952	0.013	0.856	0.872	0.038
+3D+S	0.951	0.949	0.017	0.852	0.871	0.039
+3D	0.942	0.943	0.016	0.853	0.870	0.038
Baseline	0.912	0.924	0.025	0.839	0.867	0.047
DataSets	VOS-T [27]			DAVSOD [2]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
+3D+S+MA	0.791	0.850	0.058	0.651	0.746	0.086
+3D+S	0.782	0.847	0.062	0.650	0.747	0.085
+3D	0.791	0.851	0.060	0.650	0.748	0.086
Baseline	0.771	0.839	0.062	0.629	0.725	0.099

V-C Component Evaluation

In Tab. V, we have conducted multiple component evaluations to verify the effectiveness of each component, in which the baseline represents the original spatial branch using 2D convolution only.

Regarding our temporal shuffle, it will introduce noisy information indeed, and its effect reported in Tab. V is not significant even in the case where the temporal sensing ability has been enhanced via temporal shuffle. We choose to use the attention model to compensate the lost location information during temporal shuffle, and the performance improvement after using the attention model (i.e., 3D+S+MA) is significant.

The qualitative comparisons between different components can be found in Fig. 6.

V-C1 Effectiveness of 3D Convolution (Sec. IV-A)

As shown in Tab. V, the solely 2D convolution based baseline shows the worst performance in all these tested datasets. Benefited by the 3D convolution provided temporal information, the overall performance of the baseline network achieves averagely 2% improvements as expected.

TABLE VI: The ablation study toward the number of sequential 3D convolutions in temporal model (i.e., 3D_Ri, in which the

i\in\{1,3,5\}

denotes the 3D convolution number, ) increasing improvement for performance, the best scores are labeled in red.

DataSets	DAVIS-T [24]			SegTrack-V2 [25]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
3D_R5	0.841	0.882	0.025	0.832	0.890	0.015
3D_R3	0.855	0.895	0.027	0.841	0.884	0.017
3D_R1	0.839	0.896	0.028	0.847	0.873	0.021
Baseline	0.837	0.878	0.032	0.822	0.876	0.022
DataSets	Visal [3]			FBMS-T [26]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
3D_R5	0.937	0.941	0.013	0.837	0.866	0.041
3D_R3	0.942	0.943	0.016	0.853	0.870	0.038
3D_R1	0.921	0.948	0.018	0.851	0.869	0.043
Baseline	0.912	0.924	0.025	0.839	0.867	0.047
DataSets	VOS-T [27]			DAVSOD [2]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
3D_R5	0.780	0.842	0.055	0.647	0.742	0.083
3D_R3	0.791	0.851	0.060	0.650	0.748	0.086
3D_R1	0.789	0.853	0.061	0.648	0.744	0.087
Baseline	0.771	0.839	0.062	0.629	0.725	0.099

Meanwhile, we have conducted an ablation study toward the NUMBER of the adopted sequential 3D convolutions. As shown in Tab. VI, the overall performance can get improvement gradually when we increase the used sequential 3D convolutions, e.g., 3D_R(1 $\rightarrow$ 3). However, we have noticed a slight performance degeneration when we use 5 sequential 3D convolutions in our temporal model (i.e., $\rm{3D\_R5}$ ), which may be induced by the miss-aligned spatial information in these sequential 3D convolutions, leading to an extremely large problem domain for its spatiotemporal deep feature computation. Therefore, we decide to use 3 sequential 3D convolutions in our method.

Also, we have compared the performance difference by using different padding schemes in our temporal model, in which we have listed the detailed performance via our cyclic padding and the conventional zero padding. As shown in Tab. VII, we have noticed that our “Cyclic Padding” scheme can effectively boost the performance of our method mainly over the SegTrack-V2 dataset, while the overall performance improvement over the rest datasets are marginal. This is mainly induced by the fact that, among all these tested datasets, only the SegTrack-V2 dataset is fully dominated by fast object movements, and the “Cyclic Padding” scheme can effectively robust the temporal model for the fast movement detection (e.g., the $birdfall$ sequence in SegTrack-V2 dataset), and thus we can achieve a significant performance improvement over the SegTrack-V2 dataset.

TABLE VII: The padding scheme comparison in temporal model; ”CyclicPadding” represents our proposed cyclic padding scheme and ”ZeroPadding” represents the original zero padding.

DataSets	DAVIS-T [24]			SegTrack-V2 [25]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
CyclicPadding	0.865	0.892	0.023	0.860	0.891	0.017
ZeroPadding	0.857	0.889	0.024	0.848	0.890	0.018
DataSets	Visal [3]			FBMS-T [26]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
CyclicPadding	0.952	0.952	0.013	0.856	0.872	0.038
ZeroPadding	0.938	0.944	0.016	0.852	0.871	0.039
DataSets	VOS-T [27]			DAVSOD [2]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
CyclicPadding	0.791	0.850	0.058	0.651	0.746	0.086
ZeroPadding	0.790	0.848	0.058	0.649	0.745	0.088

V-C2 Effectiveness of Temporal Shuffle (Sec. IV-B)

As shown in Tab. V, the overall performance can get further improved by integrating the temporal shuffle scheme into our temporal model. The major highlight of the temporal shuffle is that it integrates the temporally neighbored spatial information into the current spatial domain, which biases the adopted 3D kernel to the temporal scale. For an example, the temporal shuffle operation is able to enhance the temporal information sensing ability of the temporal model, and thus the overall performance can get a large performance improvements over those temporal information dominated datasets, e.g., DAVIS-T and SegTrack-V2 datasets.

V-C3 Effectiveness of Multi-scale Attention (Eq. 1)

As shown in Tab. V, the overall performance can get further improved by integrating the multi-scale attention into each decoder layer. Thought the attention has been adopted by the most recent work [2] in the single-scale manner, here, our multi-scale spatiotemporal deep feature computation has integrated such attention in the multi-scale way, which further improves the overall detection performance.

TABLE VIII: The performance comparisons between different decoder layers;

{\rm DeCoder}\_i,i\in\{5,4,3,2,1\}

represents the output of the

i

-th decoder layer respectively, see more details in Fig. 2.

DataSets	DAVIS-T [24]			SegTrack-V2 [25]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
DeCoder_1	0.865	0.892	0.023	0.860	0.891	0.017
DeCoder_2	0.863	0.891	0.022	0.857	0.891	0.016
DeCoder_3	0.861	0.890	0.024	0.853	0.885	0.017
DeCoder_4	0.839	0.874	0.026	0.827	0.863	0.020
DeCoder_5	0.806	0.855	0.032	0.792	0.839	0.024
DataSets	Visal [3]			FBMS-T [26]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
DeCoder_1	0.952	0.952	0.013	0.856	0.872	0.038
DeCoder_2	0.948	0.949	0.013	0.853	0.782	0.038
DeCoder_3	0.942	0.945	0.015	0.848	0.868	0.040
DeCoder_4	0.923	0.932	0.018	0.832	0.854	0.044
DeCoder_5	0.899	0.915	0.024	0.807	0.833	0.050
DataSets	VOS-T [27]			DAVSOD [2]
Metric	F-max	S-measure	MAE	F-max	S-measure	MAE
DeCoder_1	0.791	0.850	0.058	0.651	0.746	0.086
DeCoder_2	0.790	0.849	0.058	0.651	0.746	0.086
DeCoder_3	0.787	0.847	0.058	0.648	0.744	0.087
DeCoder_4	0.769	0.836	0.059	0.639	0.736	0.087
DeCoder_5	0.745	0.816	0.064	0.619	0.725	0.090

V-C4 Effectiveness of Multi-scale Spatiotemporal Recurrent (Eq. 2)

In order to verify the effectiveness of our spatiotemporal recurrent, we have respectively tested the performance of each decoder layer, i.e., $DeCoder_{i},i\in\{5,4,3,2,1\}$ in Fig. 2. As shown in Tab. VIII, the last recurrent layer $DeCoder_{1}$ has achieved the best result.

V-C5 The Reasons Why We Choose A Small Feature Size For Our Spatial-branch

In SSAV [2] and PDBM [12], the interactions between their spatial and temporal branches are limited, which solely feed the last output of spatial branch into temporal branch, and such “single-scale interaction” has one critical weakness, i.e., their final detections are frequently associated with blurred object boundaries, which is mainly induced by their temporal branches (i.e., ConvLSTM). Therefore, these two works resort to the attention model to compensate the lost object boundaries, and their dilated convolutions must choose a relatively large size (60*60) in general to ensure the effectiveness of their attention models.

In sharp contrast to SSAV and PDBM, our method introduces the “multi-scale” spatial information into temporal branch by using side-outputs of different spatial layers (Fig. 2), which ensures the detection results with sharp object boundaries. The attention module adopted in our method aims to provide the “location information” for temporal branch, thus it is totally acceptable to design our spatial-branch with a small feature size.

V-D Limitations

Because our method only takes 3 consecutive video frames as input each time, its sensing scope for temporal information is quite limited, leading those regions which undergo long-period static status being undetected, see pictorial demonstrations in Fig. 7. Also, this limitation is also quite common in the SOTA methods, and we believe it will may be alleviated by introducing the long-term spatiotemporal information, which deserves our future investigation.

VI Conclusion

In this paper, we propose an extremely fast end-to-end video saliency detection method. The major highlight of our method can be summarized into three solid aspects: 1) We have devised a lightweight temporal model, which can be inserted into each decoder layer to obtain the multi-scale spatiotemporal deep features; 2) We have provided a feasible way to apply a sequential of 3D convolutions to sense the temporal information; 3) We have introduced a fast temporal shuffle scheme to enhance the temporal sensing ability of the 3D convolution; Also, we have conducted extensive quantitative evaluations to verbify the effectiveness of each component in our method. And the quantitative comparisons have indicated that our method outperforms all the current SOTA methods in both detection performance and speed.

References

[1] H. Li, G. Chen, G. Li, and Y. Yu, “Motion guided attention for video salient object detection,” ICCV, 2019.
[2] D. Fan, W. Wang, M. Cheng, and J. Shen, “Shifting more attention to video salient object detection,” in CVPR, 2019, pp. 8554–8564.
[3] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” IEEE TIP, vol. 24, no. 11, pp. 4185–4196, 2015.
[4] Z. Liu, J. Li, L. Ye, G. Sun, and L. Shen, “Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation,” IEEE TCSVT, vol. 27, no. 12, pp. 2527–2542, 2016.
[5] C. Chen, G. Wang, C. Peng, X. Zhang, and H. Qin, “Improved robust video saliency detection based on long-term spatial-temporal information,” IEEE TIP, vol. 29, no. 3, pp. 1090–1100, 2019.
[6] Y. Li, S. Li, C. Chen, H. Qin, and A. Hao, “Accurate and robust video saliency detection via selfpaced diffusion,” IEEE Trans. on Multimedia (TMM), p. early access, 2019.
[7] G. Ma, C. Chen, S. Li, C. Peng, A. Hao, and H. Qin, “Salient object detection via multiple instance joint re-learning,” IEEE Trans. on Multimedia (TMM), 2019.
[8] C. Chen, S. Li, H. Qin, and A. Hao, “Structure-sensitive saliency detection via multilevel rank analysis in intrinsic feature space,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2303–2316, 2015.
[9] C. Chen, J. Wei, C. Peng, W. Zhang, and H. Qin, “Improved saliency detection in rgb-d images using two-phase depth estimation and selective deep fusion,” IEEE Trans. on Image Process. (TIP), vol. 29, pp. 4296–4307, 2020.
[10] C. Chen, S. Li, and H. Qin, “Robust salient motion detection in non-stationary videos via novel integrated strategies of spatio-temporal coherency clues and low-rank analysis,” Pattern Recognition, vol. 52, pp. 410–432, 2016.
[11] C. Chen, Y. Li, S. Li, H. Qin, and A. Hao, “A novel bottom-up saliency detection method for video with dynamic background,” IEEE Signal Processing Letters, vol. 25, no. 2, pp. 154–158, 2018.
[12] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam, “Pyramid dilated deeper convlstm for video salient object detection,” in ECCV, 2018, pp. 715–731.
[13] Y. Fang, Z. Wang, W. Lin, and Z. Fang, “Video saliency incorporating spatiotemporal cues and uncertainty weighting,” IEEE TIP, vol. 23, no. 9, pp. 3910–3921, 2014.
[14] T. Le and A. Sugimoto, “Video salient object detection using spatiotemporal deep features,” IEEE TIP, vol. 27, no. 10, pp. 5002–5015, 2018.
[15] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer assisted intervention. Springer, 2015, pp. 234–241.
[16] Z. Liu, X. Zhang, S. Luo, and O. LeMeur, “Superpixel-based spatiotemporal saliency detection,” IEEE TCSVT, vol. 24, no. 9, pp. 1522–1540, 2014.
[17] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in CVPR, 2015, pp. 3395–3402.
[18] C. Chen, S. Li, H. Qin, Z. Pan, and G. Yang, “Bilevel feature learning for video saliency detection,” IEEE TMM, vol. 20, no. 12, pp. 3324–3336, 2018.
[19] C. Chen, S. Li, Y. Wang, H. Qin, and A. Hao, “Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion,” IEEE TIP, vol. 26, no. 7, pp. 3156–3170, 2017.
[20] W. Wang, J. Shen, and L. Shao, “Video salient object detection via fully convolutional networks,” IEEE TIP, vol. 27, no. 1, pp. 38–49, 2017.
[21] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. Hoi, and H. Ling, “Learning unsupervised video object segmentation through visual attention,” in CVPR, 2019, pp. 3064–3074.
[22] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin, “Flow guided recurrent neural encoder for video salient object detection,” in CVPR, 2018, pp. 3243–3252.
[23] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in CVPR, 2018, pp. 6848–6856.
[24] F. Perazzi, J. PontTuset, B. McWilliams, L. VanGool, M. Gross, and A. SorkineHornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in CVPR, 2016, pp. 724–732.
[25] F. Li, T. Kim, A. Humayun, D. Tsai, and J. Rehg, “Video segmentation by tracking many figure-ground segments,” in ICCV, 2013, pp. 2192–2199.
[26] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE PAMI, vol. 36, no. 6, pp. 1187–1200, 2013.
[27] J. Li, C. Xia, and X. Chen, “A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection,” IEEE TIP, vol. 27, no. 1, pp. 349–364, 2017.
[28] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in CVPR, 2019, pp. 3907–3916.
[29] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang, “A simple pooling-based design for real-time salient object detection,” arXiv preprint arXiv:1904.09569, 2019.
[30] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng, “Egnet: Edge guidance network for salient object detection,” arXiv preprint arXiv:1908.08297, 2019.
[31] S. Li, B. Seybold, A. Vorobyov, X. Lei, and J. Kuo, “Unsupervised video object segmentation with motion-based bilateral networks,” in ECCV, 2018, pp. 207–223.
[32] Y. Tang, W. Zou, Z. Jin, Y. Chen, Y. Hua, and X. Li, “Weakly supervised salient object detection with spatiotemporal cascade neural networks,” IEEE TCSVT, 2018.
[33] Y. Chen, W. Zou, Y. Tang, X. Li, C. Xu, and N. Komodakis, “Scom: Spatiotemporal constrained optimization for salient object detection,” IEEE TIP, vol. 27, no. 7, pp. 3345–3357, 2018.
[34] T. Xi, W. Zhao, H. Wang, and W. Lin, “Salient object detection with spatiotemporal background priors for video,” IEEE TIP, vol. 26, no. 7, pp. 3425–3436, 2016.
[35] W. C. Tu, S. He, Q. Yang, and S. Chien, “Real time salient object detection with a minimum spanning tree,” in CVPR, 2016, pp. 2334–2342.
[36] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimum barrier salient object detection at 80 fps,” in ICCV, 2015, pp. 1404–1412.
[37] H. Kim, Y. Kim, J. Sim, and C. Kim, “Spatiotemporal saliency detection for video sequences based on random walk with restart,” IEEE TIP, vol. 24, no. 8, pp. 2552–2564, 2015.
[38] F. Zhou, S. Bing Kang, and M. Cohen, “Time-mapping using space-time saliency,” in CVPR, 2014, pp. 3358–3365.
[39] E. Rahtu, J. Kannala, M. Salo, and J. Heikkila, “Segmenting salient objects from images and videos,” in ECCV. Springer, 2010, pp. 366–379.
[40] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in CVPR, 2009, pp. 1597–1604.
[41] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in ICCV, 2017, pp. 4548–4557.
[42] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in CVPR. IEEE, 2012, pp. 733–740.
[43] M. Cheng, N. Mitra, X. Huang, P. Torr, and S. Hu, “Global contrast based salient region detection,” IEEE PAMI, vol. 37, no. 3, pp. 569–582, 2014.
[44] C. Yang, L. Zhang, R. X. Lu, Huchuan, and M. Yang, “Saliency detection via graph-based manifold ranking,” in CVPR, 2013, pp. 3166–3173.
[45] Y. Li, X. Hou, C. Koch, J. Rehg, and A. Yuille, “The secrets of salient object segmentation,” in CVPR, 2014, pp. 280–287.
[46] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in CVPR, 2015, pp. 5455–5463.
[47] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in CVPR, 2017, pp. 136–145.
[48] G. Li and Y. Yu, “Visual saliency detection based on multiscale deep cnn features,” IEEE TIP, vol. 25, no. 11, pp. 5012–5024, 2016.
[49] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in CVPR, 2017, pp. 2462–2470.
[50] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object segmentation with co-attention siamese networks,” in CVPR, 2019, pp. 3623–3632.