Co-occurrence Background Model with Superpixels for Robust Background Initialization

Wenjun Zhou1, Yuheng Deng1, Bo Peng1, Dong Liang2 and Shun’ichi Kaneko3 1School of Computer Science, Southwest Petroleum University, Chengdu, China 610500
Email: [email protected], [email protected] 2Nanjing University of Aeronautics and Astronautics, China 3Hokkaido University, Japan

Abstract

Background initialization is an important step in many high-level applications of video processing, ranging from video surveillance to video inpainting. However, this process is often affected by practical challenges such as illumination changes, background motion, camera jitter and intermittent movement, etc. In this paper, we develop a co-occurrence background model with superpixel segmentation for robust background initialization. We first introduce a novel co-occurrence background modeling method called as Co-occurrence Pixel-Block Pairs (CPB) to generate a reliable initial background model, and the superpixel segmentation is utilized to further acquire the spatial texture information of foreground and background. Then, the initial background can be determined by combining the foreground extraction results with the superpixel segmentation information. Experimental results obtained from the dataset of the challenging benchmark (SBMnet) validate it’s performance under various challenges.

I Introduction

As a widely used approach in various computer vision and video processing applications[1, 2], scene background initialization plays an active role in object detection[3], video segmentation[4], video coding[5, 6] and video inpainting[7, 8], etc. Scene background initialization describes the scene without any foreground objects and generates a clear background to facilitate more efficient follow-up processing in computer vision or video processing applications. Bouwmans et al. overviewed and summarized many traditional and recent approaches that have been proposed and developed for scene background initialization[2], and previous works[9, 10] have already analyzed challenges of background initialization. However, background initialization is still faced with some severe practical challenges[11] which include:

•

Illumination changes: for example, light intensity typically varies during day.
•

Background motion: some movements in a scene should be determined as background e.g. swaying tree, waving water, or ever-changing advertising boards.
•

Camera jitter: in video surveillance, camera jitter is one severe issue that needs to be solved for background initialization.
•

Intermittent movement: the scene with abandoned objects stopping for a short while and then moving away. Under this condition, to differentiate between foreground and abandoned objects is difficult.

Fig. 1 shows the typical examples of these challenges.

Refer to caption — Figure 1: Typical examples of these challenges: (a) Illumination changes, (b) Background motion, (c) Camera jitter, (d) Intermittent motion.

To handle above challenges, we propose a robust background initialization approach based on the co-occurrence background model (Co-occurrence Pixel-Block Pairs: CPB) with superpixels. CPB has already been described in our previous work[12, 13]. As an intuitive and robust background model, CPB was originally designed for foreground detection under dramatical background changes, such as illumination changes and background motion. Here, CPB is utilized as the background model for scene background initialization. Then, in order to further obtain the spatial texture information of foreground and background for efficient background generation, the superpixel algorithm called simple linear iterative clustering (SLIC) [14] is introduced to classify the spatial correlations and temporal difference motion between foreground and background for motion detection. The main contributions of this work are as follows:

1.

The proposed approach enables to effectively acquire the spatial-temporal information of foreground and background and sensitively distinguish the difference between them, so it is highly efficient for motion detection in a scene under complex challenges, especially strong background changes (e.g. illumination changes and background motion) or intermittent motion.
2.

The proposed approach provides a low-complexity and efficient strategy for robust background initialization. Especially when compared with neural network (NN) based approaches[15, 16], it has low cost because it is capable of training without any teacher signals.

The rest of this paper is organized as follows. The proposed approach is described in Section II. Section III analyzes the experimental results from the dataset of the SBMnet[17]. Conclusions are discussed in Section IV.

II Methodology

In this section, the proposed approach is described in details. The steps of it includes: (1) CPB background modeling; (2) Motion detection; (3) Background generation as shown in Fig. 2.

II-A Co-occurrence Background Model

The working diagram of CPB background modeling is illustrated in Fig. 3 including: the training process and the detecting process. In this work, the target pixel $p$ is compared with the $Q^{B}$ as block, and we define $\{Q_{k}^{B}\}_{k=1,2,...,K}=\{Q_{1}^{B},Q_{2}^{B},...,Q_{K}^{B}\}$ to denote a supporting block set for the target pixel $p$ . Each frame is divided into blocks $Q_{K}^{B}$ of size $m\times n$ pixels:

Q^{B}=\begin{Bmatrix}Q_{11}&Q_{12}&\dots&Q_{1n}\\ Q_{21}&Q_{22}&\dots&Q_{2n}\\ \vdots&\vdots&\vdots&\vdots\\ Q_{m1}&Q_{m2}&\dots&Q_{mn}\end{Bmatrix}.

(1)

Background changes in scene can affect the current intensity of target pixel $p$ in foreground detection. Hence, it is quite natural that block $Q^{B}$ , being strongly correlated with target pixel $p$ , can be used to determine the state of the latter. Block $Q^{B}$ can be introduced as a reference to estimate the current intensity of target pixel $p$ , that is, there exists a correlation between pixel $p$ and block $Q^{B}$ : $I_{p}=\bar{I}_{Q}+\Delta_{k}$ ( $\bar{I}_{Q}$ is the average intensity of block $Q^{B}$ in the current detecting frame). In order to reduce the risk of individual error and perform robust background model, to select the sufficient number of block $Q^{B}$ with high correlation as supporting blocks is necessary, defined as follows:

\{Q_{k}^{B}\}_{k=1,2,...,K}=\{Q^{B}|\gamma(p,Q^{B})\text{is the $K$ highest}\},

(2)

where

{\gamma}(p,Q_{k}^{B})=\dfrac{C_{p,\bar{Q}_{k}}}{\sigma_{p}\cdot\sigma_{\bar{Q}_{k}}},

(3)

where $\gamma$ is the Pearson’s product-moment correlation coefficient. Then, the Gaussian model is used to construct the co-occurrence model for each pixel-block pair:

\Delta_{k}\sim N(b_{k},\sigma_{k}^{2})\quad\Delta_{k}=I_{p}-\bar{I}_{Q_{k}},

(4)

where $I_{p}$ is the intensity of the pixel $p$ at $t$ frame and $\bar{I}_{Q_{k}}$ is the average intensity of blocks $Q_{k}^{B}$ at $t$ frame. The background model is built as a list consisting of $[I^{P},u_{k},v_{k},b_{k},\sigma_{k}]$ , where $I^{P}$ is the average intensity of target pixel $p$ in $T$ sequence frames computed by training ( $T$ is the number of training frames) and $(u_{k},v_{k})$ are the coordinates of supporting blocks.

At the detecting process, we use the correlation dependent decision for identifying the state of target pixel $p$ as shown in Fig.3 and more details are described in[13].

TABLE I: Results on different challenges from the SBMnet dataset

Challenge

Method

AGE

pEPs

pCEPs

PSNR

MS-SSIM

CQM

LaBGen-OF

1.8388

0.0026

0.0017

0.9899

34.6563

35.4184

MSCL

2.3728

0.0027

0.0016

0.9866

34.081

34.7595

FSBE

3.0236

0.0055

0.0035

0.9821

33.6317

34.2344

LaBGen-P-Semantic

(MP+U)

1.9743

0.0024

0.0015

0.9899

34.8111

35.5647

SPMD

2.1919

0.0004

0.0000

0.9935

38.6807

38.9381

Basic

Our approach

1.4275

0.0002

0.0000

0.9983

42.3151

42.2216

LaBGen-OF

19.6355

0.4062

0.2597

0.9346

19.4204

20.9417

MSCL

2.8098

0.0043

0.0000

0.9913

34.9208

35.5259

FSBE

6.6733

0.0177

0.0002

0.9817

29.2464

30.2773

LaBGen-P-Semantic

(MP+U)

17.6197

0.2733

0.1829

0.8641

18.4939

20.083

SPMD

6.0889

0.0540

0.0129

0.9755

26.9955

28.1438

Illumination Changes

Our approach

15.2618

0.1657

0.0130

0.9451

21.3651

22.3365

LaBGen-OF

1.7604

0.0022

0.0005

0.9893

38.6184

39.0805

MSCL

2.1299

0.0016

0.0005

0.9962

36.6006

36.8315

FSBE

1.8453

0.0029

0.0003

0.9814

37.9984

37.9817

LaBGen-P-Semantic

(MP+U)

1.5156

0.0000

0.9970

41.4472

41.4719

SPMD

2.2313

0.0035

0.0002

0.9823

36.8531

36.1390

Background Motion

Our approach

1.7742

0.0000

0.9965

39.7339

39.9130

LaBGen-OF

11.9868

0.1590

0.0267

0.8719

20.2275

21.7778

MSCL

5.8660

0.0471

0.0067

0.9699

26.0077

27.1642

FSBE

10.1060

0.1413

0.0283

0.9003

22.5280

23.8107

LaBGen-P-Semantic

(MP+U)

11.1637

0.1466

0.0281

0.8619

20.4535

21.8627

SPMD

1.3573

0.0001

0.0000

0.9979

42.1226

42.1988

Camera Jitter

Our approach

9.4038

0.1205

0.0133

0.9235

22.6436

24.0308

LaBGen-OF

2.3248

0.0043

0.0021

0.9948

36.5121

36.8640

MSCL

1.8481

0.0026

0.0011

0.9943

37.9796

38.1597

FSBE

3.8068

0.0263

0.0173

0.9432

27.9022

28.9156

LaBGen-P-Semantic

(MP+U)

2.1082

0.0031

0.0016

0.9945

37.5222

37.7290

SPMD

2.1629

0.0032

0.0017

0.9940

37.2778

37.5754

Intermittent Movement

Our approach

1.6250

0.0012

0.0000

0.9957

38.4293

38.7184

*

Note that red entries indicate the best in metric.

II-B Motion Detection Combined with Superpixels

Superpixel segmentation has attracted the interest of many computer vision applications as it provides an effective strategy to estimate image features and reduce the complexity of subsequent image processing tasks[18]. Superpixels have been applied in various fields including object recognition[19, 20], image segmentation[21] and object tracking[22].

As most optical flow techniques assumed [23] that the motion field near motion boundary between foreground and background tend to be over-smoothed and blurred. Motion boundaries are the most important regions and incorrect motions near the area often lead a incorrect result in motion estimation. For effective motion estimation in a scene, we introduce the superpixel segmentation algorithm in the proposed algorithm to further acquire and differentiate the spatial texture information of foreground and background[24, 11]. Here, SLIC algorithm[14] is utilized on account of its low complexity and high memory efficiency in computation.

The steps of motion detection are as follows:

1.

To record the pixels $\left\{p(x_{i},y_{j})\right\}$ of the foreground detected by CPB;
2.

To estimate the value $V$ of superpixel regions $S$ in these pixels $\left\{p(x_{i},y_{j})\right\}$ ;
3.

Then, to detect the motion and acquire the motion mask $M$ , when $\forall\{p(x,y)\}\text{in current frame}$ is denoted as:

$m(x,y)=\left\{\begin{array}[]{ll}{1}&{\text{ if }p(x,y)\in V}\\ {0}&{\text{ otherwise }}\end{array}\right..$ (5)

The motion mask $M=\left\{m(x,y)\right\}$ . With the help of superpixel segmentation, the proposed approach can further acquire the spatial information of each pixel and distinguish the different motion information between foreground and background. Based on this, the proposed approach can reinforce the original CPB for extracting motion and avoid errors in information extraction from pixels.

II-C Final Background Generation

Then, we replace the region of motion mask with the initial CPB background model for background generation as shown in Fig. 3.

III Experiments

III-A Experiment Setup

In order to fairly evaluate the proposed approach without losing generality, we consider the several challenges in the background initialization algorithm[17]. The following challenges are selected from SBMnet for evaluation:

•

Basic: PETS2006 represents a mixture of mild challenges typical of the shadows and intermittent movement.
•

Illumination changes: Dataset3Camera2 with the illumination changes during day.
•

Background motion: advertisementBoard contains an ever-changing advertising board in the scene.
•

Camera jitter: boulevard contains the videos captured by outdoor unstable cameras.
•

Intermittent movement: sofa sequence with abandoned objects moving, then stopping for a short while, and then moving again.

III-B Evaluation Measurement

Six metrics which are the common measurements for the background initialization algorithm [17, 11] are introduced for performance evaluation in this paper. They are explained as follows:

•

AGE (Average Gray-level Error): average of the absolute difference between GT and BI.
•

pEPs (Percentage of Error Pixels): number of pixels in BI whose value differs from the value of the corresponding pixel in GT by more than a threshold $\tau$ , which is set as 20 in[17].
•

pCEPs (Percentage of Clustered Error Pixels): percentage of CEPs (number of pixels whose 4-connected neighbors are also error pixels) with respect to the total number of pixels in the image.
•

PSNR (Peak Signal to Noise Ratio): widely used to measure the quality of BI compared with GT, defined as $PSNR=10\cdot{\log}_{10}\left(\dfrac{255^{2}}{MSE}\right)$ .
•

MS-SSIM (Multi-scale Structural Similarity Index): estimate of the perceived visual distortion defined in [25].
•

CQM (Color image Quality Measure): defined in [26]. It assumes values in db and the higher CQM value, the better is the background estimate.

Where, GT means the ground truth of the background image and BI means the generated background image computed by the background initialization approaches.

III-C Result Evaluation

In this section, the proposed approach is compared with five different state-of-the-art techniques selected from SBMnet benchmark, which are LaBGen-OF[27], MSCL[28], FSBE[29], LaBGen-P-Semantic(MP+U) [30] and SPMD[11]. Four of them are the leading techniques for background initialization in SBMnet benchmark, especially MSCL[28] which is the top ranked techniques at present. All the results of the five different techniques come from SBMnet benchmark.

In experiments, we set each block as $8\times 8$ pixels with input frame size of $320{\times}240$ for CPB. All used parameters are listed in Table II, and a detailed discussion of parameters can be found in[12]. Experimental results of the background initialization are presented in Fig. 4, and Table I lists the overall evaluation of these approaches in different challenges. It can be seen from the above results as shown in Fig. 4 and Table II, that our approach outperforms other techniques in challenges of Basic and Intermittent Movement, and for Background Motion, our approach has a close performance to LaBGen-P-Semantic(MP+U), which is the best in this challenge. For other two different challenges, our approach also leads the intermediate level compared with other techniques and the performance is acceptable. The comparison shows that our approach is robust and effective for background initialization in different challenges.

The processing time for background initialization is close to 0.15 seconds with frame size of $320{\times}240$ in MATLAB platform (Intel i7 2.40 GHZ and 16G).

TABLE II: Parameters setting in CPB

Supporting blocks number $K$	20
Threshold of Gaussian model $\eta$	2.5
Threshold of correlation dependent decision $\lambda$	0.5

IV Conclusions

In this paper, we propose a new approach for robust background initialization of a complex scene based on co-occurrence background model (CPB) with superpixel segmentation. It is designed to handle the severe challenges in background initialization, such as illumination changes, background motion, camera jitter and intermittent movement, etc. Video sequences contain the temporal context information which can be learned by CPB model from the training data to resist interference in the scene. Furthermore, superpixel segmentation can help acquire more spatial texture information to facilitate the motion differentiation between foreground and background. The experimental results under different challenges validate the comprehensive performance of the proposed approach. More details including source code are released in: https://github.com/zwj1archer/CPB-superpixel.git.

Acknowledgment

This work is supported by scientific research starting project of SWPU (No.2019QHZ017).

References

[1] T. Bouwmans, “Traditional and recent approaches in background modeling for foreground detection: An overview,” Computer Science Review, vol. 11, pp. 31–66, 2014.
[2] T. Bouwmans, L. Maddalena, and A. Petrosino, “Scene background initialization: A taxonomy,” Pattern Recognition Letters, vol. 96, pp. 3–11, 2017.
[3] X. Zhang, C. Zhu, S. Wang, Y. Liu, and M. Ye, “A bayesian approach to camouflaged moving object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 9, pp. 2001–2013, 2017.
[4] C. Chiu, M. Ku, and L. Liang, “A robust object segmentation system using a probability-based background extraction algorithm,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 4, pp. 518–528, 2010.
[5] M. Paul, “Efficient video coding using optimal compression plane and background modelling,” IET Image Processing, vol. 6, no. 9, pp. 1311–1318, 2012.
[6] X. Li et al., “Background-foreground information based bit allocation algorithm for surveillance video on high efficiency video coding (hevc),” in 2016 Visual Communications and Image Processing (VCIP). IEEE, 2016, pp. 1–4.
[7] A. Colombari, M. Cristani, V. Murino, and A. Fusiello, “Exemplar-based background model initialization,” in Proceedings of the third ACM international workshop on Video surveillance & sensor networks, 2005, pp. 29–36.
[8] X. Chen, Y. Shen, and Y. H. Yang, “Background estimation using graph cuts and inpainting,” in Proceedings of Graphics Interface 2010, 2010, pp. 97–103.
[9] L. Maddalena and A. Petrosino, “Towards benchmarking scene background initialization,” in International conference on image analysis and processing. Springer, 2015, pp. 469–476.
[10] P.-M. Jodoin, L. Maddalena, A. Petrosino, and Y. Wang, “Extensive benchmark and survey of modeling methods for scene background initialization,” IEEE Transactions on Image Processing, vol. 26, no. 11, pp. 5244–5256, 2017.
[11] Z. Xu, B. Min, and R. C. Cheung, “A robust background initialization algorithm with superpixel motion detection,” Signal Processing: Image Communication, vol. 71, pp. 1–12, 2019.
[12] W. Zhou, S. Kaneko, D. Liang, M. Hashimoto, and Y. Satoh, “Background subtraction based on co-occurrence pixel-block pairs for robust object detection in dynamic scenes,” IIEEJ transactions on image electronics and visual computing, vol. 5, no. 2, pp. 146–159, 2017.
[13] W. Zhou, S. Kaneko, M. Hashimoto, Y. Satoh, and D. Liang, “Foreground detection based on co-occurrence background model with hypothesis on degradation modification in dynamic scenes,” Signal Processing, vol. 160, pp. 66–79, 2019.
[14] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
[15] P. Xu, M. Ye, Q. Liu, X. Li, L. Pei, and J. Ding, “Motion detection via a couple of auto-encoder networks,” in 2014 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2014, pp. 1–6.
[16] I. Halfaoui, F. Bouzaraa, and O. Urfalioglu, “Cnn-based initial background estimation,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 101–106.
[17] “A dataset for testing background estimation algorithms,” http://scenebackgroundmodeling.net/.
[18] M. Wang, X. Liu, Y. Gao, X. Ma, and N. Q. Soomro, “Superpixel segmentation: A benchmark,” Signal Processing-image Communication, vol. 56, pp. 28–39, 2017.
[19] H. Lu, X. Feng, X. Li, and L. Zhang, “Superpixel level object recognition under local learning framework,” Neurocomputing, vol. 120, pp. 203–213, 2013.
[20] D. Giordano, F. Murabito, S. Palazzo, and C. Spampinato, “Superpixel-based video object segmentation using perceptual organization and location prior,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4814–4822.
[21] T. Lei, X. Jia, Y. Zhang, S. Liu, H. Meng, and A. K. Nandi, “Superpixel-based fast fuzzy c-means clustering for color image segmentation,” IEEE Transactions on Fuzzy Systems, vol. 27, no. 9, pp. 1753–1766, 2018.
[22] F. Yang, H. Lu, and M.-H. Yang, “Robust superpixel tracking,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1639–1651, 2014.
[23] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International journal of computer vision, vol. 92, no. 1, pp. 1–31, 2011.
[24] J. Lim and B. Han, “Generalized background subtraction using superpixels with label integrated motion estimation,” in European Conference on Computer Vision. Springer, 2014, pp. 173–187.
[25] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402.
[26] Y. Yalman and İ. ERTÜRK, “A new color image quality measure based on yuv transformation and psnr for human vision system,” Turkish Journal of Electrical Engineering & Computer Sciences, vol. 21, no. 2, pp. 603–612, 2013.
[27] B. Laugraud and M. Van Droogenbroeck, “Is a memoryless motion detection truly relevant for background generation with labgen?” in International Conference on Advanced Concepts for Intelligent Vision Systems. Springer, 2017, pp. 443–454.
[28] S. Javed, A. Mahmood, T. Bouwmans, and S. K. Jung, “Background–foreground modeling based on spatiotemporal sparse subspace clustering,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5840–5854, 2017.
[29] A. Djerida, Z. Zhao, and J. Zhao, “Robust background generation based on an effective frames selection method and an efficient background estimation procedure (fsbe),” Signal Processing: Image Communication, vol. 78, pp. 21–31, 2019.
[30] B. Laugraud, S. Piérard, and M. Van Droogenbroeck, “Labgen-p-semantic: A first step for leveraging semantic segmentation in background generation,” Journal of Imaging, vol. 4, no. 7, p. 86, 2018.