Weakly Supervised Realtime Dynamic Background Subtraction

Fateme Bahri and Nilanjan Ray
Department of Computing Science
University of Alberta, Canada
{fbahri,nray1}@ualberta.ca

Abstract

Background subtraction is a fundamental task in computer vision with numerous real-world applications, ranging from object tracking to video surveillance. However, dynamic backgrounds can pose a significant challenge in this problem. While various methods have been proposed for background subtraction, supervised deep learning-based techniques are currently considered state-of-the-art. However, these methods require pixel-wise ground-truth labeling, which can be time-consuming and expensive. In this work, we propose a weakly supervised framework that can perform background subtraction without requiring per-pixel ground-truth labels. Our framework is trained on a moving object-free sequence of images and comprises two networks. The first network is an autoencoder that generates static background images and prepares dynamic background images for training the second network. The dynamic background images are obtained by thresholding the background-subtracted images. The second network is a U-Net that uses the same moving object-free video for training and the dynamic background images as pixel-wise ground-truth labels. During the test phase, the input images are processed by the autoencoder and U-Net, which generate static and dynamic background images, respectively. The dynamic background image helps remove dynamic motion from the background subtracted image, enabling us to obtain a foreground image that is free of dynamic artifacts. To demonstrate the effectiveness of our method, we conducted experiments on selected categories of the CDnet 2014 dataset and the I2R dataset. Our method outperformed all top-ranked unsupervised methods. It also surpassed one of the two existing weakly supervised methods, while achieving comparable results to the other method but with a shorter running time. Our proposed method is online, realtime, efficient, and requires minimal frame-level annotation, making it suitable for a wide range of real-world applications.

I Introduction

Background subtraction is a crucial problem in computer vision that has practical applications in various domains like video surveillance, human-computer interaction, traffic monitoring, and autonomous navigation [1, 2]. Dealing with dynamic backgrounds is a significant challenge in background subtraction, where a background pixel’s value can change due to periodical or irregular movements [3]. Although various methods have been proposed for background subtraction, not all of them can effectively handle sequences with dynamic backgrounds. Scenes with dynamic elements like fountains, waving trees, and water motions are prime examples of dynamic backgrounds. Detecting these dynamic variations as parts of the foreground negatively impacts the performance of the methods.

Statistical methods are among the simplest approaches to dealing with dynamic backgrounds. These methods utilize statistical modeling of pixel value distributions. Examples include the Gaussian mixture model (GMM) [4] and its improved variations [5, 6, 7, 8], as well as kernel density estimation (KDE) [9].

Another class of methods involves dynamically adjusting their parameters through feedback mechanisms. The SuBSENSE method [10] was the pioneering method in this category, and it has inspired several other methods, including PAWCS [11], SWCD [12], and CVABS [13].

Because of the effectiveness of deep learning methods in computer vision, numerous neural network models have been developed for the purpose of foreground and background segmentation. These models have ranked highly among the evaluated methods in the CDnet 2014 dataset. However, they require supervised training, which involves manual annotation at the pixel level. This process is time-consuming and expensive, and may not be practical for every situation.

In recent years, weakly supervised methods have gained popularity and have demonstrated impressive performance in various tasks. One of their primary advantages is that they achieve satisfactory performance without relying on costly pixel-wise ground truth annotations. In the context of background subtraction, two new methods have been proposed by Zhang et al. [14] and Minematsu et al. [15]. Both methods are trained using frame-level labels, which is a less demanding and more cost-effective labeling approach compared to pixel-level annotation.

We present a new approach to background subtraction that learns the dynamic background component in a weakly supervised manner. It uses a fully connected autoencoder and a U-Net convolutional neural network [16]. To explain the overall working principle, let us first consider the scenario where no moving object is present in a sequence. In this case, the autoencoder takes in the sequence and produces the static background images. The difference between the input image and the output of the autoencoder contains dynamic clutter. The U-Net takes in the same sequence of images and is expected to produce only the dynamic clutter, which in this paper is referred to as the dynamic background image. So, when we subtract the autoencoder output from the input image and multiply it with the inverted output of the U-Net, we ideally obtain a zero image showing no moving objects or dynamic background. In the second scenario, when moving objects are present, the autoencoder output again contains static background and the output of the U-Net is still expected to produce only a dynamic background image. So, when we subtract the output of the autoencoder from the input image and multiply it with the inverted output of the U-Net, it will show the moving objects only.

The autoencoder is trained on a moving object-free sequence to produce static background images that capture some of the temporal and spatial variations in the scene. We obtain a binary dynamic background image by subtracting autoencoder output from the input image and applying a threshold. Then, We train the U-Net on the same object-free sequence using the binary dynamic background images from the previous step as the target output for the U-Net. Thus, the U-Net learns the temporal and spatial variations of the dynamic background in the scene.

During the testing phase, the autoencoder and U-Net generate the static and dynamic background images, respectively. Multiplying the inverted dynamic background with the static background-subtracted image produces the foreground image. Finally, we apply pixel-wise thresholding to the foreground image and use some simple post-processing techniques to enhance the final result. Fig. 1 illustrates an overview of our proposed method.

Refer to caption — Figure 1: Overview of the proposed method. The top two boxes show the training phase and the below box shows the test phase. To handle dynamic background, our method combines an autoencoder and a U-Net. The autoencoder and U-Net are trained to generate the static and dynamic background images, respectively. By subtracting the autoencoder’s output from the input image and multiplying it with the inverted output of the U-Net, we can obtain an image that shows only the moving objects, free from dynamic artifacts.

Our proposed approach offers several key contributions. First, it is a practical and cost-effective weakly supervised framework, which eliminates the need for costly pixel-wise annotations by using only an object-free training sequence. Second, it effectively learns and predicts the dynamic background component in each image and segments it from the foreground. Finally, our experimental results show that the proposed algorithm achieves superior performance in dynamic background scenes compared to other state-of-the-art unsupervised and weakly supervised methods, both quantitatively and qualitatively.

The paper is structured as follows: Section II reviews other background subtraction methods. Section III provides a detailed explanation of our framework, including its training and test phases. In Section IV, we present our implementation details, and experimental results, and compare them with state-of-the-art methods. Finally, Section V concludes the paper with a summary of our findings and future research directions.

II Related Works

II-A Statistical Methods

There is a category of methods that use statistical approaches based on probability density estimation of pixel values. The most basic of these is the single Gaussian model [17]. However, this approach has limitations as a single function cannot account for all variations in pixel values. To overcome this, the Gaussian mixture model (GMM) [4] was proposed, which uses several Gaussian densities. Various improved versions of this traditional and widely used method have been presented [5, 6, 7, 8] with better results. Flux Tensor with Split Gaussian models (FTSG) [18] is a state-of-the-art method that uses flux tensor-based motion segmentation and GMM-based background modeling separately and then merges the results. Finally, it enhances the results using a multi-cue appearance comparison. However, parametric methods such as GMM and its successors are unable to handle sudden changes in a scene. To address this issue, a statistical non-parametric algorithm called KDE [9] was introduced, which uses kernel density estimation to model the probability of pixel values.

II-B Methods Based on Dynamic Feedback Mechanism

One of the main categories of methods for background modeling involves using controller parameters that update the background model based on dynamic feedback mechanisms. One such method, SuBSENSE [10], incorporates color channel intensity and spatiotemporal binary features and adjusts its parameters using pixel-wise feedback loops based on segmentation noise. PAWCS [11], a newer and more advanced method, extends the capabilities of SuBSENSE by generating a strong and persistent dictionary model based on spatiotemporal features and color. Similar to SuBSENSE, PAWCS also employs automatic feedback mechanisms to adjust its parameters. Another method, SWCD [12], combines the dynamic controllers of SuBSENSE with a sliding window approach to update background frames. Finally, CVABS [13], a recent subspace-based method, utilizes dynamic self-adjustment mechanisms like SuBSENSE and PAWCS to update the background model.

II-C Ensemble Methods

Ensemble methods have emerged as a new approach for change detection algorithms. The authors of [19, 20] have recently introduced a method named IUTIS (In Unity There Is Strength) that utilizes genetic programming (GP) to combine different algorithms and maximize their individual strengths. By selecting the best methods, combining them in various ways, and applying appropriate post-processing techniques, GP enables IUTIS to achieve high performance. The method shows promising performance by integrating several top-ranked methods evaluated on CDnet 2014 ([21]).

II-D Deep Learning Methods

Several deep neural networks (NN) have been proposed in recent years for foreground segmentation, owing to the success of deep learning in computer vision. FgSegNet and its variations [22, 23, 24] are presently considered state-of-the-art based on their performance on CDnet 2014. Motion U-Net [25] is another deep NN method that requires fewer parameters than FgSegNet. BSPVGAN [26] employs Bayesian Generative Adversarial Networks (GANs) to create a background subtraction model. Another technique called Cascade CNN [27] uses a multi-resolution convolutional neural network (CNN) to segment moving objects. In DeepBS [28], a CNN is trained using patches of input images, which are then merged to reconstruct the frame. Temporal and spatial median filtering is utilized to enhance the segmentation outcomes. Another supervised approach, BSUV-Net [29, 30], is trained on some image sequences along with their spatiotemporal data augmentations, and exhibits good performance on unseen videos after training. Among the assessed methods on CDnet 2014, the aforementioned neural network techniques are ranked at the top. However, they require supervised training, which entails pixel-wise annotated ground truth, a time-consuming and impractical task in many situations.

A number of recently developed techniques, including SemanticBGS [31] and its variations RT-SBS-v1 and RT-SBS-v2 [32], integrate semantic segmentation with background subtraction methods. They employ the information from a semantic segmentation algorithm to obtain a pixel-wise probability that enhances the output of any background subtraction method. However, we cannot compare them to our method because they rely on pixel-level information as input, even though they are not trained using ground-truth labels.

II-E Weakly Supervised Methods

In recent years, some weakly supervised methods have emerged that solely rely on image-level tags, which indicate whether a foreground object is present in the image [14, 15]. The method proposed in [15] generates a binary mask image by subtracting a background image from an input image to identify foreground regions. It then uses intermediate feature maps of a CNN to refine the foreground locations. However, image-level supervision presents a challenge due to the lack of location information in training the network. To address this, the method introduces some constraints that help to locate foreground pixels.

Another recent technique, LDB [14], adopts a tensor-based decomposition framework to represent the background as a low-rank tensor and classify the sparse noise as foreground. Additionally, it trains a two-stream neural network using an object-free video to explicitly learn the dynamic background. The dynamic background component of LDB leads to a more precise decomposition of the background and foreground, making it the current state-of-the-art method in weakly supervised moving object detection, particularly in dynamic background scenes.

Our proposed method is also based on explicit modeling of the dynamic background using a neural network. However, our framework differs from LDB in that it relies exclusively on neural networks for the segmentation of the background and foreground, rather than using a low-rank-based approach. As a result, we gain significant advantages in reducing the running time once our networks are trained. Further, LDB uses a very light CNN to model dynamic background. Instead, we use a U-Net to model the same. Because of a large number of parameters, U-Net has significant representative power, and it overfits the scene sequence to generate the dynamic background. We make use of this overfitting, because for a sequence containing moving objects the U-Net should ignore the moving objects and output only the dynamic background.

III Proposed Method

Our proposed method, depicted in Fig. 1, is based on two neural networks and is trained in a weakly supervised way. The first network is an autoencoder that generates static background images and is trained in an unsupervised manner. The second network is a U-Net [16], which requires pixel-wise labels for training. Using the background images generated by the autoencoder, ground-truth labels for the U-Net are acquired. In the following sections, we explain the training and testing phases, as well as the roles of each network in our framework.

III-A Background Generation

Our framework uses an autoencoder to generate static background images. Autoencoders are a type of neural network that consists of two components: an encoder and a decoder. The encoder maps the input to a compressed code, and the decoder reconstructs the input from the code, aiming to make the output as close to the input as possible [33]. Consequently, autoencoders learn a compressed and meaningful representation of the input data. This results in the removal of insignificant data and noise from the reconstructed input [34].

The autoencoder used in our method for generating the static background images is a fully connected one with dense layers, each followed by a SELU [35] activation function. The only exception is the last layer, which uses a Sigmoid activation function to limit the output values between zero and one. Fig. 2 shows the details of the architecture of the autoencoder used in our approach.

The $L_{recons}$ loss term, which is responsible for constructing the background images, is defined as follows (Eq. 1):

L_{recons}=\sum_{t=1}^{N}\|I_{t}-B_{t}\|_{1},

(1)

Here, $I_{t}$ and $B_{t}$ are the $t^{\text{th}}$ input and output of the autoencoder, respectively, and $\|.\|_{1}$ denotes the $L_{1}$ -norm. We used the $L_{1}$ -norm instead of the $L_{2}$ -norm in $L_{recons}$ because it encourages sparsity [36].

The autoencoder can learn a low-dimensional manifold of the data distribution by applying constraints such as limiting the network’s capacity and choosing a small code size [37]. The network can extract the most salient features of the data, and the $L_{recons}$ loss term imposes similarity between the input and reconstructed frames, allowing the autoencoder to learn a background model during optimization. This is possible because the input image sequence is temporally correlated, and the background of the images is common among them [38].

III-B Dynamic Background Data Preparation

Autoencoders can be optimized in an unsupervised way and do not require labeled data. However, our second network, U-Net, requires pixel-wise labeled training data to be trained. A moving object-free sequence of input frames is used as training data. During the training phase, the autoencoder generates static background images of the training frames. The static background images are then subtracted from the input images, and after applying a threshold on them, the binary images are obtained. These binary images exhibit a dynamic background since they are extracted from training images without any foreground object. The entire process is illustrated in Fig. 1.

III-C Dynamic Background Prediction

The second network in our proposed method is a U-Net, which was originally developed for image segmentation tasks and has shown great success in medical image analysis [16]. Its architecture resembles a U-shape and consists of two paths: a contracting path and an expansive path. The contracting path is composed of convolutional layers followed by ReLU activation functions and max-pooling layers, where the number of features gets doubled in each contracting step. The expansive path consists of up-convolutional layers for upsampling and halving the number of features, concatenations of features from the contracting path, convolutional layers, and ReLU activation functions. In our method, we used the basic U-Net architecture as described in [16].

U-Net is a network that requires pixel-wise labels for supervised training. However, in our framework, it is trained using the prepared binary images that were explained in the previous section, which makes our method weakly supervised. All we need for training is a moving object-free sequence. In other words, if we have a training sequence with frame-level tags whether the frame contains only background or not, we select only the frames with a tag value of zero as the training data. The output of U-Net is a binary image with pixel values of zero or one, where each pixel labeled as one indicates the presence of dynamic background.

III-D Training and Test Phases

III-D1 Training Phase

In the training phase, our proposed method first optimizes the autoencoder on an object-free training sequence to generate static background images. Then, a threshold is applied to the background-subtracted images to obtain dynamic background binary images. The U-Net network is then trained on the same object-free training sequence using the dynamic background binary images as its target output to enable it to predict dynamic background pixels. We train the U-Net model long enough until it overfits to the training sequence. We exploit this overfitting to our advantage because during testing, the U-Net should identify only the dynamic background pixels as label one, while ignoring the moving objects present in the video sequence. All the steps of the training phase are illustrated in Fig. 1.

III-D2 Test Phase

During the test phase, an input test image $I_{t}$ is fed into the autoencoder to obtain the static background image $B_{t}$ . The same input image $I_{t}$ is also fed into the U-Net, which produces a dynamic background image $D_{t}$ . Next, the background subtracted image, $I_{t}-B_{t}$ , is multiplied by the inverted dynamic background, $1-D_{t}$ , as shown in equation 2. This step generates the foreground image $F_{t}$ , which excludes the dynamic background artifacts. Then, a pixel-wise thresholding technique is applied to $F_{t}$ to obtain the initial segmented image, $S_{t}$ , as described in the next section. Finally, two standard post-processing techniques, median blurring, and morphological closing are applied to $S_{t}$ to improve the results, and the final segmented image, $S_{t}^{PostProc}$ , is obtained. The entire process is illustrated in Fig. 1.

F_{t}=(I_{t}-B_{t})\times(1-D_{t})

(2)

III-E Foreground Segmentation with Pixel-wise thresholding

Although most of the dynamic background pixels are detected by the U-Net, there is still a possibility that some of them may be missed due to the selected threshold when preparing the dynamic background ground-truth images. To address this, we use a per-pixel thresholding technique inspired by the SuBSENSE method [10] to obtain the foreground masks. This technique calculates the dynamic entropy of each pixel, represented by the dynamic entropy map $C(x)$ , to detect blinking pixels. The dynamic entropy map tracks how often a pixel changes from being a foreground pixel to a background pixel, or vice versa, between consecutive frames.

The calculation of $C(x)$ is based on the XOR operator and is given by:

C(x)=\frac{1}{N-1}\sum_{t=2}^{N}XOR(S_{t}^{init}(x),S_{t-1}^{init}(x)),

(3)

Here, $x$ represents a pixel, $S_{t}^{init}$ is the binary result of the $t^{\text{th}}$ frame in the sequence after an initial segmentation, and $N$ is the total number of frames in the sequence. The initial segmentation uses $\alpha\max(F)$ as the threshold, $\max(F)$ is the maximum value of the foreground frames $F$ , and $\alpha$ is a coefficient. The dynamic entropy map values, $C$ , range from 0 to 1.

In the next step, we compute the pixel-wise distance thresholds using the following equation:

R(x)=\beta\max(F)+C(x),

(4)

Here, $\max(F)$ is the maximum value of the foreground frames $F$ , and $\beta$ is a coefficient. The binary segmented result $S_{t}$ is obtained by applying the distance threshold $R(x)$ to the foreground $F_{t}(x)$ .

IV Experimental Results and Discussion

IV-A implementation details

Our method was implemented using the Keras deep learning framework [39]. The autoencoder architecture shown in Figure 2 consists of densely connected layers with 64, 32, 16, 4, 16, 32, and 64 units, respectively. All layers use the scaled exponential linear unit (SELU) activation function [35], except for the final layer which uses the sigmoid activation function to produce output values within the range of $[0,1]$ .

The U-Net’s contracting path consists of four steps, each composed of two convolutional layers with a $3\times 3$ kernel size and ReLU activation function, followed by a $2\times 2$ max pooling operation with a stride of 2. The numbers of features are 64, 128, 256, 512, and 1024 for the top-to-bottom steps. The expansive path mirrors the contracting path but with two differences: first, features of the same contracting level are concatenated to the feature channels. Second, the max pooling operation is replaced with a transposed convolution layer with a $3\times 3$ kernel size and stride 2. Consequently, the number of features is halved in each expansive step. The final layer of the model is a convolutional layer with a $1\times 1$ kernel size and two features that construct the binary output image. Our U-Net architecture has the same design as the basic U-Net proposed in [16], except that we use convolutional and transposed convolution layers with the padding mode set to ’same’, which eliminates the need for the cropping operation in [16].

The hyper-parameters $\alpha$ and $\beta$ were set to $0.2$ and $0.08$ , respectively, after conducting several trial and error experiments. We used the Adam optimization algorithm [40] with learning rates of $0.0001$ and $0.005$ for the autoencoder and U-Net, respectively. Both networks were trained for 50 epochs. During the testing phase, we achieved an average processing speed of 107 frames per second on the CDnet 2014 dataset [21] using a GeForce GTX 1080 Ti GPU.

IV-B Datasets

We evaluated the effectiveness of our approach on various video datasets to demonstrate its suitability for real-world scenarios.

IV-B1 CDnet 2014

To show the effectiveness of our method in challenging dynamic background scenarios, we conducted evaluations on the Dynamic Background category of the CDnet 2014 dataset [21]. This category consists of six videos with different types of dynamic background motions. The videos “fountain01” and “fountain02” feature a dynamic water background, while “canoe” and “boats” depict water surface movement. “Overpass” and “fall” exhibit waving trees in the background. Additionally, we evaluated our approach on the Bad Weather category of the same dataset, which features sparse dynamic variations in the background caused by snow and rain, making it a challenging category. The four videos in this category are “blizzard”, “skating”, “snowFall”, and “wetSnow”.

We manually selected the frames without objects for the training data for each sequence. These frames were chosen from the frames before the starting frame in the temporal ROI. In the case of the “WetSnow” video, no background images were available in the initial frames of the sequence. Therefore, we chose 10 object-free frames from the sequence after frame number 2000. We used a maximum of 300 frames for training, or the number of available frames, whichever was less.

IV-B2 I2R Dataset

The I2R dataset [41] is a widely recognized benchmark for background subtraction tasks, consisting of 10 real videos shot in indoor and outdoor settings. These videos contain challenging conditions like bootstrapping, shadows, camouflage, lighting changes, noise, weather, and dynamic backgrounds. To assess our approach, we chose three outdoor scenes with dynamic backgrounds: “Campus,” “Fountain,” and “WaterSurface.” As outlined in the previous section, we manually selected the training frames.

IV-C Evaluation Metric

To evaluate the performance of our method, we utilize the F-Measure (FM) metric, which is a widely used performance indicator for moving object detection and background subtraction algorithms. The F-measure is calculated using the equation shown in (5), which combines the recall and precision scores. In order to maintain consistency with existing methods, we compute all evaluation metrics according to the definitions provided in [21].

\text{F-measure}=2*\frac{\text{Recall}*\text{Precision}}{\text{Recall}+\text{Precision}}

(5)

IV-D Qualitative Results

In Fig. 3, we present the intermediate and final qualitative results of our method’s steps on the videos. The first six rows depict the Dynamic Background category, followed by the next four rows from the Bad Weather category of the CDnet 2014 dataset. The last three rows show videos from the I2R dataset.

The first three columns display the input frame, the autoencoder-generated background, and the background-subtracted image, respectively. The fourth column exhibits the dynamic background image predicted by the U-Net. The next column displays the foreground image obtained by multiplying the background-subtracted image with the inverted version of the dynamic background image. The sixth and seventh rows showcase the initial segmented image after thresholding and the final segmented image after post-processing, respectively. The last column shows the ground-truth images.

In Fig. 3, it is evident from the 4th column that the U-Net model can efficiently capture the dynamic background motion and create a precise representation of the dynamic background image, particularly for the Dynamic Background videos. By comparing the background-subtracted image in the third column with the obtained foreground in the fifth column, it proves our method can effectively decompose the dynamic background from the foreground objects.

For the Bad Weather videos, our method is able to predict some of the dynamic background pixels. However, due to the nature of our autoencoder, it tends to absorb snow noise in the generated static background image, leading to a lack of visible snow pixels in the dynamic background image. Nevertheless, our method is still able to produce high-quality results, demonstrating its effectiveness in handling challenging weather scenarios.

Regarding the I2R dataset, our U-Net accurately predicts the dynamic background pixels in the “Campus” sequence and some of the dynamic background pixels in the “WaterSurface” sequence. However, in the “Fountain” sequence, the U-Net is not able to predict the fountain pixels in the dynamic background image since they are already detected as part of the background generated by the autoencoder. This is because the values of the fountain pixels remain constant in consecutive frames, and therefore, they are absorbed in the static background image.

The key aspect is to effectively separate the foreground pixels from the dynamic background pixels, which our framework achieves well through the use of both the autoencoder and the U-Net models. This ultimately leads to the superior performance of our method.

IV-E Quantitative Results

In this section, we present the quantitative results of our method compared to the top-performing methods on the CDnet 2014 dataset [21] listed on ChangeDetection.net website. Specifically, we chose the top 30 methods based on their average F-measure (FM) performance on the Dynamic Background and Bad Weather categories, excluding supervised methods and the ensemble method IUTIS [19]. We also included the results of the LDB weakly supervised method [14], as well as the CANDID algorithm [42], which was specifically proposed for dynamic background subtraction.

The results are shown in Table I. The second to eighth columns display the results on Dynamic Background videos, and the ninth to thirteenth columns show the results on Bad Weather videos. The methods are sorted based on their average FM on Dynamic Background videos, which is listed in the eighth column. Our method’s results are reported in the last row of the table.

As shown in Table I, our method achieves an average FM of 0.91 on Dynamic Background, outperforming all unsupervised methods and the LDB [14] method. Our method also achieves the highest FM on the “fall” video and performs the best on “fountain01” along with the FTSG method [18]. These results demonstrate the practicality of our method for dynamic background scenes with only the cost of frame-level tag training data.

On Bad Weather sequences, our method achieves an average FM of 0.89, outperforming all unsupervised top-ranked methods, but is surpassed by the LDB weakly supervised method [14], which has an FM of 0.91. Our method also achieves the best FM on the “blizzard” video, while LDB obtains the best FM on the “WetSnow” and “snowFall” sequences.

To compare our method to LDB more comprehensively, we also perform experiments on the I2R dataset and report the results in Table II. As shown, our method achieves the same average FM as LDB, but we obtain slightly better results on the “Fountain” and “WaterSurface” sequences, while LDB performs slightly better on the “Campus” sequence.

A comparison of our method and LDB on various videos shows that our method performs better in the Dynamic Background category, while LDB performs better in the Bad Weather category. For the I2R sequences, both methods achieve similar performance. It is worth noting that LDB uses a low-rank tensor decomposition and is a batch method, whereas our method is an online method achieving realtime performance where an input image is fed through the two networks to obtain the result.

We also compared our method to another weakly supervised method [15] described in the related work section. Minematsu et al. performed experiments on eight categories of the CDnet 2014 dataset [21], but only selected some of the sequences for each category. We performed the same experiments and report the results in Table III. As shown in the table, our method achieves an average FM of $0.72$ , while Minematsu et al. [15] obtain an average FM of $0.66$ . Our method outperforms theirs on the Bad Weather, Dynamic Background, Shadow, Thermal, and Turbulence sequences, while they achieve better performance on the Baseline, Camera Jitter, and Night Videos sequences.

TABLE I: Performance comparison of the top-ranked methods, evaluated on CDnet 2014 Dynamic Background and bad Weather categories, in terms of F-measure. The best performance achieved, in each column, is shown in bold. Methods are sorted based on their Average F-Measure on Dynamic Background Category

Methods	Dynamic Background							Bad Weather
Methods	fount01	fount02	canoe	boats	overpass	fall	Avg	wetSnow	snowFall	blizzard	skating	Avg
GraphCutDiff [43]	0.08	0.91	0.57	0.12	0.84	0.72	0.54	0.83	0.90	0.86	0.92	0.88
CL-VID [44]	0.05	0.45	0.93	0.81	0.85	0.23	0.55	0.54	0.79	0.75	0.87	0.74
C-EFIC [45]	0.27	0.34	0.93	0.37	0.90	0.56	0.56	0.65	0.74	0.86	0.90	0.79
EFIC [46]	0.23	0.91	0.36	0.36	0.88	0.72	0.58	0.62	0.71	0.86	0.92	0.78
Multi_ST_BG [47]	0.14	0.82	0.48	0.89	0.84	0.41	0.60	0.53	0.71	0.71	0.59	0.64
KDE-ElGamm [9]	0.11	0.82	0.88	0.63	0.82	0.31	0.60	0.57	0.78	0.77	0.91	0.76
CP3-online [48]	0.54	0.91	0.63	0.17	0.64	0.77	0.61	0.75	0.76	0.85	0.63	0.75
DCB [49]	0.40	0.83	0.45	0.87	0.83	0.30	0.61	0.30	0.34	0.41	0.49	0.38
GMM_Zivk [5]	0.08	0.79	0.89	0.75	0.87	0.42	0.63	0.58	0.76	0.86	0.76	0.74
GMM_Grim [4]	0.08	0.80	0.88	0.73	0.87	0.44	0.63	0.61	0.73	0.88	0.74	0.74
SOBS_CF [50]	0.11	0.83	0.95	0.91	0.85	0.26	0.65	0.50	0.62	0.67	0.76	0.64
SC_SOBS [51]	0.12	0.89	0.95	0.90	0.88	0.28	0.67	0.50	0.60	0.66	0.90	0.66
AAPSA [52]	0.44	0.36	0.89	0.76	0.82	0.75	0.67	0.63	0.80	0.82	0.85	0.77
M4CD_V2 [53]	0.17	0.93	0.61	0.95	0.95	0.50	0.69	0.69	0.81	0.81	0.94	0.81
RMoG [54]	0.20	0.87	0.94	0.83	0.90	0.67	0.74	0.60	0.58	0.76	0.79	0.68
WeSamBE [55]	0.73	0.94	0.61	0.64	0.72	0.81	0.74	0.81	0.87	0.90	0.86	0.86
Spectral360 [56]	0.47	0.92	0.88	0.69	0.81	0.90	0.78	0.65	0.79	0.67	0.92	0.76
LDB Weak-Supervised [14]	0.14	0.93	0.92	0.92	0.95	0.79	0.78	0.89	0.93	0.90	0.90	0.91
MBS_V0 [57]	0.52	0.92	0.93	0.90	0.90	0.57	0.79	0.43	0.88	0.86	0.92	0.77
MBS [58]	0.52	0.92	0.93	0.90	0.90	0.57	0.79	0.53	0.88	0.86	0.92	0.80
BMOG [59]	0.38	0.93	0.95	0.84	0.96	0.69	0.79	0.69	0.73	0.79	0.92	0.78
CANDID [42]	0.55	0.92	0.91	0.66	0.92	0.81	0.80	0.83	0.78	0.87	0.92	0.85
SBBS [60]	0.73	0.93	0.49	0.94	0.91	0.88	0.81	0.45	0.79	0.81	0.90	0.74
SuBSENSE [10]	0.75	0.94	0.79	0.69	0.86	0.87	0.82	0.80	0.89	0.85	0.91	0.86
SharedModel [61]	0.78	0.94	0.62	0.88	0.82	0.89	0.82	0.73	0.89	0.91	0.86	0.85
CwisarDH [62]	0.61	0.93	0.94	0.84	0.90	0.75	0.83	0.32	0.75	0.91	0.77	0.68
WisenetMD [63]	0.75	0.95	0.87	0.71	0.87	0.87	0.84	0.80	0.89	0.85	0.91	0.86
AMBER [64]	0.77	0.93	0.93	0.85	0.95	0.63	0.84	0.65	0.72	0.79	0.91	0.77
CwisarDRP [65]	0.69	0.92	0.91	0.84	0.92	0.82	0.85	0.71	0.80	0.91	0.78	0.80
CVABS [13]	0.77	0.94	0.88	0.81	0.86	0.91	0.86	0.83	0.84	0.87	0.89	0.86
SWCD [12]	0.76	0.93	0.92	0.85	0.85	0.88	0.86	0.78	0.83	0.82	0.86	0.82
DBSGen [66]	0.73	0.80	0.90	0.91	0.87	0.93	0.86	0.82	0.76	0.80	0.86	0.81
FTSG [18]	0.81	0.95	0.69	0.95	0.94	0.93	0.88	0.71	0.82	0.85	0.91	0.82
PAWCS [11]	0.78	0.94	0.94	0.84	0.96	0.91	0.89	0.75	0.77	0.84	0.90	0.82
Our Method	0.81	0.94	0.92	0.94	0.91	0.94	0.91	0.83	0.89	0.92	0.93	0.89

TABLE II: Performance comparison with LDB weakly supervised method [14], evaluated on three dynamic background videos of I2R dataset, in terms of F-measure. The best performance achieved, in each column, is shown in bold.

Methods	Campus	Fountain	WaterSurface	Avg
LDB [14]	0.85	0.85	0.94	0.88
Our Method	0.83	0.86	0.95	0.88

TABLE III: Performance comparison wit the Weakly Supervised method proposed in [15], evaluated on eight categories of CDnet 2014 , in terms of F-measure. The best performance achieved, in each column, is shown in bold.

Methods	BadWeather	Baseline	CameraJitter	DynamicBg.	NightVideos	Shadow	Thermal	Turbulence	Avg
Weak-Supervised method [15]	0.72	0.97	0.61	0.82	0.38	0.56	0.66	0.58	0.66
Our Method	0.89	0.92	0.42	0.91	0.29	0.92	0.78	0.59	0.72

V Conclusion

In this paper, we presented a novel weakly supervised realtime method for dynamic background subtraction, which utilizes two neural networks: an autoencoder for static background image generation and a U-Net for dynamic background image generation. While the autoencoder learns in an unsupervised manner, the U-Net requires pixel-wise ground-truth labels for supervised training. However, obtaining pixel-wise annotations can be an expensive and time-consuming task. To overcome this challenge, we prepared these labels in a weakly supervised way by selecting training frames that do not contain any moving objects. The autoencoder can generate the static background image by leveraging the temporal correlation between frames. After performing background subtraction and thresholding, the resulting image represents the dynamic background since the input image is moving object-free and only contains dynamic and static background. The U-Net then trains with the same moving object-free sequence of images and the binary dynamic background images as the ground-truth labels. During testing, we can feed an input image to the networks and obtain the static and dynamic background images in the output, resulting in a clean foreground image without dynamic background motions.

Our experiments on various sequences demonstrated that our method is effective in real-world scenarios. Our algorithm outperformed all top-ranked unsupervised methods as well as a weakly supervised method. We performed equally to another state-of-the-art weakly supervised method [14], which is specifically designed for handling dynamic background scenes.

In summary, our proposed method has a training phase followed by an online test phase, during which it can effectively detect dynamic background artifacts and separate them from the moving object foreground.

For future work, we plan to incorporate data augmentation techniques to acquire more comprehensive training data that includes pixel-wise dynamic background labels for images containing moving objects. Additionally, we aim to incorporate data augmentation with different brightness levels to handle illumination changes effectively.

References

[1] T. Bouwmans, S. Javed, M. Sultana, and S. K. Jung, “Deep neural network concepts for background subtraction: A systematic review and comparative evaluation,” Neural Networks, vol. 117, pp. 8–66, 2019.
[2] B. Garcia-Garcia, T. Bouwmans, and A. J. R. Silva, “Background subtraction in real applications: Challenges, current models and future directions,” Computer Science Review, vol. 35, p. 100204, 2020.
[3] Y. Xu, J. Dong, B. Zhang, and D. Xu, “Background modeling methods in video analysis: A review and comparative evaluation,” CAAI Transactions on Intelligence Technology, vol. 1, no. 1, pp. 43–60, 2016.
[4] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models for real-time tracking,” in Proceedings. 1999 IEEE computer society conference on computer vision and pattern recognition (Cat. No PR00149), vol. 2. IEEE, 1999, pp. 246–252.
[5] Z. Zivkovic, “Improved adaptive gaussian mixture model for background subtraction,” in Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., vol. 2. IEEE, 2004, pp. 28–31.
[6] Z. Zivkovic and F. Van Der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,” Pattern recognition letters, vol. 27, no. 7, pp. 773–780, 2006.
[7] D.-S. Lee, “Effective gaussian mixture learning for video background subtraction,” IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 5, pp. 827–832, 2005.
[8] P. KaewTraKulPong and R. Bowden, “An improved adaptive background mixture model for real-time tracking with shadow detection,” in Video-based surveillance systems. Springer, 2002, pp. 135–144.
[9] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background subtraction,” in European conference on computer vision. Springer, 2000, pp. 751–767.
[10] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, “Subsense: A universal change detection method with local adaptive sensitivity,” IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 359–373, 2014.
[11] ——, “A self-adjusting approach to change detection based on background word consensus,” in 2015 IEEE winter conference on applications of computer vision. IEEE, 2015, pp. 990–997.
[12] Ş. Işık, K. Özkan, S. Günal, and Ö. Nezih Gerek, “Swcd: A sliding window and self-regulated learning-based background updating method for change detection in videos,” Journal of Electronic Imaging, vol. 27, no. 2, p. 023002, 2018.
[13] Ş. Işık, K. Özkan, and Ö. Nezih Gerek, “Cvabs: moving object segmentation with common vector approach for videos,” IET Computer Vision, vol. 13, no. 8, pp. 719–729, 2019.
[14] Z. Zhang, Y. Chang, S. Zhong, L. Yan, and X. Zou, “Learning dynamic background for weakly supervised moving object detection,” Image and Vision Computing, vol. 121, p. 104425, 2022.
[15] T. Minematsu, A. Shimada, and R.-i. Taniguchi, “Simple background subtraction constraint for weakly supervised background subtraction network,” in 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2019, pp. 1–8.
[16] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[17] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 780–785, 1997.
[18] R. Wang, F. Bunyak, G. Seetharaman, and K. Palaniappan, “Static and moving object detection using flux tensor with split gaussian models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 414–418.
[19] S. Bianco, G. Ciocca, and R. Schettini, “Combination of video change detection algorithms by genetic programming,” IEEE Transactions on Evolutionary Computation, vol. 21, no. 6, pp. 914–928, 2017.
[20] ——, “How far can you get by combining change detection algorithms?” in International conference on image analysis and processing. Springer, 2017, pp. 96–107.
[21] Y. Wang, P.-M. Jodoin, F. Porikli, J. Konrad, Y. Benezeth, and P. Ishwar, “Cdnet 2014: An expanded change detection benchmark dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014, pp. 387–394.
[22] F. Gao, Y. Li, and S. Lu, “Extracting moving objects more accurately: a cda contour optimizer,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[23] L. A. Lim and H. Y. Keles, “Foreground segmentation using convolutional neural networks for multiscale feature encoding,” Pattern Recognition Letters, vol. 112, pp. 256–262, 2018.
[24] ——, “Learning multi-scale features for foreground segmentation,” Pattern Analysis and Applications, vol. 23, no. 3, pp. 1369–1380, 2020.
[25] G. Rahmon, F. Bunyak, G. Seetharaman, and K. Palaniappan, “Motion u-net: Multi-cue encoder-decoder network for motion segmentation,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 8125–8132.
[26] W. Zheng, K. Wang, and F.-Y. Wang, “A novel background subtraction algorithm based on parallel vision and bayesian gans,” Neurocomputing, vol. 394, pp. 178–200, 2020.
[27] Y. Wang, Z. Luo, and P.-M. Jodoin, “Interactive deep learning method for segmenting moving objects,” Pattern Recognition Letters, vol. 96, pp. 66–75, 2017.
[28] M. Babaee, D. T. Dinh, and G. Rigoll, “A deep convolutional neural network for video sequence background subtraction,” Pattern Recognition, vol. 76, pp. 635–649, 2018.
[29] O. Tezcan, P. Ishwar, and J. Konrad, “Bsuv-net: A fully-convolutional neural network for background subtraction of unseen videos,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 2774–2783.
[30] M. O. Tezcan, P. Ishwar, and J. Konrad, “Bsuv-net 2.0: Spatio-temporal data augmentations for video-agnostic supervised background subtraction,” IEEE Access, vol. 9, pp. 53 849–53 860, 2021.
[31] M. Braham, S. Pierard, and M. Van Droogenbroeck, “Semantic background subtraction,” in 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 4552–4556.
[32] A. Cioppa, M. Van Droogenbroeck, and M. Braham, “Real-time semantic background subtraction,” in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 3214–3218.
[33] D. Bank, N. Koenigstein, and R. Giryes, “Autoencoders,” arXiv preprint arXiv:2003.05991, 2020.
[34] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.
[35] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” Advances in neural information processing systems, vol. 30, 2017.
[36] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted $\ell$ 1 minimization,” Journal of Fourier analysis and applications, vol. 14, no. 5-6, pp. 877–905, 2008.
[37] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
[38] F. Bahri, M. Shakeri, and N. Ray, “Online illumination invariant moving object detection by generative neural network,” in Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing, 2018, pp. 1–8.
[39] F. Chollet et al., “Keras,” https://keras.io, 2015.
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[41] L. Li, W. Huang, I. Y.-H. Gu, and Q. Tian, “Statistical modeling of complex backgrounds for foreground object detection,” IEEE Transactions on Image Processing, vol. 13, no. 11, pp. 1459–1472, 2004.
[42] M. Mandal, P. Saxena, S. K. Vipparthi, and S. Murala, “Candid: Robust change dynamics and deterministic update policy for dynamic background subtraction,” in 2018 24th international conference on pattern recognition (ICPR). IEEE, 2018, pp. 2468–2473.
[43] A. Miron and A. Badii, “Change detection based on graph cuts,” in 2015 International Conference on Systems, Signals and Image Processing (IWSSIP). IEEE, 2015, pp. 273–276.
[44] E. López-Rubio, M. A. Molina-Cabello, R. M. Luque-Baena, and E. Domínguez, “Foreground detection by competitive learning for varying input distributions,” International journal of neural systems, vol. 28, no. 05, p. 1750056, 2018.
[45] G. Allebosch, D. Van Hamme, F. Deboeverie, P. Veelaert, and W. Philips, “C-efic: Color and edge based foreground background segmentation with interior classification,” in International joint conference on computer vision, imaging and computer graphics. Springer, 2015, pp. 433–454.
[46] G. Allebosch, F. Deboeverie, P. Veelaert, and W. Philips, “Efic: edge based foreground background segmentation and interior classification for dynamic camera viewpoints,” in International conference on advanced concepts for intelligent vision systems. Springer, 2015, pp. 130–141.
[47] X. Lu, “A multiscale spatio-temporal background model for motion detection,” in 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 3268–3271.
[48] D. Liang, M. Hashimoto, K. Iwata, X. Zhao et al., “Co-occurrence probability-based pixel pairs background model for robust object detection in dynamic scenes,” Pattern Recognition, vol. 48, no. 4, pp. 1374–1390, 2015.
[49] R. Krungkaew and W. Kusakunniran, “Foreground segmentation in a video by using a novel dynamic codebook,” in 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE, 2016, pp. 1–6.
[50] L. Maddalena and A. Petrosino, “A fuzzy spatial coherence-based approach to background/foreground separation for moving object detection,” Neural Computing and Applications, vol. 19, no. 2, pp. 179–186, 2010.
[51] ——, “The sobs algorithm: What are the limits?” in 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, 2012, pp. 21–26.
[52] G. Ramirez-Alonso and M. I. Chacon-Murguia, “Auto-adaptive parallel som architecture with a modular analysis for dynamic object segmentation in videos,” Neurocomputing, vol. 175, pp. 990–1000, 2016.
[53] K. Wang, C. Gou, and F.-Y. Wang, “ $m^{4}cd$ : A robust change detection method for intelligent visual surveillance,” IEEE Access, vol. 6, pp. 15 505–15 520, 2018.
[54] S. Varadarajan, P. Miller, and H. Zhou, “Spatial mixture of gaussians for dynamic background modelling,” in 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance. IEEE, 2013, pp. 63–68.
[55] S. Jiang and X. Lu, “Wesambe: A weight-sample-based method for background subtraction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2105–2115, 2017.
[56] M. Sedky, M. Moniri, and C. C. Chibelushi, “Spectral-360: A physics-based technique for change detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 399–402.
[57] H. Sajid and S.-C. S. Cheung, “Background subtraction for static & moving camera,” in 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015, pp. 4530–4534.
[58] ——, “Universal multimode background subtraction,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3249–3260, 2017.
[59] I. Martins, P. Carvalho, L. Corte-Real, and J. L. Alba-Castro, “Bmog: boosted gaussian mixture model with controlled complexity,” in Iberian conference on pattern recognition and image analysis. Springer, 2017, pp. 50–57.
[60] A. Varghese and G. Sreelekha, “Sample-based integrated background subtraction and shadow detection,” IPSJ Transactions on Computer Vision and Applications, vol. 9, no. 1, pp. 1–12, 2017.
[61] Y. Chen, J. Wang, and H. Lu, “Learning sharable models for robust background subtraction,” in 2015 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2015, pp. 1–6.
[62] M. De Gregorio and M. Giordano, “Change detection with weightless neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2014, pp. 403–407.
[63] S.-h. Lee, G.-c. Lee, J. Yoo, and S. Kwon, “Wisenetmd: Motion detection using dynamic background region analysis,” Symmetry, vol. 11, no. 5, p. 621, 2019.
[64] B. Wang and P. Dudek, “A fast self-tuning background subtraction algorithm,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 395–398.
[65] M. De Gregorio and M. Giordano, “Wisardrp for change detection in video sequences.” in ESANN, 2017.
[66] F. Bahri and N. Ray, “Dynamic background subtraction by generative neural networks,” in 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2022, pp. 1–8.