RANDOMSEMO: NORMALITY LEARNING OF MOVING OBJECTS
FOR VIDEO ANOMALY DETECTION
Abstract
Recent anomaly detection algorithms have shown powerful performance by adopting frame predicting autoencoders. However, these methods face two challenging circumstances. First, they are likely to be trained to be excessively powerful, generating even abnormal frames well, which leads to failure in detecting anomalies. Second, they are distracted by the large number of objects captured in both foreground and background. To solve these problems, we propose a novel superpixel-based video data transformation technique named Random Superpixel Erasing on Moving Objects (RandomSEMO) and Moving Object Loss (MOLoss), built on top of a simple lightweight autoencoder. RandomSEMO is applied to the moving object regions by randomly erasing their superpixels. It enforces the network to pay attention to the foreground objects and learn the normal features more effectively, rather than simply predicting the future frame. Moreover, MOLoss urges the model to focus on learning normal objects captured within RandomSEMO by amplifying the loss on the pixels near the moving objects. The experimental results show that our model outperforms state-of-the-arts on three benchmarks.
Index Terms— Video Anomaly Detection, Data Transformation, Frame Prediction, Superpixel, Moving Object
1 Introduction
Video anomaly detection (VAD) is a computer vision task that discriminates abnormal behaviors within the captured scenes. It is gaining interest due to the demanding cost on monitoring surveillance videos. Because the occurrence of abnormal events is rare compared to the normal, which is known as a data imbalance problem, the datasets used to develop anomaly detection models consist of a normal training set and a test set containing both normal and abnormal frames [1, 2, 3]. Following this circumstance, most of the deep-learning based VAD models [4, 5, 6, 7, 8, 9, 10] are trained to recognize the normal patterns during training, and detect the frames with the outlying patterns during testing. Some methods [4, 5, 6, 9] utilize frame predicting autoencoders (AEs) which generate the successive frame from the input frames based on the learned normal patterns. These approaches discriminate abnormal scenes by the quality of the predicted output, assuming that the unusual frame is likely to be generated poorly.


However, these approaches face two major challenges. First, the AEs tend to perform excessively well, generating even the abnormal frames with high quality, which is adverse in detecting anomalies. Also, the large number of objects in the foreground and background distract the AE from focusing on the prominent moving objects. Abnormal behaviors generally take place in the moving foregrounds, rather than the background objects.
To cope with these problems, inspired by the recently proposed data transformation algorithms [9, 12, 13], we propose a novel video transformation technique named RandomSEMO. It is applied in the data pre-processing stage before feeding the input to the frame predicting AE during training phase. The RandomSEMO erases random superpixels found on the moving object regions. It works under the assumption that the model is likely to learn the most prominent objects’ mormal patterns when it is given partially insufficient data for the frame predicting task.
We also propose a Moving Object Loss (MOLoss) to maximize the effect of RandomSEMO. The MOLoss amplifies the loss of the pixels on the moving objects via generating a weight map. By utilizing the proposed loss function, our model is designed to focus on the foreground regions, effectively extracting crucial features with less distraction of the background objects.
Our contributions are summarized as: (1) We propose a superpixel-based video transformation method called RandomSEMO. (2) We maximize the advantage of the RandomSEMO with the proposed loss function, MOLoss. (3) Our model surpasses or performs compatibly with other methods.
2 Related Works
Various data transformation techniques, such as augmentations, have been proposed to boost the feature learning for computer vision tasks. In the image-level, Zhong et al. [14] introduced Random Erasing which makes occlusion at a random patch in the image. DeVries and Taylor [15] suggested CutOut, a method that deletes a box at the random location. These mentioned methods are all proven by experiments to enhance the performance of various tasks.
Few VAD methods have been proposed using the data transformation. Georgescu et al. [16] used sequence-reversed frames to embed motion feature learning. Park et al. [9] devised SRT and TMT, which rotates random rectangular patches in the spatial dimension and shuffles the sequence of random patches, respectively. Our work is in close relation with this method. However, compared to the work of Park et al. [9], our proposed RandomSEMO is fully aware of the moving objects and applied only on the corresponding region. It transforms the superpixels of the region, which is more sophisticated and foreground-aware than applying transformations on random rectangular patches. Therefore, our method is more superior in the localizing ability, which is discussed in Sec. 4.2 and Fig. 5.
3 Method
The overall framework is described in Fig. 2. We resize consecutive frames to and stack them to make a 4D cuboid . During training, RandomSEMO takes place to make transform to . Then, is fed to the AE which generates a prediction of the future frame . The AE is trained with the proposed MOLoss between and the ground truth . During inference, RandomSEMO is detached. is fed to the AE, and the normality score is calculated from to distinguish the abnormal frames.
3.1 RandomSEMO
We propose RandomSEMO, a transformation method that eliminates the details of the foregrounds by randomly erasing the superpixels of the moving objects. The purpose of RandomSEMO is to let AE be trained to extract prototypical normal features of the foreground objects than simply learning to predict frames well.
To apply RandomSEMO, we first localize the moving objects. Given a sequence of frames , , , , , , we compute the optical flow between each two successive frames (Fig. 1 (2)). Then, we generate moving object masks (Fig. 1 (3)) by thresholding the magnitude of the optical flow. Next, we apply SLIC [17] on the masked regions of the former frames to generate superpixel maps (Fig. 1 (4)) with maximum components. Finally, we randomly erase the superpixels by replacing the values with zero (Fig. 1 (5)). Each superpixel is erased with probability .
By utilizing RandomSEMO, the network is forced to learn the appearance and motion features of the moving objects to infer the future frame with the randomly erased information, as seen in Fig. 1 (b). This solves the aforementioned excessive predicting power problem of the previous methods because our method makes the model to focus on the normal feature learning, rather than simply implying the upcoming frame with the cues of the input frames. Also, the AE effectively extracts normal features because RandomSEMO nudges the model to pay attention to the moving objects which are more potentially abnormal than the background.

3.2 MOLoss
To maximize the effect of RandomSEMO, we optimize our model with the newly proposed MOLoss. MOLoss is a weighted sum of a moving object weighted L1 loss and SSIM loss [18] obtained between and . amplifies loss on the pixels near the moving objects by using a Moving Object Weight Map (). Fig. 3 shows the example and generating process of MOWM. We sum up all the obtained during RandomSEMO to generate a summed binary mask () (Fig. 3 (b)), which represents the moving object region of the input frame cuboid . Therefore, is expressed as:
(1) |
Then, we apply a distance function to to generate , which demonstrates the inverse-distance from each pixel to the foreground. Thereby, the values near the foreground pixel are large and vice versa. is defined as:
(2) |
where and indicate each pixel in and its nearest foreground pixel, respectively. Evidently, the pixels near the foreground have relatively big values for the corresponding location in . Then, is multiplied during L1 loss calculation, amplifying the values near the moving objects. During this stage, we add a small value to the tensorized to avoid zero values. Because is a pixel-level loss function, we combine this with to guarantee the similarity of and in the feature-level. The equations are as follows:
(3) |
(4) |
In Eq. 3, M and N represent the number of pixels in the width and height axis, respectively. Moreover, and represent the location of each pixel in the width and height axis. In Eq. 4, and denote the average and variance of each frame, respectively. Also, represents the covariance. and are the variables to stabilize the division. Finally, MOLoss is described as:
(5) |
where and denote the weights that control the contribution of and , respectively.
3.3 AE Architecture
The AE used in our method is trained to recognize the likelihood of the normal frames and predict the upcoming frame from the transformed input frame cuboid using the learned normal features. To avoid excessive generalization, we design a lightweight, yet effective AE. As demonstrated in Fig. 2, our AE is composed of a three-layer encoder and a three-layer decoder with skip connections in between to supplement the features lost during downsampling. Each encoder layer consists of 3D convolution, batch normalization, and LeakyReLU activation. After the last encoder layer, an Atrous Spatial Pyramid Pooling (ASPP) [21] layer takes place to enlarge the receptive field. The decoder layers are all made of 3D deconvolution, batch normalization, and ReLU activation.
3.4 Normality Score
During inference, we adopt the peak signal to noise ratio (PSNR) to estimate the abnormality of each frame. It is defined as , where N is the number of pixels in the frame. When contains anomalies which our network has never seen during training, our network is incapable of predicting a clear , resulting in low PSNR and vice versa. Following [2, 11, 4, 7, 5, 10, 9, 8], we normalize ) of each video clip to the range to obtain the final normality score .
(6) |
STAN [10] | Abn GAN [25] | HyAE [7] | Mem AE [11] | AD [26] | IntAE [6] | MNAD -Recon [5] | Fast Ano [9] | Base | Base+R | Base+R+M (Ours) | |
Avenue [1] | 81.7 | - | 82.8 | 83.3 | - | 83.7 | 82.8 | 85.3 | 84.1 | 83.9 | 85.4 |
ST [2] | - | - | - | 71.2 | - | 71.5 | 69.8 | 72.2 | 71.6 | 72.2 | 72.4 |
Ped2 [3] | 92.2 | 93.5 | 84.3 | 94.1 | 95.5 | 96.2 | 90.2 | 96.3 | 93.7 | 94.2 | 95.8 |
FPS | 50 | - | - | 38 | 2 | 30 | 67 | 195 | 127 | 127 | 127 |
4 Experiments
Dataset. We validate our network on three popular benchmarks: CUHK Avenue [1], ShanghaiTech Campus (ST) [2], and UCSD Ped2 [3]. These datasets are all acquired by fixed cameras from the real-world. Among the three, ST [2] is the largest and most complex, containing 330 training videos and 107 testing videos captured from 13 different scenes.
Evaluation metric. For the evaluation, we adopt the area under curve (AUC) obtained from the frame-level scores and the ground truth labels. This metric is used in most studies [4, 7, 6, 9, 11, 12] on VAD. Since the first five frames of each clip cannot be predicted, they are ignored in the evaluation, following other prediction-based methods [4, 6, 5, 9].
Settings. We implement our experiments using PyTorch and a NVIDIA RTX A6000. Also, we use a pre-trained FlowNet2 [23] to generate optical flow maps. The training is based on batch size and Adam optimizer with an initial learning rate of , , . The network is trained for 40 epochs on Ped2 [3] and Avenue [1] and epochs for ST [2]. , , , , , and are empirically set to , and respectively.
4.1 Quantitative Results
The quantitative results are shown in Table 1. We compare our network with eight VAD networks that do not require any pre-trained network—such as object detectors or optical flow networks—during inference for fair comparison. It is observed that the accuracy of our model is superior to most of the compared methods in the three datasets. Furthermore, our proposed network is capable of detecting anomalies at 127 frames per seconds (FPS), faster than most methods. Because accuracy and speed are both mandatory in the real world application, our network is more practical than other methods that show low accuracy or slow speed.
4.2 Qualitative Results
Fig. 5 demonstrates the output and difference map of our network compared to those of FastAno [9]. The difference map shows that our network more precisely localizes the abnormal object. Only the biker is saturated in our proposed model whereas the background pixels near the biker is unnecessarily highlighted in FastAno [9], mirroring the effect of our moving object aware RandomSEMO and MOLoss. Furthermore, the difference between foreground and background is more pronounced in our results.
4.3 Impact of RandomSEMO probability
We found the optimal value of by changing from to . Table 2 and Fig. 4 show the result of the experiments conducted on Avenue [1] and Ped2 [3]. For Avenue [1], the accuracy is highest when is . For Ped2 [3], the model trained with shows the best performance, and it is also as much accurate when trained with . Furthermore, it is observed that both low leads to relatively low performance because only a few superpixels are obscured, practically having no significant difference from the original image. Similarly, the accuracy drops when is high because too much information is lost.
4.4 Impact of MOLoss
We conducted ablation studies to validate the contribution of MOLoss. The results are compared in Table 3. We experimented by changing the loss function to loss, SSIM loss, and a weighted sum of and SSIM where the weights are equivalent to those of the MOLoss. From the results, it is observed that using MOLoss strongly boosts the performance, especially when compared to using a single loss.

5 Conclusion
We propose a novel superpixel-based video transformation method called RandomSEMO for anomaly detection. Also, we propose a loss function named MOLoss to boost the effect of RandomSEMO by amplifying the loss values on the pixels near the moving objects. The experimental results have shown that our network performs competitively or surpasses other previous methods. Furthermore, we validated the impact of RandomSEMO and MOLoss with ablation studies and output comparison. As future works, we will verify the localizing ability of our model for extensive usage.
References
- [1] Cewu Lu, Jianping Shi, and Jiaya Jia, “Abnormal event detection at 150 fps in matlab,” in ICCV, 2013, pp. 2720–2727.
- [2] Weixin Luo, Wen Liu, and Shenghua Gao, “A revisit of sparse coding based anomaly detection in stacked rnn framework,” in ICCV, 2017, pp. 341–349.
- [3] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos, “Anomaly detection in crowded scenes,” in CVPR. IEEE, 2010, pp. 1975–1981.
- [4] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao, “Future frame prediction for anomaly detection–a new baseline,” in CVPR, 2018, pp. 6536–6545.
- [5] Hyunjong Park, Jongyoun Noh, and Bumsub Ham, “Learning memory-guided normality for anomaly detection,” in CVPR, 2020, pp. 14372–14381.
- [6] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang, “Integrating prediction and reconstruction for anomaly detection,” PR Letters, vol. 129, pp. 123–130, 2020.
- [7] Trong Nguyen Nguyen and Jean Meunier, “Hybrid deep network for anomaly detection,” BMVC, 2019.
- [8] MyeongAh Cho, Taeoh Kim, Ig-Jae Kim, and Sangyoun Lee, “Unsupervised video anomaly detection via normalizing flows with implicit latent features,” arXiv preprint arXiv:2010.07524, 2020.
- [9] Chaewon Park, MyeongAh Cho, Minhyeok Lee, and Sangyoun Lee, “Fastano: Fast anomaly detection via spatio-temporal patch transformation,” in WACV, 2022, pp. 2249–2259.
- [10] Sangmin Lee, Hak Gu Kim, and Yong Man Ro, “Stan: Spatio-temporal adversarial networks for abnormal event detection,” in ICASSP. IEEE, 2018, pp. 1323–1327.
- [11] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in ICCV, 2019, pp. 1705–1714.
- [12] Muhammad Zaigham Zaheer, Jin-ha Lee, Marcella Astrid, and Seung-Ik Lee, “Old is gold: Redefining the adversarially learned one-class classifier training paradigm,” in CVPR, 2020, pp. 14183–14193.
- [13] Abhishek Joshi and Vinay P Namboodiri, “Unsupervised synthesis of anomalies in videos: transforming the normal,” in IJCNN. IEEE, 2019, pp. 1–8.
- [14] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang, “Random erasing data augmentation,” in AAAI, 2020, vol. 34, pp. 13001–13008.
- [15] Terrance DeVries and Graham W Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
- [16] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah, “Anomaly detection in video via self-supervised and multi-task learning,” in CVPR, 2021, pp. 12742–12752.
- [17] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, “Slic superpixels,” Tech. Rep., 2010.
- [18] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on computational imaging, vol. 3, no. 1, pp. 47–57, 2016.
- [19] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.
- [20] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al., “Rectifier nonlinearities improve neural network acoustic models,” in ICML. Citeseer, 2013, vol. 30, p. 3.
- [21] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, vol. 40, no. 4, pp. 834–848, 2017.
- [22] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in ICML, 2010.
- [23] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in CVPR, 2017, pp. 2462–2470.
- [24] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [25] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe, “Abnormal event detection in videos using generative adversarial nets,” in ICIP. IEEE, 2017, pp. 1577–1581.
- [26] Mahdyar Ravanbakhsh, Enver Sangineto, Moin Nabi, and Nicu Sebe, “Training adversarial discriminators for cross-channel abnormal event detection in crowds,” in WACV. IEEE, 2019, pp. 1896–1904.
- [27] Fei Dong, Yu Zhang, and Xiushan Nie, “Dual discriminator generative adversarial network for video anomaly detection,” IEEE Access, vol. 8, pp. 88170–88176, 2020.